Functional Role – Site Reliability Engineer – Developer - Analyst
Region – Bengaluru
Your Impact
Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. At Goldman Sachs, SRE is responsible for the availability and reliability of our firm's most critical platform services, and ensures they meet the requirements of our internal and external users. We look for engineers who are motivated to collaborate with our businesses to build and run sustainable production systems, which can evolve and adapt to changes in our fast-paced, global business environment.
SRE team develops and maintains platforms that enables GS Engineering Teams to adhere to Observability requirements and SLA Management. It is part of SRE Platforms responsibility for designing, developing, and operating distributed systems which provide observability for Goldman’s mission-critical applications and platform services. These systems span on-premises datacentres and multiple public cloud environments. We design and build highly scalable tools which provide the following functions to our global engineering teams:
AlertingMetrics and monitoringLog collection and analysisTracing
The products and services we provide to our internal customers are used by thousands of engineers every day. We believe that reliability is the most important feature of any system, and we are devoted to giving our engineers the tools they need to build and operate reliable products.
How You Will Fulfil Your Potential
As a developer in the SRE team, you will work with internal customers, vendors, product owners, and SREs to design and develop a large-scale distributed system to handle alert generation, metrics collection, log collection & trace events. You will run a production environment spanning cloud and on-prem datacentres. You will define observability features and drive their implementation.
Basic Qualifications
2+ years of relevant work experience.Proficiency in one or more of the following: Java, Python, Go, JavaScript, Spring framework.Proficiency in using Terraform for Infrastructure deployment and management.Excellent programming skills - developing, debugging, testing and optimizing code.Experience with algorithms, data structures and software design.Experience with distributed systems design, maintenance, and troubleshooting.Preferred Experience
Knowledge of cloud native solutions in AWS or GCPExperience with products like Prometheus, Grafana, PagerDutyExperience with databases like PostgreSQL, MongoDB , ElasticsearchExperience with open-source messaging systems like RabbitMQ and/or KafkaSystems experience in UNIX/Linux and networking, especially in scaling for performance and debugging complex distributed systems