Site Reliability Engineer – UNIX
UBS
We are seeking a highly experienced Site Reliability Engineer (SRE) to join our technology team in a mission-critical financial environment. This role is ideal for someone who has a proven track record of building and operating reliable, scalable systems in regulated industries such as banking or financial services.
As a Senior SRE, you will be responsible for ensuring the availability, performance, and resilience of our platforms. You’ll collaborate with engineering, infrastructure, and security teams to build systems that are secure, observable, and automated, while championing a culture of operational excellence.
Key Responsibilities
• Design, implement, and maintain highly available and fault-tolerant systems in a financial environment.
• Define and monitor Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to ensure system reliability and customer satisfaction.
• Passionately identify, measure, and reduce TOIL, with a proactive approach to eliminating repetitive manual tasks through automation.
• Lead incident response, post-mortems, and root cause analysis for production issues.
• Collaborate with development teams to embed reliability into the software development lifecycle.
• Integrate with observability platforms (e.g., Prometheus, Grafana, ELK, Datadog) to ensure end-to-end visibility of systems and services.
As a Senior SRE, you will be responsible for ensuring the availability, performance, and resilience of our platforms. You’ll collaborate with engineering, infrastructure, and security teams to build systems that are secure, observable, and automated, while championing a culture of operational excellence.
Key Responsibilities
• Design, implement, and maintain highly available and fault-tolerant systems in a financial environment.
• Define and monitor Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to ensure system reliability and customer satisfaction.
• Passionately identify, measure, and reduce TOIL, with a proactive approach to eliminating repetitive manual tasks through automation.
• Lead incident response, post-mortems, and root cause analysis for production issues.
• Collaborate with development teams to embed reliability into the software development lifecycle.
• Integrate with observability platforms (e.g., Prometheus, Grafana, ELK, Datadog) to ensure end-to-end visibility of systems and services.
Por favor confirme su dirección de correo electrónico: Send Email