Malaysia
3 days ago
Senior SRE Lead (Application Support)

Job Summary:

We are seeking a highly skilled and experienced Senior Site Reliability Engineer (SRE) to join our dynamic team. In this role, you will be responsible for ensuring the reliability, performance, and scalability of our applications in a production environment. You will collaborate closely with development, operations, and product teams to design and implement robust application support strategies, troubleshoot complex issues, and enhance system performance.

As a Senior SRE, you will leverage your expertise in application support, automation, and monitoring tools to maintain high availability and reliability of our services. You will also play a key role in our SRE transformation by providing training, mentoring, and coaching to team members. Additionally, you will lead pilot SRE adoption initiatives for application teams, fostering a culture of reliability and operational excellence.

 

Key Responsibilities:

Pilot SRE adoption in traditional application support teams:

Study and assess the existing application architecture to identify areas for improvement in reliability and performance.Derive and document critical user journeys to understand the most important paths through the application, ensuring that SRE practices align with user experience.Define and establish SLOs that reflect the reliability and performance expectations for the application, ensuring alignment with business goals.Identify and implement SLIs that will be used to measure the performance and reliability of the application against the defined SLOs.Develop and manage error budgets to balance the trade-off between innovation and reliability, ensuring that teams understand the implications of their changes on system performance. Build and maintain a backlog of identified toil (manual, repetitive tasks) to prioritize automation and process improvements that enhance operational efficiency.Provide coaching and mentoring to current production support teams, helping them adopt SRE principles and practices to improve their operational capabilities.Work closely with application teams to facilitate the adoption of SRE practices, ensuring that they understand the benefits and methodologies involved.Create and deliver training materials and sessions to educate teams on SRE concepts, tools, and best practices.Foster a culture of continuous improvement by regularly reviewing processes, gathering feedback, and iterating on SRE practices to enhance application reliability.

System Uptime & Reliability:

Deliver regional uptime, performance, and availability targets, ensuring that SLIs, SLOs, SLAs, and error budgets are in place and met across all critical services.Proactively monitor and address risks that may impact the reliability of services, minimizing downtime and service disruptions.Provide guidance on system architecture, fault tolerance, and disaster recovery processes to ensure robustness.Define and monitor Critical User Journeys (CUJs) in collaboration with business and product teams to ensure they are tracked and optimized for reliability.Build CUJ-level metrics and telemetry into all relevant services to track the end-to-end user experience.

Monitoring & Observability:

Implement observability tools and platforms for monitoring application health and user experience across the region.Define and monitor Critical User Journeys (CUJs) in collaboration with business and product teams to ensure they are tracked and optimized for reliability.Ensure that CUJ-level metrics and telemetry are built into all relevant services to track the end-to-end user experience.Create actionable dashboards, alerts, and reports using tools like Industry standard observability tools including OpenTelemetry.Ensure that all critical systems are observable, measurable, and can be monitored proactively.

Automation & Tooling:

Develop automation of operational tasks such as deployments, failover, scaling, and remediation across the region.Advocate for and help teams build efficient tools and processes that reduce manual intervention, improve productivity, and eliminate bottlenecks.Identify high-toil areas and eliminate them across teams by promoting automation, process improvements, and tools that increase operational efficiency.

Incident Response & Prevention:

Provide technical support for incident response and resolution.Implement and execute blameless post-mortems12+ years of hands-on experience as an SRE in an application support teamStrong Technical Background: Proficiency in software development skills and principles, application production support, SDLC best practices, agile methodologyHands on SRE skills : Familiarity in implementing SRE concepts, including SLOs, SLIs, error budgets, incident management and blameless post-mortemsApplication Architecture Knowledge: Ability to analyze and understand application architectures to identify areas for improvement.Monitoring and Observability Tools: Experience with monitoring, logging, and observability tools to track application performance.Automation Skills: Proficiency in scripting and automation tools (e.g., Python, Bash, Terraform) to reduce toil and improve operational efficiency.Incident Response and Troubleshooting: Strong problem-solving skills to effectively respond to incidents and perform root cause analysis.Collaboration and Communication: Excellent interpersonal skills to work effectively with cross-functional teams and communicate technical concepts clearly.Coaching and Mentoring: Ability to train and mentor team members in SRE practices and foster a culture of reliability.Agile Methodologies: Familiarity with Agile development practices and experience working in Agile teams.Continuous Improvement Mindset: A proactive approach to identifying areas for improvement and implementing changes to enhance reliability and performance.
Por favor confirme su dirección de correo electrónico: Send Email