Shanghai, Shanghai, China
1 day ago
Senior Site Reliability Engineer

Digital Business Services (DBS)

Our GCIO organisation plays a critical role for the bank. This team partners with the businesses to build the platforms, systems, and products that our customers use every day. We keep people’s money and data safe, and are at the forefront of driving innovation for our businesses, customers, and colleagues.

We are currently seeking an experienced professional to join our team.

In this role, you will:

System Reliability and Automation:
•Design, develop, and implement automation tools and scripts to reduce manual operational tasks ("toil") and enhance system resilience.
•Ensure high availability (e.g., 99.99% uptime) of critical banking applications, including core banking, payment systems, and global platforms/local system
•Conduct capacity planning and chaos engineering to test and improve system resilience under failure conditions.
Incident Management and Response:
•Participate in on-call rotations to respond to production incidents, troubleshoot issues, and conduct post-mortems to prevent recurrence.
•Collaborate with production support teams for rapid incident resolution and escalate complex issues to application teams or vendors as needed.
Collaboration and Coordination:
•Work closely with production support teams to streamline incident handling and integrate automated solutions into support processes.
•Partner with application development teams to embed reliability practices into the software development lifecycle (SDLC).
•Engage with the bank's operation resilience project team to align on initiatives for regulatory compliance, disaster recovery, and system robustness.
•Coordinate with global and regional SRE and DevOps teams to ensure consistency in tools, processes, and standards across distributed banking systems.
Monitoring and Observability:
•Implement and maintain monitoring solutions to track service-level indicators (SLIs) and ensure service-level objectives (SLOs) are met.
•Analyze system performance metrics and proactively address potential issues to maintain operational stability.
Process Improvement:
•Drive continuous improvement in reliability practices, including automation, incident response, and problem management processes.
•Contribute to error budget discussions to balance reliability with innovation in banking systems.
Compliance and Security:
•Ensure systems adhere to China's regulatory requirements (e.g., Cybersecurity Law, data localization) and global banking standards.
•Implement secure coding practices and collaborate with security teams to protect sensitive financial data.

Por favor confirme su dirección de correo electrónico: Send Email