Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.
As a Lead Site Reliability Engineer at JPMorgan Chase within the Commercial and Investment Bank's Markets Tech Group, you hold a leadership role in your team, demonstrate strong knowledge across multiple technical domains and advise others on the technical and business issues facing them. Take lead and conduct resiliency design reviews, break up complex problems into digestible work for other engineers, act as a technical lead for medium to large-sized products, and provide advice and mentoring to other engineers.
Job responsibilities
Design, code, test and deliver software to automate manual operational work, including self-healing and resiliency patterns for engineering teams. Collaborate with others to create and implement observability and reliability designs for complex systems which are robust, stable and do not incur additional toil or technical debt Demonstrate site reliability principles and practices every day and mentor technologists within the organization Troubleshoot priority and escalation incidents, facilitate blameless post-mortems and ensure permanent closure of incidents and subsequent problem tasks. Engage and evangelize with development team throughout their SDLC to develop software for reliability and scale, ensuring minimal refactoring or changes Identify application patterns and analytics in support of better service level objectives Design automated software and product upgrades, change management and release management solutions. Work towards becoming an expert on the applications and platforms in your remit by understanding its interdependencies and limitations and driving to evolve and debug the critical components of it
Required qualifications, capabilities, and skills
Formal training or certification on software engineering concepts and 5+ years applied experience Experience working with a major public cloud provider (Amazon Web Services) and infrastructure as code (Terraform) Experience in working in a hybrid deployment environment (on premise and public cloud) Advanced understanding of site reliability culture and principles and a track record of demonstrating how to implement site reliability within an application or platform and usage of key SRE concepts such as SLOs and Error Budgets Advanced knowledge and experience in observability capabilities across applications (metrics, tracing, SLOs), alerting, telemetry collection and ability to design critical and golden signal monitoring (Datadog). Strong communication skills and a desire to mentor and educate others on site reliability engineering principles and practices
Preferred qualifications, capabilities, and skills
Experience defining non-functional standards and blueprints related to supportability – logging, alerting, resiliency patterns, etc. Working knowledge of infrastructure components (e.g. routers, load balancers, cloud products, container systems, compute, storage, and networks) Ability to partner with and influence architecture teams in defining non-functional application supportability standards AWS Cloud Certification, Linux Foundation CKA/CKAD, Terraform Associate and other relevant certifications are a plus