Senior Site Reliability Engineer - AWS - Jersey City, New Jersey

Jersey City, New Jersey

3 days ago

Senior Site Reliability Engineer - AWS

bank of america

Job Description:

At Bank of America, we are guided by a common purpose to help make financial lives better through the power of every connection. We do this by driving Responsible Growth and delivering for our clients, teammates, communities and shareholders every day.

Being a Great Place to Work is core to how we drive Responsible Growth. This includes our commitment to being an inclusive workplace, attracting and developing exceptional talent, supporting our teammates’ physical, emotional, and financial wellness, recognizing and rewarding performance, and how we make an impact in the communities we serve.

Bank of America is committed to an in-office culture with specific requirements for office-based attendance and which allows for an appropriate level of flexibility for our teammates and businesses based on role-specific considerations.

At Bank of America, you can build a successful career with opportunities to learn, grow, and make an impact. Join us!

About Us

Bank of America offers career growth opportunities to learn, develop, and make meaningful impact.

At Bank of America, we are guided by a common purpose to help make financial lives better through the power of every connection. Responsible Growth is how we run our business and deliver for our clients, teammates, communities, and shareholders every day.

One of the keys to driving Responsible Growth is being a great place to work for our teammates around the world. We're dedicated to creating an inclusive workplace where individuals with a broad range of backgrounds and experiences can thrive. We invest heavily in our teammates and their families by offering competitive benefits to support their physical, emotional, and financial well-being.

Bank of America believes both in the importance of working together and offering flexibility to our employees. We use a multi-faceted approach for flexibility, depending on the various roles in our organization. Join us!

The Enterprise Cloud Platforms Team

Bank of America's Enterprise Cloud Platform team designs, builds, and maintains PaaS environments which span multiple cloud service providers. We provide our customers with innovative platforms which allow for a faster time-to-market and reduced complexity.

As part of our team, you will participate in a high-quality, customer-focused engineering culture with an emphasis on security and resiliency. You will have multiple opportunities to make a large impact on the evolution of next-generation Cloud services for Bank of America and explore emerging technologies that benefit our global customers.

Position Summary

Our team seeks experienced Senior Site Reliability Engineers (SREs) to design, build, and maintain our next-generation AWS platform. This role provides opportunity to work with a wide range of technologies, integrate a variety of in-house and commercial services which must seamlessly interact with each other giving you room to innovate and be creative. This job is responsible for partnering with leaders across engineering and technology to define objective reliability goals for services. Key responsibilities include composing observability designs through instrumentation and dashboards, identifying root causes of complex/impactful issues, partnering with cross functional teams to deliver sustainable design patterns, and driving early adoption of non-functional production support requirements. Job expectations include automating services to improve reliability and efficiency and influencing a culture of innovation and continuous improvement.

Responsibilities:

Designs solutions to visualize key production support metrics enabling Operational Readiness and Site Reliability Engineer teams to identify scenarios requiring interventionDevelops software solutions and/or improved processes to address work identified as ‘toil’ by collaborating with key partners to identify, track and remediate processes to free time allocated to reliabilityPartners with Development and Infrastructure teams to create error budget policies prioritizing reliability stories that fall below Service Level Objective (SLO) thresholds and suggests code optimizations, additional instrumentation and/or logging structures to gain service reliability visibilityIdentifies and plans for capacity bottlenecks, vulnerabilities and opportunities for reliability improvement, such as low level error rates and 'noise', and reduces manual support effort and/or improves system reliabilityAssesses monitoring for new changes with development partners and works with monitoring tools team to monitor dashboards and enhance application and system monitoring designsEngages as a subject matter expert in incident triage efforts, failure scenario modelling and works with the Problem Manager to diagnose root causes for complex/high impact incident/problem management investigationsCollaborates with Development and Infrastructure teams to understand technical solutions and develop Service Level Indicators and SLOs to measure/improve the reliability of the services they support

As part of a growing team with colleagues that are fun, smart, hardworking, and driven, you'll be expected to:

Collaborate with a diverse set of engineers, architects, and teams to design, develop, test, and implement secure, scalable, and highly available Infrastructure as Code, Observability as Code, and Operations as Code solutions for the bank's AWS PlatformDesign and implement deployment pipelines for highly scalable, automated, continuous integration and continuous delivery pipelinesBe responsible for all aspects of reliability, collaborating with technical experts, key stakeholders, and team members to resolve complex problems, owning issues through permanent resolutionHave a deep understanding of SRE practices, service level indicators, and service level objectives proactively defining them to prevent customer impactAnalyze diverse data sets and create visualizations to drive platform improvementsImplement infrastructure, configuration, and network as code for the applications and platforms in your remitIdentify opportunities to eliminate toil and automate the triage of issues to improve overall operational stabilityCollaborate with others to identify, analyze, and resolve platform vulnerabilitiesProactively promotes the adoption of site reliability engineering best practices within the team and organizationParticipate in 24x7 on-call coverage following a follow-the-sun model and perform blameless Postmortems (RCAs) as needed

Required Skills

15 years of combined experience in either SRE, software development, or infrastructure engineering (10 years with an advanced degree in Computer Science or related technical field).Significant, hands-on experience building and maintaining an enterprise AWS platform and its services using infrastructure as code, automation, and native services related to compute, storage, networking, security, and observabilityExtensive experience with monitoring tools such as Grafana, Prometheus, Splunk, or Dynatrace, as well as AWS native tools like CloudWatch, X-Ray, and CloudWatch LogsSignificant proficiency with using Infrastructure as Code tools such as Terraform to build cloud infrastructure solutions, automating cloud deployments, and delivering artifacts like AMI and Container Images.Proficient in implementing CI/CD pipelines with tools such as git and Terraform, familiarity with using a GitOps modelProficient in at least one programming language such as Python, Java/Spring Boot, or .NetAdvanced knowledge of networking (firewalls, DNS, Load Balancing, Proxies, etc.)Wide understanding of Linux & Windows operating systems including shell scriptingExcellent interpersonal, organizational and communication (written, verbal, and presentation) skills are a must

Desired Skills

Strong experience working with a complex IAM infrastructure, including Active Directory, AWS IAM Identity Center, and PingFederate or other SSO solutionsSubstantial experience in implementing, monitoring, and maintaining a highly scalable and resilient Data Services platform on AWS (DynamoDB, AWS Glue Data Catalog, or Bedrock and other AWS AI services)Extensive experience with EC2, Storage solutions including S3Substantial familiarity with Amazon EKS, AWS Lambda and Step Functions, CloudFormation templatesUnderstanding of cost management, inventory management, RBAC, and SIEM modelA proven ability to work independently with minimal supervision and as part of a team with direct responsibilities and an ability to quickly prioritize and adapt to changes in project scope

Skills:

ArchitectureCollaborationInnovative ThinkingResult OrientationSolution DesignAdaptabilityAnalytical ThinkingInfluenceStakeholder ManagementTechnical Strategy DevelopmentAutomationDevOps PracticesProduction SupportProject ManagementRisk Management

Shift:

1st shift (United States of America)

Hours Per Week:

Mostrar mas

Save & Solicitar más tarde Applying Later... Click to ApplyI AppliedDidn't Apply

Por favor confirme su dirección de correo electrónico: Send Email

Aplicar para este empleo

Next Job »

---

16 bank of america empleos en 1,954 bank of america empleos en