Site Reliability Engineer
WESCO
The Site Reliability Engineer will be responsible for ensuring the availability, reliability, and performance of our customer-facing software applications. This role combines planning, engineering, monitoring, incident response, and administration to create highly scalable and fault-tolerant systems.
**Responsibilities:**
+ Ensure the high availability and reliability of the production environment by monitoring system health and performance
+ Provide primary operational support for large-scale distributed software applications
+ Facilitate incident resolution via triage, communication, engagement, escalation, and documentation
+ Partner with platform administration (both internal and external) to define and achieve stability and scalability objectives
+ Collaborate with technical and quality teams to improve services by identifying areas of risk and helping to define and proactively implement solutions
+ Drive continual improvement in system performance by setting service level objectives in collaboration with a performance center of practice and/or product development teams
+ Participate in system design, capacity planning, and platform management
+ Analyze and publish metrics from operating systems and applications to assist in performance tuning and fault finding
+ Pursue opportunities for automation and process improvements
**Qualifications:**
+ Bachelor’s degree (or demonstrable equivalent work experience) in information technology
+ Experience providing first-level incident response and troubleshooting with technical teams to resolve end-user issues
+ Proficiency with enterprise system monitoring software (examples: NewRelic, Nagios, Solarwinds, Dynatrace, Datadog, Azure Monitor, Splunk)
+ Experience with cloud-based infrastructure, databases, and applications
+ Experience with performance tuning and fault finding in large-scale distributed systems.
+ Experience with designing, implementing, and managing performance testing practices, including specific tools and frameworks
+ Knowledge of disaster recovery planning and execution.
+ Ability to effectively work in a highly matrixed organization
+ Excellent verbal and written communication skills.
+ Strong understanding of coding, automation, and engineering principles to build resilient, self-healing systems
+ Familiarity with DevOps practices and tools
+ Jira (or equivalent work management)
+ Confluence (or equivalent knowledge management)
_\#LI-KS1_
Por favor confirme su dirección de correo electrónico: Send Email