Job Description
Manager- Site Reliability Engineer (SRE) – Reliability & Automation
The Opportunity
Based in Hyderabad, join a global healthcare biopharma company and be part of a 130- year legacy of success backed by ethical integrity, forward momentum, and an inspiring mission to achieve new milestones in global healthcare.Be part of an organisation driven by digital technology and data-backed approaches that support a diversified portfolio of prescription medicines, vaccines, and animal health products.Drive innovation and execution excellence. Be a part of a team with passion for using data, analytics, and insights to drive decision-making, and which creates custom software, allowing us to tackle some of the world's greatest health threats.Our Technology Centres focus on creating a space where teams can come together to deliver business solutions that save and improve lives. An integral part of our company’s IT operating model, Tech Centres are globally distributed locations where each IT division has employees to enable our digital transformation journey and drive business outcomes. These locations, in addition to the other sites, are essential to supporting our business and strategy.
A focused group of leaders in each Tech Centre helps to ensure we can manage and improve each location, from investing in growth, success, and well-being of our people, to making sure colleagues from each IT division feel a sense of belonging to managing critical emergencies. And together, we must leverage the strength of our team to collaborate globally to optimize connections and share best practices across the Tech Centres.
Role Overview:
We are looking for a dedicated Site Reliability Engineer (SRE) to ensure the reliability, scalability and operational excellence of our data applications hosted on AWS and traditional datacentres. You will own release management, automate infrastructure and deployment processes using Python, implement observability solutions and enforce compliance standards to maintain robust and highly available systems.
What will you do in this role:
Reliability Engineering: Design, implement and maintain systems that ensure high availability, fault tolerance and scalability of data applications across cloud (AWS) and on-premises environments.Release & Deployment Management: Manage and automate release pipelines, coordinate deployments and ensure smooth rollouts with minimal downtime.DevOps Automation: Develop and maintain automation scripts and tools (primarily in Python) to streamline infrastructure provisioning, configuration management and operational workflows.Observability & Monitoring: Build and enhance monitoring, logging and alerting systems to proactively detect and resolve system issues, ensuring optimal performance and uptime.Ensure reliability and scalability of ETL pipelines, including orchestration, scheduling, and dependency management.Automate deployment, rollback, and version control of ETL jobs and workflows.Implement monitoring and alerting for ETL job success, latency, and data quality metrics.Collaborate with data engineering teams to troubleshoot ETL failures and optimize pipeline performance.Maintain documentation and compliance related to ETL data lineage and processingIncident Management & Root Cause Analysis: Lead incident response efforts, perform thorough root cause analysis and implement preventive measures to avoid recurrence.Compliance & Security: Ensure systems comply with organizational policies and regulatory requirements, including data governance, security best practices and audit readiness.Collaboration: Work closely with development, data engineering and operations teams to align reliability goals with business objectives and support continuous improvement.Documentation & Knowledge Sharing: Maintain clear documentation of infrastructure, processes and incident reports to facilitate team knowledge and operational consistency.Monitoring & Troubleshooting: Implement and maintain monitoring and alerting for database health, query performance, and resource utilization; lead troubleshooting and root cause analysis for database-related incidents.What should you have:
Bachelor’s degree in computer science, Engineering, Information Technology, or related field.4+ years of experience in Site Reliability Engineering, DevOps or related roles focused on infrastructure reliability and automation.Strong proficiency in Python for automation and scripting.Experience with ETL orchestration tools such as Apache Airflow, AWS Glue, or similar.Understanding of data pipeline architectures and common failure modes.Familiarity with data quality and lineage concepts in ETL processes.Hands-on experience with AWS cloud services (IAM, S3, Lambda, CloudWatch, AirFlow, etc.) and traditional datacenter environments.Expertise in release management and CI/CD pipelines using tools such as Jenkins, GitLab CI, or similar.Deep knowledge of observability tools and frameworks (e.g., Prometheus, Grafana, ELK stack, Datadog).Solid understanding of infrastructure as code (IaC) tools like Terraform, CloudFormation or Ansible.Experience with container orchestration platforms (e.g., Kubernetes, Docker Swarm) is a plus.Strong incident management skills with focus on root cause analysis and remediation.Familiarity with compliance frameworks and security best practices in cloud and on-prem environments.Excellent communication skills to collaborate effectively across technical and non-technical teams.Preferred Qualifications
Advanced degree in a relevant technical field.Certifications such as AWS Certified DevOps Engineer, ITIL V3/4 or similar.Experience working in Agile Scrum or Kanban environments.Knowledge of security and database administration in cloud and hybrid environments.Why Join Us?
Play a critical role in ensuring the reliability and scalability of mission-critical data applications.Work with cutting-edge cloud and on-premises technologies.Collaborate with a passionate team focused on operational excellence.Opportunities for professional growth and continuous learning.Our technology teams operate as business partners, proposing ideas and innovative solutions that enable new organizational capabilities. We collaborate internationally to deliver services and solutions that help everyone be more productive and enable innovation.
Who we are
We are known as Merck & Co., Inc., Rahway, New Jersey, USA in the United States and Canada and MSD everywhere else. For more than a century, we have been inventing for life, bringing forward medicines and vaccines for many of the world's most challenging diseases. Today, our company continues to be at the forefront of research to deliver innovative health solutions and advance the prevention and treatment of diseases that threaten people and animals around the world.
What we look for
Imagine getting up in the morning for a job as important as helping to save and improve lives around the world. Here, you have that opportunity. You can put your empathy, creativity, digital mastery, or scientific genius to work in collaboration with a diverse group of colleagues who pursue and bring hope to countless people who are battling some of the most challenging diseases of our time. Our team is constantly evolving, so if you are among the intellectually curious, join us—and start making your impact today.
#HYD IT 2025
Current Employees apply HERE
Current Contingent Workers apply HERE
Search Firm Representatives Please Read Carefully
Merck & Co., Inc., Rahway, NJ, USA, also known as Merck Sharp & Dohme LLC, Rahway, NJ, USA, does not accept unsolicited assistance from search firms for employment opportunities. All CVs / resumes submitted by search firms to any employee at our company without a valid written search agreement in place for this position will be deemed the sole property of our company. No fee will be paid in the event a candidate is hired by our company as a result of an agency referral where no pre-existing agreement is in place. Where agency agreements are in place, introductions are position specific. Please, no phone calls or emails.
Employee Status:
RegularRelocation:
VISA Sponsorship:
Travel Requirements:
Flexible Work Arrangements:
HybridShift:
Valid Driving License:
Hazardous Material(s):
Required Skills:
Availability Management, Capacity Management, Change Controls, Configuration Management (CM), Design Applications, Incident Management, Information Technology (IT) Infrastructure, IT Service Management (ITSM), Software Configurations, Software Development, Software Development Life Cycle (SDLC), Solution Architecture, System Administration, System DesignsPreferred Skills:
Job Posting End Date:
09/15/2025*A job posting is effective until 11:59:59PM on the day BEFORE the listed job posting end date. Please ensure you apply to a job posting no later than the day BEFORE the job posting end date.
Requisition ID:R359245