Bangalore
1 day ago
Site Reliability Engineer AWS -Python, Terraform, DataDog

We are seeking a seasoned Site Reliability Engineer (SRE) with a deep understanding of observability solutions, particularly DataDog, to help drive performance, reliability, and efficiency across our platforms. The ideal candidate is a hands-on engineer experienced in logs migration, tenant consolidation, tools rationalization, and FinOps for observability platforms, with a strong grasp of full-stack development and automation.

Key Responsibilities Observability & Monitoring

Design, implement, and scale full-stack observability solutions using DataDog across multi-cloud environments (AWS, Azure).

Lead efforts in tenant consolidation and rationalization of existing observability tools and platforms.

Migrate logs from legacy systems to DataDog ensuring integrity, traceability, and security.

Build and optimize telemetry pipelines for metrics, logs, and traces.

Automation & Engineering

Create and maintain automation scripts for operational and monitoring tasks using Python, Bash, PowerShell, Jenkins DSL, Groovy, Ansible, and Terraform.

Automate repetitive tasks and improve system reliability and performance through smart tooling.

Collaboration & Governance

Work closely with development and infrastructure teams to embed observability best practices in the CI/CD lifecycle.

Establish and enforce governance and FinOps principles for observability tools.

Recommend and implement tool consolidation strategies to streamline operations and reduce cost.

Reliability Engineering

Design, build, and maintain highly available, scalable, and resilient systems.

Participate in on-call rotations and manage system support during maintenance windows.

Required Experience & Skills Professional Experience

6+ years in SRE or software engineering roles, preferably in managing production-grade systems at scale.

Minimum 3 years of hands-on experience with DataDog or equivalent observability platforms (SaaS implementations preferred).

Technical Skills

Proficient in at least one modern programming language (e.g., Python, Java, Go).

Deep expertise in DataDog, including dashboarding, ing, and integrations.

Strong understanding of infrastructure as code using Terraform or similar tools.

Experience with multi-cloud architectures (AWS, Azure) and telemetry systems.

Familiarity with full-stack development (backend, frontend, APIs, services).

Automation Tools

Scripting and automation experience with Python, Bash, PowerShell, Ansible, and CI/CD pipelines (Jenkins, GitLab).

Por favor confirme su dirección de correo electrónico: Send Email