United States
19 hours ago
Site Reliability Developer 5

Join our team as a Senior Infrastructure SRE and play a critical role in ensuring the reliability, performance, and scalability of our next-generation GPU cloud. You will be responsible for operating and maintaining the world’s largest deployments of cutting-edge GPU hardware (H100, GB200 and future generations), supporting a diverse range of AI/ML workloads. This is a hands-on role requiring deep expertise in large-scale infrastructure, networking, and troubleshooting.

What You’ll Do:

Operate and maintain our GPU cloud infrastructure, ensuring high availability and optimal performance.
Design, implement, and maintain comprehensive monitoring systems for hardware health, network utilization, and workload performance.
Develop automation tools to streamline provisioning, configuration management, and incident response.
Troubleshoot complex issues in a distributed environment, working closely with internal teams to resolve incidents quickly and effectively.
Collaborate with network engineering teams to optimize Infiniband/RoCE networks for high-performance AI workloads.
Manage and assist Level 1/2 Validation Engineers with troubleshooting complex issues.
Travel to deployment sites internationally to assist with initial setup, testing, and ongoing support.
Work closely with customers to fine-tune their workloads and resolve performance bottlenecks.
Coordinate with manufacturing partners on hardware diagnostics, firmware updates, and RMA processes.
Impact:

You will be a key contributor to the success of Oracle’s AI cloud offering, directly impacting our customers' ability to innovate and solve challenging problems. Your work will be critical in ensuring the reliability, performance, and scalability of our GPU infrastructure, enabling us to deliver a world-class cloud experience.

We are looking for individuals who:

Are passionate about building and operating large-scale infrastructure.
Have a strong sense of ownership and are driven to solve complex problems.
Are excellent communicators and collaborators.
Thrive in a fast-paced, dynamic environment.

Por favor confirme su dirección de correo electrónico: Send Email