The Oracle Cloud Infrastructure (OCI) Compute team is responsible for providing bare metal and virtual machines at scale to our customers; these include CPUs and GPUs. With rapid growth in machine learning, the demand for GPUs and CPUs is exploding, making performance and efficiency of cloud scale services a critical area of investment.
The Core Architecture team focuses on identifying performance and efficiency constraints within the entire lifecycle of compute services from inventory management, capacity ingestion, placement, repair, and decommissioning. Consulting engineers are responsible for performing deep analysis into business problems and proposing & incubating new automated solutions that address the needs of some of our largest customers.
You will take the lead in defining the architecture for the brand-new host lifecycle management capabilities that will power the next generation of the Compute Control Plane. This initiative spans across multiple Compute domains, from GPU validation to repairs, and you will drive engineers from these organizations to build cohesive microservice based solutions that will enable Compute to scale for growing customer demands.
We are looking for a hands-on senior engineer with technical breadth, proven experience in solving cloud scale problems, distributed systems design & implementation experience to build fault tolerant solutions that will form the foundations of the next generation of Compute offerings. The candidate is expected to have strong written and verbal communications skills, the ability to lead projects across organizational boundaries, and experience representing their work to senior leaders.
Qualifications:
BS or MS degree in Computer Science/Engineering or a related IT field or equivalent experience relevant to functional area. 10+ years of development experience with large scale, highly available distributed systems Proficiency with Cloud-based Data Store primitives Proficiency in Java programming patterns Experience with operating distributed services at scale Expertise in Linux and operating systems Systematic problem-solving approach, strong communication skills, strong ownership and drive Deep understanding of service metrics and alarms through the development of dashboards, service KPIs, alarming systems Propose, scope, design and direct automation, optimizations, and enhancements Mentor junior engineers
Preferred Qualifications:
Experience in management and automation of end-to-end CPU/GPU lifecycles at scale Proficiency with Cloud and CICD environments Proficiency with Terraform, Docker Proficiency with modern build tools and pipelines Proficiency building multi-tenant, virtualized infrastructure Proficiency with change control management and mature operating processes Proficiency with Security including Identity, SSL and certificates Proficiency with Database and Data StoresCareer level-IC5