San Francisco, CA
8 hours ago
Senior Software Engineer, ML Training Platform
About the Team

DoorDash is building the world’s most reliable on-demand logistics engine. Behind the scenes, our Machine Learning Platform (MLP) powers critical real-time decision-making for millions of orders each day, supporting business-critical use cases like Ads, Groceries, Logistics, Fraud, and Search.

About the Role

As a Senior Software Engineer in the team, you will take ownership of major projects within our ML Training Platform—creating reliable, extensible solutions for data transformations, distributed model training, and rapid experimentation in production. You’ll collaborate closely with ML Engineers, Platform & Infra engineers, and partner teams to ensure our platform supports high-volume, GPU-accelerated training in a fast-evolving environment.

This is a hybrid opportunity in San Francisco, Sunnyvale, or Seattle.

You’re excited about this opportunity because you will… Drive Key Training Initiatives – Own and deliver significant sub-projects that enhance our platform’s performance, reliability, and ease of use. Architect & Implement Scalable Solutions – Design resilient pipelines for distributed model training (e.g., PyTorch, LightGBM) on Kubernetes, optimizing for both short-term speed and long-term maintainability. Collaborate with Cross-Functional Teams – Work with ML engineers, Data Scientists, and product stakeholders to refine requirements, set realistic milestones, and ensure smooth delivery. Set a High Bar for Quality & Reliability – Lead by example with clean, high-performance code, thorough design reviews, and a focus on observability, incident mitigation, and continuous improvement. Mentor & Influence – Help level up peers by sharing knowledge, driving best practices, and contributing to a supportive team culture that values empathy and technical excellence. We’re excited about you because… 6+ years of industry experience in software engineering, with a deep understanding of distributed systems and data-intensive ML pipelines in production. Hands-On ML Platform/Infra Experience – You’re familiar with modern machine learning stacks (e.g., PyTorch, LightGBM, TensorFlow) and have built or maintained large-scale training environments. Strong CS fundamentals – You excel at crafting solutions that handle scale, complexity, and reliability challenges. Proven Project Ownership – You can break down complex initiatives, estimate accurately, and deliver major projects with minimal oversight. Collaboration & Communication – You’re adept at partnering across functions, setting expectations, and ensuring alignment among diverse stakeholders. Thrive on Continuous Improvement – You proactively identify gaps, reduce technical debt, and optimize resource usage, balancing cost and performance. Nice To Haves GPU Acceleration – Experience with GPU-enabled training and its associated performance optimizations. MLOps Tooling – Familiarity with orchestration and tracking frameworks such as Metaflow, MLflow, Dagster, or Airflow. Large-Scale Data Processing – Knowledge of Spark, Hadoop, or other distributed data processing technologies. Monitoring & Observability – Proficiency with metrics and alerting solutions (e.g., Prometheus, Grafana). Cloud Platforms – Experience with AWS or GCP for scalable compute, container orchestration, and cost management.

 

Notice to Applicants for Jobs Located in NYC or Remote Jobs Associated With Office in NYC Only

We use Covey as part of our hiring and/or promotional process for jobs in NYC and certain features may qualify it as an AEDT in NYC. As part of the hiring and/or promotion process, we provide Covey with job requirements and candidate submitted applications. We began using Covey Scout for Inbound from August 21, 2023, through December 21, 2023, and resumed using Covey Scout for Inbound again on June 29, 2024.

The Covey tool has been reviewed by an independent auditor. Results of the audit may be viewed here: Covey

Por favor confirme su dirección de correo electrónico: Send Email