Sunnyvale, CA, US
23 hours ago
Sr Software Dev Engineer, Edge AI ML Platform (Level 6), Edge AI
Are you passionate about building infrastructure that trains the next generation of large language models for edge devices? Join our Edge AI team at Amazon Devices (Lab126) where you'll architect and implement distributed training systems that scale to hundreds of billions of parameters. Your work will enable novel distillation and compression techniques that transform these massive models into efficient versions that run on constrained edge devices.

Lead the development of our distributed training platform for large language models up to 400B parameters
Design high-performance training systems that produce models optimized for edge deployment
Collaborate with ML scientists to create compression pipelines that maintain model quality while reducing size
Drive innovation in both large-scale training and edge-optimized model deployment

Key job responsibilities
- Architect and implement distributed training systems that efficiently scale across hundreds or thousands of GPUs
- Design and optimize data parallelism, tensor parallelism, and pipeline parallelism strategies for large language models
- Implement memory optimization techniques like activation recomputation, ZeRO, and mixed precision training
- Develop infrastructure that supports novel distillation and compression techniques for edge deployment
- Create evaluation frameworks to measure performance of compressed models on target edge hardware
- Collaborate with ML scientists to optimize training for downstream compression requirements
- Benchmark and profile training configurations to maximize throughput and GPU utilization
- Build pipelines that connect large-scale training to edge model deployment workflows

A day in the life
You'll start your day analyzing performance metrics from overnight training runs, identifying bottlenecks that are limiting throughput on our GPU clusters. After a quick stand-up with the team, you might pair with an ML scientist to implement a new parallelism strategy that reduces memory usage while maintaining computational efficiency.

In the afternoon, you could collaborate with the model compression team to ensure your training infrastructure produces checkpoints optimized for their distillation pipeline. You might debug a communication issue causing training instability across nodes, then optimize a custom CUDA kernel to improve attention computation speed.

Your work bridges the gap between massive-scale model training and efficient edge deployment, enabling AI capabilities that would otherwise be impossible on resource-constrained devices. By optimizing the training infrastructure, you directly impact how quickly we can iterate on new models and compression techniques, accelerating our path to delivering AI features to millions of Amazon devices.

About the team
The Edge AI team at Lab126 is responsible for developing the next generation of AI capabilities for Amazon devices. We're a diverse group of engineers and scientists working at the intersection of machine learning, distributed systems, and hardware optimization. Our mission is to bring powerful AI capabilities to Amazon devices while maintaining privacy, reducing latency, and optimizing for resource constraints.

We tackle the full AI pipeline - from training massive models at scale to compressing and distilling them for efficient edge deployment. This end-to-end approach allows us to optimize each stage of the process specifically for our target devices, achieving capabilities that would be impossible with off-the-shelf solutions.

Our team culture values deep technical expertise combined with practical problem-solving. We embrace challenges that others might consider impossible, and we're not afraid to question conventional approaches when better solutions exist. We work in a collaborative environment where ideas are valued regardless of title, and we take pride in building systems that scale efficiently from research to production.
Por favor confirme su dirección de correo electrónico: Send Email