We are looking for a Senior Software Engineer with deep expertise in High-Performance Computing (HPC) and Artificial Intelligence (AI) networking performance, particularly across InfiniBand-based GPU clusters. You will be a key technical leader focused on understanding, analyzing, and optimizing the performance of distributed workloads running at massive scale - often involving tens of thousands of GPUs interconnected via high-speed networks.
This role requires strong familiarity with Message Passing Interface (MPI), NVIDIA Collective Communications Library (NCCL), collective communication algorithms, and the underlying transport technologies (Remote Direct Memory Access (RDMA) over InfiniBand). You should have extensive experience with network-level debugging, topology-aware optimization, and low-latency, high-throughput communication tuning in Linux environments. If you enjoy solving hard problems at the convergence of network systems and distributed applications, we want to talk to you.
Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.