Santa Clara, CA, USA
3 days ago
Software Developer 5

OCI is driving development of next generation hyperscalar GPU data centers built on Nvidia and AMD GPUs. OCI enables popular AI services such as openAI on GPU compute servers.  We are looking for engineers experienced in working with GPU device drivers and the runtime libraries (CUDA and ROCM). You must understand GPU architectural concepts such as UVM, host to device and device to host interactions including able to quantify performance issues in all such interactions. We are looking for strong experience in building and debugging issues that occur in the GPU drivers and  Linux kernels that interact with GPU stack including functional and performance issues when running GPU AI/ML/inference workloads. The candidate should be able to use all standard tools targeted performance and stress such as DCGM, NCCL and RCCL suites. In addition, we are looking for experience debugging and diagnosing issues in the system reported via RAS events  notified via the GPU BMC and other monitoring agents. The candidate should have breath knowledge in BIOS, CPU and GPU BMC and must show strong proficiency in C programming and working knowledge in Python or other scripting language used in AI/GPU environments

Por favor confirme su dirección de correo electrónico: Send Email