Tenable AI adds a powerful new layer of visibility, context and control to the Tenable One Exposure Management Platform to govern usage, enforce policy and control exposure across both the AI that organizations use and the AI they build.
Your Role
We’re looking for an experienced MLOps / DevOps Engineer to design and manage the infrastructure powering large-scale machine learning systems. You’ll be responsible for deploying GPU-heavy models (including LLMs) on cost-efficient, production-grade infrastructure, supporting both ML workflows and application artifact delivery.
You’ll work with cutting-edge technologies like vLLM, Triton, SageMaker, ClearML, Karpenter, KEDA, and EKS, ensuring the right balance between performance, scalability, and cost.
What You’ll Do
Deploy and manage LLMs and deep learning models using vLLM, Triton Inference Server, and custom API endpoints.
Build and maintain GPU-aware autoscaling clusters using AWS EKS, Karpenter, and KEDA, optimizing for cost-efficiency and performance.
Develop CI/CD pipelines using Jenkins and GitHub Actions to automate ML model delivery and application deployments.
Orchestrate training, fine-tuning, and inference jobs on AWS SageMaker and ClearML, with support for experiment tracking, versioning, and reproducibility.
Support backend teams in deploying app artifacts and runtime environments; implement rollback and release strategies.
Integrate observability tooling (e.g., Prometheus, Grafana, ELK, or OpenTelemetry) for both infrastructure and model performance.
Collaborate with SREs to enforce high availability, disaster recovery, and incident response procedures for mission-critical AI services.
What You’ll Need
6+ years of experience in DevOps, MLOps, or infrastructure roles with a focus on ML model delivery.
Proven hands-on experience deploying GPU-based models (LLMs, vision, transformers) using vLLM or Triton.
Deep knowledge of AWS EKS and Kubernetes, with practical experience configuring Karpenter and KEDA for auto-scaling GPU workloads.
Experience building pipelines using Jenkins, GitHub Actions, and managing releases for ML and application codebases.
Familiarity with AWS SageMaker, ClearML, or similar platforms for ML orchestration and experimentation.
Strong scripting and automation skills in Python, Bash, and working knowledge of containerization (Docker).
Solid grasp of networking, IAM, and cloud security fundamentals.
Infrastructure-as-code experience using Terraform or Pulumi in production environments.