Senior Site Reliability Engineer
@ NVIDIA

Hybrid
$271,000
Hybrid
Full Time
Posted 1 day ago

Your Application Journey

Personalized Resume
Apply
Email Hiring Manager
Interview

Email Hiring Manager

XXXXXXXXXX XXXXXXXXXXXXX XXXXXX******* @nvidia.com
Recommended after applying

Job Details

About the Role

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. Today, leveraging AI to define the next era of computing, NVIDIA is driving the future of technology.

The DGX Cloud team is seeking a Senior Site Reliability Engineer to maintain high-performance DGX Cloud clusters for AI researchers and enterprise clients across multiple cloud platforms. This role involves supporting and optimizing large-scale Kubernetes clusters, capacity management, and operational efficiency.

Responsibilities

  • Support large-scale Kubernetes services from creation to launch.
  • Build, implement, and support operational and reliability aspects of Kubernetes clusters.
  • Define SLOs/SLIs and monitor error budgets with streamlined reporting.
  • Measure and monitor service availability, latency, and system health post-launch.
  • Operate GPU workloads across AWS, GCP, Azure, OCI, and private clouds.
  • Scale systems using automation and propose improvements for reliability.
  • Lead triage and root-cause analysis of high-severity incidents with balanced incident response.
  • Participate in on-call rotations for production services support.

Requirements

  • BS in Computer Science or related field, or equivalent experience.
  • 12+ years of production service operations experience.
  • Expert-level knowledge in Kubernetes, containerization, and microservices.
  • Experience with infrastructure automation tools like Terraform, Ansible, Chef, or Puppet.
  • Proficiency in Python or Go.
  • In-depth understanding of Linux, TCP/IP networking, and cloud security standards.
  • Strong troubleshooting skills in DNS, network, Kubernetes, and systems issues.
  • Knowledge of SRE principles, including SLOs, SLIs, error budgets, and incident handling.
  • Experience with observability stacks (e.g., Prometheus, Grafana, ELK, Datadog).

Stand Out Factors

  • Experience with GPU-accelerated clusters using KubeVirt.
  • Application of generative-AI techniques to reduce operational toil.
  • Skills in automating incidents with Shoreline or StackStorm.
  • Expertise in GPU workload orchestration and large-scale resource management.

Additional Information

Competitive salary, equity, and benefits are offered. Applications accepted until September 6, 2025. NVIDIA is an equal opportunity employer committed to diversity.

Key skills/competency

  • Kubernetes
  • SRE
  • GPU Workloads
  • Automation
  • Observability
  • Linux
  • Networking
  • Cloud Platforms
  • Terraform
  • Incident Management

How to Get Hired at NVIDIA

🎯 Tips for Getting Hired

  • Customize your resume: Highlight relevant Kubernetes and cloud experience.
  • Showcase SRE expertise: Detail incident and system monitoring skills.
  • Emphasize automation skills: Include Terraform and scripting projects.
  • Research NVIDIA culture: Learn about AI innovations and team environment.

📝 Interview Preparation Advice

Technical Preparation

Review Kubernetes cluster management.
Practice infrastructure automation using Terraform.
Study GPU workload orchestration methods.
Familiarize with cloud observability tools.

Behavioral Questions

Describe handling high-severity incidents.
Explain past experience in incident resolution.
Discuss collaboration during on-call rotations.
Detail a postmortem process example.

Frequently Asked Questions