Senior Site Reliability Engineer @ NVIDIA
placeHybrid
attach_money $271,000
businessHybrid
scheduleFull Time
Posted 1 day ago
Your Application Journey
Interview
Email Hiring Manager
******* @nvidia.com
Recommended after applying
Job Details
About the Role
NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. Today, leveraging AI to define the next era of computing, NVIDIA is driving the future of technology.
The DGX Cloud team is seeking a Senior Site Reliability Engineer to maintain high-performance DGX Cloud clusters for AI researchers and enterprise clients across multiple cloud platforms. This role involves supporting and optimizing large-scale Kubernetes clusters, capacity management, and operational efficiency.
Responsibilities
- Support large-scale Kubernetes services from creation to launch.
- Build, implement, and support operational and reliability aspects of Kubernetes clusters.
- Define SLOs/SLIs and monitor error budgets with streamlined reporting.
- Measure and monitor service availability, latency, and system health post-launch.
- Operate GPU workloads across AWS, GCP, Azure, OCI, and private clouds.
- Scale systems using automation and propose improvements for reliability.
- Lead triage and root-cause analysis of high-severity incidents with balanced incident response.
- Participate in on-call rotations for production services support.
Requirements
- BS in Computer Science or related field, or equivalent experience.
- 12+ years of production service operations experience.
- Expert-level knowledge in Kubernetes, containerization, and microservices.
- Experience with infrastructure automation tools like Terraform, Ansible, Chef, or Puppet.
- Proficiency in Python or Go.
- In-depth understanding of Linux, TCP/IP networking, and cloud security standards.
- Strong troubleshooting skills in DNS, network, Kubernetes, and systems issues.
- Knowledge of SRE principles, including SLOs, SLIs, error budgets, and incident handling.
- Experience with observability stacks (e.g., Prometheus, Grafana, ELK, Datadog).
Stand Out Factors
- Experience with GPU-accelerated clusters using KubeVirt.
- Application of generative-AI techniques to reduce operational toil.
- Skills in automating incidents with Shoreline or StackStorm.
- Expertise in GPU workload orchestration and large-scale resource management.
Additional Information
Competitive salary, equity, and benefits are offered. Applications accepted until September 6, 2025. NVIDIA is an equal opportunity employer committed to diversity.
Key skills/competency
- Kubernetes
- SRE
- GPU Workloads
- Automation
- Observability
- Linux
- Networking
- Cloud Platforms
- Terraform
- Incident Management
How to Get Hired at NVIDIA
🎯 Tips for Getting Hired
- Customize your resume: Highlight relevant Kubernetes and cloud experience.
- Showcase SRE expertise: Detail incident and system monitoring skills.
- Emphasize automation skills: Include Terraform and scripting projects.
- Research NVIDIA culture: Learn about AI innovations and team environment.
📝 Interview Preparation Advice
Technical Preparation
circle
Review Kubernetes cluster management.
circle
Practice infrastructure automation using Terraform.
circle
Study GPU workload orchestration methods.
circle
Familiarize with cloud observability tools.
Behavioral Questions
circle
Describe handling high-severity incidents.
circle
Explain past experience in incident resolution.
circle
Discuss collaboration during on-call rotations.
circle
Detail a postmortem process example.
Frequently Asked Questions
What experience is needed for the Senior Site Reliability Engineer role at NVIDIA?
keyboard_arrow_down
Does NVIDIA require specific programming skills for this role?
keyboard_arrow_down
How important is experience with GPU workloads at NVIDIA?
keyboard_arrow_down
What qualifications make a candidate stand out for this role at NVIDIA?
keyboard_arrow_down
How does the on-call rotation work for the Senior Site Reliability Engineer at NVIDIA?
keyboard_arrow_down