Senior Site Reliability Engineer
@ NVIDIA

Hybrid
$200,000
Hybrid
Full Time
Posted 7 hours ago

Your Application Journey

Personalized Resume
Apply
Email Hiring Manager
Interview

Email Hiring Manager

XXXXXXXX XXXXXXXXXXXXX XXXXXXXXX***** @nvidia.com
Recommended after applying

Job Details

About NVIDIA

NVIDIA is widely considered one of the technology world’s most desirable employers with a legacy of transforming computer graphics, PC gaming, and accelerated computing for over 30 years. Today, NVIDIA is driving the next era of computing by leveraging AI to empower innovations in robotics, self-driving cars, and more.

Role Overview: Senior Site Reliability Engineer

As a Senior Site Reliability Engineer, you will be a key member of the AI Infrastructure Production engineering team, responsible for developing and maintaining large-scale systems that support critical AI Infrastructure use cases. You will drive reliability, operability, and scalability on global public and private clouds.

Key Responsibilities

  • Develop and maintain large-scale systems for AI Infrastructure.
  • Implement SRE fundamentals including incident management and performance optimization.
  • Build automation tools to enhance observability and reduce manual overhead.
  • Establish frameworks for operational maturity and lead incident response protocols.
  • Mentor peers and collaborate with diverse engineering teams.

Qualifications

Minimum of 12 years in Software Development, SRE, or Production Engineering with a degree in Computer Science or related field (or equivalent experience). Proficiency in Python, and one other language (C/C++, Go, Perl, Ruby) is required. Solid expertise in Linux or Windows systems engineering and cloud platforms including AWS, OCI, Azure, or GCP. Must understand SRE principles such as error budgets, SLOs, SLAs and have hands-on experience with Infrastructure as Code tools like Terraform CDK. Familiarity with observability platforms (ELK, Prometheus, Loki) and CI/CD systems (GitLab) is essential.

Preferred Experience

  • Involvement in AI training, inferencing, and data infrastructure services.
  • Proficiency with deep learning frameworks like PyTorch, TensorFlow, JAX, and Ray.
  • Background in hardware health monitoring and system reliability.
  • Experience scaling distributed systems with strict SLAs.
  • Expertise in incident, change, and problem management.

Key skills/competency

AI Infrastructure, SRE, Scalability, Automation, Observability, Cloud, Python, Linux, Terraform, CI/CD

How to Get Hired at NVIDIA

🎯 Tips for Getting Hired

  • Research NVIDIA's culture: Understand their mission and innovation legacy.
  • Customize your resume: Highlight SRE and AI infrastructure expertise.
  • Prepare for technical interviews: Focus on cloud and coding challenges.
  • Showcase leadership: Demonstrate mentoring and incident management skills.

📝 Interview Preparation Advice

Technical Preparation

Review cloud platform architectures.
Practice Python and a secondary language.
Study Infrastructure as Code tools.
Revisit observability and CI/CD setups.

Behavioral Questions

Describe a challenging incident response.
Explain teamwork in high-pressure situations.
Share a time you improved system efficiency.
Detail an experience mentoring peers.

Frequently Asked Questions