Senior Site Reliability Engineer
@ NVIDIA

Hybrid

$200,000

Hybrid

Full Time

Posted 22 days ago

Your Application Journey

Interview

Email Hiring Manager

XXXXXXXX XXXXXXXXXXXXX XXXXXXXXX***** @nvidia.com

Recommended after applying

Job Details

About NVIDIA

NVIDIA is widely considered one of the technology world’s most desirable employers with a legacy of transforming computer graphics, PC gaming, and accelerated computing for over 30 years. Today, NVIDIA is driving the next era of computing by leveraging AI to empower innovations in robotics, self-driving cars, and more.

Role Overview: Senior Site Reliability Engineer

As a Senior Site Reliability Engineer, you will be a key member of the AI Infrastructure Production engineering team, responsible for developing and maintaining large-scale systems that support critical AI Infrastructure use cases. You will drive reliability, operability, and scalability on global public and private clouds.

Key Responsibilities

Develop and maintain large-scale systems for AI Infrastructure.
Implement SRE fundamentals including incident management and performance optimization.
Build automation tools to enhance observability and reduce manual overhead.
Establish frameworks for operational maturity and lead incident response protocols.
Mentor peers and collaborate with diverse engineering teams.

Qualifications

Minimum of 12 years in Software Development, SRE, or Production Engineering with a degree in Computer Science or related field (or equivalent experience). Proficiency in Python, and one other language (C/C++, Go, Perl, Ruby) is required. Solid expertise in Linux or Windows systems engineering and cloud platforms including AWS, OCI, Azure, or GCP. Must understand SRE principles such as error budgets, SLOs, SLAs and have hands-on experience with Infrastructure as Code tools like Terraform CDK. Familiarity with observability platforms (ELK, Prometheus, Loki) and CI/CD systems (GitLab) is essential.

Preferred Experience

Involvement in AI training, inferencing, and data infrastructure services.
Proficiency with deep learning frameworks like PyTorch, TensorFlow, JAX, and Ray.
Background in hardware health monitoring and system reliability.
Experience scaling distributed systems with strict SLAs.
Expertise in incident, change, and problem management.

Key skills/competency

AI Infrastructure, SRE, Scalability, Automation, Observability, Cloud, Python, Linux, Terraform, CI/CD

Apply without a personalized resume

How to Get Hired at NVIDIA

🎯 Tips for Getting Hired

Research NVIDIA's culture: Understand their mission and innovation legacy.
Customize your resume: Highlight SRE and AI infrastructure expertise.
Prepare for technical interviews: Focus on cloud and coding challenges.
Showcase leadership: Demonstrate mentoring and incident management skills.

📝 Interview Preparation Advice

Technical Preparation

Review cloud platform architectures.

Practice Python and a secondary language.

Study Infrastructure as Code tools.

Revisit observability and CI/CD setups.

Behavioral Questions

Describe a challenging incident response.

Explain teamwork in high-pressure situations.

Share a time you improved system efficiency.

Detail an experience mentoring peers.

Ready to optimize your application for NVIDIA?

Our Al will adapt your resume for NVIDIA's hiring patterns and similar Senior Site Reliability Engineer roles.

Frequently Asked Questions

What does a Senior Site Reliability Engineer at NVIDIA do?

What technical skills are essential for NVIDIA's Senior Site Reliability Engineer?

How important is experience with cloud platforms for this role at NVIDIA?

How does NVIDIA value SRE fundamentals in this position?

What advanced experiences can help a candidate stand out for this role?