Senior Site Reliability Engineer @ NVIDIA
Your Application Journey
Email Hiring Manager
Job Details
About NVIDIA
NVIDIA is widely considered one of the technology world’s most desirable employers with a legacy of transforming computer graphics, PC gaming, and accelerated computing for over 30 years. Today, NVIDIA is driving the next era of computing by leveraging AI to empower innovations in robotics, self-driving cars, and more.
Role Overview: Senior Site Reliability Engineer
As a Senior Site Reliability Engineer, you will be a key member of the AI Infrastructure Production engineering team, responsible for developing and maintaining large-scale systems that support critical AI Infrastructure use cases. You will drive reliability, operability, and scalability on global public and private clouds.
Key Responsibilities
- Develop and maintain large-scale systems for AI Infrastructure.
- Implement SRE fundamentals including incident management and performance optimization.
- Build automation tools to enhance observability and reduce manual overhead.
- Establish frameworks for operational maturity and lead incident response protocols.
- Mentor peers and collaborate with diverse engineering teams.
Qualifications
Minimum of 12 years in Software Development, SRE, or Production Engineering with a degree in Computer Science or related field (or equivalent experience). Proficiency in Python, and one other language (C/C++, Go, Perl, Ruby) is required. Solid expertise in Linux or Windows systems engineering and cloud platforms including AWS, OCI, Azure, or GCP. Must understand SRE principles such as error budgets, SLOs, SLAs and have hands-on experience with Infrastructure as Code tools like Terraform CDK. Familiarity with observability platforms (ELK, Prometheus, Loki) and CI/CD systems (GitLab) is essential.
Preferred Experience
- Involvement in AI training, inferencing, and data infrastructure services.
- Proficiency with deep learning frameworks like PyTorch, TensorFlow, JAX, and Ray.
- Background in hardware health monitoring and system reliability.
- Experience scaling distributed systems with strict SLAs.
- Expertise in incident, change, and problem management.
Key skills/competency
AI Infrastructure, SRE, Scalability, Automation, Observability, Cloud, Python, Linux, Terraform, CI/CD
How to Get Hired at NVIDIA
🎯 Tips for Getting Hired
- Research NVIDIA's culture: Understand their mission and innovation legacy.
- Customize your resume: Highlight SRE and AI infrastructure expertise.
- Prepare for technical interviews: Focus on cloud and coding challenges.
- Showcase leadership: Demonstrate mentoring and incident management skills.