Senior Site Reliability Engineer @ NVIDIA
placeSanta Clara, CA
attach_money $250,000
businessOn Site
scheduleFull Time
Posted 7 hours ago
Your Application Journey
Interview
Email Hiring Manager
***** @nvidia.com
Recommended after applying
Job Details
About the Role
The Senior Site Reliability Engineer role at NVIDIA is focused on designing, building, and maintaining large scale production systems. Using a combination of software and systems engineering practices, you will ensure high efficiency and uptime across internal and external GPU cloud services. This role requires expertise in systems, networking, coding, databases, capacity management, continuous delivery, deployment, and cloud technologies like Kubernetes and OpenStack.
What You'll Be Doing
- Design, implement and support operational and reliability aspects of large scale Kubernetes clusters.
- Engage in the full lifecycle of services from inception to refinement.
- Maintain live services with monitoring of availability, latency, and system health.
- Scale systems through automation and continuous improvement practices.
- Participate in on-call rotation and conduct blameless postmortems.
What We Need To See
- BS degree in Computer Science or related field or equivalent experience.
- Over 10 years of experience in infrastructure automation and distributed systems design.
- Proficiency in Python, Go, Perl, or Ruby with in-depth Linux, Networking, and Container knowledge.
Ways To Stand Out
- Experience with large scale distributed systems and cloud technologies such as Kubernetes, OpenStack and Docker.
- Strong problem-solving, communication skills, and a proactive mindset.
Key Skills/Competency
Kubernetes, Automation, Monitoring, Distributed Systems, Networking, Linux, Cloud, Python, Reliability, Incident Management
How to Get Hired at NVIDIA
🎯 Tips for Getting Hired
- Customize Your Resume: Highlight SRE and cloud experience.
- Research NVIDIA: Understand their products and culture.
- Showcase Technical Skills: Emphasize Kubernetes and automation expertise.
- Prepare for On-call Scenarios: Review incident response cases.
📝 Interview Preparation Advice
Technical Preparation
circle
Review Kubernetes deployment and scaling strategies.
circle
Practice automation scripting in Python and Go.
circle
Study Linux system tuning and containerization.
circle
Familiarize with distributed systems troubleshooting.
Behavioral Questions
circle
Describe a challenging incident and resolution.
circle
Explain teamwork during high-pressure situations.
circle
Discuss a time you improved system reliability.
circle
Illustrate lessons from a postmortem meeting.
Frequently Asked Questions
What qualifications does NVIDIA seek for a Senior Site Reliability Engineer role?
keyboard_arrow_down
How important is automation in NVIDIA's Senior Site Reliability Engineer position?
keyboard_arrow_down
What programming skills should a candidate have for the Senior Site Reliability Engineer position at NVIDIA?
keyboard_arrow_down
How does NVIDIA support continuous learning for a Senior Site Reliability Engineer?
keyboard_arrow_down
What does on-call rotation mean for the Senior Site Reliability Engineer role at NVIDIA?
keyboard_arrow_down