Senior Site Reliability Engineer
@ NVIDIA

Santa Clara, CA
$250,000
On Site
Full Time
Posted 7 hours ago

Your Application Journey

Personalized Resume
Apply
Email Hiring Manager
Interview

Email Hiring Manager

XXXXXXXX XXXXXXXXXXX XXXXXXXXXX***** @nvidia.com
Recommended after applying

Job Details

About the Role

The Senior Site Reliability Engineer role at NVIDIA is focused on designing, building, and maintaining large scale production systems. Using a combination of software and systems engineering practices, you will ensure high efficiency and uptime across internal and external GPU cloud services. This role requires expertise in systems, networking, coding, databases, capacity management, continuous delivery, deployment, and cloud technologies like Kubernetes and OpenStack.

What You'll Be Doing

  • Design, implement and support operational and reliability aspects of large scale Kubernetes clusters.
  • Engage in the full lifecycle of services from inception to refinement.
  • Maintain live services with monitoring of availability, latency, and system health.
  • Scale systems through automation and continuous improvement practices.
  • Participate in on-call rotation and conduct blameless postmortems.

What We Need To See

  • BS degree in Computer Science or related field or equivalent experience.
  • Over 10 years of experience in infrastructure automation and distributed systems design.
  • Proficiency in Python, Go, Perl, or Ruby with in-depth Linux, Networking, and Container knowledge.

Ways To Stand Out

  • Experience with large scale distributed systems and cloud technologies such as Kubernetes, OpenStack and Docker.
  • Strong problem-solving, communication skills, and a proactive mindset.

Key Skills/Competency

Kubernetes, Automation, Monitoring, Distributed Systems, Networking, Linux, Cloud, Python, Reliability, Incident Management

How to Get Hired at NVIDIA

🎯 Tips for Getting Hired

  • Customize Your Resume: Highlight SRE and cloud experience.
  • Research NVIDIA: Understand their products and culture.
  • Showcase Technical Skills: Emphasize Kubernetes and automation expertise.
  • Prepare for On-call Scenarios: Review incident response cases.

📝 Interview Preparation Advice

Technical Preparation

Review Kubernetes deployment and scaling strategies.
Practice automation scripting in Python and Go.
Study Linux system tuning and containerization.
Familiarize with distributed systems troubleshooting.

Behavioral Questions

Describe a challenging incident and resolution.
Explain teamwork during high-pressure situations.
Discuss a time you improved system reliability.
Illustrate lessons from a postmortem meeting.

Frequently Asked Questions