Senior Site Reliability Engineer ML Platforms
@ NVIDIA

Hybrid
$150,000
Hybrid
Full Time
Posted 16 hours ago

Your Application Journey

Personalized Resume
Apply
Email Hiring Manager
Interview

Email Hiring Manager

XXXXXXXXXX XXXXXXXXXXX XXXXXXXX******* @nvidia.com
Recommended after applying

Job Details

About the Role

This role is for a Senior Site Reliability Engineer ML Platforms at NVIDIA. You will design, build, and maintain large-scale production systems supporting advanced data science and machine learning applications.

What You’ll Be Doing

  • Develop software solutions to ensure system reliability and operability.
  • Gain deep insights into system operations including scalability, interactions, and failure modes.
  • Create tools and automation to reduce manual operational tasks.
  • Establish frameworks, processes, and methodologies to enhance operational maturity.
  • Define and track reliability metrics for continuous improvement.
  • Manage capacity and performance across cloud environments globally.
  • Build improved observability tools for faster issue resolution.
  • Practice sustainable incident response and conduct blameless postmortems.

What We Need To See

A minimum of 6+ years experience in SRE, Cloud platforms, or DevOps with large-scale microservices, a strong background in incident and change management, and proficiency in automation and coding with languages like Python and Go.

Ways To Stand Out

  • Experience with large-scale distributed systems and strong SLAs.
  • Excellent coding skills in Python and Go.
  • Expertise in CI/CD systems and Infrastructure as Code.
  • Strong interpersonal skills for effective data-driven communications.

Key Skills/Competency

  • SRE
  • Cloud Platforms
  • DevOps
  • Distributed Systems
  • Observability
  • Automation
  • Python
  • Go
  • Incident Management
  • CI/CD

How to Get Hired at NVIDIA

🎯 Tips for Getting Hired

  • Research NVIDIA's culture: Study their mission and recent achievements.
  • Tailor your resume: Highlight experience in SRE and cloud operations.
  • Showcase technical skills: Emphasize Python, Go, and automation.
  • Prepare for interviews: Practice scenario-based SRE questions.

📝 Interview Preparation Advice

Technical Preparation

Review Kubernetes and OpenStack basics.
Study distributed systems and capacity management.
Practice Python and Go coding challenges.
Explore automation and incident management tools.

Behavioral Questions

Describe a challenging incident resolution experience.
Explain teamwork in high-pressure scenarios.
Discuss your approach to problem-solving.
Share examples of effective communication.

Frequently Asked Questions