12 days ago

Senior HPC DevOps Engineer

NVIDIA

Hybrid
Full Time
$150,000
Hybrid
Apply

Job Overview

Job TitleSenior HPC DevOps Engineer
Job TypeFull Time
Offered Salary$150,000
LocationHybrid

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

About the Role

NVIDIA is seeking a Senior HPC DevOps and Network Engineer to contribute to the development of future supercomputers and HPC clusters. This role is pivotal in driving advancements in artificial intelligence and GPU computing, focusing on large-scale system design and performance tuning for compute-intensive tasks. You will collaborate with researchers, developers, and customers, utilizing cutting-edge accelerated computing and Deep Learning platforms to enhance workflows and create innovative solutions. You will also work alongside HPC, OS, GPU compute, and systems specialists to architect and deploy high-performance platforms.

What You’ll Be Doing

  • Design, implement, and maintain large-scale HPC/AI clusters with advanced monitoring, logging, and alerting systems.
  • Utilize and develop Infrastructure as Code (IaC) tools for scalable and repeatable deployments.
  • Develop and maintain CI/CD pipelines to automate and streamline deployment processes.
  • Create automation scripts and tools for deployment, configuration management, and operational monitoring.
  • Develop complex network automation solutions.
  • Perform in-depth troubleshooting from bare metal to application levels, ensuring system reliability.
  • Act as a technical expert, promoting and sharing best practices within internal teams.
  • Support R&D activities, including proof of concepts (POCs) and proof of values (POVs) for future enhancements.

What We Need To See

  • Bachelor's degree in Computer Science, Engineering, or a related field, with 5+ years of experience.
  • Deep knowledge of HPC and AI technologies, including CPUs, GPUs, high-speed interconnects, and related software.
  • Advanced proficiency in programming and scripting languages with a strong grasp of object-oriented programming.
  • Familiarity with CI/CD tools such as Jenkins, and configuration management tools like Ansible, Puppet/Chef.
  • Excellent knowledge of Windows and Linux (Redhat/CentOS, Ubuntu), networking, and OS-level security.
  • Deep understanding of networking protocols like InfiniBand and Ethernet.
  • Experience with workload schedulers (e.g., Slurm) and orchestration tools (e.g., Kubernetes).
  • Background with storage solutions such as Lustre, GPFS, ZFS, and XFS.
  • Expertise in virtual systems (VMware, Hyper-V, KVM, Citrix).
  • Familiarity with major cloud platforms (AWS, Azure, Google Cloud).

Ways To Stand Out From The Crowd

  • Proven networking experience or strong knowledge acquired through professional networking training.
  • Architectural insight into CPU and/or GPU architecture.
  • Container expertise, including Kubernetes and microservice technologies.
  • Experience with GPU-specific hardware/software like DGX and CUDA.
  • Background with RDMA fabrics (InfiniBand or RoCE).

Commitment to Diversity and Inclusion

NVIDIA is dedicated to fostering diversity and an inclusive environment. We do not discriminate based on race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability. We offer reasonable accommodations to ensure equal participation in our application and interview processes, and for all aspects of employment. Join NVIDIA and contribute to technology that's pushing boundaries and making a significant global impact. Key skills/competency Senior HPC DevOps Engineer, High-Performance Computing, AI Clusters, DevOps, Network Engineering, Infrastructure as Code, CI/CD, Automation, Scripting, Linux, GPU Computing, NVIDIA

Tags:

HPC
DevOps
AI
GPU Computing
Network Engineer
Infrastructure as Code
CI/CD
Automation
Linux
Kubernetes
InfiniBand
Slurm
NVIDIA

Share Job:

How to Get Hired at NVIDIA

  • Tailor your resume: Highlight experience with HPC, AI, DevOps, networking, and specific tools like Jenkins, Ansible, Slurm, and Kubernetes.
  • Showcase technical depth: Emphasize your understanding of CPUs, GPUs, high-speed interconnects, InfiniBand, Ethernet, and storage solutions.
  • Quantify achievements: Use data to demonstrate the impact of your automation and optimization efforts on system reliability and efficiency.
  • Prepare for technical questions: Be ready to discuss complex troubleshooting scenarios and your experience with NVIDIA's technology stack.
  • Research NVIDIA's culture: Understand their commitment to innovation, diversity, and their role in advancing AI and GPU computing.

Frequently Asked Questions

Find answers to common questions about this job opportunity

Explore similar opportunities that match your background