6 days ago

Senior Solutions Architect - Cloud Infrastructure

NVIDIA

Hybrid
Full Time
$350,000
Hybrid

Job Overview

Job TitleSenior Solutions Architect - Cloud Infrastructure
Job TypeFull Time
CategoryCommerce
Experience5 Years
DegreeMaster
Offered Salary$350,000
LocationHybrid

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

What You'll Be Doing

As a Senior Solutions Architect - Cloud Infrastructure at NVIDIA, you will be a recognized technical expert and trusted advisor for NVIDIA's GPU-accelerated cloud offerings and high-performance networking solutions. You will help clients build resilient cloud infrastructures that collect and utilize system data, specifically optimized for AI Factory deployments. A key aspect of this role involves architecting and validating high-performance interconnect solutions using accelerated networking technologies such as InfiniBand, RoCE (RDMA over Converged Ethernet), and GPUDirect, which are essential for efficient large-scale AI training and inference workloads.

You will lead several complex projects, collaborating closely with engineering groups to achieve successful builds, resolve issues, and deploy solutions into production, with a strong focus on developing robust tools for observability and failure recovery. Furthermore, you will work hand-in-hand with Sales Account Managers to lead customer proof-of-concept evaluations, particularly for Microsoft/Azure-focused opportunities. This role demands leadership through project ownership, including defining projects of varying scope and complexity, and coordinating experiments, tests, and evaluations to solve customer challenges. You will also develop research collaboration programs with key customers and partners, serve as an internal reference for datacenter, large-scale computing, and networking solutions within the NVIDIA technical community, and mentor less experienced team members while promoting cross-departmental collaboration.

What We Need To See

  • 12+ years of experience in cloud infrastructure engineering, AI/ML systems, or extensive distributed systems (may be less with highly relevant industry experience).
  • A BS in Computer Science, Electrical Engineering, Mathematics, or Physics, or equivalent experience.
  • Recognized expertise in cloud computing and large-scale computing systems, coupled with a strong understanding of high-performance networking architectures, including InfiniBand, RDMA, RoCE, or similar low-latency interconnect technologies crucial for AI/HPC workloads.
  • Proficiency in Linux, Windows Subsystem for Linux, and Windows is required.
  • A passion for machine learning and AI, with the drive to continually learn and apply new technologies.
  • Excellent interpersonal skills, including the ability to explain complex technical topics to non-experts and influence collaborators at the executive level.
  • A proven track record of successfully managing multiple engagements during the implementation of new technology and products into complex projects.

Ways To Stand Out From The Crowd

  • Extensive knowledge of Microsoft Azure, especially its GPU-accelerated and HPC services.
  • Skilled in deploying and managing cloud-native solutions on leading cloud platforms, concentrating on GPU-accelerated workloads.
  • Expertise in crafting and optimizing AI Factory architectures, including network fabric design, GPUDirect RDMA, NCCL tuning, and multi-node training performance optimization.
  • Expertise with orchestration tools like Slurm and Kubernetes, along with familiarity with NVIDIA's DGX Cloud, Base Command Platform, and its ecosystem.
  • Hands-on experience crafting telemetry systems and failure recovery mechanisms for large-scale cloud infrastructures, including observability tools such as Grafana, Prometheus, and OpenTelemetry or equivalent experience.
  • Contributions to open-source projects demonstrating proficiency in cloud-AI/infrastructure engineering, with a recognized standing as a leader in cloud infrastructure or AI/ML fields.

Key skills/competency

  • Cloud Infrastructure Engineering
  • AI/ML Systems
  • High-Performance Networking
  • GPU-accelerated Computing
  • InfiniBand/RDMA/RoCE
  • Microsoft Azure
  • Kubernetes/Slurm
  • Observability Tools (Grafana, Prometheus)
  • Project Leadership
  • Technical Advising

Tags:

Solutions Architect
Cloud Infrastructure
AI/ML Systems
High-Performance Networking
GPU Computing
InfiniBand
RDMA
RoCE
Microsoft Azure
Kubernetes
Linux
Project Leadership
Technical Consulting
Observability
Failure Recovery
Distributed Systems
HPC
GPUDirect
Slurm
Grafana

Share Job:

How to Get Hired at NVIDIA

  • Research NVIDIA's culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor.
  • Tailor your resume: Customize your resume to highlight experience in cloud infrastructure, AI/ML systems, and high-performance networking relevant to NVIDIA.
  • Showcase technical expertise: Demonstrate proficiency in Linux, cloud platforms, InfiniBand, RDMA, and orchestration tools like Kubernetes.
  • Prepare for behavioral questions: Be ready to discuss project leadership, problem-solving, collaboration, and influencing stakeholders at NVIDIA.
  • Highlight impact and innovation: Frame your experiences around successful project implementations, driving innovation, and contributing to complex technical challenges.

Frequently Asked Questions

Find answers to common questions about this job opportunity

Explore similar opportunities that match your background