Senior Solutions Architect - Cloud Infrastructure
NVIDIA
Job Overview
Who's the hiring manager?
Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Job Description
What You'll Be Doing
As a Senior Solutions Architect - Cloud Infrastructure at NVIDIA, you will be a recognized technical expert and trusted advisor for NVIDIA's GPU-accelerated cloud offerings and high-performance networking solutions. You will help clients build resilient cloud infrastructures that collect and utilize system data, specifically optimized for AI Factory deployments. A key aspect of this role involves architecting and validating high-performance interconnect solutions using accelerated networking technologies such as InfiniBand, RoCE (RDMA over Converged Ethernet), and GPUDirect, which are essential for efficient large-scale AI training and inference workloads.
You will lead several complex projects, collaborating closely with engineering groups to achieve successful builds, resolve issues, and deploy solutions into production, with a strong focus on developing robust tools for observability and failure recovery. Furthermore, you will work hand-in-hand with Sales Account Managers to lead customer proof-of-concept evaluations, particularly for Microsoft/Azure-focused opportunities. This role demands leadership through project ownership, including defining projects of varying scope and complexity, and coordinating experiments, tests, and evaluations to solve customer challenges. You will also develop research collaboration programs with key customers and partners, serve as an internal reference for datacenter, large-scale computing, and networking solutions within the NVIDIA technical community, and mentor less experienced team members while promoting cross-departmental collaboration.
What We Need To See
- 12+ years of experience in cloud infrastructure engineering, AI/ML systems, or extensive distributed systems (may be less with highly relevant industry experience).
- A BS in Computer Science, Electrical Engineering, Mathematics, or Physics, or equivalent experience.
- Recognized expertise in cloud computing and large-scale computing systems, coupled with a strong understanding of high-performance networking architectures, including InfiniBand, RDMA, RoCE, or similar low-latency interconnect technologies crucial for AI/HPC workloads.
- Proficiency in Linux, Windows Subsystem for Linux, and Windows is required.
- A passion for machine learning and AI, with the drive to continually learn and apply new technologies.
- Excellent interpersonal skills, including the ability to explain complex technical topics to non-experts and influence collaborators at the executive level.
- A proven track record of successfully managing multiple engagements during the implementation of new technology and products into complex projects.
Ways To Stand Out From The Crowd
- Extensive knowledge of Microsoft Azure, especially its GPU-accelerated and HPC services.
- Skilled in deploying and managing cloud-native solutions on leading cloud platforms, concentrating on GPU-accelerated workloads.
- Expertise in crafting and optimizing AI Factory architectures, including network fabric design, GPUDirect RDMA, NCCL tuning, and multi-node training performance optimization.
- Expertise with orchestration tools like Slurm and Kubernetes, along with familiarity with NVIDIA's DGX Cloud, Base Command Platform, and its ecosystem.
- Hands-on experience crafting telemetry systems and failure recovery mechanisms for large-scale cloud infrastructures, including observability tools such as Grafana, Prometheus, and OpenTelemetry or equivalent experience.
- Contributions to open-source projects demonstrating proficiency in cloud-AI/infrastructure engineering, with a recognized standing as a leader in cloud infrastructure or AI/ML fields.
Key skills/competency
- Cloud Infrastructure Engineering
- AI/ML Systems
- High-Performance Networking
- GPU-accelerated Computing
- InfiniBand/RDMA/RoCE
- Microsoft Azure
- Kubernetes/Slurm
- Observability Tools (Grafana, Prometheus)
- Project Leadership
- Technical Advising
How to Get Hired at NVIDIA
- Research NVIDIA's culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor.
- Tailor your resume: Customize your resume to highlight experience in cloud infrastructure, AI/ML systems, and high-performance networking relevant to NVIDIA.
- Showcase technical expertise: Demonstrate proficiency in Linux, cloud platforms, InfiniBand, RDMA, and orchestration tools like Kubernetes.
- Prepare for behavioral questions: Be ready to discuss project leadership, problem-solving, collaboration, and influencing stakeholders at NVIDIA.
- Highlight impact and innovation: Frame your experiences around successful project implementations, driving innovation, and contributing to complex technical challenges.
Frequently Asked Questions
Find answers to common questions about this job opportunity
Explore similar opportunities that match your background