12 hours ago

AI Test Architect

NVIDIA

Hybrid
Full Time
$220,000
Hybrid

Job Overview

Job TitleAI Test Architect
Job TypeFull Time
CategoryCommerce
Experience5 Years
DegreeMaster
Offered Salary$220,000
LocationHybrid

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

AI Test Architect at NVIDIA

NVIDIA is seeking an AI Test Architect to join the E2E Verification group. In this role, you will profile innovative large-scale Distributed training on NVIDIA AI End-to-End solutions within extensive supercomputing clusters. You will provide critical insights into at-scale system design and tuning mechanisms for large-scale compute runs.

This position offers the opportunity to work with the latest Accelerated Computing and Deep Learning software and hardware platforms. You will collaborate with researchers, developers, and customers to refine workflows and create new, differentiated solutions. Interaction with HPC, OS, Switch, HCA, CPU, GPU compute, and systems specialists will be crucial for architecting, developing, and bringing up large-scale performance platforms.

What You’ll Be Doing

  • Profiling, benchmarking, and analyzing deep learning models to identify optimization opportunities in performance, efficiency, and accuracy, with a strong emphasis on networking aspects.
  • Collaborating closely with data scientists, researchers, development, and automation teams to design and implement scalable training pipelines and frameworks that demonstrate large-scale high-performance networking capabilities.
  • Staying up-to-date with the latest advancements in deep learning algorithms, architectures, NVIDIA GPU technologies, and high-performance networking solutions.
  • Optimizing deep learning models for performance, memory usage, and power efficiency while maximizing high-performance networking features on NVIDIA supercomputers.
  • Providing insights and recommendations based on the analysis of large-scale training results, specifically focusing on networking bottlenecks and optimizations, to improve model outcomes and achieve business objectives.
  • Collaborating with hardware engineers to guide the development and integration of efficient networking solutions for deep learning, including exploring network architecture optimizations and leveraging technologies such as RDMA or InfiniBand.

What We Need To See

  • B.Sc in Computer Science, Software Engineering, or equivalent experience.
  • Strong understanding and practical experience with machine learning algorithms and techniques, with a specialization in deep learning and expertise in high-performance networking.
  • 8+ years of overall experience, with CUDA programming for deep learning frameworks like TensorFlow, PyTorch, combined with expertise in networking libraries and protocols.
  • Ability to profile and optimize deep learning workflows, focusing on networking-related bottlenecks and optimizations, to improve overall performance and efficiency.
  • Exceptional analytical and problem-solving skills, with a keen attention to detail, particularly in identifying and resolving networking performance issues.
  • Excellent communication and collaboration skills, enabling effective teamwork and cooperation.
  • Familiarity with supercomputers, parallel computing, distributed systems, and high-performance networking technologies like RDMA or InfiniBand.

Ways To Stand Out From The Crowd

  • Demonstrated experience in successfully profiling and optimizing large-scale deep learning training on NVIDIA supercomputers, with a significant focus on high-performance networking enhancements.
  • Experience with distributed deep learning, distributed training frameworks, or large-scale data pipelines enhanced by high-performance networking solutions.
  • Expertise in optimizing networking parameters, such as bandwidth, latency, or congestion control, for deep learning workloads.
  • Familiarity with NVIDIA's networking technologies, such as Mellanox InfiniBand, and their integration with deep learning workflows.
  • Strong understanding of high-performance networking protocols and standards and their application to deep learning.

Key skills/competency

  • AI
  • Deep Learning
  • Distributed Training
  • Performance Optimization
  • High-Performance Networking
  • CUDA Programming
  • TensorFlow
  • PyTorch
  • Supercomputing
  • InfiniBand/RDMA

Tags:

AI Test Architect
Profiling
Benchmarking
Optimization
Distributed Training
Deep Learning
Networking
Supercomputing
System Design
Data Analysis
Collaboration
CUDA
TensorFlow
PyTorch
RDMA
InfiniBand
HPC
Linux
Python
C++
GPU

Share Job:

How to Get Hired at NVIDIA

  • Research NVIDIA's culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor to understand their AI leadership.
  • Tailor your resume: Highlight extensive experience in deep learning, high-performance networking, and CUDA programming, emphasizing large-scale distributed systems.
  • Showcase distributed systems expertise: Provide concrete examples of optimizing complex, large-scale AI training workflows, especially focusing on networking enhancements.
  • Prepare for technical deep dives: Be ready to discuss your expertise in AI algorithms, HPC, GPU technologies, and specific networking protocols like InfiniBand and RDMA.
  • Demonstrate problem-solving: Share instances where you identified and resolved critical performance or networking bottlenecks in challenging supercomputing environments.

Frequently Asked Questions

Find answers to common questions about this job opportunity

Explore similar opportunities that match your background