Senior Cloud Services Software Engineer @ NVIDIA
Your Application Journey
Email Hiring Manager
Job Details
Overview
Join NVIDIA's DGX Cloud Team and contribute to the infrastructure powering innovative AI research. As a Senior Cloud Services Software Engineer, you will develop and optimize AI infrastructure services to deliver peak performance and resiliency for DGX Cloud.
Responsibilities
- Develop solutions integrating machine learning, distributed systems, and HPC.
- Design and optimize micro-services orchestrated by Kubernetes for large-scale AI workflows.
- Co-design and implement APIs for integration with NVIDIA's resiliency stacks.
- Create abstractions for long-running training jobs with auto-restart capabilities.
- Develop modular services deployable on on-premises AI clusters.
Requirements
A Bachelor's degree in Computer Science or related field and at least 12 years of hands-on experience in backend development with languages such as Python, Go, or C/C++. Proven record in building large-scale distributed systems, experience with cloud platforms (AWS, Azure, GCP), container technologies like Docker and Kubernetes, and HPC/AI platforms such as Slurm.
Preferred Qualifications
- Experience with DL frameworks and orchestrators (PyTorch, TensorFlow, JAX, Ray).
- Background in framework plugin architectures and cluster scheduler integration.
- Deep understanding of NVIDIA GPUs, network technologies, and failure patterns.
- Practical experience with AI models and AI-based tools, plus code contributions.
About NVIDIA
NVIDIA leads groundbreaking developments in AI, HPC, and visualization. Work with world-class engineers to shape the future of technology.
Key Skills/Competency
- Distributed Systems
- Backend Development
- Cloud Computing
- Kubernetes
- Python
- Go
- Microservices
- High-Performance Computing
- API Development
- AI Infrastructure
How to Get Hired at NVIDIA
🎯 Tips for Getting Hired
- Research NVIDIA's culture: Review mission, values, and recent projects.
- Customize your resume: Highlight backend, cloud, and AI experience.
- Tailor your portfolio: Showcase distributed systems and microservices.
- Prepare for interviews: Familiarize with Kubernetes and HPC concepts.