Want to get hired at NVIDIA?
Senior Cloud Services Software Engineer
NVIDIA
Original Job Summary
Overview
Join NVIDIA's DGX Cloud Team and contribute to the infrastructure powering innovative AI research. As a Senior Cloud Services Software Engineer, you will develop and optimize AI infrastructure services to deliver peak performance and resiliency for DGX Cloud.
Responsibilities
- Develop solutions integrating machine learning, distributed systems, and HPC.
- Design and optimize micro-services orchestrated by Kubernetes for large-scale AI workflows.
- Co-design and implement APIs for integration with NVIDIA's resiliency stacks.
- Create abstractions for long-running training jobs with auto-restart capabilities.
- Develop modular services deployable on on-premises AI clusters.
Requirements
A Bachelor's degree in Computer Science or related field and at least 12 years of hands-on experience in backend development with languages such as Python, Go, or C/C++. Proven record in building large-scale distributed systems, experience with cloud platforms (AWS, Azure, GCP), container technologies like Docker and Kubernetes, and HPC/AI platforms such as Slurm.
Preferred Qualifications
- Experience with DL frameworks and orchestrators (PyTorch, TensorFlow, JAX, Ray).
- Background in framework plugin architectures and cluster scheduler integration.
- Deep understanding of NVIDIA GPUs, network technologies, and failure patterns.
- Practical experience with AI models and AI-based tools, plus code contributions.
About NVIDIA
NVIDIA leads groundbreaking developments in AI, HPC, and visualization. Work with world-class engineers to shape the future of technology.
Key Skills/Competency
- Distributed Systems
- Backend Development
- Cloud Computing
- Kubernetes
- Python
- Go
- Microservices
- High-Performance Computing
- API Development
- AI Infrastructure
How to Get Hired at NVIDIA
🎯 Tips for Getting Hired
- Research NVIDIA's culture: Review mission, values, and recent projects.
- Customize your resume: Highlight backend, cloud, and AI experience.
- Tailor your portfolio: Showcase distributed systems and microservices.
- Prepare for interviews: Familiarize with Kubernetes and HPC concepts.