
Software Engineer, TPU Infrastructure, Google Cloud
Google · Hyderabad, Telangana, India
This listing has closed — view similar roles below.
- On site
- Full-time
- $150,000 / year
- Hyderabad, Telangana, India
Job highlights
- Design and build scalable software for Cloud TPU infrastructure.
- Architect highly available distributed systems for ML workloads.
- Develop telemetry and tooling for SLOs and SLAs.
- Collaborate across teams to manage accelerator capacity.
- Implement reliable ML infrastructure for massive-scale training.
About the role
About The Job
Google's software engineers develop the next-generation technologies that change how billions of users connect, explore, and interact with information and one another. Our products need to handle information at massive scale, and extend well beyond web search. We're looking for engineers who bring fresh ideas from all areas, including information retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural language processing, UI design and mobile; the list goes on and is growing every day. As a software engineer, you will work on a specific project critical to Google’s needs with opportunities to switch teams and projects as you and our fast-paced business grow and evolve. We need our engineers to be versatile, display leadership qualities and be enthusiastic to take on new problems across the full-stack as we continue to push technology forward.The TPU Infra team is the engine behind Google’s AI Hypercomputer, responsible for the technical strategy and execution of the Machine Learning (ML) Compute IaaS platforms.
In this role, you will be architecting, implementing, and leading the infrastructure software solutions that manage the massive global fleet.
Google Cloud accelerates every organization’s ability to digitally transform its business and industry. We deliver enterprise-grade solutions that leverage Google’s cutting-edge technology, and tools that help developers build more sustainably. Customers in more than 200 countries and territories turn to Google Cloud as their trusted partner to enable growth and solve their most critical business problems.
Responsibilities
- Design and build scalable software capabilities to manage the availability, scheduling, and reliability of the Cloud TPU Hypercomputer stack (VMs, Networking, Storage, GKE etc.).
- Architect infrastructure solutions to ensure industry-leading availability guarantees for large-scale training and inference workloads.
- Develop telemetry and tooling to establish service level objectives (SLO) and service level agreements (SLA), and to enable rapid debugging of complex infrastructure issues across the fleet.
- Collaborate with platform, hardware, networking, and SRE teams to scale and manage accelerator capacity, including new TPU generations, ensure a seamless experience for customers.
- Design and implement reliable ML infrastructure that enables training and serving cutting edge models at massive scale, troubleshoot complex distributed system issues across the stack (hardware, kernel, network), build the automation, tooling, and telemetry needed to turn operational findings into permanent software fixes and improved SLOs.
Minimum qualifications
- Bachelor’s degree or equivalent practical experience.
- 2 years of experience in backend Infrastructure development.
- Experience in general purpose coding languages like C++, Go, or Python development.
- Experience with algorithms, data structures, software development, and distributed computing.
Preferred qualifications
- Experience designing reliable, fault-tolerant and high performance distributed systems.
- Experience with building cloud based services ideally with GCP.
- Experience with large-scale distributed systems or Machine Learning (ML) systems (training and serving for computer vision, speech recognition, natural language processing, machine translation models).
- Experience with reliability, large-scale distributed systems, Go, Google Cloud Platform, tensor processing unit (TPU), and service level objectives.
Key skills/competency
- Software Engineering
- TPU Infrastructure
- Backend Development
- Distributed Systems
- Cloud Services
- Machine Learning Infrastructure
- Google Cloud Platform (GCP)
- Reliability Engineering
- Scalability
- Algorithm Design
Skills & topics
- Software Engineer
- TPU Infrastructure
- Backend Development
- Distributed Systems
- Cloud Computing
- Machine Learning
- Google Cloud
- GCP
- C++
- Go
- Python
- Algorithms
- Data Structures
- System Design
- Reliability Engineering
How to get hired
- Tailor your resume: Highlight your experience in backend infrastructure, C++, Go, Python, algorithms, and distributed computing, aligning with Google Cloud's needs.
- Showcase cloud and ML expertise: Emphasize any experience with GCP, designing fault-tolerant distributed systems, or working with large-scale ML systems.
- Prepare for technical interviews: Be ready to discuss algorithms, data structures, software development principles, and distributed computing concepts. Practice coding challenges.
- Understand Google's culture: Research Google's commitment to innovation, scale, and user impact. Prepare to discuss your leadership qualities and enthusiasm for tackling new problems.
- Network strategically: Connect with Google Cloud employees on LinkedIn to gain insights into the team and culture.
Technical preparation
Practice C++, Go, or Python coding challenges.,Review algorithms and data structure concepts.,Study distributed computing principles.,Prepare to discuss cloud infrastructure design.
Behavioral questions
Describe a complex infrastructure problem you solved.,How do you ensure reliability in distributed systems?,Give an example of leading a technical project.,How do you handle ambiguity and evolving priorities?
Frequently asked questions
- What are the key responsibilities for a Software Engineer on the TPU Infrastructure team at Google?
- As a Software Engineer on the TPU Infrastructure team at Google, you will be responsible for designing and building scalable software capabilities for the Cloud TPU Hypercomputer stack, architecting infrastructure solutions for high availability in ML workloads, developing telemetry for SLOs/SLAs, collaborating with various teams to manage accelerator capacity, and implementing reliable ML infrastructure for large-scale model training and serving.
- What technical skills are essential for this Software Engineer role at Google?
- Essential technical skills for this role include experience in backend infrastructure development, proficiency in general-purpose coding languages like C++, Go, or Python, and a strong understanding of algorithms, data structures, software development, and distributed computing. Experience with cloud-based services, particularly GCP, and large-scale distributed or ML systems is highly preferred.
- How does Google Cloud leverage its TPU Infrastructure team?
- Google Cloud utilizes its TPU Infrastructure team as the core engine for its AI Hypercomputer. This team is crucial for the technical strategy and execution of Machine Learning (ML) Compute IaaS platforms, managing a massive global fleet of TPUs to enable cutting-edge AI development and deployment.
- What kind of projects can I expect to work on as a Software Engineer at Google Cloud?
- As a Software Engineer at Google Cloud, you will work on specific projects critical to Google's needs, focusing on managing and scaling the Cloud TPU Hypercomputer stack. This includes areas like VM management, networking, storage, GKE, and ensuring industry-leading availability for training and inference workloads.
- What is Google's approach to equal opportunity employment for this Software Engineer position?
- Google is committed to equal opportunity and affirmative action, valuing diversity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, or Veteran status. They also consider qualified applicants with criminal histories, consistent with legal requirements.
- How can I demonstrate my experience with large-scale distributed systems for this Google role?
- To demonstrate your experience, highlight projects where you designed, implemented, or managed fault-tolerant, high-performance distributed systems. Mention specific challenges overcome, the scale of the systems, and any impact on reliability or efficiency, particularly if related to cloud platforms like GCP or ML systems.