
Principal Software Engineer
Microsoft · United States
- Hybrid
- Full-time
- $139,900 / year
- United States
Job highlights
- Design and build high-volume, low-latency telemetry pipelines.
- Analyze and improve event pipeline performance and reliability.
- Drive engineering excellence in large-scale supercomputers.
- Collaborate with strategic customers and internal teams.
- Enable innovation in AI and HPC cloud services.
About the role
Overview
The Microsoft Azure High Performance Computing & AI Engineering (HPC & AI Eng) team manages the core platform and fleet of AI High Performance Computing products for demanding customer workloads. The AI Customer Experience (AICE) engineering team is responsible for flagship supercomputers used by top-tier AI customers, enabling breakthroughs like ChatGPT and achieving recognition in Top500, MLPerf, and Graph500 rankings. As a Principal Supercomputing Software Engineer, you will design and develop high-volume, low-latency telemetry pipelines, integrate with existing systems, and correlate data to provide immediate insights into customer-facing issues across the infrastructure stack. This includes identifying issues from datacenter events to hardware and networking subsystem events that impact job reliability and cause job interrupts.
In this role, you will leverage exceptional design and development expertise, with a strong background in large-scale High-Performance Computing (HPC) & GPU systems, cloud computing platforms, and high-performance data processing infrastructure. This is a unique opportunity to gain hands-on experience managing supercomputers at the largest scale. As a key technical leader, you will engage directly with strategic customers, influencing their business outcomes and driving engineering improvements across the Azure ecosystem to benefit the broader fleet. Your work will be pivotal in enabling the next wave of growth and innovation in AI and HPC in the cloud.
Microsoft's mission is to empower every person and every organization on the planet to achieve more. We foster a growth mindset, innovate to empower others, and collaborate to achieve shared goals. Our values of respect, integrity, and accountability drive an inclusive culture where everyone can thrive.
Responsibilities
- Architect, design, and develop high-volume, low-latency end-to-end event pipelines for early detection of events causing job interrupts and impacting reliability.
- Analyze existing event pipelines to assess the fidelity, granularity, and latency of critical events.
- Contribute to improving key metrics such as Job Mean Time to Interrupt, Nodes in Service, and Mean Time to Resolve on flagship supercomputers by enabling data scientists and domain experts to utilize telemetry for issue identification, hypothesis testing, and result synthesis.
- Partner with cross-organizational teams to evaluate telemetry and latency, driving the architecture, design, development, and deployment of end-to-end solutions for core infrastructure, including current and next-generation datacenter, IT hardware, power, and cooling technologies.
- Drive engineering and operational excellence by addressing issues and incorporating learnings from strategic customer usage scenarios to enhance product features and capabilities.
- Collaborate with teams on continuous learning and improvement programs by leading the resolution of complex incidents, driving root cause analyses, and championing initiatives to minimize future customer impact.
Qualifications
Required Qualifications
- Bachelor's Degree in Computer Science or a related technical field AND 6+ years of technical engineering experience, including coding in languages such as C, C++, C#, Java, JavaScript, or Python, OR equivalent experience.
- 5+ years of hands-on experience designing and developing high-volume, low-latency pipelines using products like AzPubSub, Event Hubs, Azure Stream Analytics, Kafka, Grafana, Event Hubs, Prometheus, or equivalent.
- 3+ years of experience with AI/HPC system management, High-Speed Networks, HPC Storage, OR managing Cloud Infrastructure.
- Ability to meet Microsoft, customer, and/or government security screening requirements is mandatory for this role. This includes passing the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Other Qualifications
- Bachelor's Degree in Computer Science or a related technical field AND 10+ years of technical engineering experience with coding in languages such as C, C++, C#, Java, JavaScript, or Python; OR Master's Degree in Computer Science or a related technical field AND 8+ years of technical engineering experience with coding in languages such as C, C++, C#, Java, JavaScript, or Python; OR equivalent experience.
- 5+ years of experience operating AI/HPC systems, developing and running AI/HPC applications on clusters, or operating Cloud Infrastructure.
- 3+ years of experience in multiple Data Center technologies: power, cooling, IT hardware, telemetry.
Key skills/competency
- High Performance Computing (HPC)
- Artificial Intelligence (AI)
- Software Engineering
- Telemetry Pipelines
- Low Latency Systems
- Cloud Infrastructure
- System Management
- Data Processing
- Customer Engagement
- Problem Solving
Skills & topics
- Principal Software Engineer
- Software Engineering
- High Performance Computing
- HPC
- AI
- Artificial Intelligence
- Telemetry
- Cloud Computing
- Azure
- System Management
- Low Latency
- C++
- Python
- Kafka
- Prometheus
- Event Hubs
- Grafana
- Azure Stream Analytics
- Datacenter
- Infrastructure
How to get hired
- Tailor your resume: Highlight your experience with C++, Python, low-latency telemetry pipelines, and cloud infrastructure, aligning keywords with the Principal Software Engineer role at Microsoft.
- Showcase HPC/AI expertise: Emphasize your background in AI/HPC system management, high-speed networks, or cloud infrastructure, detailing specific projects and quantifiable achievements.
- Prepare for technical interviews: Be ready to discuss system design, data structures, algorithms, and your experience with relevant technologies like Kafka, Prometheus, or Azure services.
- Demonstrate problem-solving skills: Prepare examples of how you’ve resolved complex incidents, driven root cause analyses, and improved key metrics in large-scale systems.
- Understand Microsoft's culture: Research Microsoft's mission, values, and growth mindset to articulate how your working style aligns during behavioral interviews.
Technical preparation
Behavioral questions
Frequently asked questions
- What is the salary range for a Principal Software Engineer at Microsoft?
- The typical base pay range for a Principal Software Engineer at Microsoft in the U.S. is USD $139,900 - $274,800 annually. For the San Francisco Bay area and New York City metropolitan area, the range is USD $188,000 - $304,200. Specific compensation can vary based on location and other factors.
- What specific technical skills are most critical for this Principal Software Engineer role at Microsoft?
- The most critical technical skills for this Principal Software Engineer position include designing and developing high-volume, low-latency telemetry pipelines, experience with languages like C++, Python, and familiarity with tools such as Kafka, Prometheus, Event Hubs, and Azure Stream Analytics. Experience in AI/HPC system management or cloud infrastructure is also highly valued.
- What is the work arrangement for the Principal Software Engineer position at Microsoft?
- The job description does not explicitly state the work arrangement (on-site, hybrid, or remote). However, given the nature of managing core infrastructure and engaging with strategic customers, it is likely to be on-site or hybrid, with potential for remote work depending on specific team policies and the candidate's location. Further clarification would be available during the interview process.
- How does Microsoft assess candidates for Principal Software Engineer roles?
- Microsoft typically assesses candidates through a combination of technical interviews focusing on coding, system design, and problem-solving, alongside behavioral interviews to evaluate cultural fit, leadership potential, and alignment with Microsoft's values. For this role, expect deep dives into your experience with large-scale systems, telemetry, and HPC/AI technologies.
- What are the security requirements for this Principal Software Engineer role at Microsoft?
- Candidates must meet Microsoft, customer, and/or government security screening requirements. This specifically includes passing the Microsoft Cloud Background Check upon hire and every two years thereafter.
- What is the application process for the Principal Software Engineer job at Microsoft?
- To apply for the Principal Software Engineer role at Microsoft, you should submit your resume and application through Microsoft's careers portal. Applications are accepted on an ongoing basis until the position is filled, and the role will remain open for a minimum of 5 days.
- What is the career growth potential for a Principal Software Engineer at Microsoft?
- As a Principal Software Engineer at Microsoft, you are in a senior individual contributor role with significant technical leadership responsibilities. Career growth can involve deepening technical expertise, moving into architect roles, or potentially transitioning into management roles within the company's vast engineering organization.
- What kind of impact does this Principal Software Engineer role have at Microsoft?
- This Principal Software Engineer role has a significant impact by managing critical AI supercomputing infrastructure, directly influencing customer success in AI breakthroughs, and driving improvements across the Azure ecosystem that benefit a broad range of high-performance computing workloads.