Site Reliability Engineer
Microsoft
Job Overview
Who's the hiring manager?
Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Job Description
About the Role: Site Reliability Engineer at Microsoft
Join the team dedicated to maintaining Microsoft 365's seamless operation within sovereign cloud environments, where reliability, scalability, and security are paramount. As a Site Reliability Engineer, you will contribute to distributed systems operating at a massive scale, focusing on automating operations, developing robust disaster recovery capabilities, and engineering solutions to eliminate manual toil and enhance service delivery. Leverage your expertise in large-scale systems to help establish the gold standard for sovereign cloud reliability.
The M365 Sovereign Clouds organization is at the forefront of building secure productivity solutions for the world's most critical customers. As an integral part of Azure Silver and Microsoft Sovereign Clouds, we are responsible for delivering and operating the entire Microsoft 365 suite, encompassing Office 365, Exchange, Outlook, Teams, SharePoint, OneDrive, and Purview, all within highly regulated sovereign cloud infrastructures. Our team thrives on innovation and problem-solving, transforming complex challenges into high-performance, reliable services that empower our sovereign cloud clientele. Our culture is built on a growth mindset, innovation, collaboration, and inclusion, recognizing that diverse perspectives drive our most impactful work.
Within the Security & Compliance team, you will collaborate with fellow engineers on systems designed to protect M365 sovereign cloud customers from prevalent threats such as phishing, malware, spam, and data governance complexities. These critical systems process and safeguard millions of messages and documents daily. Our sub-teams offer compelling opportunities to engage with highly intricate systems that enable advanced information protection and data governance for our customers.
Who You Are
- Passionate about distributed systems and adept at working with highly scalable services.
- Enjoys new technological challenges and is motivated to solve them effectively.
- Excited about developing superior software and continuously refining development, integration, and deployment processes.
- A self-starter who excels in a bottoms-up, fast-paced, and technically demanding environment.
- An effective collaborator with proven experience in fostering technical partnerships across diverse teams.
- Committed to ensuring exceptional customer satisfaction through technical excellence.
Microsoft's mission is to empower every person and every organization on the planet to achieve more. Our employees embody a growth mindset, innovate to empower others, and collaborate to achieve shared goals. We uphold our values of respect, integrity, and accountability daily, cultivating an inclusive culture where everyone can thrive.
Responsibilities
- Responds to incidents during regular on-call rotations by identifying impact levels, troubleshooting, mitigating, and deploying fixes for root causes.
- Notifies product teams of major customer-impacting issues and escalates critical issues affecting multiple components or features.
- Communicates incident details and resolutions through post-mortem reports and review meetings.
- Independently writes code and scripts to automate scalable operations processes like monitoring, alerting, and deployments.
- Designs, develops, and maintains telemetry pipelines and monitoring tools for operations metrics such as availability, reliability, performance, and efficiency.
- Performs analyses using existing tools and models to generate insights for product development and operations improvements.
- Monitors the operational impact of changes, including Time-to-X metrics.
- Troubleshoots problems affecting availability, security, reliability, performance, and efficiency using existing tools and AI/ML capabilities.
- Proposes solutions to resolve and prevent recurring issues, collaborating with SRE and product engineering teams.
- Independently creates, tests, and deploys changes via safe deployment processes (SDP) to enhance code quality, observability, security, reliability, and operability.
- Shares insights and best practices through documentation, code/design reviews, incident drills, and regular meetings.
- Engages with product engineering teams through code/design reviews, meetings, on-call rotations, and incident responses across development and operations cycles.
- Utilizes technical knowledge, security best practices, AI/ML, and telemetry analyses to suggest code and design improvements across products.
Required Qualifications
- Master's Degree in Computer Science or a related technical field AND 3+ years of technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python. OR
- Bachelor's Degree in Computer Science or a related technical field AND 5+ years of technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python. OR
- Equivalent experience.
- 2+ years of technical experience working with large-scale cloud or distributed systems.
Other Requirements
Security Clearance Requirements
- Candidates must meet Microsoft, customer, and/or government security screening requirements for this role.
- Active TS/SCI clearance is required, with willingness and eligibility to upgrade to TS/SCI (with polygraph).
- Maintaining the TS/SCI (with polygraph) clearance is mandatory for this position.
- Failure to obtain or maintain appropriate clearance and/or customer screening may result in employment action up to and including termination.
- Clearance verification is required prior to an offer of employment.
- This position requires successful completion of the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Key Skills/Competency
- Distributed Systems
- Cloud Operations
- Site Reliability Engineering (SRE)
- Automation
- Incident Response
- Telemetry & Monitoring
- Security & Compliance
- Azure Cloud
- Python/C#/Java
- Disaster Recovery
How to Get Hired at Microsoft
- Research Microsoft's culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor to align with their ethos.
- Tailor your SRE resume: Highlight extensive experience with distributed systems, cloud platforms, robust automation, and critical incident management.
- Showcase deep technical prowess: Prepare to discuss complex technical challenges, specific coding skills, and effective problem-solving approaches in interviews.
- Emphasize a growth mindset: Demonstrate a continuous eagerness to learn, adapt to emerging technologies, and foster strong team collaboration.
- Prepare for behavioral questions: Practice responses that clearly reflect Microsoft's core values of respect, integrity, and accountability in professional scenarios.
Frequently Asked Questions
Find answers to common questions about this job opportunity
Explore similar opportunities that match your background