Senior Software Engineer, Site Reliability Engineering, Vertex
Job Overview
Who's the hiring manager?
Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Job Description
About the Senior Software Engineer, Site Reliability Engineering, Vertex Role
Site Reliability Engineering (SRE) at Google combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. This role ensures that Google Cloud's services, including internally critical and externally-visible systems, maintain appropriate reliability and uptime for customer needs, alongside a rapid rate of improvement. SREs also meticulously monitor systems capacity and performance.
Much of the software development focuses on optimizing existing systems, building robust infrastructure, and eliminating manual work through automation. As a Senior Software Engineer, Site Reliability Engineering, Vertex, you'll tackle unique scale challenges at Google Cloud, leveraging your expertise in coding, algorithms, analysis, and large-scale system design. The SRE team thrives on intellectual curiosity, problem-solving, and openness, fostering collaboration and encouraging big thinking and risk-taking in a blame-free environment. We promote self-direction on meaningful projects while providing essential support and mentorship for continuous learning and growth.
As part of the Vertex First-Party (1P) GenAI SRE team, you will collaborate with individuals passionate about shaping the future of artificial intelligence, generative AI, and machine learning platforms. Your work will drive production excellence through SRE principles, building and supporting groundbreaking AI/ML tools on the rapidly growing Vertex GenAI platform. This unified artificial intelligence platform empowers industries and organizations to transform, solve real-world problems, and scale ML models faster for both internal teams and Google Cloud customers.
Behind everything users see online is the architecture built by the Technical Infrastructure team. From developing and maintaining data centers to building next-generation Google platforms, this team makes Google's product portfolio possible. We pride ourselves on being engineers' engineers, constantly improving our networks to ensure the best and fastest user experience.
Responsibilities
- Drive improvements across the entire service life-cycle, from inception and design through deployment, operation, and refinement.
- Enable services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews.
- Monitor and measure availability, latency, and overall system health to maintain live services.
- Automate and evolve systems sustainably through mechanisms that improve reliability and velocity by pushing for changes that scale.
- Respond to incidents sustainably and conduct blameless postmortems.
Minimum Qualifications
- Bachelor’s degree in Computer Science, a related field, or equivalent practical experience.
- 5 years of experience with software development in one or more programming languages.
- 3 years of experience in designing, analyzing, and troubleshooting large-scale distributed systems.
- 2 years of experience leading projects and providing technical leadership.
Preferred Qualifications
- Master's degree in Computer Science or Engineering.
- Experience developing and supporting Google-scale production systems.
- Experience enhancing and supporting large production systems on cluster management systems.
- Experience in software engineering and development with C++, Python, General Configuration Language (GCL), APIs, and Go.
- Experience with networking, capacity, and performance.
- Experience in large-scale system and architecture design, and system integrations or migrations.
Key skills/competency
- Site Reliability Engineering (SRE)
- Distributed Systems
- Software Development (C++, Python, Go)
- Automation
- System Design
- Capacity Planning
- Incident Management
- Generative AI (GenAI)
- Machine Learning Platforms
- Technical Leadership
How to Get Hired at Google
- Research Google's culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor, especially concerning SRE principles and innovation.
- Tailor your resume: Highlight extensive experience with large-scale distributed systems, SRE methodologies, and proficiency in programming languages like C++, Python, and Go.
- Showcase problem-solving skills: Prepare to discuss in detail your experience with incident management, root cause analysis, and how you have automated solutions in previous SRE roles.
- Practice system design: Google SRE interviews heavily emphasize designing, analyzing, and troubleshooting complex, Google-scale systems, so prepare for these technical challenges.
- Demonstrate technical leadership: Be ready to share concrete examples of how you have led projects, mentored team members, and driven technical initiatives effectively.
Frequently Asked Questions
Find answers to common questions about this job opportunity
Explore similar opportunities that match your background