Staff Site Reliability Engineer
OVO
Job Overview
Who's the hiring manager?
Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Job Description
Staff Site Reliability Engineer at OVO
Role Overview
At OVO, we're tackling one of humanity's biggest challenges: the climate crisis. The Site Reliability Engineering team is central to this mission, driving OVO's customer-focused technology transformation by building and maintaining scalable, efficient, and reliable platforms. Our goal is to enhance system reliability, performance, and cost-efficiency, enabling teams to confidently deliver robust services in GCP. This focus on smart and efficient cloud usage also significantly contributes to reducing CO2, aligning with OVO's Plan Zero.
As a Staff Site Reliability Engineer, you will ensure our systems are reliable, scalable, and efficient. You'll maintain high service availability, improve performance, and optimize monitoring and incident response. Your expertise will support continuous improvement, proactively resolve issues, and strengthen infrastructure resilience.
Where you'll work
This is a Hub Based - Hybrid role. We expect hub-based employees to be in the office at least once a week and attend OVO Connection events in-person. You'll be assigned to one of our three hub offices: Bristol, Glasgow, or London. Each hub offers accessible spaces designed to inspire connection and foster innovation.
Teamworking for the planet: OVO's Plan Zero
Everything we do at OVO revolves around our Plan Zero mission. The Site Reliability Engineering team plays a crucial role in achieving this by ensuring the reliability, performance, and cost-efficiency of our systems, which contributes to reducing CO2 usage.
Your Core Responsibilities
- Developing, Refining, and Automating Monitoring Systems: Design, manage, and enhance monitoring, alerting, and observability systems (e.g., Datadog, Prometheus, Grafana), ensuring meaningful insights and effective alerting. Automate repetitive monitoring tasks to boost efficiency.
- Managing SLOs/SLIs and Improving Incident Response: Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for key services, contributing to better reliability insights. Refine incident response processes, support on-call operations, and improve tooling and communication during incidents.
- Incident Management and Post-Mortem Analysis: Lead or support technical response efforts in resolving complex production incidents. Conduct blameless post-mortems to uncover root causes and drive lasting improvements.
- Cost Optimisation Implementation: Assess infrastructure usage and apply approved strategies to optimize cloud costs, balancing resource efficiency with performance and reliability.
- Capacity Planning, Performance Tuning & Resilience: Utilize monitoring and load testing data to support capacity planning, recommend performance improvements, and help implement resilience best practices across systems.
- Collaboration and Knowledge Sharing: Work closely with engineering, QA, security, and product teams to embed reliability practices, document key processes, and mentor peers.
- Design Review Input: Participate in design reviews, offering guidance on improving reliability, scalability, and day-to-day operability within system architecture.
- Community of Practice: Actively contribute to your Community of Practice, leading discussions, sharing experiences, mentoring others, and shaping content and capability growth.
What you'll bring
- Software Engineering Background: Professional experience in programming languages such as Python, Typescript, Go, or Java, applying software best practices (CI/CD, unit testing, code reviews) to infrastructure.
- Experienced with the Cloud: Hands-on experience navigating public cloud ecosystems (AWS, GCP, or Azure), understanding cloud-native networking and storage. Demonstrated understanding of distributed system failure modes and fault-tolerant design.
- Infrastructure as Code (IaC) Expert: Advanced experience with Terraform, Pulumi, or Crossplane to manage at-scale infrastructure.
- Data-Driven Mindset: Proficient in using metrics and logs to drive engineering decisions, with an understanding of SLOs and error budgets.
- Problem Solver: Enjoys complex debugging, capable of deep dives into the Linux kernel or network stack to diagnose performance bottlenecks.
- Mentor & Advocate: Passionate about teaching "The SRE Way" to engineers, fostering service reliability ownership.
- Efficiency and Cost Engineering Mindset: Treats capacity planning, performance tuning, and cost optimization as software engineering challenges, leaning towards building "efficiency-as-code."
What's in it for you?
We offer a competitive salary banding of £64,070 - £84,569, dependent on your skills and experience, plus an on-target bonus of 15% linked to OVO's collective performance and Plan Zero goals. We provide 9% Flex Pay on top of your salary, with 4% auto-enrolled into your pension and the remaining 5% for flexible benefits, additional pension contributions, or cash. Our comprehensive benefits package includes 34 days holiday (including bank holidays), healthcare options, critical illness cover, life assurance, gym membership, travel insurance, discount dining, home & tech loans, and discounts on OVO Energy plans, solar, smart thermostats, and EV chargers. We also support your commute with ultra-low emission car leasing, cycle to work schemes, and public transport season ticket loans. At OVO, we foster a culture of belonging with 8 employee-led networks dedicated to creating an inclusive and diverse workplace.
Key skills/competency
- Site Reliability Engineering
- Cloud Platforms (GCP, AWS, Azure)
- Infrastructure as Code (Terraform)
- Monitoring & Observability (Datadog, Prometheus, Grafana)
- Incident Management
- System Resilience
- Cost Optimisation
- Capacity Planning
- Distributed Systems
- Programming (Python, Go, Java, Typescript)
How to Get Hired at OVO
- Research OVO's culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor.
- Tailor your Staff SRE resume: Highlight extensive experience in SRE principles, cloud platforms (GCP), Infrastructure as Code (Terraform), and relevant programming languages.
- Showcase problem-solving skills: Prepare detailed examples of how you've debugged complex production incidents and driven lasting systemic improvements.
- Demonstrate OVO alignment: Articulate how your expertise in efficiency and cost optimization supports OVO's Plan Zero and climate crisis mission.
- Prepare for technical depth: Be ready to discuss distributed systems design, fault tolerance, observability tools, and cloud-native networking.
Frequently Asked Questions
Find answers to common questions about this job opportunity
Explore similar opportunities that match your background