4 days ago

Staff Site Reliability Engineer

OVO

Hybrid
Full Time
£64,070
Hybrid

Job Overview

Job TitleStaff Site Reliability Engineer
Job TypeFull Time
CategoryCommerce
Experience5 Years
DegreeMaster
Offered Salary£64,070
LocationHybrid

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

Staff Site Reliability Engineer at OVO

Role Overview

At OVO, we're tackling one of humanity's biggest challenges: the climate crisis. The Site Reliability Engineering team is central to this mission, driving OVO's customer-focused technology transformation by building and maintaining scalable, efficient, and reliable platforms. Our goal is to enhance system reliability, performance, and cost-efficiency, enabling teams to confidently deliver robust services in GCP. This focus on smart and efficient cloud usage also significantly contributes to reducing CO2, aligning with OVO's Plan Zero.

As a Staff Site Reliability Engineer, you will ensure our systems are reliable, scalable, and efficient. You'll maintain high service availability, improve performance, and optimize monitoring and incident response. Your expertise will support continuous improvement, proactively resolve issues, and strengthen infrastructure resilience.

Where you'll work

This is a Hub Based - Hybrid role. We expect hub-based employees to be in the office at least once a week and attend OVO Connection events in-person. You'll be assigned to one of our three hub offices: Bristol, Glasgow, or London. Each hub offers accessible spaces designed to inspire connection and foster innovation.

Teamworking for the planet: OVO's Plan Zero

Everything we do at OVO revolves around our Plan Zero mission. The Site Reliability Engineering team plays a crucial role in achieving this by ensuring the reliability, performance, and cost-efficiency of our systems, which contributes to reducing CO2 usage.

Your Core Responsibilities

  • Developing, Refining, and Automating Monitoring Systems: Design, manage, and enhance monitoring, alerting, and observability systems (e.g., Datadog, Prometheus, Grafana), ensuring meaningful insights and effective alerting. Automate repetitive monitoring tasks to boost efficiency.
  • Managing SLOs/SLIs and Improving Incident Response: Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for key services, contributing to better reliability insights. Refine incident response processes, support on-call operations, and improve tooling and communication during incidents.
  • Incident Management and Post-Mortem Analysis: Lead or support technical response efforts in resolving complex production incidents. Conduct blameless post-mortems to uncover root causes and drive lasting improvements.
  • Cost Optimisation Implementation: Assess infrastructure usage and apply approved strategies to optimize cloud costs, balancing resource efficiency with performance and reliability.
  • Capacity Planning, Performance Tuning & Resilience: Utilize monitoring and load testing data to support capacity planning, recommend performance improvements, and help implement resilience best practices across systems.
  • Collaboration and Knowledge Sharing: Work closely with engineering, QA, security, and product teams to embed reliability practices, document key processes, and mentor peers.
  • Design Review Input: Participate in design reviews, offering guidance on improving reliability, scalability, and day-to-day operability within system architecture.
  • Community of Practice: Actively contribute to your Community of Practice, leading discussions, sharing experiences, mentoring others, and shaping content and capability growth.

What you'll bring

  • Software Engineering Background: Professional experience in programming languages such as Python, Typescript, Go, or Java, applying software best practices (CI/CD, unit testing, code reviews) to infrastructure.
  • Experienced with the Cloud: Hands-on experience navigating public cloud ecosystems (AWS, GCP, or Azure), understanding cloud-native networking and storage. Demonstrated understanding of distributed system failure modes and fault-tolerant design.
  • Infrastructure as Code (IaC) Expert: Advanced experience with Terraform, Pulumi, or Crossplane to manage at-scale infrastructure.
  • Data-Driven Mindset: Proficient in using metrics and logs to drive engineering decisions, with an understanding of SLOs and error budgets.
  • Problem Solver: Enjoys complex debugging, capable of deep dives into the Linux kernel or network stack to diagnose performance bottlenecks.
  • Mentor & Advocate: Passionate about teaching "The SRE Way" to engineers, fostering service reliability ownership.
  • Efficiency and Cost Engineering Mindset: Treats capacity planning, performance tuning, and cost optimization as software engineering challenges, leaning towards building "efficiency-as-code."

What's in it for you?

We offer a competitive salary banding of £64,070 - £84,569, dependent on your skills and experience, plus an on-target bonus of 15% linked to OVO's collective performance and Plan Zero goals. We provide 9% Flex Pay on top of your salary, with 4% auto-enrolled into your pension and the remaining 5% for flexible benefits, additional pension contributions, or cash. Our comprehensive benefits package includes 34 days holiday (including bank holidays), healthcare options, critical illness cover, life assurance, gym membership, travel insurance, discount dining, home & tech loans, and discounts on OVO Energy plans, solar, smart thermostats, and EV chargers. We also support your commute with ultra-low emission car leasing, cycle to work schemes, and public transport season ticket loans. At OVO, we foster a culture of belonging with 8 employee-led networks dedicated to creating an inclusive and diverse workplace.

Key skills/competency

  • Site Reliability Engineering
  • Cloud Platforms (GCP, AWS, Azure)
  • Infrastructure as Code (Terraform)
  • Monitoring & Observability (Datadog, Prometheus, Grafana)
  • Incident Management
  • System Resilience
  • Cost Optimisation
  • Capacity Planning
  • Distributed Systems
  • Programming (Python, Go, Java, Typescript)

Tags:

Site Reliability Engineer
Automation
Observability
Incident Management
Capacity Planning
Performance Tuning
Cost Optimization
System Resilience
Cloud Operations
Monitoring
GCP
Datadog
Prometheus
Grafana
Terraform
Python
Go
Kubernetes
AWS
Distributed Systems

Share Job:

How to Get Hired at OVO

  • Research OVO's culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor.
  • Tailor your Staff SRE resume: Highlight extensive experience in SRE principles, cloud platforms (GCP), Infrastructure as Code (Terraform), and relevant programming languages.
  • Showcase problem-solving skills: Prepare detailed examples of how you've debugged complex production incidents and driven lasting systemic improvements.
  • Demonstrate OVO alignment: Articulate how your expertise in efficiency and cost optimization supports OVO's Plan Zero and climate crisis mission.
  • Prepare for technical depth: Be ready to discuss distributed systems design, fault tolerance, observability tools, and cloud-native networking.

Frequently Asked Questions

Find answers to common questions about this job opportunity

Explore similar opportunities that match your background