2 months ago

Staff Site Reliability Engineer - Incident Management & Reliability

Confluent

Hybrid

Full Time

CA$244,800

Hybrid

Apply

Job Overview

Job TitleStaff Site Reliability Engineer - Incident Management & Reliability

Job TypeFull Time

Offered SalaryCA$244,800

LocationHybrid

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

Staff Site Reliability Engineer - Incident Management & Reliability at Confluent

We’re not just building better tech. We’re rewriting how data moves and what the world can do with it. With Confluent, data doesn’t sit still. Our platform puts information in motion, streaming in near real-time so companies can react faster, build smarter, and deliver experiences as dynamic as the world around them.

It takes a certain kind of person to join this team. Those who ask hard questions, give honest feedback, and show up for each other. No egos, no solo acts. Just smart, curious humans pushing toward something bigger, together.

One Confluent. One Team. One Data Streaming Platform.

About The Role

Confluent Cloud processes millions of events per second across AWS, GCP, and Azure. When incidents happen in a multi-cloud streaming platform, they happen at scale—data in motion, exactly-once semantics, and cascading failure modes that require deep systems thinking. We need an expert-level engineer who can drive proactive reliability improvements that prevent these incidents before they occur.

This role combines hands-on technical work with strategic program ownership. You'll spend roughly 75% of your time on engineering: building automation, improving tooling, analyzing systemic failure patterns, and designing reliability improvements. The remaining 25% is teaching and coordination: coaching teams through post-mortems, training incident commanders, and evolving our incident response practices.

You'll be part of a global team with follow-the-sun coverage, with clean handoffs that keep everyone working sustainable hours. This role sits within Cloud Architecture and Reliability - Supportability, a horizontal team that owns reliability standards and tooling across engineering. You're the person who makes us need incident management less.

What You Will Do

Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence
Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments
Own standards, practices, and continuous improvement of incident response across engineering
Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity
Develop and deliver training programs; coach teams through post-mortems
Partner with engineering leaders to elevate reliability practices org-wide

What You Will Bring

10+ years of relevant experience in SRE, incident management, or reliability engineering
Cloud experience with at least one of AWS, GCP, or Azure (we run all three)
Experience navigating reliability/incident programs at 500+ engineer organizations
Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)
Strong understanding of distributed systems and failure modes at scale
Deep experience with observability: metrics, logging, tracing
Kubernetes and container orchestration experience
Understanding of CI/CD pipelines and release processes
Strong written communication (design docs, runbooks, post-mortems)
Experience driving org-wide process and cultural changes
Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems

Ready to build what's next? Let’s get in motion.

Come As You Are

Belonging isn’t a perk here. It’s the baseline. We work across time zones and backgrounds, knowing the best ideas come from different perspectives. And we make space for everyone to lead, grow, and challenge what’s possible.

We’re proud to be an equal opportunity workplace. Employment decisions are based on job-related criteria, without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or any other classification protected by law.

Key skills/competency

Site Reliability Engineering (SRE)
Incident Management
Reliability Engineering
Distributed Systems
Cloud Computing (AWS, GCP, Azure)
Kubernetes
Observability (Metrics, Logging, Tracing)
CI/CD Pipelines
Kafka/Event Streaming
Post-mortems

Tags:

Staff Site Reliability Engineer

SRE

Incident Management

Reliability Engineering

System Design

Incident Response

Automation

Observability

Post-mortems

SLO/SLA

Distributed Systems

Cloud Operations

Coaching

Process Improvement

AWS

GCP

Azure

Kubernetes

Kafka

Rootly

PagerDuty

Jira

Confluence

Slack

CI/CD

How to Get Hired at Confluent

Research Confluent's culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor.
Tailor your resume for SRE roles: Highlight experience in distributed systems, incident management, and cloud platforms like AWS, GCP, or Azure.
Showcase reliability engineering expertise: Emphasize proactive incident prevention, SLO/SLA frameworks, and post-mortem analysis.
Prepare for technical interviews: Focus on system design, incident response scenarios, and deep dives into Kubernetes or Kafka.
Demonstrate collaborative leadership: Be ready to discuss coaching teams, driving process changes, and cross-functional partnerships.

Frequently Asked Questions

Find answers to common questions about this job opportunity

01What is the primary focus of the Staff Site Reliability Engineer - Incident Management & Reliability role at Confluent?

02What cloud platforms does Confluent Cloud operate on, and why is this relevant for the Staff SRE role?

03How does Confluent approach incident management and continuous improvement within engineering?

04What kind of experience with incident management tooling is expected for this Confluent position?

05What is the team structure like for the Staff Site Reliability Engineer - Incident Management & Reliability at Confluent?

06How important is Kafka or event streaming expertise for this Staff SRE role at Confluent?

07What is the expected balance between technical work and strategic leadership in this Staff SRE role?

Explore similar opportunities that match your background

This job post expired on March 16, 2026