Site Reliability Engineer

Scale.jobs · Seattle, WA

On site
Full-time
$150,000 / year
Seattle, WA

✓ Hiring manager found for this role

Email the hiring manager to get a response.

Get their verified email + an intro that's ready to send.

★★★★★4.7 · 120,000+ users on the Chrome Web Store

Site Reliability Engineer

Scale.jobs · Seattle, WA

Verified ✓

Avery Quinn

Hiring Manager · h•••••@scale.jobs

✍️ Your intro emailReady to send

Subject: Interested in the Site Reliability Engineer role at Scale.jobs

Hi Avery — I came across the Site Reliability Engineer opening and wanted to reach out directly. I've spent the last few years doing exactly this kind of work, and Scale.jobs stood out because…

🔒 Unlock to read & send

✎ Personalized to your résumé after sign-up.

$1 once

Just this hiring manager

Best value

$9/mo

Unlimited — any job, anywhere

✓ Verified email of the hiring manager
✓ Intro email personalized to your résumé
✓ $9/mo = unlimited — any job link

Secure checkout · cancel anytime

View the original posting ↗

Not recommended alone — most applicants never hear back.

Job highlights

Build and maintain multi-region cloud infrastructure.
Optimize Kubernetes orchestration platforms.
Develop automated CI/CD pipelines.
Implement observability frameworks for alerting.
Automate operations with Go or Python.

About the role

About The Role

The role focuses on building and maintaining the infrastructure platforms that support high-throughput, low-latency services across multi-region cloud environments. The team is responsible for ensuring the reliability, scalability, and performance of production systems, moving away from manual operations toward automated, self-healing software-defined infrastructure.

The engineer in this role will collaborate directly with product engineering teams to architect reliable services, design robust CI/CD pipelines, and establish clear observability standards using modern SRE methodologies.

Key Responsibilities

Design, provision, and manage multi-region cloud infrastructure using Terraform to ensure high availability and disaster recovery readiness.
Optimize and maintain containerized orchestration platforms using Kubernetes (EKS/GKE), including service meshes and ingress controllers.
Develop and maintain automated CI/CD pipelines using GitLab CI, GitHub Actions, or Jenkins to support continuous deployment and zero-downtime releases.
Build and mature comprehensive observability frameworks using Prometheus, Grafana, Jaeger, and ELK stack for proactive alerting and rapid incident resolution.
Participate in a shared blameless on-call rotation, conducting deep-dive post-mortems and implementing long-term engineering fixes to prevent recurrence of production issues.
Write clean, maintainable automation tools and scripts in Go or Python to eliminate toil and automate manual operational processes.

What We Are Looking For

3–6 years of experience in a Site Reliability Engineering, DevOps, or Systems Engineering role supporting production environments at scale.
Strong proficiency with infrastructure as code (IaC), specifically Terraform, and container orchestration with Kubernetes.
Proficient in at least one software development language, preferably Go or Python, for systems automation and tooling.
Deep understanding of Linux systems internals, networking protocols (TCP/IP, DNS, HTTP/S, BGP), and cloud security best practices.
Experience implementing and tuning observability tools (Prometheus, Datadag, or OpenTelemetry) to establish SLIs, SLOs, and error budgets.
BS or MS in Computer Science, Engineering, or a related technical discipline, or equivalent practical experience.
Bonus: Experience managing relational and non-relational databases (PostgreSQL, DynamoDB, Redis) at scale, or expertise with service mesh architectures like Istio.

Key skills/competency

Site Reliability Engineering
DevOps
Infrastructure as Code (IaC)
Terraform
Kubernetes
CI/CD
Observability
Prometheus
Grafana
Go or Python

Skills & topics

Site Reliability Engineer
SRE
DevOps
Cloud Infrastructure
Terraform
Kubernetes
CI/CD
Observability
Prometheus
Grafana
Go
Python
Automation
Linux
Networking

How to get hired

Tailor your resume: Highlight experience with Terraform, Kubernetes, and CI/CD.
Showcase automation skills: Emphasize proficiency in Go/Python and IaC principles.
Demonstrate SRE expertise: Detail experience with observability tools and on-call rotations.
Prepare for technical interviews: Review Linux internals, networking, and cloud security.
Express your interest: Clearly articulate your passion for site reliability and system automation.

Technical preparation

Master Terraform for IaC.,Deep dive into Kubernetes concepts.,Practice Go/Python for automation.,Review Linux, networking, cloud security.

Behavioral questions

Describe a complex system failure you resolved.,How do you handle on-call incidents calmly?,Explain your approach to automating toil.,How do you collaborate with product engineers?

Prefer to apply the usual way?

Not recommended alone — most applicants never hear back. Email the hiring manager first.

View original posting ↗

Frequently asked questions

What are the core responsibilities of a Site Reliability Engineer at Scale.jobs?: As a Site Reliability Engineer at Scale.jobs, you will be responsible for building and maintaining infrastructure platforms, ensuring system reliability, scalability, and performance. This includes designing cloud infrastructure, optimizing Kubernetes, developing CI/CD pipelines, implementing observability, and automating operational processes using Go or Python.
What specific technologies will I use as a Site Reliability Engineer at Scale.jobs?: You will work with a modern tech stack including Terraform for infrastructure as code, Kubernetes (EKS/GKE) for orchestration, CI/CD tools like GitLab CI/GitHub Actions/Jenkins, and observability tools such as Prometheus, Grafana, Jaeger, and the ELK stack. Proficiency in Go or Python for automation is also key.
What level of experience is required for this Site Reliability Engineer role at Scale.jobs?: Scale.jobs is looking for candidates with 3-6 years of experience in Site Reliability Engineering, DevOps, or Systems Engineering roles supporting production environments at scale. Strong proficiency with Terraform and Kubernetes is essential.
Is there an on-call rotation for the Site Reliability Engineer position at Scale.jobs?: Yes, this role involves participating in a shared, blameless on-call rotation. The focus is on conducting deep-dive post-mortems and implementing long-term engineering fixes to prevent future production issues.
What are the 'nice-to-have' skills for a Site Reliability Engineer at Scale.jobs?: Bonus points are awarded for experience managing relational and non-relational databases (PostgreSQL, DynamoDB, Redis) at scale, or for expertise with service mesh architectures like Istio. Familiarity with Datadog or OpenTelemetry is also beneficial.
How does Scale.jobs approach system reliability and automation?: Scale.jobs is committed to moving away from manual operations towards automated, self-healing software-defined infrastructure. The focus is on building reliable services, robust CI/CD pipelines, and clear observability standards using modern SRE methodologies.

Similar roles

Open positions we recommend based on this role.

Email the hiring manager to get a response.

Get their verified email + an intro that's ready to send.

★★★★★4.7 · 120,000+ users on the Chrome Web Store

Site Reliability Engineer

Scale.jobs · Seattle, WA

Verified ✓

Avery Quinn

Hiring Manager · h•••••@scale.jobs

✍️ Your intro emailReady to send

Subject: Interested in the Site Reliability Engineer role at Scale.jobs

Hi Avery — I came across the Site Reliability Engineer opening and wanted to reach out directly. I've spent the last few years doing exactly this kind of work, and Scale.jobs stood out because…

🔒 Unlock to read & send

✎ Personalized to your résumé after sign-up.

$1 once

Just this hiring manager

Best value

$9/mo

Unlimited — any job, anywhere

✓ Verified email of the hiring manager
✓ Intro email personalized to your résumé
✓ $9/mo = unlimited — any job link

Secure checkout · cancel anytime

View the original posting ↗

Not recommended alone — most applicants never hear back.

About the role

About The Role

Key Responsibilities

Design, provision, and manage multi-region cloud infrastructure using Terraform to ensure high availability and disaster recovery readiness.
Optimize and maintain containerized orchestration platforms using Kubernetes (EKS/GKE), including service meshes and ingress controllers.
Develop and maintain automated CI/CD pipelines using GitLab CI, GitHub Actions, or Jenkins to support continuous deployment and zero-downtime releases.
Build and mature comprehensive observability frameworks using Prometheus, Grafana, Jaeger, and ELK stack for proactive alerting and rapid incident resolution.
Participate in a shared blameless on-call rotation, conducting deep-dive post-mortems and implementing long-term engineering fixes to prevent recurrence of production issues.
Write clean, maintainable automation tools and scripts in Go or Python to eliminate toil and automate manual operational processes.

What We Are Looking For

3–6 years of experience in a Site Reliability Engineering, DevOps, or Systems Engineering role supporting production environments at scale.
Strong proficiency with infrastructure as code (IaC), specifically Terraform, and container orchestration with Kubernetes.
Proficient in at least one software development language, preferably Go or Python, for systems automation and tooling.
Deep understanding of Linux systems internals, networking protocols (TCP/IP, DNS, HTTP/S, BGP), and cloud security best practices.
Experience implementing and tuning observability tools (Prometheus, Datadag, or OpenTelemetry) to establish SLIs, SLOs, and error budgets.
BS or MS in Computer Science, Engineering, or a related technical discipline, or equivalent practical experience.
Bonus: Experience managing relational and non-relational databases (PostgreSQL, DynamoDB, Redis) at scale, or expertise with service mesh architectures like Istio.

Key skills/competency

Site Reliability Engineering
DevOps
Infrastructure as Code (IaC)
Terraform
Kubernetes
CI/CD
Observability
Prometheus
Grafana
Go or Python

How to get hired

Tailor your resume: Highlight experience with Terraform, Kubernetes, and CI/CD.
Showcase automation skills: Emphasize proficiency in Go/Python and IaC principles.
Demonstrate SRE expertise: Detail experience with observability tools and on-call rotations.
Prepare for technical interviews: Review Linux internals, networking, and cloud security.
Express your interest: Clearly articulate your passion for site reliability and system automation.

Frequently asked questions

What are the core responsibilities of a Site Reliability Engineer at Scale.jobs?

As a Site Reliability Engineer at Scale.jobs, you will be responsible for building and maintaining infrastructure platforms, ensuring system reliability, scalability, and performance. This includes designing cloud infrastructure, optimizing Kubernetes, developing CI/CD pipelines, implementing observability, and automating operational processes using Go or Python.

What specific technologies will I use as a Site Reliability Engineer at Scale.jobs?

You will work with a modern tech stack including Terraform for infrastructure as code, Kubernetes (EKS/GKE) for orchestration, CI/CD tools like GitLab CI/GitHub Actions/Jenkins, and observability tools such as Prometheus, Grafana, Jaeger, and the ELK stack. Proficiency in Go or Python for automation is also key.

What level of experience is required for this Site Reliability Engineer role at Scale.jobs?

Scale.jobs is looking for candidates with 3-6 years of experience in Site Reliability Engineering, DevOps, or Systems Engineering roles supporting production environments at scale. Strong proficiency with Terraform and Kubernetes is essential.

Is there an on-call rotation for the Site Reliability Engineer position at Scale.jobs?

Yes, this role involves participating in a shared, blameless on-call rotation. The focus is on conducting deep-dive post-mortems and implementing long-term engineering fixes to prevent future production issues.

What are the 'nice-to-have' skills for a Site Reliability Engineer at Scale.jobs?

Bonus points are awarded for experience managing relational and non-relational databases (PostgreSQL, DynamoDB, Redis) at scale, or for expertise with service mesh architectures like Istio. Familiarity with Datadog or OpenTelemetry is also beneficial.

How does Scale.jobs approach system reliability and automation?

Scale.jobs is committed to moving away from manual operations towards automated, self-healing software-defined infrastructure. The focus is on building reliable services, robust CI/CD pipelines, and clear observability standards using modern SRE methodologies.