
Site Reliability Engineer
Scale.jobs · Seattle, WA
- On site
- Full-time
- $150,000 / year
- Seattle, WA
Email the hiring manager to get a response.
Get their verified email + an intro that's ready to send.
Subject: Interested in the Site Reliability Engineer role at Scale.jobs
Hi Avery — I came across the Site Reliability Engineer opening and wanted to reach out directly. I've spent the last few years doing exactly this kind of work, and Scale.jobs stood out because…
✎ Personalized to your résumé after sign-up.
- ✓ Verified email of the hiring manager
- ✓ Intro email personalized to your résumé
- ✓ $9/mo = unlimited — any job link
Secure checkout · cancel anytime
Job highlights
- Build and maintain multi-region cloud infrastructure.
- Optimize Kubernetes orchestration platforms.
- Develop automated CI/CD pipelines.
- Implement observability frameworks for alerting.
- Automate operations with Go or Python.
About the role
About The Role
The role focuses on building and maintaining the infrastructure platforms that support high-throughput, low-latency services across multi-region cloud environments. The team is responsible for ensuring the reliability, scalability, and performance of production systems, moving away from manual operations toward automated, self-healing software-defined infrastructure.
The engineer in this role will collaborate directly with product engineering teams to architect reliable services, design robust CI/CD pipelines, and establish clear observability standards using modern SRE methodologies.
Key Responsibilities
- Design, provision, and manage multi-region cloud infrastructure using Terraform to ensure high availability and disaster recovery readiness.
- Optimize and maintain containerized orchestration platforms using Kubernetes (EKS/GKE), including service meshes and ingress controllers.
- Develop and maintain automated CI/CD pipelines using GitLab CI, GitHub Actions, or Jenkins to support continuous deployment and zero-downtime releases.
- Build and mature comprehensive observability frameworks using Prometheus, Grafana, Jaeger, and ELK stack for proactive alerting and rapid incident resolution.
- Participate in a shared blameless on-call rotation, conducting deep-dive post-mortems and implementing long-term engineering fixes to prevent recurrence of production issues.
- Write clean, maintainable automation tools and scripts in Go or Python to eliminate toil and automate manual operational processes.
What We Are Looking For
- 3–6 years of experience in a Site Reliability Engineering, DevOps, or Systems Engineering role supporting production environments at scale.
- Strong proficiency with infrastructure as code (IaC), specifically Terraform, and container orchestration with Kubernetes.
- Proficient in at least one software development language, preferably Go or Python, for systems automation and tooling.
- Deep understanding of Linux systems internals, networking protocols (TCP/IP, DNS, HTTP/S, BGP), and cloud security best practices.
- Experience implementing and tuning observability tools (Prometheus, Datadag, or OpenTelemetry) to establish SLIs, SLOs, and error budgets.
- BS or MS in Computer Science, Engineering, or a related technical discipline, or equivalent practical experience.
- Bonus: Experience managing relational and non-relational databases (PostgreSQL, DynamoDB, Redis) at scale, or expertise with service mesh architectures like Istio.
Key skills/competency
- Site Reliability Engineering
- DevOps
- Infrastructure as Code (IaC)
- Terraform
- Kubernetes
- CI/CD
- Observability
- Prometheus
- Grafana
- Go or Python
Skills & topics
- Site Reliability Engineer
- SRE
- DevOps
- Cloud Infrastructure
- Terraform
- Kubernetes
- CI/CD
- Observability
- Prometheus
- Grafana
- Go
- Python
- Automation
- Linux
- Networking
How to get hired
- Tailor your resume: Highlight experience with Terraform, Kubernetes, and CI/CD.
- Showcase automation skills: Emphasize proficiency in Go/Python and IaC principles.
- Demonstrate SRE expertise: Detail experience with observability tools and on-call rotations.
- Prepare for technical interviews: Review Linux internals, networking, and cloud security.
- Express your interest: Clearly articulate your passion for site reliability and system automation.
Technical preparation
Behavioral questions
Frequently asked questions
- What are the core responsibilities of a Site Reliability Engineer at Scale.jobs?
- As a Site Reliability Engineer at Scale.jobs, you will be responsible for building and maintaining infrastructure platforms, ensuring system reliability, scalability, and performance. This includes designing cloud infrastructure, optimizing Kubernetes, developing CI/CD pipelines, implementing observability, and automating operational processes using Go or Python.
- What specific technologies will I use as a Site Reliability Engineer at Scale.jobs?
- You will work with a modern tech stack including Terraform for infrastructure as code, Kubernetes (EKS/GKE) for orchestration, CI/CD tools like GitLab CI/GitHub Actions/Jenkins, and observability tools such as Prometheus, Grafana, Jaeger, and the ELK stack. Proficiency in Go or Python for automation is also key.
- What level of experience is required for this Site Reliability Engineer role at Scale.jobs?
- Scale.jobs is looking for candidates with 3-6 years of experience in Site Reliability Engineering, DevOps, or Systems Engineering roles supporting production environments at scale. Strong proficiency with Terraform and Kubernetes is essential.
- Is there an on-call rotation for the Site Reliability Engineer position at Scale.jobs?
- Yes, this role involves participating in a shared, blameless on-call rotation. The focus is on conducting deep-dive post-mortems and implementing long-term engineering fixes to prevent future production issues.
- What are the 'nice-to-have' skills for a Site Reliability Engineer at Scale.jobs?
- Bonus points are awarded for experience managing relational and non-relational databases (PostgreSQL, DynamoDB, Redis) at scale, or for expertise with service mesh architectures like Istio. Familiarity with Datadog or OpenTelemetry is also beneficial.
- How does Scale.jobs approach system reliability and automation?
- Scale.jobs is committed to moving away from manual operations towards automated, self-healing software-defined infrastructure. The focus is on building reliable services, robust CI/CD pipelines, and clear observability standards using modern SRE methodologies.
Similar roles
Open positions we recommend based on this role.
