
Site Reliability Engineer
Cognition · San Francisco, CA
- On site
- Full-time
- $280,000 / year
- San Francisco, CA
Email the hiring manager to get a response.
Get their verified email + an intro that's ready to send.
Subject: Interested in the Site Reliability Engineer role at Cognition
Hi Taylor — I came across the Site Reliability Engineer opening and wanted to reach out directly. I've spent the last few years doing exactly this kind of work, and Cognition stood out because…
✎ Personalized to your résumé after sign-up.
- ✓ Verified email of the hiring manager
- ✓ Intro email personalized to your résumé
- ✓ $9/mo = unlimited — any job link
Secure checkout · cancel anytime
Job highlights
- Own production reliability for AI products.
- Manage CI/CD and developer infrastructure.
- Implement SLOs, incident response, and on-call.
- Utilize cloud infrastructure and IaC skills.
- Drive reliability culture and automation.
About the role
About Cognition
We are an applied AI lab building end-to-end software agents. We're the team behind Devin, the first AI software engineer, and Windsurf, an AI-native IDE. These products represent our vision for AI that doesn't just assist engineers, but works alongside them as a genuine teammate.
Our team is small and talent-dense: world-class competitive programmers, former founders, and researchers from the frontier of AI, including Scale AI, Palantir, Cursor, Google DeepMind, and others.
Role Mission
Devin and Windsurf are used by hundreds of thousands of developers every day. When something goes wrong, it goes wrong for all of them at once. This role exists to make sure that doesn't happen, and when it does, to make sure it's resolved faster than anyone expects.
You will own both the production reliability of our user-facing products and the platform engineering that lets our team ship quickly and confidently. That means SLOs, incident response, and on-call on one side, and CI/CD pipelines, deployment infrastructure, and developer tooling on the other. At Cognition, these are not separate jobs. The best SREs here understand that reliability is engineered in, not bolted on.
What You'll Accomplish
- Production Reliability: Define and own SLOs, SLIs, and error budgets for Devin and Windsurf. Build the monitoring, alerting, and observability systems that give the team a clear, honest picture of service health at all times.
- Incident Response and On-Call: Lead incident response with speed and clarity. Run blameless postmortems that turn outages into durable improvements. Build the runbooks and tooling that make on-call sustainable and effective.
- Platform Engineering and CI/CD: Own the deployment pipelines, release infrastructure, and internal developer tooling that let the team ship fast without breaking things. Reduce toil systematically so engineers spend time on work that matters.
- Infrastructure as Code: Manage cloud infrastructure through code. Build reproducible, auditable, version-controlled environments that scale with the product and eliminate configuration drift.
- Capacity Planning and Performance: Model growth, forecast resource needs, and ensure the infrastructure stays ahead of demand. Profile and improve system performance before users feel it.
- Security and Reliability as One: Treat security not as a separate concern but as a reliability requirement. Ensure that misconfigurations, vulnerabilities, and access failures are caught and remediated with the same urgency as outages.
- Reliability Culture: Partner closely with product and engineering teams to build reliability in from the start. Be the person who catches the single point of failure in the architecture review before it becomes a page at 2am.
Exceptional Candidates Have Demonstrated
- Deep experience running production systems at scale: SLOs, error budgets, on-call rotations, and incident command
- Strong software engineering fundamentals; SRE at Cognition means writing real code, not just configuring tools
- Proficiency with cloud infrastructure (AWS, GCP, or Azure), container orchestration (Kubernetes), and infrastructure as code (Terraform or equivalent)
- Experience building and owning CI/CD pipelines and deployment infrastructure for fast-moving product teams
- Strong observability instincts: knows how to instrument systems, build useful dashboards, and design alerts that surface signal without generating noise
- A track record of reducing toil systematically through automation, not just working around it
- Comfort owning incidents end to end: detection, triage, mitigation, resolution, and postmortem
- Enough product empathy to understand what reliability means from a user's perspective, not just an infrastructure one
- Experience with developer-facing products or platforms is a strong plus
Resources & Environment
- Small, highly selective team shipping products used by hundreds of thousands of developers daily
- High ownership and high trust: you'll set the reliability bar, not inherit someone else's standards
- The environment rewards engineers who are proactive, systematic, and treat reliability as a craft, not a checklist
Compensation & Benefits
- Base Salary: $260,000 - $300,000 + significant early-stage equity
- Medical, Dental, Vision: Fully paid for you and your dependents
- 401(k): Company match included
- Perks: Private chef, cozy slippers, endless snacks, and more
Equal Opportunity
Cognition is an equal opportunity employer. We do not discriminate on the basis of race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other protected characteristic under applicable law. We are committed to providing reasonable accommodations for candidates with disabilities throughout the hiring process - please let us know if you need any.
Key skills/competency
- Site Reliability Engineering (SRE)
- Production Systems at Scale
- Software Engineering Fundamentals
- Cloud Infrastructure (AWS, GCP, Azure)
- Kubernetes
- Infrastructure as Code (Terraform)
- CI/CD Pipelines
- Observability and Monitoring
- Incident Response
- Developer Tooling
Skills & topics
- Site Reliability Engineer
- SRE
- Production Systems
- Software Engineering
- Cloud Infrastructure
- AWS
- GCP
- Azure
- Kubernetes
- Terraform
- CI/CD
- Observability
- Incident Response
- Automation
- Developer Tools
- AI
- Software Agents
- Devin
- Windsurf
- Platform Engineering
How to get hired
- Tailor your resume: Highlight experience with production systems, SLOs, incident response, and cloud infrastructure relevant to Cognition's AI products.
- Showcase software engineering skills: Emphasize your ability to write code for SRE tasks, not just configure tools.
- Demonstrate IaC and CI/CD expertise: Provide examples of managing cloud infrastructure with tools like Terraform and building robust deployment pipelines.
- Quantify achievements: Use data to show the impact of your work in reducing toil, improving reliability, or speeding up deployments.
- Research Cognition's mission: Understand their vision for AI software agents like Devin and Windsurf to align your application with their goals.
Technical preparation
Behavioral questions
Frequently asked questions
- What is the typical career growth for a Site Reliability Engineer at Cognition?
- At Cognition, Site Reliability Engineers have the opportunity to grow by taking on more complex system ownership, leading critical incident responses, and shaping the future of platform engineering for our AI products. Given the company's focus on high ownership and a selective team, there's significant potential for impact and advancement.
- What are the main challenges a Site Reliability Engineer will face at Cognition?
- The primary challenges for a Site Reliability Engineer at Cognition involve ensuring the extreme reliability of rapidly evolving AI products used by hundreds of thousands of developers. This includes managing complex production systems, leading incident response for critical services, and building scalable infrastructure in a fast-paced environment.
- How does Cognition approach on-call rotations for Site Reliability Engineers?
- Cognition emphasizes making on-call sustainable and effective. This involves building robust runbooks and tooling, ensuring clear incident response procedures, and implementing systems that minimize unnecessary pages. The goal is to ensure that when incidents occur, they are managed with speed and clarity, and lead to durable improvements.
- What specific AI products will a Site Reliability Engineer be supporting at Cognition?
- A Site Reliability Engineer at Cognition will be directly responsible for the production reliability and platform engineering of Devin, the first AI software engineer, and Windsurf, an AI-native IDE. These are Cognition's flagship products, used daily by a large developer community.
- Does Cognition require specific cloud provider experience for the Site Reliability Engineer role?
- While proficiency with cloud infrastructure is essential, Cognition is open to candidates with deep experience in AWS, GCP, or Azure. The focus is on your ability to manage cloud infrastructure effectively using Infrastructure as Code principles, rather than a specific vendor lock-in.
- How important is software engineering ability for a Site Reliability Engineer at Cognition?
- Software engineering fundamentals are critical for SREs at Cognition. The role involves writing real code to build and improve systems, automate tasks, and solve complex reliability challenges, rather than solely configuring existing tools. Demonstrating strong coding skills is a key requirement.
Similar roles
Open positions we recommend based on this role.
