
Site Reliability Engineer (SRE) - Casual AI -MarTech/AdTech
Three Pillars Recruiting · San Francisco, CA
- On site
- Full-time
- $150,000 / year
- San Francisco, CA
Email the hiring manager to get a response.
Get their verified email + an intro that's ready to send.
Subject: Interested in the Site Reliability Engineer (SRE) - Casual AI -MarTech/AdTech role at Three Pillars Recruiting
Hi Jordan — I came across the Site Reliability Engineer (SRE) - Casual AI -MarTech/AdTech opening and wanted to reach out directly. I've spent the last few years doing exactly this kind of work, and Three Pillars Recruiting stood out because…
✎ Personalized to your résumé after sign-up.
- ✓ Verified email of the hiring manager
- ✓ Intro email personalized to your résumé
- ✓ $9/mo = unlimited — any job link
Secure checkout · cancel anytime
Job highlights
- Build and maintain scalable infrastructure for ML workloads.
- Improve system reliability with automation and observability.
- Own and evolve CI/CD pipelines and deployment processes.
- Implement monitoring, alerting, and incident response.
- Collaborate to drive performance and reliability culture.
About the role
Site Reliability Engineer
Seeking an experienced Site Reliability Engineer (SRE) to help them scale their platform with reliability, observability, and operational excellence at the core. You’ll partner with engineers and data scientists to build, automate, and maintain the infrastructure that powers their core platform—including data pipelines, ML workloads, and real-time analytics systems. This is a hands-on, high-impact role with visibility across the stack and the opportunity to shape the future of their infrastructure and operations.
Key Responsibilities
- Design, build, and maintain scalable infrastructure to support real-time analytics and machine learning workloads
- Improve system reliability and performance through automation, observability, and proactive capacity planning
- Own and evolve CI/CD pipelines, deployment automation, rollback mechanisms, and config management
- Implement and maintain monitoring, alerting, and incident response processes (SLOs, runbooks, on-call rotations)
- Collaborate across engineering and data science teams to drive a culture of performance and reliability
- Ensure security, compliance, and operational readiness across their cloud infrastructure
- Drive post-incident analysis and continuous improvement initiatives
What Will Help You Succeed
- 8+ years of experience in SRE, DevOps, or infrastructure engineering roles
- 5+ years of experience with datacenter operations and/or system and network administration
- Experience with containerization (Docker), and orchestration (Kubernetes)
- Strong knowledge of Linux systems, networking, and systems performance tuning
- Solid understanding of infrastructure-as-code (e.g., Terraform, Ansible)
- Good programming skills and ability to apply sound coding principles to IaC and scripting code with languages such as Terraform, Ansible, Bash (shell scripting), and/or Python.
- Experience with monitoring and observability stacks (e.g., Prometheus, Grafana, Datadog, ELK, OpenTelemetry)
- Proficiency with CI/CD tools and pipelines (e.g., GitHub Actions, ArgoCD, etc.)
- Ability to debug complex systems and automate solutions in scripting languages
- Excellent communication skills and the ability to work cross-functionally
Nice-to-Have
- Experience with cloud and managed services (e.g. AWS)
- Experience supporting data-intensive platforms (Spark, Airflow, Kafka, etc.)
- Familiarity with security practices for cloud-native applications and infrastructure
- Experience in high-compliance or SOC-2 environments
Key skills/competency
- Site Reliability Engineering
- DevOps
- Infrastructure Engineering
- Cloud Computing
- Containerization (Docker)
- Orchestration (Kubernetes)
- Infrastructure as Code (IaC)
- CI/CD
- Monitoring & Observability
- System Administration
Skills & topics
- Site Reliability Engineer
- SRE
- DevOps
- Infrastructure Engineering
- Cloud Computing
- Kubernetes
- Docker
- Terraform
- Ansible
- Prometheus
- Grafana
- Datadog
- Linux
- Automation
- CI/CD
How to get hired
- Tailor your resume: Highlight SRE, DevOps, and infrastructure experience. Quantify achievements in reliability and automation.
- Showcase technical skills: Emphasize experience with Kubernetes, Docker, IaC (Terraform, Ansible), and monitoring tools.
- Demonstrate problem-solving: Provide examples of debugging complex systems and automating solutions with scripting.
- Prepare for technical interviews: Be ready to discuss system design, incident response, and troubleshooting scenarios.
- Highlight collaboration: Showcase experience working cross-functionally with engineering and data science teams.
Technical preparation
Behavioral questions
Frequently asked questions
- What are the core responsibilities of a Site Reliability Engineer at Three Pillars Recruiting?
- As a Site Reliability Engineer at Three Pillars Recruiting, you will be responsible for designing, building, and maintaining scalable infrastructure for real-time analytics and machine learning workloads. This includes improving system reliability through automation, owning CI/CD pipelines, implementing monitoring and alerting, and collaborating with engineering and data science teams to ensure operational excellence.
- What technical skills are most important for this Site Reliability Engineer role?
- The most important technical skills for this Site Reliability Engineer role include 8+ years in SRE/DevOps, 5+ years in datacenter/system administration, experience with Docker and Kubernetes, strong Linux and networking knowledge, proficiency in infrastructure-as-code (Terraform, Ansible), scripting (Bash, Python), and experience with monitoring stacks like Prometheus or Datadog.
- Does Three Pillars Recruiting offer opportunities for professional growth in this SRE position?
- Yes, this Site Reliability Engineer role offers significant opportunities for professional growth. It's a high-impact position with visibility across the stack, allowing you to shape the future of infrastructure and operations. You'll work on challenging ML and real-time analytics systems, contributing to a culture of continuous improvement.
- What is the expected experience level for the Site Reliability Engineer position at Three Pillars Recruiting?
- The ideal candidate for the Site Reliability Engineer position at Three Pillars Recruiting will have at least 8 years of experience in SRE, DevOps, or infrastructure engineering roles, with a minimum of 5 years in datacenter operations or system/network administration.
- How does Three Pillars Recruiting approach collaboration for their Site Reliability Engineers?
- Three Pillars Recruiting fosters a collaborative environment for their Site Reliability Engineers. You will partner closely with engineers and data scientists to build and maintain the core platform infrastructure, ensuring a shared culture of performance and reliability across teams.
Similar roles
Open positions we recommend based on this role.
