What are the key responsibilities for a Site Reliability Engineer at Drivetrain?

As a Site Reliability Engineer at Drivetrain, you will be responsible for architecting and managing multi-cloud infrastructure on AWS and GCP, leading Kubernetes orchestration, implementing CI/CD pipelines, driving automation with Python and Terraform, and enhancing the observability stack with tools like Prometheus and Grafana. You will also own incident response and champion a culture of reliability.

What cloud platforms does Drivetrain utilize for its infrastructure?

Drivetrain utilizes a multi-cloud strategy, with a deep and proven proficiency required in both Amazon Web Services (AWS) and Google Cloud Platform (GCP). This includes expertise in services like EKS, GKE, EC2, Compute Engine, RDS, Cloud SQL, and more.

What level of experience is required for the Senior Site Reliability Engineer role at Drivetrain?

The role requires 5+ years of hands-on experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure, preferably within a fast-paced SaaS environment. This experience should include advanced knowledge of containerization, automation, and observability tools.

How does Drivetrain foster a positive work culture for its remote employees?

Drivetrain is a remote-first company that emphasizes a supportive culture through remote-friendliness, open and transparent communication, an idea-friendly environment that encourages risk-taking and learning, and a customer-centric approach. They trust their employees and provide autonomy.

What are the essential technical skills for this Site Reliability Engineer position?

Essential technical skills include expert-level knowledge of Docker and Kubernetes, strong programming skills in Python, extensive experience with Terraform for IaC, hands-on expertise with Prometheus, Grafana, and log aggregation stacks (ELK/EFK), and a solid understanding of cloud networking and security principles.

How does Drivetrain approach hiring and decision-making?

Drivetrain may use AI tools to assist in reviewing applications and assessing candidates, but human judgment remains central to the hiring process. Final hiring decisions are always made by humans.

What is the company's background and funding status?

Drivetrain was founded in 2021 by ex-Googlers and is a fast-growing SaaS company backed by leading venture capital firms, indicating a strong trajectory for success and growth.

Site Reliability Engineer - SRE at Drivetrain | Apply at Drivetrain

Senior Site Reliability Engineer at Drivetrain

Drivetrain is on a mission to empower businesses to make better decisions. Our financial planning & decision-making platform helps companies scale and achieve their targets predictably.

Drivetrain is a remote-first company headquartered in the San Francisco Bay Area. Founded in 2021 by a couple of ex-Googlers, Drivetrain is a fast-growing company on a trajectory for success with backing from leading venture capital firms.

Drivetrain provides a great culture for its employees to thrive in and be happy.

💜 Remote-friendly: Drivetrain brings together the best and the brightest, no matter where they are and provides them a great degree of autonomy. We trust our people.
🗣️ Open & transparent: We know that when our creators have access to all the information they need, their best work will emerge.
👏 Idea-friendly: We provide an environment to explore new ideas, to take risks, to make mistakes, and to learn, so you can succeed. Anyone in the company can come up with great ideas and become a catalyst for positive change. We let the best ideas win.
👥 Customer-centric: We follow a product-led growth strategy, continuously learning from our customers and collaborating to build the amazing software that Drivetrain is.

About the Role

As a Senior Site Reliability Engineer at Drivetrain, you will be a cornerstone of our engineering organization, ensuring our fast-growing SaaS platform remains highly available, performant, and secure. At this stage of our growth, scaling infrastructure efficiently while maintaining the rigorous security and reliability standards required for financial data is paramount. You will take ownership of our multi-cloud infrastructure, drive automation, champion observability, and collaborate closely with development teams to build a culture of reliability from code commit to production.

Key Responsibilities

Cloud Infrastructure & Orchestration

Multi-Cloud Management: Architect, manage, and continuously optimize highly available cloud infrastructure across both AWS and GCP. Balance workload demands to ensure maximum cost-efficiency, scalability, and strict security compliance across both platforms.
Advanced Kubernetes Orchestration: Lead the design, deployment, and management of scalable Kubernetes clusters. Utilize configuration management tools like Kustomize to enforce standardized, repeatable, and automated deployment configurations across all environments.
Service Mesh & Security Integration: Implement and maintain service mesh technologies (e.g., Istio, Linkerd) to secure, control, and observe service-to-service communication. Drive container security best practices, including image scanning, runtime protection, and strict RBAC enforcement.

CI/CD & Automation

Pipeline Engineering: Architect, maintain, and optimize robust CI/CD pipelines using Git and Jenkins. Focus on reducing deployment friction, accelerating release velocity, and enforcing automated testing and security gates.
Infrastructure as Code (IaC): Treat infrastructure as software. Write, review, and maintain Terraform modules to provision and manage cloud resources predictably and safely.
Operational Automation: Aggressively reduce operational toil. Develop robust Python scripts and tooling to automate routine maintenance, data backups, scaling operations, and system recovery processes.

Observability & Reliability

Comprehensive Monitoring: Design and enhance our observability stack to provide deep, real-time insights into system health. Manage and scale tools including Prometheus, Grafana, ELK/EFK stack, AWS CloudWatch, and GCP Operations Suite.
Reliability Engineering: Spearhead reliability initiatives critical to a scaling SaaS platform. Drive rigorous capacity planning exercises to stay ahead of growth.
Incident Management & SLOs: Own the incident response lifecycle. Facilitate blameless postmortems to extract actionable learnings. Define, track, and enforce SLIs, SLOs, and SLAs, ensuring the platform consistently meets its reliability guarantees.

Collaboration & Leadership

DevOps Culture: Act as an embedded reliability advocate. Collaborate closely with software engineers early in the development lifecycle to ensure applications are designed for deployability, scalability, and resilience.
Continuous Improvement: Proactively identify system bottlenecks and architectural weaknesses. Contribute to process improvements, build internal developer tooling, and maintain comprehensive documentation to elevate team productivity and system understanding.

Required Proficiency & Qualifications

Experience: 5+ years of hands-on experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles, preferably within a fast-paced SaaS environment.
Cloud Platforms: Deep, proven proficiency in AWS (EC2, EKS, RDS, VPC, IAM, S3) AND GCP (GKE, Compute Engine, Cloud SQL, IAM, Cloud Storage). Ability to navigate and optimize multi-cloud architectures.
Containerization: Expert-level knowledge of Docker and Kubernetes, including advanced deployment strategies and lifecycle management.
Automation/IaC: Strong programming skills in Python and extensive experience with Terraform.
Observability: Hands-on expertise building dashboards and alerting systems using Prometheus, Grafana, and log aggregation stacks (ELK/EFK).
Networking & Security: Solid understanding of cloud networking (VPC peering, load balancing, DNS) and zero-trust security principles in a containerized environment.

Sounds exciting? Apply at careers@drivetrain.ai. It may just be the next best decision you’ve ever made!

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

Key skills/competency

Site Reliability Engineering
DevOps
Cloud Infrastructure
AWS
GCP
Kubernetes
Docker
Terraform
Python
Observability

Tailor your resume: Highlight your 5+ years of SRE experience, AWS/GCP proficiency, and Kubernetes expertise. Quantify achievements in automation and reliability.
Showcase technical skills: Emphasize your experience with Python, Terraform, Prometheus, Grafana, and CI/CD tools like Jenkins.
Demonstrate collaboration: Provide examples of how you've embedded DevOps culture and worked with development teams.
Apply strategically: Send your application to careers@drivetrain.ai, clearly stating your interest in the Senior Site Reliability Engineer role.
Prepare for technical interviews: Be ready to discuss multi-cloud architecture, Kubernetes, infrastructure as code, and incident management scenarios.

Site Reliability Engineer - SRE

Job highlights

About the role

Senior Site Reliability Engineer at Drivetrain

About the Role

Key Responsibilities

Cloud Infrastructure & Orchestration

CI/CD & Automation

Observability & Reliability

Collaboration & Leadership

Required Proficiency & Qualifications

Key skills/competency

Skills & topics

How to get hired

Technical preparation

Behavioral questions

Frequently asked questions