
SRE Lead Engineer
Litmus7 · San Francisco, CA
- On site
- Full-time
- $150,000 / year
- San Francisco, CA
Email the hiring manager to get a response.
Get their verified email + an intro that's ready to send.
Subject: Interested in the SRE Lead Engineer role at Litmus7
Hi Riley — I came across the SRE Lead Engineer opening and wanted to reach out directly. I've spent the last few years doing exactly this kind of work, and Litmus7 stood out because…
✎ Personalized to your résumé after sign-up.
- ✓ Verified email of the hiring manager
- ✓ Intro email personalized to your résumé
- ✓ $9/mo = unlimited — any job link
Secure checkout · cancel anytime
Job highlights
- Lead SRE Engineer for production reliability.
- Manage incidents and operational improvements.
- Utilize Dynatrace and Splunk tools.
- Improve monitoring and alerting systems.
- Mentor SRE teams and communicate with stakeholders.
About the role
Lead SRE Engineer
We are looking for a hands-on Lead SRE Engineer to work from onsite and own production reliability, observability, incident response, and operational improvements for enterprise-scale ecommerce systems. The candidate should be technically strong, able to lead P1/P2 incident triage, work directly with client stakeholders, guide offshore teams, and drive improvements across monitoring, alerting, dashboards, runbooks, and automation.
Key Responsibilities
- Lead onsite production triage for critical incidents and coordinate with application, infrastructure, DevOps, database, network, and offshore teams.
- Monitor and support business-critical ecommerce flows such as checkout, order capture, payment, inventory, promotions, and fulfilment integrations.
- Use Dynatrace and Splunk to analyse logs, metrics, traces, service health, latency, failure rates, and downstream dependencies.
- Build and maintain dashboards for SRE operations, service owners, and leadership visibility.
- Improve alerting by reducing noise, defining meaningful thresholds, and aligning alerts with customer impact and SLOs.
- Drive root cause analysis, post-incident reviews, corrective actions, and preventive improvements.
- Create and maintain runbooks, SOPs, troubleshooting guides, and operational playbooks.
- Identify automation and AI-assisted triage opportunities to improve incident response and operational efficiency.
- Mentor SRE/support engineers and ensure smooth onsite-offshore coordination and handovers.
- Communicate incident status, business impact, risks, and next steps clearly to client stakeholders.
Required Skills
- 8+ years of experience in SRE, production support, DevOps, platform engineering, or application operations.
- Strong hands-on experience with Dynatrace and Splunk.
- Good understanding of microservices, APIs, distributed systems, Kubernetes, containers, and cloud platforms.
- Experience supporting high-volume ecommerce or enterprise production systems.
- Strong knowledge of incident management, root cause analysis, monitoring, alerting, and SLO/SLA practices.
- Ability to analyse application performance issues including latency, throughput, error rates, pod restarts, CPU/memory, database latency, and third-party dependency issues.
- Strong communication skills with the ability to explain technical issues to both engineering and leadership teams.
- Experience leading onsite-offshore coordination and mentoring engineers.
Preferred Skills
- Retail or ecommerce domain experience.
- Experience with order capture, checkout, payment, inventory, or OMS flows.
- Knowledge of Dynatrace DQL, Grail, Smartscape, Davis AI, Open Pipeline, and SLOs.
- Experience with ServiceNow, Jira, PagerDuty, Teams, or similar incident-management integrations.
- Scripting or automation experience using Python, shell scripting, or similar tools.
- Exposure to AI-assisted triage, self-healing, or runbook automation.
Key skills/competency
- Site Reliability Engineering (SRE)
- Production Support
- DevOps
- Incident Management
- Root Cause Analysis
- Monitoring and Alerting
- Dynatrace
- Splunk
- Kubernetes
- Ecommerce Systems
Skills & topics
- SRE Lead Engineer
- Site Reliability Engineering
- Production Support
- DevOps
- Incident Management
- Root Cause Analysis
- Monitoring
- Alerting
- Dynatrace
- Splunk
- Kubernetes
- Ecommerce
How to get hired
- Tailor your resume: Highlight SRE, production support, and DevOps experience. Emphasize Dynatrace, Splunk, and e-commerce system expertise.
- Showcase leadership: Detail experience leading incident response, mentoring teams, and client communication. Quantify achievements where possible.
- Prepare for technical questions: Review microservices, Kubernetes, cloud platforms, and performance analysis. Be ready to discuss incident scenarios.
- Understand the company: Research Litmus7's focus on e-commerce and enterprise-scale systems. Align your answers with their operational needs.
Technical preparation
Behavioral questions
Frequently asked questions
- What are the primary responsibilities of a Lead SRE Engineer at Litmus7?
- As a Lead SRE Engineer at Litmus7, you will be responsible for production reliability, observability, incident response, and driving operational improvements for enterprise-scale e-commerce systems. This includes leading incident triage, monitoring critical business flows, analyzing performance with tools like Dynatrace and Splunk, improving alerting and dashboards, and mentoring SRE teams.
- What technical skills are essential for this Lead SRE Engineer role at Litmus7?
- Essential technical skills include 8+ years of experience in SRE, production support, or DevOps, with strong hands-on experience in Dynatrace and Splunk. A good understanding of microservices, Kubernetes, containerization, and cloud platforms is crucial. Experience supporting high-volume e-commerce systems and knowledge of incident management and monitoring practices are also required.
- Does Litmus7 offer opportunities for career growth for SRE Engineers?
- Litmus7 emphasizes leading operational improvements and mentoring SRE/support engineers. This role provides opportunities to drive significant impact on critical e-commerce systems, refine advanced monitoring and automation strategies, and develop leadership skills, which can pave the way for further career advancement within the company.
- What kind of e-commerce experience is beneficial for this role?
- Experience supporting high-volume e-commerce or enterprise production systems is highly beneficial. Specific experience with business-critical flows like checkout, order capture, payment, inventory, promotions, and fulfillment integrations is preferred, along with a general understanding of the retail or e-commerce domain.
- How does Litmus7 handle incident management and response for its e-commerce platforms?
- Litmus7 focuses on onsite production triage for critical incidents, coordinating across various technical teams and offshore counterparts. The role involves utilizing tools like Dynatrace and Splunk for analysis, improving alerting to reduce noise, and driving root cause analysis and post-incident reviews to implement corrective and preventive measures.
