
Site Reliability Engineering Lead, Specialist
Vanguard · Malvern, PA
- On site
- Full-time
- $150,000 / year
- Malvern, PA
Email the hiring manager to get a response.
Get their verified email + an intro that's ready to send.
Subject: Interested in the Site Reliability Engineering Lead, Specialist role at Vanguard
Hi Alex — I came across the Site Reliability Engineering Lead, Specialist opening and wanted to reach out directly. I've spent the last few years doing exactly this kind of work, and Vanguard stood out because…
✎ Personalized to your résumé after sign-up.
- ✓ Verified email of the hiring manager
- ✓ Intro email personalized to your résumé
- ✓ $9/mo = unlimited — any job link
Secure checkout · cancel anytime
Job highlights
- Lead SRE initiatives for millions of investors.
- Architect enterprise-scale resiliency solutions.
- Implement OpenTelemetry and AI diagnostics.
- Automate incident response and infrastructure.
- Shape next-gen client experiences at Vanguard.
About the role
Site Reliability Engineering Lead, Specialist
Join Vanguard's Personal Investor Technologies Site Reliability Engineering team and lead cutting-edge SRE initiatives impacting hundreds of applications and millions of investors. You will architect and build enterprise-scale resiliency solutions, driving our ambitious 2026 roadmap. This is an opportunity to combine deep technical expertise with strategic influence — designing OpenTelemetry integrations, implementing distributed tracing at scale, automating incident responses, and pioneering AI-enhanced diagnostics and analysis. Work alongside a collaborative, technically-focused team where your innovations in resilience engineering will shape Vanguard's next generation of client experiences.
At Vanguard, we pride ourselves on delivering an exceptional client experience to all investors. At the core of this experience are systems that reside in a technically complex and constantly evolving resiliency landscape. Passionate, technically skilled engineers are at the center of our resiliency operations, and we are looking to grow our team.
We are seeking an experienced engineer with broad, end-to-end software development experience, including operating applications in a microservices environment in production at scale. This role goes beyond feature implementation; it requires someone who can design, build, and support resilient systems from the ground up.
As a Senior Reliability Engineer at Vanguard, you will play a critical role in solving impactful operational problems. You are curious and take a proactive approach to identifying problems and making improvements. You balance innovative thinking with pragmatism and understand the long-term impacts of technical decisions. You communicate complex ideas clearly and collaborate effectively to deliver scalable solutions.
Core Responsibilities
- Improve resiliency engineering practices across platforms and applications, including resilient application design patterns, system observability and deployment strategies.
- Incident detection, troubleshooting, and resolution.
- Develop automation for incident response and infrastructure management.
- Develop and support OpenTelemetry integrations for multiple application platforms (browser, ECS, lambda, etc.) and languages (JavaScript, Java).
- Contribute to architectural decisions and support implementation of solutions.
Skills And Qualifications
- Expertise in JavaScript (server-side and client-side execution environments) or Java.
- Working knowledge of Python (or similar scripting language).
- Strong knowledge of resiliency engineering techniques for both platforms and applications.
- Experience troubleshooting complex production issues and implementing effective mitigations.
- Hands-on experience with AWS services and cloud infrastructure.
- Familiarity with OpenTelemetry specification and core APIs.
- Practical experience developing and operating software in distributed systems environments.
Special Factors
- Sponsorship: Vanguard is not offering visa sponsorship for this position.
About Vanguard
At Vanguard, we don't just have a mission—we're on a mission. To work for the long-term financial wellbeing of our clients. To lead through products and services that transform our clients' lives. To learn and develop our skills as individuals and as a team. From Malvern to Melbourne, our mission drives us forward and inspires us to be our best.
How We Work
Vanguard has implemented a hybrid working model for the majority of our crew members, designed to capture the benefits of enhanced flexibility while enabling in-person learning, collaboration, and connection. We believe our mission-driven and highly collaborative culture is a critical enabler to support long-term client outcomes and enrich the employee experience.
Key skills/competency
- Site Reliability Engineering
- Resiliency Engineering
- OpenTelemetry
- Distributed Tracing
- Automation
- Incident Response
- AWS
- JavaScript
- Java
- Python
Skills & topics
- Site Reliability Engineering
- SRE
- Resiliency Engineering
- OpenTelemetry
- Distributed Tracing
- Automation
- Incident Response
- AWS
- JavaScript
- Java
- Python
- Cloud Infrastructure
- Microservices
- Lead
How to get hired
- Tailor your resume: Highlight expertise in JavaScript, Java, Python, resiliency engineering, and AWS services. Quantify achievements in improving system reliability.
- Showcase SRE experience: Emphasize your background in operating microservices at scale, incident response automation, and OpenTelemetry implementation.
- Demonstrate problem-solving: Prepare examples of troubleshooting complex production issues and implementing effective, scalable solutions.
- Understand Vanguard's culture: Research their mission, values, and hybrid work model. Align your application with their client-focused approach.
- Prepare for technical interviews: Be ready to discuss distributed systems, cloud infrastructure, and SRE best practices.
Technical preparation
Behavioral questions
Frequently asked questions
- What are the key technical skills required for the Site Reliability Engineering Lead role at Vanguard?
- The Site Reliability Engineering Lead role at Vanguard requires strong expertise in JavaScript or Java, a working knowledge of Python, and deep experience in resiliency engineering for both platforms and applications. You'll also need hands-on experience with AWS services, cloud infrastructure, troubleshooting complex production issues, and familiarity with OpenTelemetry.
- Does Vanguard offer visa sponsorship for the Site Reliability Engineering Lead position?
- No, Vanguard is not offering visa sponsorship for this Site Reliability Engineering Lead position.
- What is the work environment like for a Site Reliability Engineering Lead at Vanguard?
- Vanguard employs a hybrid working model, balancing flexibility with in-person collaboration. The Site Reliability Engineering team is technically focused and collaborative, working on cutting-edge SRE initiatives.
- What kind of impact can a Site Reliability Engineering Lead have at Vanguard?
- As a Site Reliability Engineering Lead at Vanguard, you will lead initiatives impacting hundreds of applications and millions of investors. You will architect enterprise-scale resiliency solutions, drive the 2026 roadmap, and shape the next generation of client experiences through innovations in resilience engineering.
- How important is experience with OpenTelemetry for this Site Reliability Engineering Lead role?
- Experience with OpenTelemetry is important, as a core responsibility includes developing and supporting OpenTelemetry integrations for multiple application platforms and languages. Familiarity with the OpenTelemetry specification and core APIs is expected.
- What is Vanguard's approach to site reliability and resiliency engineering?
- Vanguard emphasizes a technically complex and constantly evolving resiliency landscape at the core of their client experience. They seek passionate, technically skilled engineers to drive improvements in resiliency engineering practices, incident detection, resolution, and automation.
Similar roles
Open positions we recommend based on this role.
