
Site Reliability Engineer Lead - Senior Vice President
Citi · New York, NY
- On site
- Full-time
- $265,000 / year
- New York, NY
Email the hiring manager to get a response.
Get their verified email + an intro that's ready to send.
Subject: Interested in the Site Reliability Engineer Lead - Senior Vice President role at Citi
Hi Jamie — I came across the Site Reliability Engineer Lead - Senior Vice President opening and wanted to reach out directly. I've spent the last few years doing exactly this kind of work, and Citi stood out because…
✎ Personalized to your résumé after sign-up.
- ✓ Verified email of the hiring manager
- ✓ Intro email personalized to your résumé
- ✓ $9/mo = unlimited — any job link
Secure checkout · cancel anytime
Job highlights
- Lead SRE strategy and operations for critical systems.
- Drive observability, resiliency, and recovery initiatives.
- Ensure operational resilience and compliance.
- Collaborate with development for application reliability.
- Provide technical leadership and strategic guidance.
About the role
Senior Vice President Site Reliability Engineer Lead
The Site Reliability Engineer (SRE) is a strategic professional accountable for the daily operations, architectural resilience, and overall implementation of SRE principles in a complex, critical, and largescale multi-disciplinary environment. This role requires a comprehensive understanding of multiple technology domains and their interaction to achieve business objectives. As a recognized technical authority, you will apply an in depth understanding of the business impact of technical contributions and provide advice and counsel on strategic solutions.
We are seeking a passionate and experienced SRE to join our Production Management team. In this role, you will be instrumental in enhancing the reliability, performance, and efficiency of our Applications and Services. You will drive our strategy for end-to-end observability and resiliency, collaborating across the organization to ensure our services are stable, scalable, and fault tolerant. This is a key role that will influence strategic decisions and foster a culture of technical excellence and accountability.
Key Responsibilities
Culture & Strategy
- Foster a culture of transparency, innovation, and accountability that encourages continuous improvement.
- Communicate the progress and impact of SRE initiatives to stakeholders at all levels.
- Operate effectively within a highly regulated environment, ensuring compliance with all relevant requirements.
Resiliency & Recovery
- Ensure critical business applications meet stringent operational resilience requirements, including adherence to defined impact tolerances.
- Oversee advanced recovery testing, including Production Swing Tests, Data Recovery Tests, and chaos engineering practices.
- Drive the adoption and development of automation, such as One Touch Recovery solutions, to minimize recovery time.
- Partner with development teams to leverage cloud native services and established resiliency patterns to enhance application reliability.
Observability & Performance
- Collaborate across the organization to develop and scale observability solutions using modern tools for metrics, logging, and tracing.
- Partner with development teams to effectively instrument applications, providing deep insights into system health and performance.
Essential Skills
- Deep understanding of SRE concepts, including SLOs, SLIs, error budgets, and toil reduction.
- Demonstrable experience with Disaster Recovery planning, resiliency testing, and fault tolerant distributed system design.
- Proficiency in deploying, managing, and troubleshooting applications on OpenShift/Kubernetes.
- Hands on experience with modern observability tools (e.g., Prometheus, Grafana, Loki, Mimir, Tempo, AppDynamics).
- Experience with Infrastructure as Code (IaC), configuration management, and automation tools (e.g., Ansible, Terraform).
- Experience creating, modifying, and managing Helm charts for application deployment.
Desired Skills
- Experience with major public cloud providers (e.g., Google Cloud, AWS, Azure).
- Proven experience delivering software and infrastructure using Agile frameworks.
- Experience presenting technical strategy to senior and executive level audiences.
- Experience writing or maintaining code in Java, Python, Go, or similar languages.
Qualifications
- 10+ years of significant professional experience in production management, software development, or an equivalent field, with a strong focus on Site Reliability Engineering.
- Expertise in analyzing complex application, database, network, and OS issues within large scale, customer facing systems.
- A service-oriented attitude combined with excellent problem-solving and strategic thinking skills.
- Strong communication and diplomacy skills, with a proven ability to work effectively across multiple business and technical teams.
Key Skills/Competency
- Site Reliability Engineering
- Production Management
- Cloud Native Services
- Observability
- Resiliency
- Disaster Recovery
- Kubernetes
- Prometheus
- Terraform
- Python
Skills & topics
- Site Reliability Engineer
- SRE Lead
- Production Management
- Cloud Engineering
- Observability
- Resiliency
- Kubernetes
- Terraform
- Prometheus
- Senior Vice President
How to get hired
- Tailor your resume: Highlight SRE experience, cloud proficiency, and leadership skills relevant to Citi's needs.
- Showcase technical expertise: Emphasize experience with Kubernetes, observability tools, and IaC.
- Demonstrate leadership: Provide examples of strategic thinking and cross-functional collaboration.
- Prepare for interviews: Be ready to discuss SRE principles, system design, and recovery strategies.
- Understand the environment: Research Citi's commitment to operational resilience and regulated industries.
Technical preparation
Behavioral questions
Frequently asked questions
- What are the key responsibilities of a Senior Vice President Site Reliability Engineer Lead at Citi?
- The Senior Vice President Site Reliability Engineer Lead at Citi is responsible for the daily operations, architectural resilience, and implementation of SRE principles in large-scale environments. This includes fostering a culture of continuous improvement, ensuring operational resilience, overseeing recovery testing, and developing observability solutions. The role also involves collaborating with development teams and providing strategic guidance.
- What technical skills are essential for the Senior Vice President Site Reliability Engineer Lead position at Citi?
- Essential technical skills include a deep understanding of SRE concepts (SLOs, SLIs, error budgets), experience with Disaster Recovery and resiliency testing, proficiency in Kubernetes/OpenShift, familiarity with observability tools (Prometheus, Grafana), Infrastructure as Code (Terraform, Ansible), and Helm chart management.
- Does Citi's Senior Vice President Site Reliability Engineer Lead role require cloud experience?
- Yes, experience with major public cloud providers such as Google Cloud, AWS, or Azure is desired for the Senior Vice President Site Reliability Engineer Lead role at Citi. This indicates a need for adaptability and expertise across various cloud platforms.
- What kind of experience is expected for the Senior Vice President Site Reliability Engineer Lead at Citi?
- The role requires at least 10 years of significant professional experience in production management, software development, or a related field, with a strong focus on Site Reliability Engineering. Expertise in analyzing complex system issues in large-scale, customer-facing environments is also crucial.
- How does Citi approach operational resilience for its critical applications, and how does the SRE Lead contribute?
- Citi emphasizes stringent operational resilience requirements for critical business applications, including defined impact tolerances. The SRE Lead plays a pivotal role by overseeing advanced recovery testing, driving automation for faster recovery, and partnering with development teams to embed resiliency into application design using cloud-native services and best practices.
- What is the expected impact of the Senior Vice President Site Reliability Engineer Lead on Citi's technology strategy?
- The Senior Vice President Site Reliability Engineer Lead is expected to influence strategic decisions related to SRE, observability, and resiliency. They will foster technical excellence, drive innovation, and communicate the impact of SRE initiatives across the organization, ensuring services are stable, scalable, and fault-tolerant.
Similar roles
Open positions we recommend based on this role.
