Site Reliability Engineer, Data, Cloud & Developer Experience
Blackstone
Job Overview
Who's the hiring manager?
Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Job Description
About Blackstone
Blackstone is the world’s largest alternative asset manager, focused on creating positive economic impact and long-term value for investors, companies, and communities. With $1.1 trillion in assets under management across various investment vehicles globally, Blackstone leverages extraordinary people and flexible capital to solve problems. Further information is available at www.blackstone.com.
Role: Site Reliability Engineer, Data, Cloud & Developer Experience
Blackstone's Site Reliability Engineering team is dedicated to enhancing the reliability of systems and services to meet critical business needs. This involves close collaboration with development and engineering teams to integrate SRE practices and principles. In this role, you will identify and resolve emerging problems, deploy and maintain robust observability systems and pipelines, and optimize operations and support for various services and platforms. You will also pursue new opportunities to drive efficiency and business value, focusing on the selection, implementation, and maintenance of key observability tooling. Continuous evaluation of the firm's requirements in observability, monitoring, alerting, resilience, and recovery is essential.
You will work alongside service owners to design, implement, and manage services for continuous improvement, ensuring requisite reliability through clear definitions and measurable targets. This includes planning and practicing disaster recovery scenarios and responding to incidents in real time. Guiding the postmortem process will be crucial to mitigate risks, prevent future disruptions, and improve the on-call experience. A core aim is to eliminate manual work, improve operational efficiency, and ensure high-quality outputs across all activities.
Key Responsibilities:
- Assist technical leadership in promoting and integrating SRE methodologies firm-wide.
- Incorporate observability standards into code and deployment pipelines.
- Contribute to the evolution of SRE standards adopted by all teams.
- Partner with colleagues across various roles and reporting lines to boost service reliability and operational efficiency.
- Provide direct assistance to developers and engineers, leveraging AI assistants.
- Implement instrumentation and deliver comprehensive performance insights to service owners.
- Ensure monitoring and alerting accurately reflect service reliability for users and facilitate effective on-call operations.
- Implement strategic observability tools and manage overhead in maintenance and cost.
- Participate in on-call rotations and respond to system incidents to maintain service availability and minimize operational impact.
- Utilize automation to manage, maintain, and scale SRE systems with minimal human intervention.
- Foster a blameless team culture while contributing to postmortem discussions and reporting.
Qualifications:
- 2+ years of professional experience in Infrastructure Engineering, Software Engineering, DevOps Engineering, or Platform Engineering.
- Proficient in automation script writing and effective troubleshooting of code (e.g., Python, C#, Typescript).
- Skilled in utilizing coding assistants and chat models (e.g., Anthropic, OpenAI).
- Proficiency with public cloud providers, with strong AWS experience required and Azure experience preferred.
- Experience with Configuration-as-Code, infrastructure management, and CI/CD tooling (e.g., Terraform, Puppet, Gitlab CI).
- Hands-on experience with Docker and container schedulers, including AWS ECS & EKS.
- Excellent troubleshooting skills for Linux and Windows environments, combined with networking experience.
- Familiarity with observability tools such as Grafana, Prometheus, and Splunk.
- Comfortable managing incidents under pressure and collaborating effectively during postmortems.
- Demonstrated excellent communication and organizational skills.
- Exhibits curiosity and motivation to enhance systems and processes through a sense of shared ownership.
Key skills/competency
- Site Reliability Engineering
- Cloud Platforms (AWS)
- DevOps Practices
- Automation Scripting
- Observability Tools (Grafana, Prometheus, Splunk)
- Incident Management
- CI/CD
- Terraform
- Docker/Kubernetes (ECS, EKS)
- Python
How to Get Hired at Blackstone
- Research Blackstone's culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor to understand their high-performance environment.
- Tailor your SRE resume: Highlight extensive experience with AWS, Python, automation, and observability tools relevant to a Site Reliability Engineer at Blackstone.
- Showcase problem-solving skills: Prepare specific examples of how you've handled critical incidents, optimized systems, and driven efficiency improvements.
- Master technical fundamentals: Be ready to discuss your hands-on experience with Terraform, Docker, Kubernetes (EKS/ECS), and CI/CD, demonstrating deep cloud engineering knowledge.
- Emphasize collaboration and ownership: Illustrate your ability to partner with development teams, foster a blameless culture, and take shared ownership of system reliability.
Frequently Asked Questions
Find answers to common questions about this job opportunity
Explore similar opportunities that match your background