Site Reliability Engineer II @ Atlan
Your Application Journey
Email Hiring Manager
Job Details
About the Role
At Atlan, our Site Reliability Engineer II is a key member of the Platform & Reliability Engineering team. You will strengthen alert management and incident response capabilities to ensure fast, reliable, and uninterrupted customer experiences.
Your Mission at Atlan
As a Site Reliability Engineer II, you will:
- Own and operate end-to-end system reliability
- Manage incidents within defined SLAs (60 mins for Critical, 180 mins for High)
- Enhance observability by refining monitoring systems and alerts
- Automate operations and incident workflows to eliminate manual tasks
- Collaborate with Platform, Observability, and Product Engineering teams
- Contribute to documentation and playbooks for process improvement
What Makes You a Great Fit
You possess proven experience in managing alerts, incidents, and root cause analysis in production environments. You have hands-on experience with cloud platforms (AWS, GCP, or Azure) and Kubernetes, along with expertise in monitoring tools like Prometheus, Grafana, ELK/EFK, or Datadog. Strong scripting skills (Python, Bash, or Shell) and excellent communication abilities are essential.
Why You'll Love Working at Atlan
Joining Atlan means real impact from day one with a modern tech stack including Kubernetes, Terraform, Prometheus, and Datadog. You will work with world-class engineers in a learning culture, enjoy autonomy, and have a clear growth path from SRE II to principal levels.
About Atlan
At Atlan, we transform data chaos into clarity for Fortune 500 leaders and hyper-growth startups alike. Backed by top investors and recognized by Gartner and Forrester, we are a fully remote company trusted by global leaders like Cisco, Nasdaq, and HubSpot.
Key skills/competency
reliability, incident response, automation, monitoring, Kubernetes, cloud, scripting, observability, troubleshooting, documentation
How to Get Hired at Atlan
🎯 Tips for Getting Hired
- Research Atlan's culture: Understand their mission, values, and tech stack.
- Customize your resume: Highlight cloud, Kubernetes, and automation skills.
- Prepare for technical interviews: Practice incident management and scripting challenges.
- Show case collaboration: Emphasize teamwork and communication experiences.