Job Overview
Job TitleSenior Site Reliability Engineer
Job TypeFull Time
Offered Salary₹0
LocationHybrid
Who's the hiring manager?
Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Job Description
About AgileEngine
AgileEngine is an Inc. 5000 company that creates award-winning software for Fortune 500 brands and trailblazing startups across 17+ industries. We rank among the leaders in areas like application development and AI/ML, and our people-first culture has earned us multiple Best Place to Work awards.Why Join Us
If you're looking for a place to grow, make an impact, and work with people who care, we'd love to meet you!About the Role
We are looking for a Senior Site Reliability Engineer to strengthen our platform reliability and observability capabilities. You will own the design and operation of monitoring infrastructure — including Datadog APM, alerting, and distributed tracing — across Kubernetes-based microservices on AWS. The role spans backend engineering and SRE practice in roughly a 65/35 split, with direct involvement in CI/CD integration and observability automation. You will also support internal teams in adopting monitoring best practices as we modernize our R&D platform.What You Will Do
- Design, build, and maintain scalable backend and platform components;
- Implement and manage observability solutions across distributed systems;
- Configure dashboards, alerts, and APM for tracing, metrics, and logging;
- Monitor and improve system reliability, scalability, and performance;
- Deploy, operate, and maintain services in Kubernetes environments;
- Integrate observability tools into CI/CD pipelines and cloud infrastructure;
- Automate monitoring and operational workflows using scripting;
- Provide operational and training support for observability platforms, especially Datadog;
- Collaborate with engineering teams to improve system visibility and reliability practices.
Must Haves
- 4+ years of experience with Python, Node.js, or Java;
- Hands-on experience with API integrations;
- Strong experience in Kubernetes environments;
- Experience with Datadog or similar tools such as Prometheus and Grafana;
- Ability to configure dashboards, alerts, and APM;
- Experience monitoring containerized and microservices architectures;
- Hands-on experience with AWS;
- Experience integrating observability tools into cloud environments;
- Experience with CI/CD integrations for observability;
- Ability to automate monitoring and operational tasks using scripting;
- Upper-intermediate English level.
Nice to Haves
- Experience owning and operating an internal engineering platform, especially observability platforms;
- Demonstrated ownership of reliability, scalability, and performance;
- Ability to proactively lead maintenance and platform improvements;
- Experience installing and configuring Datadog agents and integrations;
- Experience managing API keys and secure configurations;
- Experience managing user roles and access controls;
- Familiarity with Go (Golang);
- Experience with additional observability tools such as New Relic, Dynatrace, Elastic Stack, or Splunk.
Perks and Benefits
- Remote work & Local connection: Work where you feel most productive and connect with your team in periodic meet-ups to strengthen your network and connect with other top experts.
- Legal presence in India: We ensure full local compliance with a structured, secure work environment tailored to Indian regulations.
- Competitive Compensation in INR: Fair compensation in INR with dedicated budgets for your personal growth, education, and wellness.
- Innovative Projects: Leverage the latest tech and create cutting-edge solutions for world-recognized clients and the hottest startups.
Key skills/competency
- Senior Site Reliability Engineer
- Python
- Node.js
- Java
- Kubernetes
- AWS
- Datadog
- Prometheus
- Grafana
- CI/CD
How to Get Hired at AgileEngine
- Tailor your resume: Highlight your 4+ years of experience with Python, Node.js, or Java, and showcase your Kubernetes and AWS expertise.
- Emphasize SRE skills: Detail your experience with Datadog, Prometheus, Grafana, API integrations, and CI/CD for observability.
- Quantify achievements: Use numbers to demonstrate your impact on system reliability, scalability, and performance.
- Prepare for technical questions: Be ready to discuss distributed systems, microservices architectures, and monitoring automation.
- Showcase collaboration: Highlight your experience supporting internal teams and improving system visibility.
Frequently Asked Questions
Find answers to common questions about this job opportunity
01What programming languages does AgileEngine primarily use for the Senior Site Reliability Engineer role?
02What are the key monitoring and observability tools mentioned for this Senior Site Reliability Engineer role?
03Does this Senior Site Reliability Engineer position require experience with cloud platforms?
04What is the expected level of English proficiency for this role at AgileEngine?
05What is the work arrangement for the Senior Site Reliability Engineer position at AgileEngine?
06What kind of projects can I expect to work on as a Senior Site Reliability Engineer at AgileEngine?
07What compensation and benefits are offered for the Senior Site Reliability Engineer role?
Explore similar opportunities that match your background