
Software Engineer ID61984
AgileEngine · Chennai, Tamil Nadu, India
- Hybrid
- Full-time
- ₹1,500,000 / year
- Chennai, Tamil Nadu, India
Job highlights
- Strengthen platform reliability and observability capabilities.
- Design and operate monitoring infrastructure on AWS.
- Work with Kubernetes, Datadog, APM, and tracing.
- Involve in CI/CD integration and observability automation.
- Support teams in adopting monitoring best practices.
About the role
About AgileEngine
AgileEngine is an Inc. 5000 company that creates award-winning software for Fortune 500 brands and trailblazing startups across 17+ industries. We rank among the leaders in areas like application development and AI/ML, and our people-first culture has earned us multiple Best Place to Work awards.Why Join Us
If you're looking for a place to grow, make an impact, and work with people who care, we'd love to meet you!About the Role
We are looking for a Senior Site Reliability Engineer to strengthen our platform reliability and observability capabilities. You will own the design and operation of monitoring infrastructure — including Datadog APM, alerting, and distributed tracing — across Kubernetes-based microservices on AWS. The role spans backend engineering and SRE practice in roughly a 65/35 split, with direct involvement in CI/CD integration and observability automation. You will also support internal teams in adopting monitoring best practices as we modernize our R&D platform.What You Will Do
- Design, build, and maintain scalable backend and platform components;
- Implement and manage observability solutions across distributed systems;
- Configure dashboards, alerts, and APM for tracing, metrics, and logging;
- Monitor and improve system reliability, scalability, and performance;
- Deploy, operate, and maintain services in Kubernetes environments;
- Integrate observability tools into CI/CD pipelines and cloud infrastructure;
- Automate monitoring and operational workflows using scripting;
- Provide operational and training support for observability platforms, especially Datadog;
- Collaborate with engineering teams to improve system visibility and reliability practices.
Must Haves
- 4+ years of experience with Python, Node.js, or Java;
- Hands-on experience with API integrations;
- Strong experience in Kubernetes environments;
- Experience with Datadog or similar tools such as Prometheus and Grafana;
- Ability to configure dashboards, alerts, and APM;
- Experience monitoring containerized and microservices architectures;
- Hands-on experience with AWS;
- Experience integrating observability tools into cloud environments;
- Experience with CI/CD integrations for observability;
- Ability to automate monitoring and operational tasks using scripting;
- Upper-intermediate English level.
Nice to Haves
- Experience owning and operating an internal engineering platform, especially observability platforms;
- Demonstrated ownership of reliability, scalability, and performance;
- Ability to proactively lead maintenance and platform improvements;
- Experience installing and configuring Datadog agents and integrations;
- Experience managing API keys and secure configurations;
- Experience managing user roles and access controls;
- Familiarity with Go (Golang);
- Experience with additional observability tools such as New Relic, Dynatrace, Elastic Stack, or Splunk.
Perks and Benefits
- Remote work & Local connection: Work where you feel most productive and connect with your team in periodic meet-ups to strengthen your network and connect with other top experts.
- Legal presence in India: We ensure full local compliance with a structured, secure work environment tailored to Indian regulations.
- Competitive Compensation in INR: Fair compensation in INR with dedicated budgets for your personal growth, education, and wellness.
- Innovative Projects: Leverage the latest tech and create cutting-edge solutions for world-recognized clients and the hottest startups.
Key skills/competency
- Senior Site Reliability Engineer
- Python
- Node.js
- Java
- Kubernetes
- AWS
- Datadog
- Prometheus
- Grafana
- CI/CD
Skills & topics
- Site Reliability Engineer
- SRE
- Kubernetes
- AWS
- Datadog
- Python
- Node.js
- Java
- Observability
- Monitoring
- CI/CD
- APM
- Backend Engineering
- Microservices
- Remote
How to get hired
- Tailor your resume: Highlight your 4+ years of experience with Python, Node.js, or Java, and showcase your Kubernetes and AWS expertise.
- Emphasize SRE skills: Detail your experience with Datadog, Prometheus, Grafana, API integrations, and CI/CD for observability.
- Quantify achievements: Use numbers to demonstrate your impact on system reliability, scalability, and performance.
- Prepare for technical questions: Be ready to discuss distributed systems, microservices architectures, and monitoring automation.
- Showcase collaboration: Highlight your experience supporting internal teams and improving system visibility.
Technical preparation
Master Python, Node.js, or Java for backend tasks.,Deepen Kubernetes and AWS cloud infrastructure knowledge.,Gain hands-on experience with Datadog and APM.,Practice API integrations and CI/CD pipeline automation.
Behavioral questions
Describe a major system outage you resolved.,How do you prioritize monitoring alerts?,How do you collaborate with development teams?,Share an example of improving system reliability.
Frequently asked questions
- What programming languages does AgileEngine primarily use for the Senior Site Reliability Engineer role?
- For this Senior Site Reliability Engineer position at AgileEngine, we require at least 4 years of experience with Python, Node.js, or Java. Familiarity with Go (Golang) is also considered a plus.
- What are the key monitoring and observability tools mentioned for this Senior Site Reliability Engineer role?
- The key tools for this role include Datadog (APM, tracing, alerting, dashboards), Prometheus, and Grafana. Experience with these or similar observability tools is essential for success in this position.
- Does this Senior Site Reliability Engineer position require experience with cloud platforms?
- Yes, hands-on experience with AWS is a must-have for this Senior Site Reliability Engineer role. You will be integrating observability tools into our cloud environments and deploying services on Kubernetes.
- What is the expected level of English proficiency for this role at AgileEngine?
- An upper-intermediate English level is required for this Senior Site Reliability Engineer position. This ensures effective communication with global teams and clients.
- What is the work arrangement for the Senior Site Reliability Engineer position at AgileEngine?
- This is a remote work position at AgileEngine. While you can work where you feel most productive, periodic meet-ups are organized to foster team connection and networking.
- What kind of projects can I expect to work on as a Senior Site Reliability Engineer at AgileEngine?
- As a Senior Site Reliability Engineer, you will work on innovative projects leveraging the latest technologies to create cutting-edge solutions for Fortune 500 brands and startups.
- What compensation and benefits are offered for the Senior Site Reliability Engineer role?
- AgileEngine offers competitive compensation in INR, with dedicated budgets for personal growth, education, and wellness. The role also includes remote work flexibility and opportunities for professional development.