5 days ago

Senior Site Reliability Engineer

AgileEngine

Hybrid
Full Time
₹0
Hybrid
Apply

Job Overview

Job TitleSenior Site Reliability Engineer
Job TypeFull Time
Offered Salary₹0
LocationHybrid

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

About AgileEngine

AgileEngine is an Inc. 5000 company that creates award-winning software for Fortune 500 brands and trailblazing startups across 17+ industries. We rank among the leaders in areas like application development and AI/ML, and our people-first culture has earned us multiple Best Place to Work awards.

Why Join Us

If you're looking for a place to grow, make an impact, and work with people who care, we'd love to meet you!

About the Role

We are looking for a Senior Site Reliability Engineer to strengthen our platform reliability and observability capabilities. You will own the design and operation of monitoring infrastructure — including Datadog APM, alerting, and distributed tracing — across Kubernetes-based microservices on AWS. The role spans backend engineering and SRE practice in roughly a 65/35 split, with direct involvement in CI/CD integration and observability automation. You will also support internal teams in adopting monitoring best practices as we modernize our R&D platform.

What You Will Do

  • Design, build, and maintain scalable backend and platform components;
  • Implement and manage observability solutions across distributed systems;
  • Configure dashboards, alerts, and APM for tracing, metrics, and logging;
  • Monitor and improve system reliability, scalability, and performance;
  • Deploy, operate, and maintain services in Kubernetes environments;
  • Integrate observability tools into CI/CD pipelines and cloud infrastructure;
  • Automate monitoring and operational workflows using scripting;
  • Provide operational and training support for observability platforms, especially Datadog;
  • Collaborate with engineering teams to improve system visibility and reliability practices.

Must Haves

  • 4+ years of experience with Python, Node.js, or Java;
  • Hands-on experience with API integrations;
  • Strong experience in Kubernetes environments;
  • Experience with Datadog or similar tools such as Prometheus and Grafana;
  • Ability to configure dashboards, alerts, and APM;
  • Experience monitoring containerized and microservices architectures;
  • Hands-on experience with AWS;
  • Experience integrating observability tools into cloud environments;
  • Experience with CI/CD integrations for observability;
  • Ability to automate monitoring and operational tasks using scripting;
  • Upper-intermediate English level.

Nice to Haves

  • Experience owning and operating an internal engineering platform, especially observability platforms;
  • Demonstrated ownership of reliability, scalability, and performance;
  • Ability to proactively lead maintenance and platform improvements;
  • Experience installing and configuring Datadog agents and integrations;
  • Experience managing API keys and secure configurations;
  • Experience managing user roles and access controls;
  • Familiarity with Go (Golang);
  • Experience with additional observability tools such as New Relic, Dynatrace, Elastic Stack, or Splunk.

Perks and Benefits

  • Remote work & Local connection: Work where you feel most productive and connect with your team in periodic meet-ups to strengthen your network and connect with other top experts.
  • Legal presence in India: We ensure full local compliance with a structured, secure work environment tailored to Indian regulations.
  • Competitive Compensation in INR: Fair compensation in INR with dedicated budgets for your personal growth, education, and wellness.
  • Innovative Projects: Leverage the latest tech and create cutting-edge solutions for world-recognized clients and the hottest startups.

Key skills/competency

  • Senior Site Reliability Engineer
  • Python
  • Node.js
  • Java
  • Kubernetes
  • AWS
  • Datadog
  • Prometheus
  • Grafana
  • CI/CD

Tags:

Site Reliability Engineer
SRE
Kubernetes
AWS
Datadog
Python
Node.js
Java
Observability
Monitoring
CI/CD
APM
Backend Engineering
Microservices
Remote

Share Job:

How to Get Hired at AgileEngine

  • Tailor your resume: Highlight your 4+ years of experience with Python, Node.js, or Java, and showcase your Kubernetes and AWS expertise.
  • Emphasize SRE skills: Detail your experience with Datadog, Prometheus, Grafana, API integrations, and CI/CD for observability.
  • Quantify achievements: Use numbers to demonstrate your impact on system reliability, scalability, and performance.
  • Prepare for technical questions: Be ready to discuss distributed systems, microservices architectures, and monitoring automation.
  • Showcase collaboration: Highlight your experience supporting internal teams and improving system visibility.

Frequently Asked Questions

Find answers to common questions about this job opportunity

Explore similar opportunities that match your background