
Engineer- SRE
hackajob · Gurugram, Haryana, India
This listing has closed — view similar roles below.
- On site
- Full-time
- $120,000 / year
- Gurugram, Haryana, India
Job highlights
- Ensure reliable, fast, and smooth distributed systems.
- Implement monitoring, logging, and tracing solutions.
- Automate operational tasks and improve system stability.
- Collaborate with cross-functional engineering teams.
- Utilize observability tools for system visibility and performance.
About the role
About Dunnhumby Ltd
dunnhumby is the global leader in Customer Data Science, partnering with the world’s most ambitious retailers and brands to put the customer at the heart of every decision. We combine deep insight, advanced technology, and close collaboration to help our clients grow, innovate, and deliver measurable value for their customers. dunnhumby employs nearly 2,500 experts in offices throughout Europe, Asia, Africa, and the Americas working for transformative, iconic brands such as Tesco, Coca-Cola, Nestlé, Unilever and Metro.About Tesco Media
Tesco Media is building a world-class, self-serve B2B advertising platform that enables retailers and brands to plan, activate, and measure omnichannel retail media campaigns Retail Media is transforming how advertisers connect with consumers through personalized and targeted campaigns across retailers' digital and physical touchpoints. Retail Media Measurement plays a pivotal role in ensuring the effectiveness of these campaigns, driving value for advertisers, retailers, and consumers alike.Role Overview
We are looking for a Platform Reliability Engineer (SRE) who can help us keep our large, distributed systems reliable, fast, and running smoothly. In this role, you will help improve how we build and maintain reliable systems, strengthen system stability, create automation standards, and guide engineering teams on SRE best practices. You will work closely with platform, backend, security, and product teams to ensure our services are stable, easy to monitor, and always available. You will use tools like Prometheus, Grafana, Elastic, and New Relic to improve system visibility, manage incidents, and boost overall performance.Key Responsibilities
- Implement monitoring, logging, and tracing for applications, services, and infrastructure.
- Build dashboards and alerts to monitor system health and performance.
- Support production systems and participate in incident response activities.
- Troubleshoot operational issues using logs, metrics, and system diagnostics.
- Work with engineering teams to onboard services into monitoring platforms.
- Assist in defining alert thresholds and reducing unnecessary alert noise.
- Maintain monitoring configurations and ensure operational documentation is up to date.
- Support post-incident reviews and implement improvements in monitoring coverage.
- Automate routine operational tasks where possible.
Required Experience
- 3–5 years of experience in infrastructure operations, monitoring, or Site Reliability Engineering.
- Experience working with Infrastructure as Code tools such as Terraform.
- Familiarity with cloud platforms such as GCP or Azure.
- Understanding of APIs, service monitoring, and system logs.
- Experience supporting production environments and incident response processes.
- Strong written and verbal communication skills with the ability to collaborate across teams.
Preferred Experience
- Experience with observability tools such as Grafana, Prometheus, Splunk, or New Relic.
- Experience supporting distributed systems or microservices.
- Exposure to automation or scripting for operational tasks.
- Experience working in Media, SaaS, or streaming environments.
What You Can Expect From Us
We won’t just meet your expectations. We’ll defy them. So you’ll enjoy the comprehensive rewards package you’d expect from a leading technology company. But also, a degree of personal flexibility you might not expect. Plus, thoughtful perks, like flexible working hours and your birthday off. You’ll also benefit from an investment in cutting-edge technology that reflects our global ambition. But with a nimble, small-business feel that gives you the freedom to play, experiment and learn. And we don’t just talk about diversity and inclusion. We live it every day – with thriving networks including dh Gender Equality Network, dh Proud, dh Family, dh One, dh Enabled and dh Thrive as the living proof. We want everyone to have the opportunity to shine and perform at your best throughout our recruitment process. Please let us know how we can make this process work best for you. Our approach to Flexible Working At dunnhumby, we value and respect difference and are committed to building an inclusive culture by creating an environment where you can balance a successful career with your commitments and interests outside of work. We believe that you will do your best at work if you have a work / life balance. Some roles lend themselves to flexible options more than others, so if this is important to you please raise this with your recruiter, as we are open to discussing agile working opportunities during the hiring process. For further information about how we collect and use your personal information please see our Privacy Notice which can be found (here)Key skills/competency
Site Reliability Engineering, Infrastructure Operations, Monitoring, Terraform, GCP, Azure, APIs, Service Monitoring, System Logs, Incident ResponseSkills & topics
- Site Reliability Engineer
- SRE
- Platform Reliability Engineer
- Infrastructure
- Monitoring
- Observability
- Automation
- Cloud
- GCP
- Azure
- Terraform
- Incident Response
- Kubernetes
- DevOps
How to get hired
- Tailor your resume: Highlight 3-5 years of experience in SRE, infrastructure operations, or monitoring, emphasizing Terraform and cloud platforms (GCP/Azure).
- Showcase relevant skills: Detail your experience with APIs, service monitoring, system logs, and production environment support.
- Demonstrate collaboration: Provide examples of working with platform, backend, security, and product teams.
- Prepare for technical questions: Be ready to discuss incident response processes and SRE best practices.
- Research dunnhumby: Understand their role in Customer Data Science and their work with major retailers.
Technical preparation
Practice Terraform for infrastructure as code.,Review GCP/Azure services and best practices.,Understand Prometheus and Grafana for monitoring.,Prepare to troubleshoot distributed systems.
Behavioral questions
Describe a complex system outage you resolved.,How do you prioritize competing operational tasks?,How do you guide engineering teams on SRE best practices?,Share an example of automating a repetitive task.
Frequently asked questions
- What are the core responsibilities of a Platform Reliability Engineer at dunnhumby?
- The core responsibilities include implementing and maintaining monitoring, logging, and tracing systems, building dashboards and alerts, supporting production environments, troubleshooting operational issues, and automating routine tasks to ensure the reliability and performance of large, distributed systems.
- What technical skills are most important for this Platform Reliability Engineer role?
- Key technical skills include 3-5 years of experience in SRE or infrastructure operations, proficiency with Infrastructure as Code tools like Terraform, familiarity with cloud platforms such as GCP or Azure, and a strong understanding of APIs, service monitoring, and system logs.
- What kind of observability tools does dunnhumby use?
- dunnhumby utilizes tools such as Prometheus, Grafana, Elastic, and New Relic to enhance system visibility, manage incidents, and improve overall performance. Experience with these or similar tools is highly valued.
- Does dunnhumby offer flexible working arrangements for this Platform Reliability Engineer position?
- Yes, dunnhumby is committed to building an inclusive culture and offers flexible working opportunities. Candidates interested in agile working arrangements are encouraged to discuss this with their recruiter.
- What is dunnhumby's approach to diversity and inclusion for potential Platform Reliability Engineers?
- dunnhumby actively lives diversity and inclusion through various employee networks and strives to create an environment where everyone has the opportunity to perform at their best throughout the recruitment process. They are open to making accommodations to ensure the process works best for you.
- What makes working at dunnhumby unique for a Platform Reliability Engineer?
- You can expect a comprehensive rewards package, personal flexibility, and thoughtful perks like flexible working hours and your birthday off. dunnhumby also offers an investment in cutting-edge technology within a nimble environment that encourages experimentation and learning.
- What is the expected experience level for the Platform Reliability Engineer role?
- The role requires 3-5 years of experience in infrastructure operations, monitoring, or Site Reliability Engineering. Preferred experience includes working with observability tools, supporting distributed systems, and automation.