Senior AI Engineer, APM Experiences
Datadog
Job Overview
Who's the hiring manager?
Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Job Description
The Opportunity
Datadog’s APM Experiences team owns the core product experience for Application Performance Monitoring, including distributed tracing, service representation, and more. We’re building a new wave of AI-powered capabilities that help customers detect, resolve, and prevent performance issues faster. As a Senior AI Engineer, APM Experiences, you will lead end-to-end development of LLM- and Agent-based features that can:
- Debug and investigate application performance issues down to the root cause, as both a developer assistant and a fully autonomous agent
- Proactively recommend performance and reliability-based optimizations to prevent the next incident
- Automatically create intelligent monitors and SLOs for the most important business flows and critical paths
This is a highly product-minded engineering role, working from problem discovery and UX all the way to reliable, scalable production systems.
What You’ll Do
- Shape AI experiences for APM. Design and ship LLM/agentic workflows that analyze traces, metrics, logs, and other telemetry to generate diagnoses, explanations, and guided fixes.
- Own the full loop. Prototype quickly, define success metrics and evals, run experiments, iterate, and ultimately productionize for scale and reliability.
- Build robust agent systems. Develop tools, retrieval and planning strategies, and guardrails; manage prompts/evals; design fallbacks and human-in-the-loop paths.
- Integrate with Datadog’s platform. Leverage surfaces like Trace Explorer, Service Catalog, monitors, and workflows to deliver end-to-end value in the APM UI.
- Partner deeply. Collaborate with PM, Design, and partner teams to build cohesive experiences.
- Raise the bar on engineering. Write performant, maintainable backend code, own services in production, and improve reliability for high-throughput, low-latency data systems.
Who You Are
You are a product-minded engineer who ships AI to production with:
- 4+ years building backend or real-time ML systems; you value simplicity, correctness, and performance.
- Proven experience delivering LLM/agent features to production (prompting, tooling, evals, safety/guardrails).
- Comfortable owning user journeys, iterating from prototype → alpha → GA, and measuring impact with clear product metrics.
You have strong ML / applied science fundamentals, including:
- Solid grasp of the ML lifecycle (task definition, dataset collection, modeling, evaluation, deployment, iteration) and statistics (experiment design, confidence intervals).
- Experience choosing/modeling the right technique for the job (e.g., anomaly detection, ranking/recommendation, NLP), and knowing when a heuristic beats a model.
- Fluency with offline/online evals for AI systems; can build reliable golden sets and automatic regressions.
You are distributed systems & observability savvy, with:
- Experience with microservices performance: tracing, latency breakdowns, concurrency, and resiliency patterns.
- Proficient in Go, Java, or Python; strong API/service design; production ops (monitoring, alerting, on-call rotation).
Nice to have:
- Hands-on with distributed tracing stacks (OpenTelemetry/Datadog APM), profilers, and logs/metrics pipelines.
- Exposure to planning/agent frameworks, tool-use orchestration, RAG, and retrieval/indexing for observability data.
- Familiarity with SLO/SLA practices and incident response.
Benefits and Growth
Datadog offers a range of benefits and growth opportunities:
- Build tools for software engineers, using them to accelerate development.
- Influence product direction and make a significant business impact.
- Work with skilled, knowledgeable, and supportive teammates.
- Enjoy competitive global benefits and continuous professional development.
(Benefits may vary based on country and employment nature.)
Datadog provides a competitive salary and equity package, with actual compensation based on skills, qualifications, and experience. Comprehensive and inclusive benefits include healthcare, dental, parental planning, mental health, 401(k) with match, paid time off, fitness reimbursements, and a discounted employee stock purchase plan.
About Datadog
Datadog (NASDAQ: DDOG) is a global SaaS business known for growth and profitability. Our mission is to break down silos and solve complexity in the cloud age by enabling digital transformation, cloud migration, and infrastructure monitoring. Built by engineers, for engineers, Datadog is used by organizations of all sizes across industries. We champion professional development, diversity of thought, innovation, and work excellence. Join a collaborative, pragmatic, and thoughtful community to solve tough problems, take smart risks, and celebrate one another.
Key skills/competency
- LLM Development
- Agentic Workflows
- Application Performance Monitoring (APM)
- Distributed Tracing
- Machine Learning Lifecycle
- Go/Java/Python
- Microservices
- Observability
- Experiment Design
- Prompt Engineering
How to Get Hired at Datadog
- Research Datadog's culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor.
- Tailor your resume: Customize your resume to highlight experience in AI, APM, LLMs, and distributed systems, aligning with Datadog's needs.
- Showcase your AI projects: Prepare to discuss specific examples of shipping LLM/agent features to production and your ML lifecycle expertise at Datadog interviews.
- Demonstrate system savvy: Emphasize your proficiency in Go, Java, or Python and experience with microservices, observability, and production operations for Datadog.
- Practice behavioral responses: Be ready to share examples of collaboration, problem-solving, and driving impact in product-minded engineering roles at Datadog.
Frequently Asked Questions
Find answers to common questions about this job opportunity
Explore similar opportunities that match your background