9 days ago

Senior AI Engineer, APM Experiences

Datadog

On Site
Full Time
$210,000
New York, NY

Job Overview

Job TitleSenior AI Engineer, APM Experiences
Job TypeFull Time
CategoryCommerce
Experience5 Years
DegreeMaster
Offered Salary$210,000
LocationNew York, NY

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

The Opportunity

Datadog’s APM Experiences team owns the core product experience for Application Performance Monitoring, including distributed tracing, service representation, and more. We’re building a new wave of AI-powered capabilities that help customers detect, resolve, and prevent performance issues faster. As a Senior AI Engineer, APM Experiences, you will lead end-to-end development of LLM- and Agent-based features that can:

  • Debug and investigate application performance issues down to the root cause, as both a developer assistant and a fully autonomous agent
  • Proactively recommend performance and reliability-based optimizations to prevent the next incident
  • Automatically create intelligent monitors and SLOs for the most important business flows and critical paths

This is a highly product-minded engineering role, working from problem discovery and UX all the way to reliable, scalable production systems.

What You’ll Do

  • Shape AI experiences for APM. Design and ship LLM/agentic workflows that analyze traces, metrics, logs, and other telemetry to generate diagnoses, explanations, and guided fixes.
  • Own the full loop. Prototype quickly, define success metrics and evals, run experiments, iterate, and ultimately productionize for scale and reliability.
  • Build robust agent systems. Develop tools, retrieval and planning strategies, and guardrails; manage prompts/evals; design fallbacks and human-in-the-loop paths.
  • Integrate with Datadog’s platform. Leverage surfaces like Trace Explorer, Service Catalog, monitors, and workflows to deliver end-to-end value in the APM UI.
  • Partner deeply. Collaborate with PM, Design, and partner teams to build cohesive experiences.
  • Raise the bar on engineering. Write performant, maintainable backend code, own services in production, and improve reliability for high-throughput, low-latency data systems.

Who You Are

You are a product-minded engineer who ships AI to production with:

  • 4+ years building backend or real-time ML systems; you value simplicity, correctness, and performance.
  • Proven experience delivering LLM/agent features to production (prompting, tooling, evals, safety/guardrails).
  • Comfortable owning user journeys, iterating from prototype → alpha → GA, and measuring impact with clear product metrics.

You have strong ML / applied science fundamentals, including:

  • Solid grasp of the ML lifecycle (task definition, dataset collection, modeling, evaluation, deployment, iteration) and statistics (experiment design, confidence intervals).
  • Experience choosing/modeling the right technique for the job (e.g., anomaly detection, ranking/recommendation, NLP), and knowing when a heuristic beats a model.
  • Fluency with offline/online evals for AI systems; can build reliable golden sets and automatic regressions.

You are distributed systems & observability savvy, with:

  • Experience with microservices performance: tracing, latency breakdowns, concurrency, and resiliency patterns.
  • Proficient in Go, Java, or Python; strong API/service design; production ops (monitoring, alerting, on-call rotation).

Nice to have:

  • Hands-on with distributed tracing stacks (OpenTelemetry/Datadog APM), profilers, and logs/metrics pipelines.
  • Exposure to planning/agent frameworks, tool-use orchestration, RAG, and retrieval/indexing for observability data.
  • Familiarity with SLO/SLA practices and incident response.

Benefits and Growth

Datadog offers a range of benefits and growth opportunities:

  • Build tools for software engineers, using them to accelerate development.
  • Influence product direction and make a significant business impact.
  • Work with skilled, knowledgeable, and supportive teammates.
  • Enjoy competitive global benefits and continuous professional development.

(Benefits may vary based on country and employment nature.)

Datadog provides a competitive salary and equity package, with actual compensation based on skills, qualifications, and experience. Comprehensive and inclusive benefits include healthcare, dental, parental planning, mental health, 401(k) with match, paid time off, fitness reimbursements, and a discounted employee stock purchase plan.

About Datadog

Datadog (NASDAQ: DDOG) is a global SaaS business known for growth and profitability. Our mission is to break down silos and solve complexity in the cloud age by enabling digital transformation, cloud migration, and infrastructure monitoring. Built by engineers, for engineers, Datadog is used by organizations of all sizes across industries. We champion professional development, diversity of thought, innovation, and work excellence. Join a collaborative, pragmatic, and thoughtful community to solve tough problems, take smart risks, and celebrate one another.

Key skills/competency

  • LLM Development
  • Agentic Workflows
  • Application Performance Monitoring (APM)
  • Distributed Tracing
  • Machine Learning Lifecycle
  • Go/Java/Python
  • Microservices
  • Observability
  • Experiment Design
  • Prompt Engineering

Tags:

Senior AI Engineer
APM
Machine Learning
LLM
Distributed Tracing
Go
Python
Java
Observability
Microservices
Product Development
Agent Systems
Prompt Engineering
Anomaly Detection
Cloud Monitoring
SaaS
Data Systems
System Design
Production Operations
Software Engineering

Share Job:

How to Get Hired at Datadog

  • Research Datadog's culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor.
  • Tailor your resume: Customize your resume to highlight experience in AI, APM, LLMs, and distributed systems, aligning with Datadog's needs.
  • Showcase your AI projects: Prepare to discuss specific examples of shipping LLM/agent features to production and your ML lifecycle expertise at Datadog interviews.
  • Demonstrate system savvy: Emphasize your proficiency in Go, Java, or Python and experience with microservices, observability, and production operations for Datadog.
  • Practice behavioral responses: Be ready to share examples of collaboration, problem-solving, and driving impact in product-minded engineering roles at Datadog.

Frequently Asked Questions

Find answers to common questions about this job opportunity

Explore similar opportunities that match your background