16 hours ago

AI Site Reliability Engineer

Leidos

Hybrid
Full Time
$190,000
Hybrid

Job Overview

Job TitleAI Site Reliability Engineer
Job TypeFull Time
Offered Salary$190,000
LocationHybrid

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

AI Site Reliability Engineer at Leidos

The U.S. Navy’s Service Management, Integration, and Transport (SMIT) program has an opening for an AI Site Reliability Engineer on a high-visibility DoD program that provides engineering support to the Navy Marine Corps Intranet (NMCI), the largest information technology (IT) network in the world. This position will provide many opportunities to challenge and grow your skills.

The AI Site Reliability Engineer (AI-SRE) is responsible for integrating artificial intelligence and machine learning capabilities into Site Reliability Engineering (SRE) operations to improve system reliability, availability, performance, and operational efficiency. This role serves as a horizontal enabler across SRE pods, leveraging AI-driven insights to reduce operational toil, accelerating incident response, enhance observability, and enable predictive reliability engineering. The AI-SRE partners closely with infrastructure, network, application, cyber, and platform SRE teams to transform operational data into actionable intelligence while ensuring AI solutions are safe, explainable, auditable, and aligned with SRE principles.

Key Responsibilities

  • AIOps & Observability Intelligence: Design, develop, and maintain AI/ML models for anomaly detection, trend analysis, and signal correlation across metrics, logs, traces, and events. Reduce alert noise through intelligent alert grouping, suppression, and prioritization. Enhance observability platforms with AI-generated insights supporting SLO and error-budget management.
  • AI-Assisted Incident Management: Implement AI-driven incident classification, enrichment, and summarization. Provide probable root-cause analysis recommendations based on historical and real-time telemetry. Support on-call and incident response teams with AI-guided remediation suggestions. Contribute AI insights to post-incident reviews and reliability improvement plans.
  • Automation & Ops-as-Code Enablement: Apply AI techniques to identify repetitive operational tasks and automation opportunities. Assist in generating, validating, and optimizing automation playbooks and workflows. Analyze automation execution data to improve success rates, resiliency, and reuse.
  • Knowledge Management & Runbook Intelligence: Build and maintain AI-searchable knowledge repositories containing runbooks, SOPs, lessons learned, and historical incident data. Enable natural-language access to operational knowledge for SREs and operations staff. Reduce dependency on tribal knowledge through intelligent documentation and discovery.
  • Predictive Reliability Engineering: Develop predictive models for capacity planning, failure forecasting, configuration risk, and reliability debt identification. Support proactive remediation strategies to prevent incidents before customer impact. Assist SRE leadership in data-driven prioritization of reliability investments.
  • Governance, Security & Trust: Ensure AI solutions adhere to organizational security, compliance, and data-handling policies. Establish guardrails for AI recommendations, human-in-the-loop decision making, and automation execution. Promote transparency, explainability, and auditability of AI-driven operational decisions.

Required Qualifications

Education and Requirements
  • Bachelor’s degree in computer science, Engineering, Information Systems, Data Science, or related discipline
  • 5+ years in Site Reliability Engineering, DevOps, IT Operations, or Systems Engineering
  • 2+ years applying AI/ML techniques in operational, analytics, or automation contexts
  • Demonstrated experience supporting production systems in high-availability environments
  • Must have an active Secret Clearance in order to be considered for the position
Technical Skills
  • Proficiency in data analysis tooling
  • Experience with machine learning fundamentals (anomaly detection, clustering, time-series analysis, NLP)
  • Familiarity with observability platforms (metrics, logs, traces, events)
  • Experience with automation frameworks and infrastructure-as-code concepts
  • Strong understanding of distributed systems and operational telemetry

Key skills/competency

  • AI/ML
  • Site Reliability Engineering (SRE)
  • AIOps
  • Anomaly Detection
  • Incident Management
  • Automation
  • Observability
  • Predictive Analytics
  • Data Analysis
  • Distributed Systems

Tags:

AI Site Reliability Engineer
SRE
AI/ML
AIOps
Observability
Incident Management
Automation
Predictive Reliability
Data Analysis
Machine Learning
Distributed Systems
DevOps
Infrastructure-as-Code
NLP
System Reliability
Telemetry
Python
Cloud

Share Job:

How to Get Hired at Leidos

  • Research Leidos's mission: Understand Leidos's commitment to national security and innovation, particularly within DoD programs like SMIT.
  • Tailor your resume: Highlight SRE, AI/ML, and DevOps experience, emphasizing work with high-availability production systems and active Secret Clearance.
  • Showcase technical acumen: Prepare to discuss data analysis, machine learning fundamentals, observability platforms, and automation frameworks.
  • Emphasize problem-solving: Be ready to share examples of how you've used AI to improve system reliability, reduce toil, or enhance incident response.
  • Demonstrate security awareness: Understand the importance of governance, security, and trust in AI solutions for government contracts.

Frequently Asked Questions

Find answers to common questions about this job opportunity

Explore similar opportunities that match your background