3 days ago

Lead Site Reliability Engineer

Mattermost

Hybrid
Full Time
$185,000
Hybrid

Job Overview

Job TitleLead Site Reliability Engineer
Job TypeFull Time
CategoryCommerce
Experience5 Years
DegreeMaster
Offered Salary$185,000
LocationHybrid

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

About Mattermost

At Mattermost, we build the #1 collaborative workflow solution for defense, intelligence, security, and critical infrastructure organizations. Trusted by governments, financial institutions, and technology companies, our platform enables secure, efficient operations for the world’s most critical teams.

We’re dedicated to empowering organizations to operate with confidence, reducing risks, and accelerating productivity. Guided by our core values of Customer Obsession, Earn Trust, Self Awareness, Ownership and High Impact, we collaborate closely with our customers to deliver solutions that meet complex needs and drive success.

To learn more, visit www.mattermost.com

The Role: Lead Site Reliability Engineer

Mattermost is seeking an experienced and visionary Lead Site Reliability Engineer (SRE) to guide the architecture, reliability, and operational excellence of the infrastructure powering our secure, mission-critical collaboration platform.

In this role, you will provide technical leadership across our SRE function, driving strategic initiatives for scalability, observability, performance, and automation across cloud and hybrid environments. You will mentor engineers, establish best practices, and collaborate closely with development, security, and operations teams to ensure our customers in defense, government, and critical infrastructure sectors experience exceptional reliability and performance.

Responsibilities Include

  • Define the strategy, architecture, and roadmap for Mattermost’s site reliability engineering function, aligning infrastructure initiatives with product and business goals.
  • Lead the design, deployment, and optimization of production-grade containerized workloads, infrastructure-as-code, and compliant cloud environments for regulated domains (e.g., FedRAMP, DoD).
  • Establish and evolve observability, monitoring, and alerting frameworks to ensure performance, reliability, and capacity planning at scale.
  • Drive incident management processes, including on-call rotations, root cause analysis, and systemic reliability improvements.
  • Partner with security and compliance teams to meet data sovereignty, security, and regulatory requirements.
  • Champion automation and operational excellence to improve efficiency, reduce risk, and scale operations.
  • Oversee cloud cost management and capacity planning to optimize infrastructure spending while meeting performance targets.
  • Build and maintain a developer platform that enables fast, secure software delivery and improves application stability in production.
  • Mentor and coach SRE team members, fostering a culture of learning, collaboration, and technical excellence.

Requirements

  • BS in Computer Science, Cybersecurity, Software Engineering, or a related technical field, or equivalent experience, with 5+ years of relevant experience in site reliability engineering, DevOps, or cloud infrastructure roles.
  • Proven expertise in container orchestration platforms, ideally Kubernetes.
  • Extensive experience with infrastructure-as-code, ideally Terraform.
  • Strong background in cloud platforms, ideally AWS.
  • Demonstrated experience designing and implementing monitoring, alerting, and performance optimization strategies.
  • Exceptional troubleshooting and incident management skills for distributed systems.
  • Proficiency in at least one scripting or programming language for automation.
  • Excellent communication skills with a track record of influencing cross-functional teams.
  • Experience leading globally distributed teams in a remote-first environment.
  • For candidates residing in the U.S.: This role may require the ability to obtain and maintain a U.S. government security clearance in the future. As such, U.S. applicants must be U.S. citizens and eligible under applicable clearance requirements. Applicants must meet eligibility requirements for access to export-controlled information as defined by U.S. export control laws, including EAR and ITAR.

Preferences

  • Familiarity with observability stacks such as Grafana and Prometheus.
  • Experience designing high-availability, disaster recovery, and scaling architectures.
  • Exposure to GCP and Azure cloud environments.
  • Leadership experience in highly regulated industries such as defense, finance, or critical infrastructure.
  • Experience with U.S. federal compliance frameworks and authorization processes, including FedRAMP, DoD ATO, NIST 800-53, and related government standards.
  • Experience preparing, delivering, and maintaining software offerings through AWS Marketplace and other cloud provider marketplaces (e.g., Azure Marketplace, Google Cloud Marketplace), including packaging, compliance validation, and ongoing operational support.
  • Open-source contributions in reliability, DevOps, or infrastructure tooling.
  • Certifications in cloud infrastructure, reliability, or DevOps engineering (e.g., CKA, CKAD, AWS Certified Solutions Architect).

Key skills/competency

  • Site Reliability Engineering
  • Kubernetes
  • Terraform
  • AWS
  • Cloud Infrastructure
  • Distributed Systems
  • Incident Management
  • Observability (Grafana/Prometheus)
  • Automation
  • Compliance (FedRAMP, DoD)

Tags:

Site Reliability Engineer
SRE
Lead Engineer
Scalability
Observability
Automation
Kubernetes
AWS
Terraform
Incident Management
Compliance
Security
Architecture
Distributed Systems
Grafana
Prometheus
GCP
Azure
Container Orchestration
Infrastructure as Code

Share Job:

How to Get Hired at Mattermost

  • Research Mattermost's culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor.
  • Tailor your resume: Highlight SRE, Kubernetes, AWS, Terraform expertise, and leadership in distributed systems.
  • Showcase regulated industry experience: Emphasize any familiarity with FedRAMP, DoD compliance, or critical infrastructure sectors.
  • Prepare for technical deep-dives: Focus on architectural design, incident management, and performance optimization strategies.
  • Demonstrate leadership and collaboration: Discuss experience mentoring, influencing cross-functional teams, and working in remote environments.

Frequently Asked Questions

Find answers to common questions about this job opportunity

Explore similar opportunities that match your background