Senior Site Reliability Engineer
@ Rocket.Chat

Remote
$150,000
Remote
Full Time
Posted 8 hours ago

Your Application Journey

Personalized Resume
Apply
Email Hiring Manager
Interview

Email Hiring Manager

XXXXXXXX XXXXXXXXX XXXXXXXXXX***** @rocket.chat
Recommended after applying

Job Details

Overview

The Senior Site Reliability Engineer at Rocket.Chat plays a crucial role in ensuring reliability, scalability, and performance across all critical systems and services. Reporting to the Head Of Infrastructure and Deployment, you will be a key member of the Engineering team.

Your Responsibilities

As a Senior Site Reliability Engineer, you will:

  • Enhance reliability, performance, and scalability of Rocket.Chat's ecosystem.
  • Design, develop, and maintain Kubernetes operators and manage core infrastructure.
  • Implement robust monitoring, alerting, logging and automation for operational efficiency.
  • Lead incident management, on-call response, and blameless post-mortems.
  • Collaborate with cross-functional teams to integrate reliability practices early in the product lifecycle.

Mandatory Hard Skills

  • Expertise in Kubernetes and cloud platforms (AWS, GCP, Azure, OVH).
  • Proficiency in programming/scripting languages (Go, Python, Bash).
  • Experience with monitoring tools (Prometheus, Grafana, Loki) and IaC (Terraform, Pulumi, Ansible).
  • Solid networking fundamentals and security principles.
  • Familiarity with databases like MongoDB or Redis.

Desirable Skills & Soft Skills

  • Knowledge in chaos engineering and disaster recovery planning.
  • Experience with agile tools like Jira.
  • Proactive, collaborative, and strong problem-solving mindset.
  • Leadership and clear communication skills even in stressful incidents.
  • Data-driven decision making and accountability.

What You'll Do

  • Engineer and operate deployment and platform services.
  • Manage and optimize core infrastructure and associated tools.
  • Ensure service reliability through SLOs, error budgets, and robust monitoring.
  • Automate operations and reduce manual toil.
  • Foster cross-functional collaboration and implement advanced reliability practices.

Benefits

  • Fully remote and flexible working hours.
  • Flexible paid time off, holidays, and vacation.
  • Company laptop and remote benefits.
  • Access to Talki, courses, books, stock options, and a multicultural environment.
  • Vibrant company culture and detailed competitive compensation based on location.

Key skills/competency

Kubernetes, AWS, GCP, Python, Terraform, CI/CD, Monitoring, Distributed Systems, Automation, Incident Management

How to Get Hired at Rocket.Chat

🎯 Tips for Getting Hired

  • Customize your resume: Tailor skills and projects to Rocket.Chat requirements.
  • Showcase SRE expertise: Highlight Kubernetes, cloud, and automation experience.
  • Research Rocket.Chat: Understand their open-source communication platform and culture.
  • Prepare for technical interviews: Practice incident management and system design questions.

📝 Interview Preparation Advice

Technical Preparation

Review Kubernetes architecture and operator development.
Practice using cloud platforms like AWS and GCP.
Brush up on scripting languages like Python and Go.
Familiarize with IaC tools and CI/CD pipeline setups.

Behavioral Questions

Describe a time you led incident management.
Explain your approach to cross-team collaboration.
Discuss solving complex system outages.
Share experience in proactive problem prevention.

Frequently Asked Questions