10 days ago

Observability Platform Engineer

Nscale

Hybrid
Full Time
£85,000
Hybrid

Job Overview

Job TitleObservability Platform Engineer
Job TypeFull Time
CategoryCommerce
Experience5 Years
DegreeMaster
Offered Salary£85,000
LocationHybrid

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

About Nscale

Nscale is the GPU cloud engineered for AI. We offer high-performance, cost-efficient infrastructure designed for modern AI workloads, blending the power of bespoke supercomputers with the flexibility of cloud services. Our vertically integrated platform spans GPU-dense, energy-efficient data centres through Kubernetes and Slurm orchestration to AI-ready services.

We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you’ll be contributing to building the technology that powers the future.

About The Role (Job Purpose)

As an Observability Platform Engineer, you will design, build, and manage the systems that surface deep visibility into Nscale’s infrastructure and AI workloads. You’ll treat observability as a product, partnering with engineering and SRE teams to ensure our monitoring, logging, tracing, and alerting platforms are robust, scalable, and easy to use.

This role requires hands-on engineering experience combined with empathy for how other teams consume observability data. You’ll ensure infrastructure health, reliability, and performance by enabling proactive insights and reducing operational friction.

What You’ll Do

  • Design, build, and support scalable observability infrastructure (metrics, logs, traces, alerts).
  • Collaborate with internal teams to embed observability as a seamless product across GPU clusters, Kubernetes, Slurm, and AI services.
  • Implement and refine monitoring and alerting patterns to enhance system reliability and reliability culture.
  • Maintain production and pre-production observability clusters and help others adopt best practices.
  • Automate observability pipelines using IaC tools and scripting for repeatability and consistency.
  • Troubleshoot observability platform issues and support incident remediation efforts.
  • Serve as an advocate for observability best practices, training teams on effective usage and instrumentation.

About You

Skills / Experience
  • 2–5 years of experience in Software Engineering, SRE, DevOps, or observability-related roles.
  • Proficiency in at least one scripting or programming language (Python, Go, Bash).
  • Experience with Kubernetes or containerised environments.
  • Familiarity with on-call responsibilities, triaging, and escalating live production issues.
  • Comfortable with observability tooling, Grafana, Prometheus, Loki, OpenTelemetry, ClickHouse, Elastic, Thanos, VictoriaMetrics, etc.
  • Strong communication and collaboration skills, able to empathise with users of observability systems and translate needs into solutions.

Preferred
  • Hands-on experience operating observability infrastructure at scale.
  • Knowledge of Infrastructure-as-Code (e.g. Terraform) to automate deployments.
  • Exposure to streaming systems or pipelines for observability data.

Key skills/competency

  • Observability
  • Platform Engineering
  • Kubernetes
  • Prometheus
  • Grafana
  • OpenTelemetry
  • SRE
  • DevOps
  • Infrastructure as Code (IaC)
  • Python/Go/Bash

Tags:

Observability Platform Engineer
Observability
Platform Engineering
SRE
DevOps
Kubernetes
Monitoring
Logging
Tracing
Alerting
Prometheus
Grafana
OpenTelemetry
Python
Go
Bash
ClickHouse
Elastic
Thanos
Terraform

Share Job:

How to Get Hired at Nscale

  • Research Nscale's culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor.
  • Tailor your resume for observability: Customize your resume to highlight experience in SRE, DevOps, Kubernetes, and observability tooling like Prometheus and Grafana, matching keywords from the Observability Platform Engineer job description.
  • Showcase problem-solving skills: Prepare to discuss specific examples of how you've designed, built, and troubleshot complex observability systems.
  • Understand Nscale's AI focus: Familiarize yourself with GPU cloud, AI workloads, and distributed systems architecture as Nscale is a GPU cloud engineered for AI.
  • Prepare for technical and behavioral interviews: Practice explaining your technical decisions and demonstrating strong communication and collaboration skills.

Frequently Asked Questions

Find answers to common questions about this job opportunity

Explore similar opportunities that match your background