17 days ago

Staff ML Platform Engineer

Shopify

Remote
Full Time
$180,000
Remote
Apply

Job Overview

Job TitleStaff ML Platform Engineer
Job TypeFull Time
Offered Salary$180,000
LocationRemote

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

About The Role

Shopify is the commerce platform that powers millions of merchants worldwide. Behind the product experience are ML systems that drive recommendations, search, and personalization at massive scale.

We build the compute and serving layer behind these systems: multi-node GPU training clusters, real-time inference with strict latency budgets, and the performance engineering that keeps it all efficient at scale. Our models serve hundreds of millions of buyers, and the infrastructure we build directly impacts how merchants grow their businesses.

The Role

You will own the core infrastructure that ML Engineers depend on to train and serve models: GPU training clusters, real-time serving systems, and the performance and reliability layer underneath both. You'll sit between ML Engineers who need fast iteration and production systems that need to stay up during events like Black Friday/Cyber Monday, where traffic and stakes peak simultaneously.

This role carries real technical authority. You'll make architectural decisions about how we scale training and serving, set standards for infrastructure quality, and be the person the team relies on when systems need to scale by an order of magnitude. You'll mentor engineers across the team, drive alignment on infrastructure direction across multiple workstreams, and influence technical strategy beyond your immediate team. You'll also raise the engineering bar through hiring and technical reviews.

What You'll Do

Training Infrastructure
  • Design and operate GPU training pipelines on Kubernetes, including multi-node distributed training on GPU clusters
  • Own training reliability: checkpointing, fault tolerance, preemption recovery, and resource scheduling
  • Optimize training performance: mixed precision, kernel tuning, data loading throughput, and cluster utilization. You own compute efficiency; data correctness and freshness are owned by the operations side of the team.
  • Build abstractions that let ML Engineers launch and iterate on training runs with minimal friction
Serving Infrastructure
  • Build and maintain model serving infrastructure for real-time recommendation and LLM inference, with strict latency and throughput requirements
  • Optimize serving cost and performance: batching strategies, model compilation, GPU right-sizing, and autoscaling
  • Ensure serving systems meet availability and latency targets under peak traffic
Platform & Developer Experience
  • Build internal tools and platforms that accelerate the model development lifecycle
  • Define infrastructure patterns and best practices adopted across the team
  • Improve the inner loop for ML Engineers: faster iteration from code change to training result to production evaluation
Technical Leadership
  • Drive cross-team technical strategy for ML infrastructure - identify the next set of problems before they become blockers
  • Mentor and up-level engineers on the team through pairing, design reviews, and setting technical standards
  • Contribute to hiring: screen candidates, conduct technical interviews, and calibrate the engineering bar
  • Write technical proposals and RFCs that shape infrastructure direction across the organization

Required

What We're Looking For
  • 7+ years in software engineering, with 5+ years focused on ML infrastructure or distributed systems
  • Deep hands-on experience with GPU training at scale: distributed training, checkpointing, fault recovery, and performance tuning. You've debugged real problems like NCCL hangs, gradient synchronization issues, or data loading bottlenecks.
  • Strong Kubernetes skills: pod specs, GPU scheduling, resource quotas, debugging scheduling failures, and operating stateful GPU workloads
  • Production model serving experience: you've built or operated serving systems behind real user traffic with latency constraints
  • Solid Python and systems fundamentals; comfortable reading and modifying PyTorch training code
  • Experience designing infrastructure abstractions used by other engineers
  • Demonstrated technical leadership: you've driven architecture decisions, written technical proposals, and influenced engineering direction beyond your immediate team
  • Track record of mentoring engineers and raising the technical bar on a team

Preferred

  • Experience with cloud-native ML orchestration (SkyPilot, Ray, or similar)
  • Hands-on with LLM serving stacks (vLLM, TensorRT-LLM, Triton, or equivalent)
  • Experience with model compression in production (quantization, pruning, distillation)
  • Experience operating recommendation or retrieval systems at scale
  • Track record of building internal platforms adopted by other teams

How We Work

  • You'll pair directly with ML Engineers. Understanding their models well enough to build the right infrastructure abstractions is part of the job.
  • We prefer automation over runbooks. If a process can be scripted, it should be.
  • On-call is shared. When you're on rotation, your scope is GPU cluster health, training failures, and serving availability - you own it end to end.
  • You'll profile GPU kernels, chase p99 latency regressions, and care about FLOPS utilization. This is a deeply technical infrastructure role.
  • Research and production are the same codebase. You'll see your infrastructure decisions reflected in real model quality and real merchant outcomes.
  • Shopify operates on high trust and low process. You'll have real ownership and the autonomy to make decisions, not just execute tickets.

What Success Looks Like

  • In 3 months: You've onboarded to training and serving infrastructure, shipped at least one meaningful improvement to reliability or performance, and can independently debug issues across the GPU stack.
  • In 6 months: You own a major infrastructure subsystem (training cluster or serving platform). Researchers are training faster or serving more reliably because of changes you've made.
  • In 12 months: You've shaped the technical roadmap for ML infrastructure and influenced engineering direction beyond the team. Other engineers across the organization come to you for architectural guidance. The platform scales to the next generation of models because of the systems and standards you've put in place. You've made the team stronger through hiring and mentorship.

About Shopify

Opportunity is not evenly distributed. Shopify puts independence within reach for anyone with a dream to start a business. We propel entrepreneurs and enterprises to scale the heights of their potential. Since 2006, we’ve grown to over 8,300 employees and generated over $1 trillion in sales for millions of merchants in 175 countries.

This is life-defining work that directly impacts people’s lives as much as it transforms your own. This is putting the power of the few in the hands of the many, is a future with more voices rather than fewer, and is creating more choices instead of an elite option.

About You

Moving at our pace brings a lot of change, complexity, and ambiguity—and a little bit of chaos. Shopifolk thrive on that and are comfortable being uncomfortable. That means Shopify is not the right place for everyone.

Before you apply, consider if you can:

  • Care deeply about what you do and about making commerce better for everyone
  • Excel by seeking professional and personal hypergrowth
  • Keep up with an unrelenting pace (the week, not the quarter)
  • Be resilient and resourceful in face of ambiguity and thrive on (rather than endure) change
  • Bring critical thought and opinion
  • Put AI agents and tools to work on the tasks they're built for, and focus on the work only humans can do
  • Embrace differences and disagreement to get shit done and move forward
  • Work digital-first for your daily work

We may use AI-enabled tools to screen, select, and assess applications. All AI outputs are reviewed and validated by our recruitment team.

Key skills/competency

  • Staff ML Platform Engineer
  • Machine Learning Infrastructure
  • Distributed Systems
  • GPU Training
  • Kubernetes
  • Model Serving
  • Performance Engineering
  • Python
  • Technical Leadership
  • Platform Development

Tags:

Staff ML Platform Engineer
Machine Learning
ML Infrastructure
Platform Engineering
Distributed Systems
GPU Computing
Kubernetes
Model Serving
Performance Engineering
Python
Technical Leadership
Shopify
Recommendations
LLM Inference
Software Engineering

Share Job:

How to Get Hired at Shopify

  • Tailor your resume: Highlight ML infrastructure, distributed systems, GPU training, and Kubernetes experience. Quantify achievements.
  • Showcase technical leadership: Emphasize architectural decisions, mentorship, and cross-team influence.
  • Prepare for technical interviews: Review Python, systems fundamentals, PyTorch, and Kubernetes concepts. Practice debugging distributed systems.
  • Understand Shopify's culture: Research their mission, values, and how they operate with high trust and low process.
  • Craft a compelling cover letter: Express your passion for commerce and your ability to thrive in a fast-paced, ambiguous environment.

Frequently Asked Questions

Find answers to common questions about this job opportunity

Explore similar opportunities that match your background