1 month ago

Staff ML Platform Engineer

Shopify

Remote

Full Time

$180,000

Remote

Apply

Job Overview

Job TitleStaff ML Platform Engineer

Job TypeFull Time

Offered Salary$180,000

LocationRemote

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

About The Role

Shopify is the commerce platform that powers millions of merchants worldwide. Behind the product experience are ML systems that drive recommendations, search, and personalization at massive scale.

We build the compute and serving layer behind these systems: multi-node GPU training clusters, real-time inference with strict latency budgets, and the performance engineering that keeps it all efficient at scale. Our models serve hundreds of millions of buyers, and the infrastructure we build directly impacts how merchants grow their businesses.

The Role

You will own the core infrastructure that ML Engineers depend on to train and serve models: GPU training clusters, real-time serving systems, and the performance and reliability layer underneath both. You'll sit between ML Engineers who need fast iteration and production systems that need to stay up during events like Black Friday/Cyber Monday, where traffic and stakes peak simultaneously.

This role carries real technical authority. You'll make architectural decisions about how we scale training and serving, set standards for infrastructure quality, and be the person the team relies on when systems need to scale by an order of magnitude. You'll mentor engineers across the team, drive alignment on infrastructure direction across multiple workstreams, and influence technical strategy beyond your immediate team. You'll also raise the engineering bar through hiring and technical reviews.

What You'll Do

Training Infrastructure

Design and operate GPU training pipelines on Kubernetes, including multi-node distributed training on GPU clusters
Own training reliability: checkpointing, fault tolerance, preemption recovery, and resource scheduling
Optimize training performance: mixed precision, kernel tuning, data loading throughput, and cluster utilization. You own compute efficiency; data correctness and freshness are owned by the operations side of the team.
Build abstractions that let ML Engineers launch and iterate on training runs with minimal friction

Serving Infrastructure

Build and maintain model serving infrastructure for real-time recommendation and LLM inference, with strict latency and throughput requirements
Optimize serving cost and performance: batching strategies, model compilation, GPU right-sizing, and autoscaling
Ensure serving systems meet availability and latency targets under peak traffic

Platform & Developer Experience

Build internal tools and platforms that accelerate the model development lifecycle
Define infrastructure patterns and best practices adopted across the team
Improve the inner loop for ML Engineers: faster iteration from code change to training result to production evaluation

Technical Leadership

Drive cross-team technical strategy for ML infrastructure - identify the next set of problems before they become blockers
Mentor and up-level engineers on the team through pairing, design reviews, and setting technical standards
Contribute to hiring: screen candidates, conduct technical interviews, and calibrate the engineering bar
Write technical proposals and RFCs that shape infrastructure direction across the organization

Required

What We're Looking For

7+ years in software engineering, with 5+ years focused on ML infrastructure or distributed systems
Deep hands-on experience with GPU training at scale: distributed training, checkpointing, fault recovery, and performance tuning. You've debugged real problems like NCCL hangs, gradient synchronization issues, or data loading bottlenecks.
Strong Kubernetes skills: pod specs, GPU scheduling, resource quotas, debugging scheduling failures, and operating stateful GPU workloads
Production model serving experience: you've built or operated serving systems behind real user traffic with latency constraints
Solid Python and systems fundamentals; comfortable reading and modifying PyTorch training code
Experience designing infrastructure abstractions used by other engineers
Demonstrated technical leadership: you've driven architecture decisions, written technical proposals, and influenced engineering direction beyond your immediate team
Track record of mentoring engineers and raising the technical bar on a team

Preferred

Experience with cloud-native ML orchestration (SkyPilot, Ray, or similar)
Hands-on with LLM serving stacks (vLLM, TensorRT-LLM, Triton, or equivalent)
Experience with model compression in production (quantization, pruning, distillation)
Experience operating recommendation or retrieval systems at scale
Track record of building internal platforms adopted by other teams

How We Work

You'll pair directly with ML Engineers. Understanding their models well enough to build the right infrastructure abstractions is part of the job.
We prefer automation over runbooks. If a process can be scripted, it should be.
On-call is shared. When you're on rotation, your scope is GPU cluster health, training failures, and serving availability - you own it end to end.
You'll profile GPU kernels, chase p99 latency regressions, and care about FLOPS utilization. This is a deeply technical infrastructure role.
Research and production are the same codebase. You'll see your infrastructure decisions reflected in real model quality and real merchant outcomes.
Shopify operates on high trust and low process. You'll have real ownership and the autonomy to make decisions, not just execute tickets.

What Success Looks Like

In 3 months: You've onboarded to training and serving infrastructure, shipped at least one meaningful improvement to reliability or performance, and can independently debug issues across the GPU stack.
In 6 months: You own a major infrastructure subsystem (training cluster or serving platform). Researchers are training faster or serving more reliably because of changes you've made.
In 12 months: You've shaped the technical roadmap for ML infrastructure and influenced engineering direction beyond the team. Other engineers across the organization come to you for architectural guidance. The platform scales to the next generation of models because of the systems and standards you've put in place. You've made the team stronger through hiring and mentorship.

About Shopify

Opportunity is not evenly distributed. Shopify puts independence within reach for anyone with a dream to start a business. We propel entrepreneurs and enterprises to scale the heights of their potential. Since 2006, we’ve grown to over 8,300 employees and generated over $1 trillion in sales for millions of merchants in 175 countries.

This is life-defining work that directly impacts people’s lives as much as it transforms your own. This is putting the power of the few in the hands of the many, is a future with more voices rather than fewer, and is creating more choices instead of an elite option.

About You

Moving at our pace brings a lot of change, complexity, and ambiguity—and a little bit of chaos. Shopifolk thrive on that and are comfortable being uncomfortable. That means Shopify is not the right place for everyone.

Before you apply, consider if you can:

Care deeply about what you do and about making commerce better for everyone
Excel by seeking professional and personal hypergrowth
Keep up with an unrelenting pace (the week, not the quarter)
Be resilient and resourceful in face of ambiguity and thrive on (rather than endure) change
Bring critical thought and opinion
Put AI agents and tools to work on the tasks they're built for, and focus on the work only humans can do
Embrace differences and disagreement to get shit done and move forward
Work digital-first for your daily work

We may use AI-enabled tools to screen, select, and assess applications. All AI outputs are reviewed and validated by our recruitment team.

Key skills/competency

Staff ML Platform Engineer
Machine Learning Infrastructure
Distributed Systems
GPU Training
Kubernetes
Model Serving
Performance Engineering
Python
Technical Leadership
Platform Development

Tags:

Staff ML Platform Engineer

Machine Learning

ML Infrastructure

Platform Engineering

Distributed Systems

GPU Computing

Kubernetes

Model Serving

Performance Engineering

Python

Technical Leadership

Shopify

Recommendations

LLM Inference

Software Engineering

How to Get Hired at Shopify

Tailor your resume: Highlight ML infrastructure, distributed systems, GPU training, and Kubernetes experience. Quantify achievements.
Showcase technical leadership: Emphasize architectural decisions, mentorship, and cross-team influence.
Prepare for technical interviews: Review Python, systems fundamentals, PyTorch, and Kubernetes concepts. Practice debugging distributed systems.
Understand Shopify's culture: Research their mission, values, and how they operate with high trust and low process.
Craft a compelling cover letter: Express your passion for commerce and your ability to thrive in a fast-paced, ambiguous environment.

Frequently Asked Questions

Find answers to common questions about this job opportunity

01What specific ML models will I be working with as a Staff ML Platform Engineer at Shopify?

02What are the typical challenges faced in this Staff ML Platform Engineer role at Shopify?

03How does Shopify handle on-call responsibilities for ML infrastructure?

04What kind of career growth can I expect as a Staff ML Platform Engineer at Shopify?

05How does Shopify's 'high trust and low process' culture apply to this role?

06What is the expected impact of a Staff ML Platform Engineer on Shopify's business?

07What are the key technologies I'll be working with as a Staff ML Platform Engineer at Shopify?

Explore similar opportunities that match your background

This job post expired on April 25, 2026

Staff ML Platform Engineer

Shopify

Job Overview

Who's the hiring manager?

Job Description

About The Role

The Role

What You'll Do

Training Infrastructure

Serving Infrastructure

Platform & Developer Experience

Technical Leadership

Required

What We're Looking For

Preferred

How We Work

What Success Looks Like

About Shopify

About You

Key skills/competency

Tags:

Share Job:

How to Get Hired at Shopify

Frequently Asked Questions