2 days ago

Senior/Staff Machine Learning Engineer, ML Training Platform

Pluralis Research

Hybrid
Full Time
A$200,000
Hybrid

Job Overview

Job TitleSenior/Staff Machine Learning Engineer, ML Training Platform
Job TypeFull Time
CategoryCommerce
Experience5 Years
DegreeMaster
Offered SalaryA$200,000
LocationHybrid

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

About Pluralis Research

Pluralis Research carries out foundational research on Protocol Learning: multi-participant training of foundation models where no single participant has, or can ever obtain, a full copy of the model. The purpose of Protocol Learning is to facilitate the creation of community-trained and community-owned frontier models with self-sustaining economics.

The Role: Senior/Staff Machine Learning Engineer, ML Training Platform

We're looking for Senior/Staff engineers with 5+ years of experience in distributed systems and ML large-scale training. You'll be implementing a novel substrate for training distributed ML models that work under consumer grade internet connection.

Responsibilities

Distributed Training Architecture & Optimization

  • Design and implement large-scale distributed training systems optimized for heterogeneous hardware operating under low-bandwidth, high-latency conditions.
  • Develop and optimize model-parallel training strategies (data, tensor, pipeline parallelism) with custom sharding techniques that minimize communication overhead.
  • Optimize GPU utilization, memory efficiency, and compute performance across distributed nodes.
  • Implement robust checkpointing, state synchronization, and recovery mechanisms for long-running, fault-prone training jobs.
  • Build monitoring and metrics systems to track training progress, model quality, and system bottlenecks.

Decentralized Networking & Resilience

  • Architect resilient training systems where nodes can fail, networks can partition, and participants can dynamically join or leave.
  • Design and optimize peer-to-peer topologies for decentralized coordination across non-co-located nodes.
  • Implement NAT traversal, peer discovery, dynamic routing, and connection lifecycle management.
  • Profile and optimize communication patterns to reduce latency and bandwidth overhead in multi-participant environments.

What You’ll Bring

  • Strong experience building and operating distributed systems in production.
  • Hands-on expertise with distributed training frameworks (FSDP, DeepSpeed, Megatron, or similar).
  • Deep understanding of model parallelism (data, tensor, pipeline parallelism).
  • Expert-level Python with production experience (concurrency, error handling, retry logic, clean architecture).
  • Strong networking fundamentals: P2P systems, gRPC, routing, NAT traversal, distributed coordination.
  • Experience optimizing GPU workloads, memory management, and large-scale compute efficiency.

What We Offer

  • Equity-heavy compensation with meaningful ownership in a mission-driven company.
  • Competitive base salary for senior engineering roles in Australia.
  • Visa sponsorship available for exceptional candidates.
  • Remote-first with optional access to our Melbourne hub.
  • World-class team — team mates were previously at Google, Amazon, Microsoft, and leading startups.

Backed by Union Square Ventures and other tier-1 investors, we're a world-class, deeply technical team of ML researchers and engineers. Pluralis is unapologetically ideological. We view the world as a better place if we are able to implement what we are attempting, and Protocol Learning as the only plausible approach to preventing a handful of massive corporations monopolising model development, access and release, and achieving massive economic capture. If this resonates, please apply.

Key Skills/Competency

  • Distributed Systems
  • Machine Learning Training
  • Model Parallelism
  • GPU Optimization
  • Peer-to-Peer Networking
  • Python (Expert)
  • Fault Tolerance
  • System Resilience
  • Low-Bandwidth Optimization
  • Large-Scale Compute

Tags:

Machine Learning Engineer
Distributed Systems
ML Training
Model Parallelism
GPU Optimization
Peer-to-Peer Networking
Python
Fault Tolerance
System Resilience
Low-Bandwidth Optimization
Large-Scale Compute
FSDP
DeepSpeed
Megatron
gRPC
NAT Traversal
Distributed Coordination
Data Parallelism
Tensor Parallelism
Pipeline Parallelism
Checkpointing

Share Job:

How to Get Hired at Pluralis Research

  • Research Pluralis Research's mission: Study their vision for Protocol Learning and its ideological foundation for community-owned models.
  • Tailor your resume: Highlight extensive experience in distributed systems, large-scale ML training, and networking, emphasizing achievements.
  • Showcase relevant projects: Detail practical experience with distributed training frameworks like FSDP, DeepSpeed, or custom parallelization strategies.
  • Prepare for technical depth: Expect rigorous discussions on distributed ML architecture, GPU optimization, P2P networking, and fault tolerance.
  • Emphasize problem-solving: Be ready to discuss challenges and solutions for building resilient systems in low-bandwidth, high-latency, and decentralized environments.

Frequently Asked Questions

Find answers to common questions about this job opportunity

Explore similar opportunities that match your background