22 hours ago

AI Training Infrastructure Engineer

Dex

Remote
Full Time
€175,000
Remote

Job Overview

Job TitleAI Training Infrastructure Engineer
Job TypeFull Time
CategoryCommerce
Experience5 Years
DegreeMaster
Offered Salary€175,000
LocationRemote

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

AI Training Infrastructure Engineer

This role is with one of Dex’s trusted Partner companies. We work closely with their teams to truly understand their culture, goals, and what they’re looking for, so we can match you with the right opportunity and give you context about the role before you commit to a process.

If you're interested, sign up to Dex to apply.

Dex is an AI recruiter agent that helps you run your job search. Tell Dex your stack, seniority, and what you want to build. We will manage your applications and surface other opportunities that are a fit.

About the Company

A well-funded generative AI company building foundational models that create high-quality sound, speech, and music directly from video.

Their technology enables creators, platforms, and gaming companies to generate realistic, synchronised audio for previously silent or dynamically produced video content. Backed by leading global investors, the company is scaling rapidly across engineering and product.

The Opportunity

This role sits at the core of the model training stack. You will design and optimise the infrastructure that enables large-scale training of generative audio and video models. The focus is on GPU-level performance, distributed systems, and building scalable pipelines that allow researchers to iterate efficiently.

Your work will directly influence training throughput, cost efficiency, and model performance.

What You’ll Do

  • Design and optimise distributed training strategies across varying model sizes and compute constraints.
  • Profile and debug GPU workloads to improve utilisation and throughput.
  • Improve end-to-end training pipelines, including data loading, distributed execution, checkpointing, and logging.
  • Architect and maintain scalable ML training clusters (SLURM-based).
  • Implement experiment tracking, model versioning, and reproducibility systems.
  • Optimise PyTorch code and inference pathways for performance and efficiency.

What They’re Looking For

  • Strong hands-on experience optimising large-scale training workloads.
  • Deep understanding of GPU architecture, memory hierarchies, and performance bottlenecks.
  • Experience balancing compute-bound vs memory-bound workloads.
  • Expertise in distributed training and parallelism strategies.
  • Strong systems thinking across data pipelines, storage, and cluster orchestration.

Nice to have:

  • Experience implementing custom GPU kernels.
  • Familiarity with diffusion or autoregressive models.
  • Experience managing SLURM clusters at scale.
  • Knowledge of high-performance storage systems for ML workloads.

Why It’s Compelling

  • Foundational role shaping the infrastructure behind next-generation generative models.
  • Significant autonomy and technical ownership.
  • Backed by top-tier investors with strong early traction.
  • Competitive compensation (€150k–€200k + equity).
  • Remote-first with periodic collaboration in Berlin.

Key skills/competency

  • Infrastructure Design
  • AI Training
  • GPU Optimization
  • Distributed Systems
  • ML Pipelines
  • PyTorch
  • SLURM
  • Generative Models
  • Performance Tuning
  • Scalability

Tags:

AI Training Infrastructure Engineer
Infrastructure Design
Distributed Training
GPU Performance
ML Pipeline
Experiment Tracking
Model Versioning
Code Optimization
Systems Thinking
Workload Debugging
Scalability
PyTorch
SLURM
CUDA
Python
Linux
Cloud Computing
Git
Deep Learning
Generative AI
Distributed Computing

Share Job:

How to Get Hired at Dex

  • Research Dex's partner company: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor.
  • Tailor your resume: Highlight expertise in large-scale ML infrastructure, GPU optimization, and distributed systems.
  • Showcase relevant projects: Provide examples of work with PyTorch, SLURM, and high-performance computing in AI.
  • Prepare for technical deep-dives: Expect questions on GPU architecture, distributed training strategies, and debugging complex workloads.
  • Demonstrate systems thinking: Articulate how you approach end-to-end training pipelines and cluster orchestration.

Frequently Asked Questions

Find answers to common questions about this job opportunity

Explore similar opportunities that match your background