15 days ago

Member of Engineering Pre-training Data Research

Poolside

Hybrid
Full Time
$150,000
Hybrid
Apply

Job Overview

Job TitleMember of Engineering Pre-training Data Research
Job TypeFull Time
Offered Salary$150,000
LocationHybrid

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

About Poolside

Poolside is building Artificial General Intelligence (AGI) and aims to be a leader in this transformative decade. Our strategy focuses on accelerating software development with agentic systems, coding assistants, and frontier models, serving security-conscious enterprises.

About Our Team

We are a globally distributed team across Europe and North America, with monthly in-person collaboration in Paris. Our multidisciplinary team comprises research, engineering, and business experts united by a shared passion for our mission. We foster a culture of low ego, kindness, hard work, and intellectual curiosity.

About The Role

Join our data team as a Member of Engineering focused on enhancing the quality of datasets for training our models. Your primary mission will be to improve pretraining dataset quality through experimentation, synthetic data generation, and data mix optimization. You will collaborate closely with Pretraining, Postraining, Evals, and Product teams to define data needs aligned with model capabilities and use cases. Staying current with dataset design and pretraining research is crucial, as you'll lead research initiatives through experiments and deploy technical engineering solutions. You will have access to a performant distributed data pipeline and a large GPU cluster.

Your Mission

To deliver large, high-quality, and diverse datasets of natural language and source code for training Poolside models and coding agents.

Responsibilities

  • Follow the latest research related to LLMs and data quality.
  • Be familiar with relevant open-source datasets and models.
  • Design and implement complex pipelines for data generation, ensuring diversity and optimizing resources.
  • Collaborate with Pretraining, Posttraining, Evals, and Product teams for rapid feedback loops.
  • Conduct and analyze data ablations or training experiments to improve dataset quality using quantitative insights.

Skills & Experience

  • Strong machine learning and engineering background.
  • Experience with Large Language Models (LLMs), including transformer architectures, how LLMs learn, data ablations, scaling laws, mid-training, and post-training techniques.
  • Experience training reasoning and agentic models.
  • Experience with evals tracking model capabilities (general knowledge, reasoning, math, coding, long-context, etc.).
  • Experience in building trillion-scale pretraining datasets, including data curation, deduplication, data mixing, tokenization, curriculum, and data repetition impact.
  • Excellent programming skills in Python.
  • Strong prompt engineering skills.
  • Experience with large-scale GPU clusters and distributed data pipelines.
  • Strong obsession with data quality.
  • Research experience: Author of scientific papers in applied deep learning, LLMs, source code generation, etc. (nice to have).
  • Ability to freely discuss the latest papers and engage in detailed technical discussions.
  • Reasonably opinionated and able to express informed viewpoints.

Process

  • Intro call with a Founding Engineer.
  • Technical Interview(s) with a Member of Engineering.
  • Team fit call with the People team.
  • Final interview with a Founding Engineer.

Benefits

  • Fully remote work & flexible hours.
  • 37 days/year of vacation & holidays.
  • Health insurance allowance for you & dependents.
  • Company-provided equipment.
  • Well-being, learning, and home office allowances.
  • Frequent team get-togethers.
  • Diverse & inclusive people-first culture.

Key skills/competency

  • Large Language Models (LLMs)
  • Machine Learning
  • Data Quality
  • Python
  • Distributed Data Pipelines
  • GPU Clusters
  • Prompt Engineering
  • Data Curation
  • Transformer Architectures
  • Research Initiatives

Tags:

Machine Learning
LLM
Data Research
Python
Data Engineering
AI
AGI
Distributed Systems
GPU
Pretraining
Data Curation
Prompt Engineering

Share Job:

How to Get Hired at Poolside

  • Tailor your resume: Highlight your ML, LLM, Python, and data quality experience. Quantify achievements in dataset generation and pipeline development.
  • Showcase research: Emphasize any publications or significant research contributions in applied deep learning or LLMs.
  • Prepare for technical interviews: Brush up on transformer architectures, LLM training techniques, data curation, and distributed systems.
  • Demonstrate collaboration: Be ready to discuss how you work with cross-functional teams to define data needs and feedback loops.
  • Research Poolside's mission: Understand their AGI goal and how data quality is central to it.

Frequently Asked Questions

Find answers to common questions about this job opportunity

Explore similar opportunities that match your background