Member of Engineering Pre-training Data Research
Poolside
Job Overview
Who's the hiring manager?
Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Job Description
About Poolside
Poolside is building Artificial General Intelligence (AGI) and aims to be a leader in this transformative decade. Our strategy focuses on accelerating software development with agentic systems, coding assistants, and frontier models, serving security-conscious enterprises.
About Our Team
We are a globally distributed team across Europe and North America, with monthly in-person collaboration in Paris. Our multidisciplinary team comprises research, engineering, and business experts united by a shared passion for our mission. We foster a culture of low ego, kindness, hard work, and intellectual curiosity.
About The Role
Join our data team as a Member of Engineering focused on enhancing the quality of datasets for training our models. Your primary mission will be to improve pretraining dataset quality through experimentation, synthetic data generation, and data mix optimization. You will collaborate closely with Pretraining, Postraining, Evals, and Product teams to define data needs aligned with model capabilities and use cases. Staying current with dataset design and pretraining research is crucial, as you'll lead research initiatives through experiments and deploy technical engineering solutions. You will have access to a performant distributed data pipeline and a large GPU cluster.
Your Mission
To deliver large, high-quality, and diverse datasets of natural language and source code for training Poolside models and coding agents.
Responsibilities
- Follow the latest research related to LLMs and data quality.
- Be familiar with relevant open-source datasets and models.
- Design and implement complex pipelines for data generation, ensuring diversity and optimizing resources.
- Collaborate with Pretraining, Posttraining, Evals, and Product teams for rapid feedback loops.
- Conduct and analyze data ablations or training experiments to improve dataset quality using quantitative insights.
Skills & Experience
- Strong machine learning and engineering background.
- Experience with Large Language Models (LLMs), including transformer architectures, how LLMs learn, data ablations, scaling laws, mid-training, and post-training techniques.
- Experience training reasoning and agentic models.
- Experience with evals tracking model capabilities (general knowledge, reasoning, math, coding, long-context, etc.).
- Experience in building trillion-scale pretraining datasets, including data curation, deduplication, data mixing, tokenization, curriculum, and data repetition impact.
- Excellent programming skills in Python.
- Strong prompt engineering skills.
- Experience with large-scale GPU clusters and distributed data pipelines.
- Strong obsession with data quality.
- Research experience: Author of scientific papers in applied deep learning, LLMs, source code generation, etc. (nice to have).
- Ability to freely discuss the latest papers and engage in detailed technical discussions.
- Reasonably opinionated and able to express informed viewpoints.
Process
- Intro call with a Founding Engineer.
- Technical Interview(s) with a Member of Engineering.
- Team fit call with the People team.
- Final interview with a Founding Engineer.
Benefits
- Fully remote work & flexible hours.
- 37 days/year of vacation & holidays.
- Health insurance allowance for you & dependents.
- Company-provided equipment.
- Well-being, learning, and home office allowances.
- Frequent team get-togethers.
- Diverse & inclusive people-first culture.
Key skills/competency
- Large Language Models (LLMs)
- Machine Learning
- Data Quality
- Python
- Distributed Data Pipelines
- GPU Clusters
- Prompt Engineering
- Data Curation
- Transformer Architectures
- Research Initiatives
How to Get Hired at Poolside
- Tailor your resume: Highlight your ML, LLM, Python, and data quality experience. Quantify achievements in dataset generation and pipeline development.
- Showcase research: Emphasize any publications or significant research contributions in applied deep learning or LLMs.
- Prepare for technical interviews: Brush up on transformer architectures, LLM training techniques, data curation, and distributed systems.
- Demonstrate collaboration: Be ready to discuss how you work with cross-functional teams to define data needs and feedback loops.
- Research Poolside's mission: Understand their AGI goal and how data quality is central to it.
Frequently Asked Questions
Find answers to common questions about this job opportunity
Explore similar opportunities that match your background