4 days ago

Staff Research Engineer, Pre-training Data

Reddit, Inc.

Hybrid
Full Time
$275,000
Hybrid

Job Overview

Job TitleStaff Research Engineer, Pre-training Data
Job TypeFull Time
CategoryCommerce
Experience5 Years
DegreeMaster
Offered Salary$275,000
LocationHybrid

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

Staff Research Engineer, Pre-training Data at Reddit, Inc.

Reddit is a vibrant community built on shared interests, passion, and trust, fostering open and authentic conversations. With over 100,000 active communities and approximately 121 million daily active unique visitors, Reddit stands as a significant source of information on the internet.

Reddit is actively expanding its teams with top talent. This particular role offers complete remote flexibility within the United States. For those residing near our physical offices in San Francisco, Los Angeles, New York City, or Chicago, you are welcome to work from the office as often as you prefer.

The AI Engineering team at Reddit is driving a strategic initiative to develop Reddit-native foundational Large Language Models (LLMs). This team operates at the crossroads of applied research and large-scale infrastructure, with the mission to train models that deeply comprehend Reddit’s distinctive culture, language, and community structure. Joining this team means collaborating with distinguished engineers and safety experts to build the core engine of Reddit’s AI future, creating foundational models that will enhance Safety & Moderation, Search, Ads, and the next generation of user products.

As a Staff Research Engineer, Pre-training Data, you will be instrumental in defining the technical strategy and architecture for the data curriculum pipelines that power our advanced foundation models. This role involves working at the intersection of distributed infrastructure, multimodal processing, and mathematics, where you will design systems to transform Reddit’s extensive corpus of human conversation—encompassing petabytes of text, images, and video—into high-quality training signals. Your work will extend beyond simple text processing to engineer solutions that respect the intricate, tree-structured nature of Reddit threads, ensuring our models capture the nuances of community interaction effectively.

Responsibilities

  • Architect and implement high-throughput, deterministic data sampling systems for distributed training clusters at frontier-model scale.
  • Design and execute dynamic curriculum learning strategies, building systems that automatically adjust data distributions (text vs. multimodal) during training to enhance model stability and reasoning.
  • Engineer logic for serializing Reddit’s complex conversational trees (threads, subreddits, cross-posts) into optimal training contexts, developing topological data processing strategies that preserve semantic relationships for model understanding.
  • Formulate and validate statistical hypotheses concerning data mixtures, applying advanced sampling theory to minimize bias and maximize token quality.
  • Design the “Safety-First” ingestion layer: construct automated pipelines for PII redaction, toxicity signals, and quality deduplication upstream of training, collaborating closely with Safety and Moderation Engineering teams.
  • Bridge the gap between research and engineering by transforming theoretical sampling insights into robust, low-latency production infrastructure.
  • Mentor senior engineers and researchers on system design, numerical correctness, and performance optimization in distributed Python/Rust environments.

Required Qualifications

  • 8+ years of software engineering experience focusing on machine learning infrastructure, data science at scale, or LLM pre-training.
  • Expert proficiency in Python and distributed data processing frameworks (e.g., Ray Data, Spark, or custom high-performance dataloaders).
  • Experience managing Unstructured and Semi-Structured data at scale (including text, code, images, and audio/video).
  • Strong mathematical foundation in probability, statistics, and importance sampling theory.
  • Deep understanding of pre-training dynamics and data quality/ordering impact on model performance.
  • Experience with Graph data structures or serializing conversation trees is highly valued.

Nice To Have

  • Experience with JAX or PyTorch internals related to distributed data loading.
  • Experience with Multimodal datasets (image/video + text) and vision-language preprocessing.
  • Proficiency in Rust or C++ for performance-critical data path optimization.
  • Published research or significant practical experience in active learning or automated data selection.

Benefits

  • Comprehensive Healthcare Benefits and Income Replacement Programs
  • 401k with Employer Match
  • Global Benefit programs covering workspace, professional development, and caregiving support
  • Family Planning Support
  • Gender-Affirming Care
  • Mental Health & Coaching Benefits
  • Flexible Vacation & Paid Volunteer Time Off
  • Generous Paid Parental Leave

Pay Transparency

This job posting may encompass more than one career level. In addition to base salary, this role is eligible for equity in the form of restricted stock units and may include a commission component depending on the position offered. Reddit provides a comprehensive benefits package to U.S.-based employees, including medical, dental, and vision insurance, a 401(k) program with employer match, ample vacation time, and parental leave. For further details, please visit https://www.redditinc.com/careers/.

For transparency, we provide base salary ranges for all US-based job postings. Our standard base pay ranges are determined by function, level, and country location, benchmarked against similar growth-stage companies. Final offer amounts are influenced by various factors such as skills, depth of work experience, and relevant licenses/credentials, and may differ from the listed amounts.

The Base Salary Range For This Position Is: $230,000—$322,000 USD

Please note that for select roles and locations, interviews may be recorded, transcribed, and summarized by artificial intelligence. Candidates will have the option to opt out of these processes prior to any scheduled interviews. During the interview, we will collect personal information including Identifiers, Professional and Employment-Related Information, Sensory Information (audio/video recording), and any other information you choose to share. This data will be used solely for evaluating your application. We do not sell your personal information or share it for third-party marketing. Interview recordings will be deleted promptly after a hiring decision is made. Refer to our Candidate Privacy Policy for more information.

Reddit is an equal opportunity employer committed to building a diverse workforce. We are dedicated to providing reasonable accommodations for qualified individuals with disabilities and disabled veterans during the job application process. If you require accommodation due to a disability, please inform your recruiter.

Key skills/competency

  • Machine Learning Infrastructure
  • LLM Pre-training
  • Distributed Data Processing
  • Python Programming
  • Unstructured Data
  • Statistical Modeling
  • Graph Data Structures
  • Multimodal AI
  • Data Pipeline Architecture
  • System Design

Tags:

Staff Research Engineer
Machine Learning Infrastructure
LLM Pre-training
Data Pipelines
Distributed Systems
Data Engineering
Statistical Modeling
Curriculum Learning
PII Redaction
Toxicity Detection
Performance Optimization
Python
Ray Data
Spark
JAX
PyTorch
Rust
C++
Graph Databases
Multimodal AI
Large Language Models

Share Job:

How to Get Hired at Reddit, Inc.

  • Research Reddit, Inc.'s culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor. Understand Reddit's unique community-driven ethos.
  • Tailor your resume: Customize your resume to highlight experience in LLM pre-training, distributed data processing, and handling unstructured data, specifically for Reddit, Inc.'s AI Engineering team.
  • Showcase technical expertise: Prepare to discuss your deep proficiency in Python, distributed frameworks like Ray Data or Spark, and a strong mathematical foundation in statistics relevant to data sampling and quality.
  • Demonstrate problem-solving: Be ready to share specific examples of architecting complex data pipelines, especially those involving graph data structures or multimodal datasets, to showcase your problem-solving skills for Reddit, Inc.
  • Prepare for behavioral questions: Practice articulating how you've mentored peers, bridged research and engineering gaps, and contributed to a "safety-first" approach in data initiatives, aligning with Reddit, Inc.'s values.

Frequently Asked Questions

Find answers to common questions about this job opportunity

Explore similar opportunities that match your background