Question 1

What is the primary focus of the Staff Research Engineer, Pre-training Data role at Reddit, Inc.?

Accepted Answer

This role is centered on defining the technical strategy and architecture for the data curriculum pipelines that power Reddit's next-generation, Reddit-native foundational Large Language Models (LLMs). It involves transforming petabytes of diverse data into high-quality training signals.

Question 2

What core technical skills are essential for this Staff Research Engineer position at Reddit, Inc.?

Accepted Answer

Candidates need expert proficiency in Python, distributed data processing frameworks (like Ray Data or Spark), and substantial experience with unstructured and semi-structured data (text, images, video). A strong mathematical background in probability, statistics, and importance sampling theory is also critical.

Question 3

Does Reddit, Inc. offer remote work for the Staff Research Engineer, Pre-training Data role?

Accepted Answer

Yes, this position is completely remote-friendly within the United States. While there are physical offices in San Francisco, Los Angeles, New York City, and Chicago, coming into the office is optional.

Question 4

What is the expected salary range for a Staff Research Engineer, Pre-training Data at Reddit, Inc.?

Accepted Answer

The base salary range for this US-based position is $230,000 to $322,000 USD, in addition to equity in the form of restricted stock units and potential commissions, depending on the specific offer.

Question 5

How does Reddit, Inc.'s AI Engineering team leverage LLMs?

Accepted Answer

The AI Engineering team at Reddit is building their own foundational LLMs designed to understand the unique culture, language, and structure of Reddit communities. These models will power critical functions such as Safety & Moderation, Search, Ads, and future user products.

Question 6

What kind of data challenges will a Staff Research Engineer, Pre-training Data face at Reddit, Inc.?

Accepted Answer

You will work with petabytes of Reddit's unique corpus, including complex conversational trees (threads, subreddits, cross-posts), text, images, and video. The challenge is to process this multimodal, tree-structured data to preserve semantic relationships for optimal model understanding and training.

Question 7

What is Reddit, Inc.'s approach to data safety in LLM pre-training for this role?

Accepted Answer

A key responsibility is designing a "Safety-First" ingestion layer, which includes building automated pipelines for PII redaction, incorporating toxicity signals, and performing quality deduplication upstream of training, in close collaboration with Safety and Moderation Engineering.

Question 8

Is prior experience with Graph data structures beneficial for this Reddit, Inc. role?

Accepted Answer

Yes, experience working with Graph data structures or serializing conversation trees is highly valued for this role, as it directly relates to processing the complex, interconnected nature of Reddit's community interactions.

Question 9

What are some 'nice to have' qualifications for the Staff Research Engineer, Pre-training Data position at Reddit, Inc.?

Accepted Answer

Beneficial qualifications include experience with JAX or PyTorch internals for distributed data loading, multimodal datasets, Rust or C++ for performance optimization, and published research or practical experience in active learning or automated data selection.

Staff Research Engineer, Pre-training Data

Reddit, Inc.

Job Overview

Who's the hiring manager?

Job Description

Staff Research Engineer, Pre-training Data at Reddit, Inc.

Responsibilities

Required Qualifications

Nice To Have

Benefits

Pay Transparency

Key skills/competency

Tags:

How to Get Hired at Reddit, Inc.

Frequently Asked Questions