Senior LLM Evaluation Engineer
Braintrust
Job Overview
Who's the hiring manager?
Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Job Description
Introduction
This is a contracting engagement – initially 6 months – with potential for long term engagement. We are building and evaluating state-of-the-art large language models (LLMs) and are looking for experienced software engineers to join our evaluation and annotation team. This role sits at the intersection of real-world software engineering, model evaluation, and applied AI, and is critical to improving model reliability, reasoning, and code quality. You will design challenging coding tasks, evaluate model outputs against rigorous benchmarks, identify failure modes, and contribute to reinforcement learning and model improvement workflows. This is not a junior annotation role. We are looking for practitioners with deep hands-on coding experience who can think like both an engineer and an evaluator.
What You’ll Do
- Create high-quality coding prompts and reference answers (benchmark-style, e.g. SWE-Bench-like problems).
- Evaluate LLM outputs for code generation, refactoring, debugging, and implementation tasks.
- Identify and document model failures, edge cases, and reasoning gaps.
- Perform head-to-head evaluations between private LLMs (Mistral-based) and leading external models.
- Build or configure coding environments to support evaluation and reinforcement learning (RL).
- Follow detailed annotation and evaluation guidelines with high consistency.
What We’re Looking For
- 5+ years of professional software development experience.
- Strong Python skills (required).
- Knowledge of at least one additional programming language (bonus).
- 1+ year of coding annotation and/or LLM evaluation experience (part-time OK) for a major frontier AI lab or AI infrastructure company.
- Prior code reviewer experience is a plus.
- Proven ability to apply structured evaluation criteria and write clear technical feedback.
- Fluent in English (written and spoken).
- Team lead or mentoring experience is a strong plus.
Why This Role
- Work hands-on with cutting-edge LLMs.
- Apply real-world engineering judgment to model evaluation and improvement.
- High-impact, technical work with a focused, senior team.
Key skills/competency
- LLM Evaluation
- Coding Annotation
- Software Engineering
- Python Programming
- Model Improvement
- Debugging
- Code Quality
- Benchmark Development
- Reinforcement Learning
- Technical Feedback
How to Get Hired at Braintrust
- Research Braintrust's culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor.
- Tailor your resume strategically: Highlight your LLM evaluation, Python, and software development experience for Braintrust.
- Showcase your technical depth: Provide concrete examples of code review, debugging, and identifying complex model failures.
- Prepare for in-depth technical interviews: Be ready to discuss your experience with LLM outputs, coding tasks, and evaluation methodologies.
- Demonstrate strong communication: Practice articulating clear technical feedback and structured evaluation criteria effectively.
Frequently Asked Questions
Find answers to common questions about this job opportunity
Explore similar opportunities that match your background