1 day ago

Senior LLM Evaluation Engineer

Braintrust

Hybrid
Contractor
€120,000
Hybrid

Job Overview

Job TitleSenior LLM Evaluation Engineer
Job TypeContractor
CategoryCommerce
Experience5 Years
DegreeMaster
Offered Salary€120,000
LocationHybrid

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

Introduction

This is a contracting engagement – initially 6 months – with potential for long term engagement. We are building and evaluating state-of-the-art large language models (LLMs) and are looking for experienced software engineers to join our evaluation and annotation team. This role sits at the intersection of real-world software engineering, model evaluation, and applied AI, and is critical to improving model reliability, reasoning, and code quality. You will design challenging coding tasks, evaluate model outputs against rigorous benchmarks, identify failure modes, and contribute to reinforcement learning and model improvement workflows. This is not a junior annotation role. We are looking for practitioners with deep hands-on coding experience who can think like both an engineer and an evaluator.

What You’ll Do

  • Create high-quality coding prompts and reference answers (benchmark-style, e.g. SWE-Bench-like problems).
  • Evaluate LLM outputs for code generation, refactoring, debugging, and implementation tasks.
  • Identify and document model failures, edge cases, and reasoning gaps.
  • Perform head-to-head evaluations between private LLMs (Mistral-based) and leading external models.
  • Build or configure coding environments to support evaluation and reinforcement learning (RL).
  • Follow detailed annotation and evaluation guidelines with high consistency.

What We’re Looking For

  • 5+ years of professional software development experience.
  • Strong Python skills (required).
  • Knowledge of at least one additional programming language (bonus).
  • 1+ year of coding annotation and/or LLM evaluation experience (part-time OK) for a major frontier AI lab or AI infrastructure company.
  • Prior code reviewer experience is a plus.
  • Proven ability to apply structured evaluation criteria and write clear technical feedback.
  • Fluent in English (written and spoken).
  • Team lead or mentoring experience is a strong plus.

Why This Role

  • Work hands-on with cutting-edge LLMs.
  • Apply real-world engineering judgment to model evaluation and improvement.
  • High-impact, technical work with a focused, senior team.

Key skills/competency

  • LLM Evaluation
  • Coding Annotation
  • Software Engineering
  • Python Programming
  • Model Improvement
  • Debugging
  • Code Quality
  • Benchmark Development
  • Reinforcement Learning
  • Technical Feedback

Tags:

LLM Evaluation Engineer
LLM evaluation
coding
annotation
software engineering
model improvement
debugging
code generation
prompt design
benchmark development
technical feedback
Python
LLM
Mistral
AI models
machine learning
coding environments
Git
evaluation tools
deep learning
software development

Share Job:

How to Get Hired at Braintrust

  • Research Braintrust's culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor.
  • Tailor your resume strategically: Highlight your LLM evaluation, Python, and software development experience for Braintrust.
  • Showcase your technical depth: Provide concrete examples of code review, debugging, and identifying complex model failures.
  • Prepare for in-depth technical interviews: Be ready to discuss your experience with LLM outputs, coding tasks, and evaluation methodologies.
  • Demonstrate strong communication: Practice articulating clear technical feedback and structured evaluation criteria effectively.

Frequently Asked Questions

Find answers to common questions about this job opportunity

Explore similar opportunities that match your background