9 days ago
AI Evaluation Engineer
FirstIgnite
Hybrid
Full Time
$140,000
Hybrid
Job Overview
Job TitleAI Evaluation Engineer
Job TypeFull Time
Offered Salary$140,000
LocationHybrid
Who's the hiring manager?
Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Job Description
About FirstIgnite
FirstIgnite is the AI-powered business development platform for university technology transfer offices (TTOs). We help research institutions turn breakthroughs into partnerships, licenses, and companies by combining deep LLM-driven workflows with the relationships that actually move deals forward. Our product suite spans expert discovery, grants search, and AI-driven outreach — all built on a modern, agentic stack. We ship fast, we measure everything, and we believe evaluations are the difference between AI features that demo well and AI features that work in production.The Role
We're hiring an AI Evaluation Engineer to own the quality bar for every LLM-powered feature we ship. You'll design, build, and scale the infrastructure that tells us — with evidence — whether a prompt change, model swap, or agent refactor made things better or worse.This is a high-leverage role. Every customer-facing AI capability at FirstIgnite flows through your evals. You'll work directly with the Head of Engineering and partner closely with product, applied AI, and the full-stack team to establish evaluation as a first-class discipline across the company.What You'll Do
- Build evaluation infrastructure: Design and maintain eval suites using Promptfoo, LLM-as-judge methodologies, and custom harnesses for features like our expert search system, natural language grants search, and AI SDR agents.
- Define what 'good' means: Partner with product and domain experts to translate fuzzy customer outcomes ("does this surface the right principal investigator?") into precise, measurable rubrics.
- Own the feedback loop: Instrument production traffic, curate golden datasets from real customer interactions, and build pipelines that turn user behavior into regression tests.
- Ship quickly under uncertainty: We routinely run 48-hour eval sprints for greenfield features with no production traffic. You'll be comfortable bootstrapping quality signal from scratch.
- Model and prompt evaluation: Run rigorous A/B comparisons across models (OpenAI, Anthropic, open-weight), prompt strategies, and agent architectures. Quantify tradeoffs between cost, latency, and quality.
- Agent evaluation: Help us measure multi-step agent behavior built on the OpenAI Agents SDK, Vercel AI SDK, and Temporal Cloud — including tool-use correctness, trajectory quality, and end-to-end task completion.
- Raise the floor for the team: Create templates, documentation, and tooling so every engineer can write and run evals as part of normal development. Evals should feel as natural as unit tests.
Requirements
- 3+ years of engineering experience, with at least 1 year focused on LLM/ML evaluation, applied AI, or data quality systems.
- Hands-on experience with LLM evaluation frameworks — Promptfoo, Braintrust, LangSmith, OpenAI Evals, DeepEval, or equivalent in-house tooling.
- Strong grasp of LLM-as-judge methodology, including its failure modes (position bias, verbosity bias, judge-model drift) and how to mitigate them.
- Statistical literacy — you know the difference between a real regression and noise, and you can design experiments that answer the question actually being asked.
- Product instincts. You can sit with a customer success call transcript, identify the three failure modes that matter, and ship an eval for each by end of week.
- Strong written communication. Evals are useless if the engineers shipping features don't trust or read the results.
Preferred Qualifications
- Experience evaluating retrieval systems (RAG, hybrid search, reranking) — especially over structured or semi-structured domains like research, grants, or patents.
- Exposure to agent orchestration frameworks (Temporal, LangGraph, OpenAI Agents SDK) and the specific challenges of evaluating multi-step, tool-using systems.
- Background in information retrieval, search relevance, or a research-adjacent domain.
- Experience building internal tooling or dashboards that non-engineers (PMs, domain experts) actually use to label and review model outputs.
Why This Role
- You'll be the first dedicated evals hire. The scope, standards, and tooling are yours to define.
- AI quality is existential for our product. This isn't a compliance role tucked into a corner — it's directly on the critical path to revenue.
- Small, senior team. ~10 engineers, distributed globally, with a strong bias toward shipping and measuring.
- Direct access to real-world, high-stakes LLM use cases — research discovery, grants, outbound — across a customer base that deeply values accuracy.
Application Instructions
As part of your LinkedIn application submission, please use Loom to record a video and email it to us within the next 48 hours. The recording should include both your camera (showing yourself) and your screen as you walk us through a project or piece of code you’re proud of.Please email the video to careers@firstignite.com with the title of the position you are applying for in the subject line. We look forward to your submission!How to Get Hired at FirstIgnite
- Tailor your resume: Highlight your LLM/ML evaluation, applied AI, or data quality systems experience, quantifying achievements.
- Showcase evaluation skills: Detail hands-on experience with frameworks like Promptfoo, Braintrust, or LangSmith in your application.
- Demonstrate product instinct: Prepare examples of translating user needs into measurable evaluation rubrics during interviews.
- Prepare your Loom video: Record a video showcasing a proud project or code, demonstrating your technical and communication skills.
- Follow application instructions: Email your Loom video to careers@firstignite.com with the job title in the subject line.
Frequently Asked Questions
Find answers to common questions about this job opportunity
01What is the primary responsibility of an AI Evaluation Engineer at FirstIgnite?
02What LLM evaluation frameworks are commonly used at FirstIgnite?
03How does FirstIgnite approach defining 'good' for AI features?
04What is the significance of the Loom video submission for the AI Evaluation Engineer role at FirstIgnite?
05How does FirstIgnite handle evaluating multi-step agent behavior?
06What is the role of statistical literacy in this AI Evaluation Engineer position?
07How does FirstIgnite ensure evaluation is integrated into the development process?
08What kind of real-world LLM use cases will I encounter as an AI Evaluation Engineer at FirstIgnite?
Explore similar opportunities that match your background