7 days ago

Short Term Consultant, Agent Evaluation Specialist

The World Bank Group

Hybrid
Temporary
$250,000
Hybrid

Job Overview

Job TitleShort Term Consultant, Agent Evaluation Specialist
Job TypeTemporary
CategoryCommerce
Experience5 Years
DegreeMaster
Offered Salary$250,000
LocationHybrid

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

Background: The World Bank Group's Institute for Economic Development

The World Bank Group's Institute for Economic Development (IED) is at the forefront of building advanced AI tools and agents. These span critical areas like knowledge curation, research synthesis, and co-creation with practitioners. As AI integration grows, the challenge of ensuring their reliability and accuracy becomes paramount.

The Challenge of AI Agent Evaluation

AI agents often fail in subtle, insidious ways—not through crashes, but by taking wrong turns, missing crucial context, or exhibiting inappropriate confidence. A research advisor that provides subtly misleading guidance is far more detrimental than one that overtly malfunctions. The core need is for a specialist who can rigorously answer, "Is this agent actually good?" beyond mere intuition.

Objective of the Role

The primary objective for the Short Term Consultant, Agent Evaluation Specialist is to establish robust evaluation frameworks and infrastructure. This will enable IED to thoroughly assess and continuously improve the quality and effectiveness of its AI agents, ensuring they meet their intended purposes reliably.

Scope of Work

  • Define precise criteria for what constitutes "good" across various agent types, recognizing that a research advisor's success metrics differ from an interview agent's.
  • Design targeted evaluation tasks that specifically address real-world failure modes, moving beyond simplistic "happy path" testing.
  • Construct efficient evaluation pipelines utilizing diverse grader types, including automated code-based checks, LLM-as-judge methodologies when appropriate, and human review for essential calibration.
  • Conduct in-depth reviews of agent transcripts to pinpoint recurring failure patterns and identify actionable opportunities for enhancement.
  • Effectively communicate evaluation findings to the team, translating complex results into practical, implementable changes for agent improvement.

Deliverables

Deliverables will be flexible and tailored to project requirements, potentially encompassing comprehensive evaluation frameworks, meticulously curated test datasets, detailed grading rubrics, functional evaluation pipelines, insightful quality reports, and concise technical notes.

Qualifications

Required:

  • Proven experience in evaluating ML or LLM systems, with a clear understanding of the distinction between superficial metrics and true system efficacy.
  • Ability to design objective rubrics and define success criteria for complex tasks where the concept of "correctness" may not be immediately obvious.
  • A deep commitment to obsessively reviewing agent traces, recognizing this as the indispensable method for validating grader effectiveness.
  • Existing authorization to work in your current country of residence.

Preferred:

  • Practical experience with LLM-as-judge patterns and a nuanced understanding of their inherent limitations.
  • A background in user research, psychometrics, or QA engineering, providing a strong foundation for robust evaluation.
  • Familiarity with prominent observability platforms such as Langfuse, LangSmith, or similar tools used in AI development.
  • At least one relevant publication or research paper demonstrating expertise.

Duration and Schedule

This is a short-term consulting engagement with a maximum level of effort of 20 days. The period of engagement is anticipated from March 1, 2026, to June 30, 2026, with potential for extension. This position is fully remote.

Application Process and Next Steps

Interested applicants must submit their applications exclusively via the provided link: https://survey.wb.surveycto.com/collect/ied_ai_hub_stc_application_2026?caseid= Please refrain from applying through LinkedIn. Applicants who can demonstrate prior experience in evaluating agentic systems are highly preferred. Ensure to include a link to your portfolio, GitHub profile, or relevant research papers you are proud of within your cover letter. Shortlisted candidates may be invited to participate in a competency-based assessment.

Key skills/competency

  • AI Evaluation
  • LLM Systems
  • Evaluation Frameworks
  • Quality Assurance
  • Research Synthesis
  • Test Dataset Design
  • Rubric Development
  • Agent Trace Analysis
  • Failure Mode Analysis
  • Actionable Insights

Tags:

Agent Evaluation Specialist
AI evaluation
LLM evaluation
evaluation frameworks
quality assurance
research synthesis
test design
rubric development
agent trace analysis
failure analysis
actionable insights
AI
Machine Learning
LLM
Natural Language Processing
Langfuse
LangSmith
observability platforms
Python
data analysis
deep learning

Share Job:

How to Get Hired at The World Bank Group

  • Research The World Bank Group's mission: Study their commitment to global development, values, and the specific work of the Institute for Economic Development to align your application.
  • Tailor your resume and cover letter: Emphasize your direct experience in AI/LLM system evaluation, rubric design, and agent trace analysis, highlighting quantifiable achievements relevant to The World Bank Group's AI initiatives.
  • Showcase your portfolio: Provide clear links to your GitHub, research papers, or a portfolio demonstrating practical experience with agentic system evaluation and robust methodological approaches.
  • Highlight analytical rigor: Prepare to discuss how you define 'good' for AI agents and translate complex evaluation results into actionable, impactful improvements.
  • Prepare for competency-based assessment: Focus on demonstrating your problem-solving abilities, analytical thinking, and communication skills, particularly in explaining complex technical concepts clearly.

Frequently Asked Questions

Find answers to common questions about this job opportunity

Explore similar opportunities that match your background