Short Term Consultant, Agent Evaluation Specialist
The World Bank Group
Job Overview
Who's the hiring manager?
Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Job Description
Background: The World Bank Group's Institute for Economic Development
The World Bank Group's Institute for Economic Development (IED) is at the forefront of building advanced AI tools and agents. These span critical areas like knowledge curation, research synthesis, and co-creation with practitioners. As AI integration grows, the challenge of ensuring their reliability and accuracy becomes paramount.
The Challenge of AI Agent Evaluation
AI agents often fail in subtle, insidious ways—not through crashes, but by taking wrong turns, missing crucial context, or exhibiting inappropriate confidence. A research advisor that provides subtly misleading guidance is far more detrimental than one that overtly malfunctions. The core need is for a specialist who can rigorously answer, "Is this agent actually good?" beyond mere intuition.
Objective of the Role
The primary objective for the Short Term Consultant, Agent Evaluation Specialist is to establish robust evaluation frameworks and infrastructure. This will enable IED to thoroughly assess and continuously improve the quality and effectiveness of its AI agents, ensuring they meet their intended purposes reliably.
Scope of Work
- Define precise criteria for what constitutes "good" across various agent types, recognizing that a research advisor's success metrics differ from an interview agent's.
- Design targeted evaluation tasks that specifically address real-world failure modes, moving beyond simplistic "happy path" testing.
- Construct efficient evaluation pipelines utilizing diverse grader types, including automated code-based checks, LLM-as-judge methodologies when appropriate, and human review for essential calibration.
- Conduct in-depth reviews of agent transcripts to pinpoint recurring failure patterns and identify actionable opportunities for enhancement.
- Effectively communicate evaluation findings to the team, translating complex results into practical, implementable changes for agent improvement.
Deliverables
Deliverables will be flexible and tailored to project requirements, potentially encompassing comprehensive evaluation frameworks, meticulously curated test datasets, detailed grading rubrics, functional evaluation pipelines, insightful quality reports, and concise technical notes.
Qualifications
Required:
- Proven experience in evaluating ML or LLM systems, with a clear understanding of the distinction between superficial metrics and true system efficacy.
- Ability to design objective rubrics and define success criteria for complex tasks where the concept of "correctness" may not be immediately obvious.
- A deep commitment to obsessively reviewing agent traces, recognizing this as the indispensable method for validating grader effectiveness.
- Existing authorization to work in your current country of residence.
Preferred:
- Practical experience with LLM-as-judge patterns and a nuanced understanding of their inherent limitations.
- A background in user research, psychometrics, or QA engineering, providing a strong foundation for robust evaluation.
- Familiarity with prominent observability platforms such as Langfuse, LangSmith, or similar tools used in AI development.
- At least one relevant publication or research paper demonstrating expertise.
Duration and Schedule
This is a short-term consulting engagement with a maximum level of effort of 20 days. The period of engagement is anticipated from March 1, 2026, to June 30, 2026, with potential for extension. This position is fully remote.
Application Process and Next Steps
Interested applicants must submit their applications exclusively via the provided link: https://survey.wb.surveycto.com/collect/ied_ai_hub_stc_application_2026?caseid= Please refrain from applying through LinkedIn. Applicants who can demonstrate prior experience in evaluating agentic systems are highly preferred. Ensure to include a link to your portfolio, GitHub profile, or relevant research papers you are proud of within your cover letter. Shortlisted candidates may be invited to participate in a competency-based assessment.
Key skills/competency
- AI Evaluation
- LLM Systems
- Evaluation Frameworks
- Quality Assurance
- Research Synthesis
- Test Dataset Design
- Rubric Development
- Agent Trace Analysis
- Failure Mode Analysis
- Actionable Insights
How to Get Hired at The World Bank Group
- Research The World Bank Group's mission: Study their commitment to global development, values, and the specific work of the Institute for Economic Development to align your application.
- Tailor your resume and cover letter: Emphasize your direct experience in AI/LLM system evaluation, rubric design, and agent trace analysis, highlighting quantifiable achievements relevant to The World Bank Group's AI initiatives.
- Showcase your portfolio: Provide clear links to your GitHub, research papers, or a portfolio demonstrating practical experience with agentic system evaluation and robust methodological approaches.
- Highlight analytical rigor: Prepare to discuss how you define 'good' for AI agents and translate complex evaluation results into actionable, impactful improvements.
- Prepare for competency-based assessment: Focus on demonstrating your problem-solving abilities, analytical thinking, and communication skills, particularly in explaining complex technical concepts clearly.
Frequently Asked Questions
Find answers to common questions about this job opportunity
Explore similar opportunities that match your background