1 day ago

Applied Research Scientist, LLM Evaluation & Post-Training

Innodata Inc.

Hybrid

Full Time

$190,000

Hybrid

Job Overview

Job TitleApplied Research Scientist, LLM Evaluation & Post-Training

Job TypeFull Time

CategoryCommerce

Experience5 Years

DegreeMaster

Offered Salary$190,000

LocationHybrid

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

Who We Are

Innodata (NASDAQ: INOD) is a leading data engineering company, serving over 2,000 customers with operations in 13 cities globally. We are a premier AI technology solutions provider, partnering with 4 out of 5 of the world’s biggest technology companies, alongside leaders in financial services, insurance, technology, law, and medicine. By integrating advanced machine learning and artificial intelligence (ML/AI) technologies, a global team of subject matter experts, and a high-security infrastructure, we are committed to delivering clean and optimized digital data solutions across all industries. Innodata offers both robust digital data solutions and user-friendly, high-quality platforms.

Our global workforce, comprising over 3,000 employees, spans the United States, Canada, United Kingdom, the Philippines, India, Sri Lanka, Israel, and Germany. We are positioned for significant growth in the coming years.

Position Summary

Innodata is significantly expanding its Generative AI (GenAI) research capabilities to advance state-of-the-art evaluation and post-training methodologies for large language models (LLM) and multimodal systems. As an Applied Research Scientist, LLM Evaluation & Post-Training, you will be at the forefront, leading research and experimentation to understand how evaluation design, measurement strategies, and feedback signals contribute to model improvement.

This opportunity is ideal for a technically rigorous researcher with deep expertise in modern LLM evaluation and post-training techniques. You will be expected to translate research insights into practical methods for both customer solutions and internal platform innovation. This role involves working across human-in-the-loop and AI-augmented workflows, collaborating with Language Data Scientists and AI/ML Research Engineers to design and validate evaluation frameworks that achieve measurable model gains.

The ideal candidate possesses strong experimental and statistical judgment, combined with hands-on technical proficiency. You should be capable of engaging as a peer with research and engineering stakeholders at leading AI companies.

Who We’re Looking For

You bring at least 5+ years of relevant experience, including graduate research, in applied ML research, research science, or advanced ML experimentation, with a significant focus on LLM evaluation, benchmarking, alignment, or post-training. You have a proven track record of designing high-quality experiments, interpreting results rigorously, and effectively translating findings into practical improvements.

You are adept at navigating both research and product/customer contexts, capable of identifying critical methodological questions, structuring a comprehensive research agenda, and collaborating with engineers and data experts for execution. You understand that effective evaluation goes beyond simple metrics, encompassing measurement validity, robustness, stress testing, and alignment with real-world usage.

You are excited by frontier challenges, including long-context, cross-modal, and dynamic multi-turn evaluations, and by the prospect of building new benchmark datasets and evaluation frameworks that will become strategic assets for Innodata and its clients.

Your approach to experimentation is implementation-minded, and you are comfortable collaborating closely with engineers to productionize methods and research outputs as appropriate.

Responsibilities

Define and execute a comprehensive research agenda focused on LLM evaluation and post-training, with a strong emphasis on evaluation-driven model improvement.
Design rigorous experiments to study the impact of various evaluation methodologies on fine-tuning and post-training outcomes.
Develop and validate robust evaluation frameworks for LLM and multimodal systems, encompassing benchmark/task design, scoring methods, judge/model-assisted evaluation, human evaluation protocols, and robustness/stress testing.
Lead research into advanced evaluation domains, including long-context, cross-modal, and dynamic multi-turn evaluations.
Investigate the effectiveness and limitations of current evaluation techniques, proposing improved methodologies with clear consideration of validity and scalability tradeoffs.
Analyze model behavior and identify failure patterns, generating actionable recommendations for model improvement and evaluation redesign.
Collaborate closely with AI/ML Research Engineers to transform research methods into scalable evaluation and post-training pipelines.
Partner with Language Data Scientists to integrate human-in-the-loop and synthetic data/evaluation strategies into research programs.
Engage with customer technical stakeholders to understand their evaluation goals, review methodologies, and provide expert recommendations.
Contribute to the development of internal benchmark datasets, evaluation frameworks, and reusable research assets.
Produce high-quality technical documentation, internal research reports, and client-facing materials that clearly explain methods, results, assumptions, and limitations.
Contribute to thought leadership and best practices in LLM evaluation, post-training, and GenAI quality measurement.

Qualifications

MS/PhD in Computer Science, Machine Learning, Statistics, Applied Mathematics, AI, or a related quantitative scientific field (PhD strongly preferred).
5+ years of relevant experience in applied research or research science in ML/AI, with substantial work in LLMs or foundation models.
Demonstrated experience with LLM evaluation, benchmarking, alignment, post-training, or model quality research.
Strong foundation in experimental design, statistical analysis, and scientific reasoning for ML systems.
Proficient coding skills in Python for research experimentation and analysis (e.g., data processing, evaluation pipelines, statistical analysis, visualization).
Experience working with modern ML tooling/frameworks (e.g., PyTorch, Hugging Face, JAX/TensorFlow as applicable) sufficient to design and execute model/evaluation experiments.
Ability to evaluate and compare human and automated evaluation methods, understanding tradeoffs in cost, reliability, validity, and scalability.
Experience designing evaluation studies and protocols that are reproducible across datasets, model versions, and evaluation runs.
Ability to collaborate directly with technical stakeholders including research scientists, ML engineers, data scientists, and customer technical counterparts.
Strong communication skills and ability to present nuanced technical conclusions, assumptions, and limitations clearly.

Technical Skills

Evaluation Science & Benchmarking: Experience designing benchmark datasets, test suites, or evaluation frameworks for language or multimodal models. Deep understanding of metric design, scoring reliability, and measurement validity. Experience with human evaluation methods and quality assurance considerations (e.g., rubric design, inter-rater reliability, adjudication frameworks).
LLM / Post-Training: Understanding of post-training methods and how training objectives interact with evaluation outcomes. Ability to reason about model behavior, failure modes, and tradeoffs across tasks/domains. Familiarity with alignment and robustness considerations in model evaluation.
Quantitative Analysis: Strong statistical analysis skills (sampling, uncertainty, significance testing where appropriate, error analysis, metric interpretation). Ability to synthesize complex experimental findings into actionable recommendations.

Preferred Skills

Hands-on experience running or supporting fine-tuning/post-training experiments (SFT, preference optimization, RLHF/RLAIF-style workflows).
Experience with multimodal evaluation (e.g., text-image, audio, video).
Experience with long-context benchmarking/evaluation and real-world context management challenges.
Experience designing multi-turn, interactive, or agentic evaluation protocols.
Published research and/or open-source benchmark contributions in LLM evaluation, post-training, alignment, or related areas.
Experience in customer-facing applied research, technical consulting, or cross-functional product/research collaborations.
Familiarity with safety, trustworthiness, and governance considerations in GenAI evaluation.

How This Role Partners With The Team

This role involves close collaboration with:

Language Data Scientists: Bringing expertise in language data, human evaluation workflows, multilingual/multimodal process design, and data quality operations.
AI/ML Research Engineers: Implementing scalable training/evaluation systems and connecting research methods to production-grade pipelines.
Business and Customer Teams: Relying on Innodata for expert consultation and credible, technically rigorous GenAI solutions.
Internal R&D and Platform Teams: Transforming research outputs into reusable frameworks, benchmarks, and differentiated offerings.

Key skills/competency

LLM Evaluation
Post-Training Methods
Experimental Design
Statistical Analysis
Python Programming
Machine Learning (ML)
Artificial Intelligence (AI)
Benchmark Development
Multimodal Systems
Generative AI (GenAI)

Tags:

Applied Research Scientist

LLM Evaluation

Post-Training

Machine Learning

Experimental Design

Statistical Analysis

Benchmark Development

Multimodal Systems

Generative AI

Python

PyTorch

Hugging Face

JAX

TensorFlow

Data Processing

Evaluation Pipelines

Metric Design

ML Tooling

Human-in-the-loop

How to Get Hired at Innodata Inc.

Research Innodata Inc.'s culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor.
Tailor your resume for AI roles: Customize your resume to highlight LLM evaluation, ML research, and data engineering expertise for Innodata Inc.
Showcase practical experience: Provide specific examples of your work in experimental design, statistical analysis, and ML tooling/frameworks.
Prepare for technical deep dives: Be ready to discuss advanced LLM evaluation, post-training, and multimodal system challenges during interviews.
Demonstrate collaborative skills: Emphasize your ability to work effectively with engineers, data scientists, and customer stakeholders.

Frequently Asked Questions

Find answers to common questions about this job opportunity

01What kind of LLM evaluation challenges will I tackle as an Applied Research Scientist at Innodata Inc.?

02What is Innodata Inc.'s approach to GenAI research and development?

03What technical skills are most critical for this Applied Research Scientist, LLM Evaluation & Post-Training role at Innodata Inc.?

04How does the Applied Research Scientist, LLM Evaluation & Post-Training role collaborate with other teams at Innodata Inc.?

05What academic background is preferred for the Applied Research Scientist, LLM Evaluation & Post-Training position at Innodata Inc.?

Explore similar opportunities that match your background