Applied Research Scientist, LLM Evaluation & Post-Training
Innodata Inc.
Job Overview
Who's the hiring manager?
Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Job Description
Who We Are
Innodata (NASDAQ: INOD) is a leading data engineering company, serving over 2,000 customers with operations in 13 cities globally. We are a premier AI technology solutions provider, partnering with 4 out of 5 of the world’s biggest technology companies, alongside leaders in financial services, insurance, technology, law, and medicine. By integrating advanced machine learning and artificial intelligence (ML/AI) technologies, a global team of subject matter experts, and a high-security infrastructure, we are committed to delivering clean and optimized digital data solutions across all industries. Innodata offers both robust digital data solutions and user-friendly, high-quality platforms.
Our global workforce, comprising over 3,000 employees, spans the United States, Canada, United Kingdom, the Philippines, India, Sri Lanka, Israel, and Germany. We are positioned for significant growth in the coming years.
Position Summary
Innodata is significantly expanding its Generative AI (GenAI) research capabilities to advance state-of-the-art evaluation and post-training methodologies for large language models (LLM) and multimodal systems. As an Applied Research Scientist, LLM Evaluation & Post-Training, you will be at the forefront, leading research and experimentation to understand how evaluation design, measurement strategies, and feedback signals contribute to model improvement.
This opportunity is ideal for a technically rigorous researcher with deep expertise in modern LLM evaluation and post-training techniques. You will be expected to translate research insights into practical methods for both customer solutions and internal platform innovation. This role involves working across human-in-the-loop and AI-augmented workflows, collaborating with Language Data Scientists and AI/ML Research Engineers to design and validate evaluation frameworks that achieve measurable model gains.
The ideal candidate possesses strong experimental and statistical judgment, combined with hands-on technical proficiency. You should be capable of engaging as a peer with research and engineering stakeholders at leading AI companies.
Who We’re Looking For
You bring at least 5+ years of relevant experience, including graduate research, in applied ML research, research science, or advanced ML experimentation, with a significant focus on LLM evaluation, benchmarking, alignment, or post-training. You have a proven track record of designing high-quality experiments, interpreting results rigorously, and effectively translating findings into practical improvements.
You are adept at navigating both research and product/customer contexts, capable of identifying critical methodological questions, structuring a comprehensive research agenda, and collaborating with engineers and data experts for execution. You understand that effective evaluation goes beyond simple metrics, encompassing measurement validity, robustness, stress testing, and alignment with real-world usage.
You are excited by frontier challenges, including long-context, cross-modal, and dynamic multi-turn evaluations, and by the prospect of building new benchmark datasets and evaluation frameworks that will become strategic assets for Innodata and its clients.
Your approach to experimentation is implementation-minded, and you are comfortable collaborating closely with engineers to productionize methods and research outputs as appropriate.
Responsibilities
- Define and execute a comprehensive research agenda focused on LLM evaluation and post-training, with a strong emphasis on evaluation-driven model improvement.
- Design rigorous experiments to study the impact of various evaluation methodologies on fine-tuning and post-training outcomes.
- Develop and validate robust evaluation frameworks for LLM and multimodal systems, encompassing benchmark/task design, scoring methods, judge/model-assisted evaluation, human evaluation protocols, and robustness/stress testing.
- Lead research into advanced evaluation domains, including long-context, cross-modal, and dynamic multi-turn evaluations.
- Investigate the effectiveness and limitations of current evaluation techniques, proposing improved methodologies with clear consideration of validity and scalability tradeoffs.
- Analyze model behavior and identify failure patterns, generating actionable recommendations for model improvement and evaluation redesign.
- Collaborate closely with AI/ML Research Engineers to transform research methods into scalable evaluation and post-training pipelines.
- Partner with Language Data Scientists to integrate human-in-the-loop and synthetic data/evaluation strategies into research programs.
- Engage with customer technical stakeholders to understand their evaluation goals, review methodologies, and provide expert recommendations.
- Contribute to the development of internal benchmark datasets, evaluation frameworks, and reusable research assets.
- Produce high-quality technical documentation, internal research reports, and client-facing materials that clearly explain methods, results, assumptions, and limitations.
- Contribute to thought leadership and best practices in LLM evaluation, post-training, and GenAI quality measurement.
Qualifications
- MS/PhD in Computer Science, Machine Learning, Statistics, Applied Mathematics, AI, or a related quantitative scientific field (PhD strongly preferred).
- 5+ years of relevant experience in applied research or research science in ML/AI, with substantial work in LLMs or foundation models.
- Demonstrated experience with LLM evaluation, benchmarking, alignment, post-training, or model quality research.
- Strong foundation in experimental design, statistical analysis, and scientific reasoning for ML systems.
- Proficient coding skills in Python for research experimentation and analysis (e.g., data processing, evaluation pipelines, statistical analysis, visualization).
- Experience working with modern ML tooling/frameworks (e.g., PyTorch, Hugging Face, JAX/TensorFlow as applicable) sufficient to design and execute model/evaluation experiments.
- Ability to evaluate and compare human and automated evaluation methods, understanding tradeoffs in cost, reliability, validity, and scalability.
- Experience designing evaluation studies and protocols that are reproducible across datasets, model versions, and evaluation runs.
- Ability to collaborate directly with technical stakeholders including research scientists, ML engineers, data scientists, and customer technical counterparts.
- Strong communication skills and ability to present nuanced technical conclusions, assumptions, and limitations clearly.
Technical Skills
- Evaluation Science & Benchmarking: Experience designing benchmark datasets, test suites, or evaluation frameworks for language or multimodal models. Deep understanding of metric design, scoring reliability, and measurement validity. Experience with human evaluation methods and quality assurance considerations (e.g., rubric design, inter-rater reliability, adjudication frameworks).
- LLM / Post-Training: Understanding of post-training methods and how training objectives interact with evaluation outcomes. Ability to reason about model behavior, failure modes, and tradeoffs across tasks/domains. Familiarity with alignment and robustness considerations in model evaluation.
- Quantitative Analysis: Strong statistical analysis skills (sampling, uncertainty, significance testing where appropriate, error analysis, metric interpretation). Ability to synthesize complex experimental findings into actionable recommendations.
Preferred Skills
- Hands-on experience running or supporting fine-tuning/post-training experiments (SFT, preference optimization, RLHF/RLAIF-style workflows).
- Experience with multimodal evaluation (e.g., text-image, audio, video).
- Experience with long-context benchmarking/evaluation and real-world context management challenges.
- Experience designing multi-turn, interactive, or agentic evaluation protocols.
- Published research and/or open-source benchmark contributions in LLM evaluation, post-training, alignment, or related areas.
- Experience in customer-facing applied research, technical consulting, or cross-functional product/research collaborations.
- Familiarity with safety, trustworthiness, and governance considerations in GenAI evaluation.
How This Role Partners With The Team
This role involves close collaboration with:
- Language Data Scientists: Bringing expertise in language data, human evaluation workflows, multilingual/multimodal process design, and data quality operations.
- AI/ML Research Engineers: Implementing scalable training/evaluation systems and connecting research methods to production-grade pipelines.
- Business and Customer Teams: Relying on Innodata for expert consultation and credible, technically rigorous GenAI solutions.
- Internal R&D and Platform Teams: Transforming research outputs into reusable frameworks, benchmarks, and differentiated offerings.
Key skills/competency
- LLM Evaluation
- Post-Training Methods
- Experimental Design
- Statistical Analysis
- Python Programming
- Machine Learning (ML)
- Artificial Intelligence (AI)
- Benchmark Development
- Multimodal Systems
- Generative AI (GenAI)
How to Get Hired at Innodata Inc.
- Research Innodata Inc.'s culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor.
- Tailor your resume for AI roles: Customize your resume to highlight LLM evaluation, ML research, and data engineering expertise for Innodata Inc.
- Showcase practical experience: Provide specific examples of your work in experimental design, statistical analysis, and ML tooling/frameworks.
- Prepare for technical deep dives: Be ready to discuss advanced LLM evaluation, post-training, and multimodal system challenges during interviews.
- Demonstrate collaborative skills: Emphasize your ability to work effectively with engineers, data scientists, and customer stakeholders.
Frequently Asked Questions
Find answers to common questions about this job opportunity
Explore similar opportunities that match your background