7 days ago

Senior Software Engineer LLM Evaluation

Keystone Recruitment

Hybrid
Contractor
$250,000
Hybrid

Job Overview

Job TitleSenior Software Engineer LLM Evaluation
Job TypeContractor
CategoryCommerce
Experience5 Years
DegreeMaster
Offered Salary$250,000
LocationHybrid

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

Senior Software Engineer LLM Evaluation at Keystone Recruitment

One of our global AI research clients is actively seeking a Senior Software Engineer specializing in LLM Evaluation. This remote contract opportunity focuses on enhancing the performance and reliability of large language models within real-world software engineering contexts. You will be instrumental in developing advanced evaluation and benchmarking datasets, specifically assessing AI-generated code and strengthening model reliability across production-grade engineering workflows.

Role Overview

As a Senior Software Engineer LLM Evaluation supporting AI model evaluation, you will build high-quality datasets for training and benchmarking large language models. This role involves close collaboration with researchers to curate code examples, provide precise technical solutions, and refine AI-generated outputs across various programming languages. It's a unique blend of hands-on software engineering expertise with structured AI evaluation and research collaboration.

Key Responsibilities

  • Curate and develop realistic software engineering tasks across languages such as Python, JavaScript (including React), C/C++, Java, Rust, and Go.
  • Review, evaluate, and refine AI-generated code for efficiency, scalability, correctness, and maintainability.
  • Collaborate with cross-functional research teams to enhance AI-driven coding solutions against industry performance benchmarks.
  • Design verification mechanisms to automatically validate software engineering solutions.
  • Analyze stages of the software development lifecycle (architecture design, API design, prototyping, production deployment, monitoring, and maintenance) and evaluate model performance across these stages.
  • Build internal tools or agents to detect code quality issues and error patterns.

Requirements

  • Several years of professional software engineering experience.
  • At least 2 years of continuous full-time experience at a product-focused technology company.
  • Strong expertise in building and deploying scalable, production-grade applications.
  • Deep understanding of software architecture, debugging, performance optimization, and code review standards.
  • Experience working with modern development workflows and tooling.
  • Strong written and verbal communication skills for documenting structured evaluation feedback.

Engagement Details

  • Flexible engagement: minimum 10 hours per week, up to 40 hours per week.
  • Partial overlap with Pacific Time required.
  • Contractor engagement (no medical or paid leave benefits).
  • Initial duration: 1 month, with potential extension based on performance and project needs.

Key skills/competency

  • Large Language Models (LLM)
  • AI Evaluation
  • Software Engineering
  • Code Review
  • Python
  • JavaScript/React
  • C/C++/Java/Rust/Go
  • Software Architecture
  • Debugging
  • Performance Optimization

Tags:

Senior Software Engineer LLM Evaluation
LLM evaluation
AI model assessment
code review
dataset curation
software architecture
performance optimization
debugging
technical solutions
verification mechanisms
error detection
Python
JavaScript
React
C++
Java
Rust
Go
AI/ML platforms
development workflows
tooling

Share Job:

How to Get Hired at Keystone Recruitment

  • Research Keystone Recruitment's clients: Understand the types of AI research and technology companies Keystone Recruitment partners with.
  • Tailor your resume for LLM evaluation: Highlight experience in AI, machine learning, and specifically Large Language Model evaluation or development.
  • Showcase diverse language proficiency: Emphasize expertise in Python, JavaScript, C/C++, Java, Rust, and Go within your portfolio.
  • Prepare for technical depth: Practice discussing software architecture, performance optimization, and code review best practices for AI-driven systems.
  • Demonstrate strong communication: Be ready to articulate structured evaluation feedback and collaborative experiences with research teams.

Frequently Asked Questions

Find answers to common questions about this job opportunity

Explore similar opportunities that match your background