Senior Software Engineer LLM Evaluation
Nexus Consulting
Job Overview
Who's the hiring manager?
Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Job Description
Senior Software Engineer LLM Evaluation
As a Senior Software Engineer specializing in LLM Evaluation, you will join one of Nexus Consulting's global AI research clients. This critical role involves developing and refining advanced evaluation and benchmarking datasets to enhance the real-world performance of large language models in software engineering scenarios. You will specifically focus on assessing AI-generated code and strengthening model reliability across various production-grade engineering workflows.
This is an hourly contract position, offered on a remote basis, with flexible engagement options from a minimum of 10 hours up to 40 hours per week. A partial overlap with Pacific Time is required for collaboration.
Role Overview
In this role, you will be instrumental in building high-quality datasets essential for both training and benchmarking large language models. Your work will involve close collaboration with research teams to curate relevant code examples, develop precise technical solutions, and meticulously refine AI-generated outputs across a diverse set of programming languages. This position uniquely blends deep hands-on software engineering expertise with structured AI model evaluation and collaborative research.
Key Responsibilities
- Curate and develop realistic software engineering tasks across multiple languages, including Python, JavaScript (and React), C/C++, Java, Rust, and Go.
- Review, evaluate, and refine AI-generated code for critical attributes such as efficiency, scalability, correctness, and maintainability.
- Collaborate effectively with cross-functional research teams to continuously enhance AI-driven coding solutions against established industry performance benchmarks.
- Design robust verification mechanisms capable of automatically validating complex software engineering solutions.
- Analyze various stages of the software development lifecycle, including architecture design, API design, prototyping, production deployment, monitoring, and maintenance, to evaluate model performance throughout.
- Build internal tools or agents specifically designed to detect common code quality issues and identify recurring error patterns.
Requirements
- Several years of extensive professional software engineering experience.
- At least 2 years of continuous, full-time experience gained at a product-focused technology company.
- Strong expertise in building and successfully deploying scalable, production-grade applications.
- Deep understanding of fundamental software architecture principles, effective debugging techniques, performance optimization strategies, and established code review standards.
- Proven experience working within modern development workflows and utilizing contemporary tooling.
- Strong written and verbal communication skills, essential for documenting structured evaluation feedback and collaborating effectively.
Engagement Details
- Flexible engagement: minimum 10 hours per week, up to 40 hours per week.
- Partial overlap with Pacific Time required to facilitate team collaboration.
- This is a contractor engagement; no medical or paid leave benefits are provided.
- Initial duration: 1 month, with strong potential for extension based on performance and evolving project needs.
Key skills/competency
- LLM Evaluation
- AI-Generated Code Analysis
- Software Engineering
- Python, JavaScript, C/C++, Java, Rust, Go
- Software Architecture
- Performance Optimization
- Code Review
- Dataset Curation
- Debugging
- Verification Mechanism Design
How to Get Hired at Nexus Consulting
- Research Nexus Consulting's clients: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor, especially focusing on their AI research initiatives.
- Tailor your resume: Highlight deep software engineering, LLM evaluation, and diverse language skills (Python, JavaScript, C/C++, Java, Rust, Go) for this specialized role.
- Showcase relevant projects: Demonstrate experience with AI-generated code analysis, LLM benchmarking, or building robust evaluation frameworks.
- Prepare for technical deep-dives: Expect in-depth questions on software architecture, debugging complex systems, performance optimization, and code quality standards.
- Emphasize communication: Practice articulating technical feedback clearly and demonstrating collaborative problem-solving, crucial for working with research teams.
Frequently Asked Questions
Find answers to common questions about this job opportunity
Explore similar opportunities that match your background