Towards Evaluation Engineering

An Empirical Study of ML Evaluation Harnesses in the Wild

arXiv Paper GitHub Repo HF Dataset Read More

Abstract

Machine learning (ML) evaluation depends on evaluation harnesses — software systems that orchestrate model invocation, data loading, metric computation, and result reporting. Despite their critical role in ML infrastructure, no software engineering (SE) study had investigated evaluation harnesses as software products. We coin the term Evaluation Engineering (EvalEng) for this emerging SE concern and present the first large-scale empirical study of evaluation harnesses in the wild. We identify 57 evaluation harnesses, extract a 5-stage workflow model (34 operational strategies across 9 workflow steps), and mine 19,638 GitHub issues using an LLM-based classifier calibrated against human consensus (κ > 0.87). Our analysis reveals that unimplemented feature gaps (24.3%), documentation deficiencies (20.3%), and validation gaps (17.2%) account for 61.7% of all operational challenges — and that operational challenges shift from environment-related concerns in early workflow stages to scoring-related concerns in later stages. We distill actionable implications for harness developers, benchmark curators, and SE researchers.

Motivation

Methodology

A 4-stage empirical protocol combining systematic collection, qualitative modeling, large-scale mining, and LLM-assisted classification.

Research Questions

RQ1

Evaluation Workflow Model

What is the operational workflow for evaluation harness execution?

The 5-Stage Execution Workflow

{{ stage.steps }} steps · {{ stage.strategies }} strategies

Strategy Adoption Gaps

Under-adopted strategies signal engineering debt

Harness Archetypes (n=57)

Four distinct categories of evaluation harnesses

The Four Harness Archetypes

RQ2

Root Causes of Operational Challenges

What are the root causes of operational challenges in evaluation harnesses?

61.7% of all issues explained by top 3 root causes

All Root Cause Categories (n = 16,560 issues)

Percentage of classified issues per root cause

{{ rc.name }} {{ rc.pct }}%

RQ3

Stage-wise Challenge Distribution

How do root cause distributions vary across evaluation workflow stages?

💡

Key Insight

Operational challenges follow a predictable arc: early stages (S0 Provisioning) are dominated by environment incompatibility and external dependency breakage (36.2% combined), while later stages (S3 Assessment) shift toward algorithmic errors (25.9%) and validation gaps (22.5%). Engineers should expect different failure modes depending on where in the pipeline they are working.

Top Root Causes by Workflow Stage

Relative distribution of root causes per stage

{{ item.pct }}%

Citation

If you use this work, please cite:

BibTeX

{{ bibtex }}