An Empirical Study of ML Evaluation Harnesses in the Wild
Machine learning (ML) evaluation depends on evaluation harnesses — software systems that orchestrate model invocation, data loading, metric computation, and result reporting. Despite their critical role in ML infrastructure, no software engineering (SE) study had investigated evaluation harnesses as software products. We coin the term Evaluation Engineering (EvalEng) for this emerging SE concern and present the first large-scale empirical study of evaluation harnesses in the wild. We identify 57 evaluation harnesses, extract a 5-stage workflow model (34 operational strategies across 9 workflow steps), and mine 19,638 GitHub issues using an LLM-based classifier calibrated against human consensus (κ > 0.87). Our analysis reveals that unimplemented feature gaps (24.3%), documentation deficiencies (20.3%), and validation gaps (17.2%) account for 61.7% of all operational challenges — and that operational challenges shift from environment-related concerns in early workflow stages to scoring-related concerns in later stages. We distill actionable implications for harness developers, benchmark curators, and SE researchers.
A 4-stage empirical protocol combining systematic collection, qualitative modeling, large-scale mining, and LLM-assisted classification.
{{ rq.question }}
What is the operational workflow for evaluation harness execution?
Under-adopted strategies signal engineering debt
Four distinct categories of evaluation harnesses
{{ arch.desc }}
What are the root causes of operational challenges in evaluation harnesses?
{{ rc.note }}
Percentage of classified issues per root cause
How do root cause distributions vary across evaluation workflow stages?
Operational challenges follow a predictable arc: early stages (S0 Provisioning) are dominated by environment incompatibility and external dependency breakage (36.2% combined), while later stages (S3 Assessment) shift toward algorithmic errors (25.9%) and validation gaps (22.5%). Engineers should expect different failure modes depending on where in the pipeline they are working.
Relative distribution of root causes per stage
If you use this work, please cite:
{{ bibtex }}