ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
Today's paper introduces ONEBench, a new approach to evaluating AI models that moves beyond traditional fixed test datasets. The method consolidates individual evaluation datasets into a unified, expandable sample pool that can test various model capabilities. This approach allows for more flexible and comprehensive evaluation of AI models while addressing challenges like dataset bias and overfitting.
Method Overview
ONEBench works by treating individual test samples as atomic units that can be combined in various ways to create custom evaluation benchmarks. Instead of using fixed test sets, the method maintains a large pool of annotated samples that can be queried based on specific capabilities of interest.
The approach has two main challenges: heterogeneity and incompleteness. Heterogeneity refers to dealing with different types of measurements (binary, numeric, and ordinal), while incompleteness refers to the issue of comparing models evaluated on different subsets of testing data.
To address these challenges, ONEBench employs social choice theory, treating data samples as voters expressing preferences among models. The method converts various measurements into ordinal rankings and uses a Plackett-Luce framework to aggregate scores effectively. This approach ensures reliable model comparisons even with relatively small amounts of data and can handle missing measurements efficiently.
In order to evaluate the models, the system employs a personalized concept querying framework utilizing two mechanisms:
1. Semantic Search: k-NN lookups are performed in embedding spaces (all-MiniLM-L6-v2 for language tasks and SigLIP-B16 for vision-language tasks). The system retrieves the top-k relevant samples with tuned cosine similarity thresholds of 0.3 for ONEBench-LLM and 0.7 for ONEBench-LMM.
2. Metadata Search: Benchmarks equipped with detailed metadata (e.g., MMMU) are queried based on constraints such as image type, question type, field, or subfield. For datasets with limited metadata (e.g., COCO), retrieval relies on other available descriptors.
Recommended by LinkedIn
These mechanisms ensure the retrieval of representative samples tailored to specific evaluation queries. The ordinal rankings of models for these samples are aggregated using the Plackett-Luce model, producing a ranking for each query.
Results
The paper demonstrates that ONEBench's aggregation algorithm produces rankings that highly correlate with traditional average scores on homogeneous datasets. It shows robustness even when over 95% of measurements are missing, which can reduce evaluation costs by up to 20 times without significantly affecting model rankings. The method has been successfully implemented in two variants: ONEBench-LLM for language models and ONEBench-LMM for vision-language models.
Conclusion
ONEBench presents a novel approach to evaluating foundation models by enabling dynamic, sample-level testing that can grow alongside rapidly developing AI capabilities. For more information please consult the full paper.
Congrats to the authors for their work!
Ghosh, Adhiraj, et al. "ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities." arXiv preprint arXiv:2412.06745 (2024).
MSc ML @ Tubingen, Vision&Language
4moThanks for sharing, Vlad! Happy to answer questions on this work!