ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Published Dec 15, 2024

Today's paper introduces ONEBench, a new approach to evaluating AI models that moves beyond traditional fixed test datasets. The method consolidates individual evaluation datasets into a unified, expandable sample pool that can test various model capabilities. This approach allows for more flexible and comprehensive evaluation of AI models while addressing challenges like dataset bias and overfitting.

Method Overview

ONEBench works by treating individual test samples as atomic units that can be combined in various ways to create custom evaluation benchmarks. Instead of using fixed test sets, the method maintains a large pool of annotated samples that can be queried based on specific capabilities of interest.

The approach has two main challenges: heterogeneity and incompleteness. Heterogeneity refers to dealing with different types of measurements (binary, numeric, and ordinal), while incompleteness refers to the issue of comparing models evaluated on different subsets of testing data.

To address these challenges, ONEBench employs social choice theory, treating data samples as voters expressing preferences among models. The method converts various measurements into ordinal rankings and uses a Plackett-Luce framework to aggregate scores effectively. This approach ensures reliable model comparisons even with relatively small amounts of data and can handle missing measurements efficiently.

In order to evaluate the models, the system employs a personalized concept querying framework utilizing two mechanisms:

1. Semantic Search: k-NN lookups are performed in embedding spaces (all-MiniLM-L6-v2 for language tasks and SigLIP-B16 for vision-language tasks). The system retrieves the top-k relevant samples with tuned cosine similarity thresholds of 0.3 for ONEBench-LLM and 0.7 for ONEBench-LMM.

2. Metadata Search: Benchmarks equipped with detailed metadata (e.g., MMMU) are queried based on constraints such as image type, question type, field, or subfield. For datasets with limited metadata (e.g., COCO), retrieval relies on other available descriptors.

Results

The paper demonstrates that ONEBench's aggregation algorithm produces rankings that highly correlate with traditional average scores on homogeneous datasets. It shows robustness even when over 95% of measurements are missing, which can reduce evaluation costs by up to 20 times without significantly affecting model rankings. The method has been successfully implemented in two variants: ONEBench-LLM for language models and ONEBench-LMM for vision-language models.

Conclusion

ONEBench presents a novel approach to evaluating foundation models by enabling dynamic, sample-level testing that can grow alongside rapidly developing AI capabilities. For more information please consult the full paper.

Congrats to the authors for their work!

Ghosh, Adhiraj, et al. "ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities." arXiv preprint arXiv:2412.06745 (2024).

AI Paper of the Day

1,409 follower

+ Subscribe

Adhiraj Ghosh

MSc ML @ Tubingen, Vision&Language

4mo

Thanks for sharing, Vlad! Happy to answer questions on this work!

2 Reactions

To view or add a comment, sign in

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Method Overview

Recommended by LinkedIn

Results

Conclusion

AI Paper of the Day

1,409 follower

More articles by Vlad Bogolin

Insights from the community

Others also viewed

Retrieval-Augmented Generation (RAG) Techniques

Unlocking AI’s Full Potential: Understanding the Model Context Protocol

Bye Bye RAG! Welcome CAG (Cache-Augmented Generation). Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation

Unlocking the Future of Knowledge Graphs with XRDNA’s Proprietary Technologies

Vector Databases 101: The Ultimate Guide

The Future of Intelligent Querying with AI Agents and Knowledge Graphs

Model Context Protocol (MCP): The Universal Connector for Context-Aware AI Systems

Knowledge Graphs in RAG: Enhancing AI with Structured Information

Model Context Protocol(MCP): Bridging AI and Real-Time Data

What is MCP (Model Context Protocol)? And Why is everybody talking about it now?

Explore topics

Method Overview

Recommended by LinkedIn

Results

Conclusion

AI Paper of the Day

1,409 follower

More articles by Vlad Bogolin

MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining

Seed1.5-VL Technical Report

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

On Path to Multimodal Generalist: General-Level and General-Bench

Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

RM-R1: Reward Modeling as Reasoning

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

PixelHacker: Image Inpainting with Structural and Semantic Consistency

Insights from the community

Others also viewed

Retrieval-Augmented Generation (RAG) Techniques

Unlocking AI’s Full Potential: Understanding the Model Context Protocol

Bye Bye RAG! Welcome CAG (Cache-Augmented Generation). Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation

Unlocking the Future of Knowledge Graphs with XRDNA’s Proprietary Technologies

Vector Databases 101: The Ultimate Guide

The Future of Intelligent Querying with AI Agents and Knowledge Graphs

Model Context Protocol (MCP): The Universal Connector for Context-Aware AI Systems

Knowledge Graphs in RAG: Enhancing AI with Structured Information

Model Context Protocol(MCP): Bridging AI and Real-Time Data

What is MCP (Model Context Protocol)? And Why is everybody talking about it now?

Explore topics