Evaluating Large Language Models - extending open LLM Leaderboard

Sankhadeep Dhar

🚀 Industries work with SPM & EA. Ask me how | Outbound Product Manager, SPM & EA

Published Aug 16, 2023

In the ever-evolving landscape of technology, Technology Leaders face a wide range of critical decisions concerning GenAI. In my last blog I used an example of deciding between Prompt Engineering and Fine Tuning.

A more foundational decision is which LLMs to chose for which use-case. There are a quite a number of options to choose from and new models are introduced rapidly. The good news is that the community has also come up with a number of ways to evaluate these models. One such example is the Open LLM leaderboard.

This leaderboard uses the below 4 criteria (copied from the above website):

AI2 Reasoning Challenge - a set of grade-school science questions.
HellaSwag - a test of commonsense inference, which is easy for humans but challenging for SOTA models.
MMLU - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
TruthfulQA - a test to measure a model’s propensity to reproduce falsehoods commonly found online.

These are excellent criteria and good starting point for enterprises to shortlist models and also to keep track of model quality leaderboard over time.

However there are a couple of additional criteria which enterprises should look into before deciding to onboard a model. These criteria are more focussed on to measure how good a fit is the model to your organisation.

For example, you wouldn't buy a car just by measuring the quality of the engine. There are tonnes of other things that you look into before you buy a car. Choosing a large language model is no different.

A few of these additional criteria are:

Recommended by LinkedIn

Build Your First RAG System Using LlamaIndex!

Pavan Belagatti 4 months ago

Mastering the Art of Prompting a Large Language Model

Varun Grover 10 months ago

🚨Breaking: Alibaba Just Launched Marco-o1: Advancing…

Saeed Al Hasan .AI Master 5 months ago

Cost
Organisation/Domain Fit
Business Value
Security and Privacy Impact
Implementation Complexity

So, why am I writing a blog on this topic? I believe that when onboarding a new technology or platform, teams including enterprise architects, technology leaders, implementation teams and stakeholders (including from the business) should look at the solution holistically.

So, how do we enable this?

With ServiceNow platform, one of the many workflows that we offer is Application Portfolio Management. This helps organisation to onboard, rollout, maintain and decommission technologies.

One of the aspects of APM is the ability to score - which is exactly what we want to do when evaluating and onboarding LLMs.

These scores are automatically calculated based on data in our system and combining these with the scoring from stakeholders.

These scores are very useful, not just at the onboarding step, but during the whole lifecycle of the LLM at the organisation. Scores enable teams to take action when there are gaps identified. For example, if there is a gap in how the actual users are perceiving and using AI (which can be the case due to the lack of user training), it indicates that more organisation change management would be required to properly deploy this within the user community.

The below is one such output of evaluating LLMs (business value vs functional fit) using APM.

Not just scoring, there are more benefits of APM - for example creating a governance framework on how new technologies can be onboarded and off-boarded. And more on this in my upcoming blogs.

Michael Yancheson

Very helpful!

1 Reaction

Doug Page

Senior Director, Product Management - Strategic Portfolio Management

Thanks Sankhadeep!

1 Reaction

See more comments

To view or add a comment, sign in

Evaluating Large Language Models - extending open LLM Leaderboard

Sankhadeep Dhar

🚀 Industries work with SPM & EA. Ask me how | Outbound Product Manager, SPM & EA

Recommended by LinkedIn

More articles by Sankhadeep Dhar

Insights from the community

Others also viewed

Bedrock Knowledge Bases

🧐 RAG vs. CAG: Which Knowledge-Augmented Strategy Wins?

Enhancing LLM Agents with Tool Integration

LLM Model Serving : An Interesting Challenge

Llama 3.1 405b Deep Dive | The Best LLM is now Open Source

What Does "Context" Mean in Model Context Protocol (MCP)?

RAGs to ORGs

How Retrieval-Augmented Generation (RAG) is fixing AI’s biggest flaw

Apple is not so secretive? OpenELM and DCLM

Explore topics

Recommended by LinkedIn

More articles by Sankhadeep Dhar

Introducing SPM and EA AI Agents

I am excited about the Yokohama Release for SPM! Here is why.

ServiceNow Now Assist and Zoom AI Companion- All the goodies unlocked

Introducing Enterprise Architecture from ServiceNow (yes, we have introduced Enterprise Modelling on the NOW platform!)

SPM Xanadu Release: Accelerate, starting today with Now Assist for SPM

What is Strategic Portfolio Management anyway?

Washington Release is here! Here is what to expect for SPM.

Business Capability Made Easy

The Priority Tug-of-War: An Epic Showdown in Portfolio Arena

🚨🕵️♂️ Beware of the lurking menace in project management! Introducing... "Scope Creep"! 🕵️♂️🚨

Insights from the community

Others also viewed

Bedrock Knowledge Bases

🧐 RAG vs. CAG: Which Knowledge-Augmented Strategy Wins?

Enhancing LLM Agents with Tool Integration

LLM Model Serving : An Interesting Challenge

Llama 3.1 405b Deep Dive | The Best LLM is now Open Source

What Does "Context" Mean in Model Context Protocol (MCP)?

RAGs to ORGs

How Retrieval-Augmented Generation (RAG) is fixing AI’s biggest flaw

Apple is not so secretive? OpenELM and DCLM

Explore topics