Evaluating Large Language Models - extending open LLM Leaderboard
In the ever-evolving landscape of technology, Technology Leaders face a wide range of critical decisions concerning GenAI. In my last blog I used an example of deciding between Prompt Engineering and Fine Tuning.
A more foundational decision is which LLMs to chose for which use-case. There are a quite a number of options to choose from and new models are introduced rapidly. The good news is that the community has also come up with a number of ways to evaluate these models. One such example is the Open LLM leaderboard.
This leaderboard uses the below 4 criteria (copied from the above website):
These are excellent criteria and good starting point for enterprises to shortlist models and also to keep track of model quality leaderboard over time.
However there are a couple of additional criteria which enterprises should look into before deciding to onboard a model. These criteria are more focussed on to measure how good a fit is the model to your organisation.
For example, you wouldn't buy a car just by measuring the quality of the engine. There are tonnes of other things that you look into before you buy a car. Choosing a large language model is no different.
A few of these additional criteria are:
Recommended by LinkedIn
So, why am I writing a blog on this topic? I believe that when onboarding a new technology or platform, teams including enterprise architects, technology leaders, implementation teams and stakeholders (including from the business) should look at the solution holistically.
So, how do we enable this?
With ServiceNow platform, one of the many workflows that we offer is Application Portfolio Management. This helps organisation to onboard, rollout, maintain and decommission technologies.
One of the aspects of APM is the ability to score - which is exactly what we want to do when evaluating and onboarding LLMs.
These scores are automatically calculated based on data in our system and combining these with the scoring from stakeholders.
These scores are very useful, not just at the onboarding step, but during the whole lifecycle of the LLM at the organisation. Scores enable teams to take action when there are gaps identified. For example, if there is a gap in how the actual users are perceiving and using AI (which can be the case due to the lack of user training), it indicates that more organisation change management would be required to properly deploy this within the user community.
The below is one such output of evaluating LLMs (business value vs functional fit) using APM.
Not just scoring, there are more benefits of APM - for example creating a governance framework on how new technologies can be onboarded and off-boarded. And more on this in my upcoming blogs.
Very helpful!
Senior Director, Product Management - Strategic Portfolio Management
1yThanks Sankhadeep!