How to test the latest AI models
https://agi.safe.ai/

How to test the latest AI models

Every few days, a new AI model seems to drop from one of the big AI companies. This week it was Anthropic’s turn with Claude 3.7 Sonnet. Social media then explodes with people saying it’s either the best thing in the history of time or massively overhyped. But as a normal person, how do you work out whether a new AI model is actually better than what you were using before? 

Thankfully for most of us, you don’t need a brain the size of a planet or a PhD in machine learning to test AI. It will be put through the ringer with Humanity's Last Exam for you! But as an individual to find out for yourself, you just need a bit of curiosity, a decent amount of time and a structured approach. 

Here’s how to test if an AI model is worth the hype... 

Test it on what you actually do 

If you’re a lawyer, get it to summarise a contract. If you’re in training, ask it to create a lesson plan. If you’re in marketing, see if it can write ad copy that sounds even remotely human. 

Have a list of your day-to-day tasks and use cases in your back pocket and use the same ones each time. 

Most new AI models perform well in general tasks, but can they do your specific tasks well? 

Quick test: Ask it to draft an email you’d actually send. If you still need to rewrite most of it, the model probably isn’t saving you time. 

Push it with edge cases 

Find out where the AI model’s boundaries are. What is its ‘jagged frontier’ (with thanks to Ethan Mollick for the phrase). When does it start going haywire? 

  • Give it an ambiguous request (‘Write a summary of this article’ but don’t give it the article). Does it ask for clarification or just write a nonsense summary of something else? 

  • Ask it to handle nuance (‘Explain AI regulation to a 10-year-old and then to a CEO’). Can it adjust its tone and depth? 

If it falls apart on these, it’s probably not as advanced as it claims. 

Test its knowledge 

New models boast about their knowledge cut-off dates, but that doesn’t mean they understand the information they have been trained on correctly. 

  • Ask it about recent events (e.g. ‘What happened in UK politics last week?’ - actually don’t ask that, it’s too depressing). 

  • Get it to summarise a niche topic you know well. For me it’s instructions on how to make a Rhoorkhee chair. As niche as you can get. 

AI is confident even when it's wrong, so if it’s misrepresenting things you know, it’s probably unreliable elsewhere too. 

Measure how much effort and time it actually saves you 

The best AI tools don’t just generate text; they make your work and life easier. 

  • Ask it to help you plan your next holiday, including flights, hotel comparisons, and an itinerary. 

  • Get it to rewrite a bad piece of writing into something clear and professional. 

If you’re spending as much time fixing the AI’s output as you would have spent writing or researching from scratch, it’s not a game-changer. 

Compare models side by side 

The easiest way to see if a new AI model is better? Run the exact same prompt across different models. 

I have five different AI models open at any one time on a dedicated screen. Probably overkill, but it really clearly highlights the differences and capabilities of each. 

  • For example, I would try ChatGPT (GPT-4), Claude, Gemini, CoPilot and Le Chat with the same request. 

  • Look at response quality, depth, accuracy, and how much editing you need to do. 

This gives you an instant reality check on whether the new model is actually an upgrade or just marketing rubbish. 

Use a software tool to support you

If you have the budget, there are a breed of software tool such as arthur.ai that can help you evaluate model performance, but this doesn't necessarily cover the things you need to test for your role.

The best AI is the one that helps you do what you need to do 

But for a moment, forget benchmarks and marketing claims. The best way to test AI is to see how well it fits into your life and work. If it works for you in your role, that’s a win – stick with it... until the next one 😵💫 

Final thought - don't forget good AI governance. Stay within the boundaries of your AI policy when testing new models.

Keep up to date with what’s happening in AI by signing up to our newsletter: https://iwantmore.ai/ai-newsletter 

 

 

Madhurjya Sarmah

AI & Automations Developer, SEO Strategist | 24/7 AI Workforce

2mo

Test new AI by putting it through real tasks, pushing its limits, checking its knowledge, and measuring time saved—then compare models to find what works best for you in your role. Stay ahead of AI trends and let's connect!

To view or add a comment, sign in

More articles by James Grant

Insights from the community

Others also viewed

Explore topics