Evaluating prompts locally with Ollama and PromptLab
Developing a Gen AI app that works reliably is a big challenge in today's industry. Even the most advanced AI models can generate misleading, inaccurate, or harmful outputs if not properly evaluated. Over the last two years, we’ve seen the rise of some great evaluation libraries, yet end-to-end prompt experimentation and productionization remain unnecessarily complex. As a result, many of us rely on manual prompt testing and ‘vibe checks’ instead of proper evaluation before deploying to production.
In this article, we shall see how to use PromptLab to implement end-to-end prompt experimentation locally.
Code
The complete code is available here - promptlab/samples/quickstart
Consider starring the PromptLab repo to show your support and encourage the developers.
Tools
Use Case
Ever heard of overburdened teachers struggling to review countless essays? In this article, we’ll develop a prompt to help them generate effective feedback more efficiently.
To better understand the use case, let's examine a sample essay and the feedback provided by the teacher.
Topic
------
An interest or hobby that you enjoy
Submitted Essay
------------------
One hobby I really enjoy is playing chess. I love how every game is different and makes me think carefully about my next move. It’s exciting to plan strategies and try to outsmart my opponent. Sometimes, I play against friends, and other times, I practice online. Chess helps me stay focused and teaches me patience. I also enjoy watching professional chess matches to learn new tactics. It feels great when I win a game after a tough battle. Playing chess is not just fun—it also makes me smarter!
Feedback
----------
Grammar & Spelling – The grammar and spelling are mostly correct, with no major mistakes. Sentences are structured well, making the essay easy to read.
Clarity & Fluency – The writing flows smoothly, and ideas are expressed clearly. However, some sentences could be slightly more detailed to enhance explanation.
Content & Relevance – The essay stays on topic and explains why chess is enjoyable. It includes personal experiences and reasons, making it engaging. Adding more examples of specific strategies or memorable games could make it even stronger.
Structure & Organization – The essay has a clear beginning, middle, and end. The introduction introduces the topic well, and the conclusion wraps it up nicely. A transition between some sentences could improve the overall flow.
Now let's try to write a prompt to generate similar feedback.
System Prompt
-----------------
You are a helpful assistant who can provide feedback on essays.
User Prompt
--------------
The essay topic is - An interest or hobby that you enjoy.
The submitted essay is - One hobby I really enjoy is playing chess. I love how every game is different and makes me think carefully about my next move. It’s exciting to plan strategies and try to outsmart my opponent. Sometimes, I play against friends, and other times, I practice online. Chess helps me stay focused and teaches me patience. I also enjoy watching professional chess matches to learn new tactics. It feels great when I win a game after a tough battle. Playing chess is not just fun—it also makes me smarter!
Now write feedback on this essay.
As you can already assume, running this prompt may not give you the results you expect. So, we shall create multiple versions of this prompt and evaluate them to find the best performing one.
Setup Environment
1. Install Ollama on your machine.
2. Download two models for PromptLab to work properly.
a. Inference model - we are using llama3.2, feel free to try other models.
b. Embedding model - we are using nomic-embed-text, feel free to try other models.
3. Install PromptLab package.
Though not mandatory, it's highly recommended to use a virtual python environment to install PromptLab package.
C:\Users>pip install promptlab
Prepare Evaluation Dataset
The first step in prompt evaluation is creating an evaluation dataset - and it may also be the most challenging. It requires domain expertise (e.g., being a teacher), historical data (such as past exam questions), and, most importantly, patience.
PromptLab expects a jsonl file as evaluation dataset. It needs a mandatory id column. Here is the schema for this particular dataset.
Recommended by LinkedIn
{"id": "0", "essay_topic":"--------", "essay": "--------", "feedback":"--------"}
You can check the file here - promptlab/test/dataset/essay_feedback.jsonl
The dataset is quite small, it only has three essays and their corresponding feedback. In practice, a much larger dataset is needed for proper evaluation.
Evaluation Metrics
The next big challenge is coming up with proper metrics to evaluate the outcome of the prompt. PromptLab supports integration with Ragas, a popular evaluation library. So, you can use any metric from Ragas - List of available metrics - Ragas.
For this use case, we are using the following metrics for evaluation:
These metrics are chosen for demonstration purposes; in practice, they may not be the most suitable for the use case.
Using PromptLab
Now it's time to get started with coding. The complete code is available here - promptlab/samples/quickstart. The README has a detail explanation of the code.
Two main areas of the code are:
1. Registering assets
We can register two types of assets in PromptLab - Prompt Template and Dataset. An asset can have multiple versions. In this sample, we have created two versions of the prompt template.
First version
system_prompt_v1 = 'You are a helpful assistant who can provide feedback on essays.'
user_prompt_v1 = '''The essay topic is - <essay_topic>.
The submitted essay is - <essay>
Now write feedback on this essay.
'''
Second version
system_prompt_v2 = '''You are a helpful assistant who can provide feedback on essays. You follow the criteria below while writing feedback.
Grammar & Spelling - The essay should have correct grammar, punctuation, and spelling.
Clarity & Fluency - Ideas should be expressed clearly, with smooth transitions between sentences and paragraphs.
Content & Relevance - The essay should stay on topic, answer the prompt effectively, and include well-developed ideas with supporting details or examples.
Structure & Organization - The essay should have a clear introduction, body paragraphs, and conclusion. Ideas should be logically arranged, with a strong thesis statement and supporting arguments.
'''
user_prompt_v2 = '''The essay topic is - <essay_topic>.
The submitted essay is - <essay>
Now write feedback on this essay.
'''
2. Running experiments
Once dataset and prompt templates are registered, we can define an experiment and run it. An experiment is defined as a json object.
Here is the experiment definition:
{
"model": {
"type": "ollama",
"inference_model_deployment": "llama3.2",
"embedding_model_deployment": "nomic-embed-text"
},
"prompt_template": {
"id": prompt_template_name,
"version": prompt_template_version
},
"dataset": {
"id": dataset_name,
"version": dataset_version
},
"evaluation": [
{
"type": "ragas",
"metric": "SemanticSimilarity",
"column_mapping": {
"response": "$inference",
"reference": "feedback"
}
},
{
"type": "ragas",
"metric": "NonLLMStringSimilarity",
"column_mapping": {
"response": "$inference",
"reference": "feedback"
}
},
{
"type": "ragas",
"metric": "RougeScore",
"column_mapping": {
"response": "$inference",
"reference": "feedback"
}
}
]
}
First, we shall run the experiment with the first version of the prompt. And then we shall run the experiment again with the second version of the prompt. This will give us the opportunity to compare how both prompts are performing.
We can launch the Prompt Lab Studio locally to check the details of the assets and the experiments.
In this example, the studio is launched at http://localhost:8000/.
We can clearly see that, the second version of the prompt has got better scores in almost all metrics. We can continue experimenting with other prompts and metrics and build a much better prompt for our use case.
Conclusion
Prompt experimentation doesn’t have to be overly complex or require additional cloud services. PromptLab aims to offer a simple, standalone tool for reliably productionizing prompts. In upcoming articles, we’ll explore other aspects of prompt engineering, such as collaboration, integrating with existing projects, and adapting to their DevOps workflows.
Director of Software Engineering at Adidas ///
1moThanks Raihan for sharing, will try it out
Driving Gen AI Product Excellence | ex AWS | ex Telstra
2moVery insightful. Thanks
Senior Manager - Data Science at Capital One
2moWill try it out. 😁