Evaluating prompts locally with Ollama and PromptLab

Evaluating prompts locally with Ollama and PromptLab

Developing a Gen AI app that works reliably is a big challenge in today's industry. Even the most advanced AI models can generate misleading, inaccurate, or harmful outputs if not properly evaluated. Over the last two years, we’ve seen the rise of some great evaluation libraries, yet end-to-end prompt experimentation and productionization remain unnecessarily complex. As a result, many of us rely on manual prompt testing and ‘vibe checks’ instead of proper evaluation before deploying to production.

In this article, we shall see how to use PromptLab to implement end-to-end prompt experimentation locally.

Code

The complete code is available here - promptlab/samples/quickstart

Consider starring the PromptLab repo to show your support and encourage the developers.

Tools

  1. Ollama - to serve LLM models locally
  2. PromptLab - a lightweight python library to manage prompts and experiments

Use Case

Ever heard of overburdened teachers struggling to review countless essays? In this article, we’ll develop a prompt to help them generate effective feedback more efficiently.

To better understand the use case, let's examine a sample essay and the feedback provided by the teacher.

Topic
------
An interest or hobby that you enjoy

Submitted Essay
------------------
One hobby I really enjoy is playing chess. I love how every game is different and makes me think carefully about my next move. It’s exciting to plan strategies and try to outsmart my opponent. Sometimes, I play against friends, and other times, I practice online. Chess helps me stay focused and teaches me patience. I also enjoy watching professional chess matches to learn new tactics. It feels great when I win a game after a tough battle. Playing chess is not just fun—it also makes me smarter!

Feedback
----------
Grammar & Spelling – The grammar and spelling are mostly correct, with no major mistakes. Sentences are structured well, making the essay easy to read. 

Clarity & Fluency – The writing flows smoothly, and ideas are expressed clearly. However, some sentences could be slightly more detailed to enhance explanation. 

Content & Relevance – The essay stays on topic and explains why chess is enjoyable. It includes personal experiences and reasons, making it engaging. Adding more examples of specific strategies or memorable games could make it even stronger. 

Structure & Organization – The essay has a clear beginning, middle, and end. The introduction introduces the topic well, and the conclusion wraps it up nicely. A transition between some sentences could improve the overall flow.         

Now let's try to write a prompt to generate similar feedback.

System Prompt
-----------------
You are a helpful assistant who can provide feedback on essays.

User Prompt
--------------
The essay topic is - An interest or hobby that you enjoy.

The submitted essay is - One hobby I really enjoy is playing chess. I love how every game is different and makes me think carefully about my next move. It’s exciting to plan strategies and try to outsmart my opponent. Sometimes, I play against friends, and other times, I practice online. Chess helps me stay focused and teaches me patience. I also enjoy watching professional chess matches to learn new tactics. It feels great when I win a game after a tough battle. Playing chess is not just fun—it also makes me smarter!

Now write feedback on this essay.        

As you can already assume, running this prompt may not give you the results you expect. So, we shall create multiple versions of this prompt and evaluate them to find the best performing one.

Setup Environment

1. Install Ollama on your machine.

2. Download two models for PromptLab to work properly.

a. Inference model - we are using llama3.2, feel free to try other models.

b. Embedding model - we are using nomic-embed-text, feel free to try other models.

Article content

3. Install PromptLab package.

Though not mandatory, it's highly recommended to use a virtual python environment to install PromptLab package.

C:\Users>pip install promptlab        

Prepare Evaluation Dataset

The first step in prompt evaluation is creating an evaluation dataset - and it may also be the most challenging. It requires domain expertise (e.g., being a teacher), historical data (such as past exam questions), and, most importantly, patience.

PromptLab expects a jsonl file as evaluation dataset. It needs a mandatory id column. Here is the schema for this particular dataset.

{"id": "0", "essay_topic":"--------", "essay": "--------", "feedback":"--------"}        

You can check the file here - promptlab/test/dataset/essay_feedback.jsonl

The dataset is quite small, it only has three essays and their corresponding feedback. In practice, a much larger dataset is needed for proper evaluation.

Evaluation Metrics

The next big challenge is coming up with proper metrics to evaluate the outcome of the prompt. PromptLab supports integration with Ragas, a popular evaluation library. So, you can use any metric from Ragas - List of available metrics - Ragas.

For this use case, we are using the following metrics for evaluation:

  1. Semantic Similarity
  2. Non LLM String Similarity
  3. ROUGE Score

These metrics are chosen for demonstration purposes; in practice, they may not be the most suitable for the use case.

Using PromptLab

Now it's time to get started with coding. The complete code is available here - promptlab/samples/quickstart. The README has a detail explanation of the code.

Two main areas of the code are:

1. Registering assets

We can register two types of assets in PromptLab - Prompt Template and Dataset. An asset can have multiple versions. In this sample, we have created two versions of the prompt template.

First version

system_prompt_v1 = 'You are a helpful assistant who can provide feedback on essays.'
user_prompt_v1 = '''The essay topic is - <essay_topic>.

The submitted essay is - <essay>
Now write feedback on this essay.
'''        

Second version

system_prompt_v2 = '''You are a helpful assistant who can provide feedback on essays. You follow the criteria below while writing feedback.                    
Grammar & Spelling - The essay should have correct grammar, punctuation, and spelling.
Clarity & Fluency - Ideas should be expressed clearly, with smooth transitions between sentences and paragraphs.
Content & Relevance - The essay should stay on topic, answer the prompt effectively, and include well-developed ideas with supporting details or examples.
Structure & Organization - The essay should have a clear introduction, body paragraphs, and conclusion. Ideas should be logically arranged, with a strong thesis statement and supporting arguments.
'''
user_prompt_v2 = '''The essay topic is - <essay_topic>.

The submitted essay is - <essay>
Now write feedback on this essay.
'''        

2. Running experiments

Once dataset and prompt templates are registered, we can define an experiment and run it. An experiment is defined as a json object.

Here is the experiment definition:

{
    "model": {
        "type": "ollama",
        "inference_model_deployment": "llama3.2",
        "embedding_model_deployment": "nomic-embed-text"
    },
    "prompt_template": {
        "id": prompt_template_name,
        "version": prompt_template_version
    },
    "dataset": {
        "id": dataset_name,
        "version": dataset_version
    },
    "evaluation": [
        {
            "type": "ragas",
            "metric": "SemanticSimilarity",
            "column_mapping": {
                "response": "$inference",
                "reference": "feedback"
            }
        },
        {
            "type": "ragas",
            "metric": "NonLLMStringSimilarity",
            "column_mapping": {
                "response": "$inference",
                "reference": "feedback"
            }
        },
        {
            "type": "ragas",
            "metric": "RougeScore",
            "column_mapping": {
                "response": "$inference",
                "reference": "feedback"
            }
        }
    ]
}        

First, we shall run the experiment with the first version of the prompt. And then we shall run the experiment again with the second version of the prompt. This will give us the opportunity to compare how both prompts are performing.

We can launch the Prompt Lab Studio locally to check the details of the assets and the experiments.

In this example, the studio is launched at http://localhost:8000/.

Article content

We can clearly see that, the second version of the prompt has got better scores in almost all metrics. We can continue experimenting with other prompts and metrics and build a much better prompt for our use case.

Conclusion

Prompt experimentation doesn’t have to be overly complex or require additional cloud services. PromptLab aims to offer a simple, standalone tool for reliably productionizing prompts. In upcoming articles, we’ll explore other aspects of prompt engineering, such as collaboration, integrating with existing projects, and adapting to their DevOps workflows.

Imrul Sheikh

Director of Software Engineering at Adidas ///

1mo

Thanks Raihan for sharing, will try it out

Harjyot Malik

Driving Gen AI Product Excellence | ex AWS | ex Telstra

2mo

Very insightful. Thanks

Muhammad Sharif Uddin

Senior Manager - Data Science at Capital One

2mo

Will try it out. 😁

To view or add a comment, sign in

More articles by Raihan Alam

Insights from the community

Others also viewed

Explore topics