AI Quick Actions: Evaluating Mistral 7B Instruct
All views are my own and not representative of Oracle.
Welcome to the fourth and final blog in the series exploring how we can start utilising the AI Quick Actions capabilities within OCI Data Science to evaluate the Mistral 7B Instruct v0.1 model without having to write a line of code.
If you have not already checked it out, give my previous blog a read here; AI Quick Actions: Fine Tuning Mistral 7B Instruct.
With deployed models, you can create a model evaluation to evaluate its performance. You can choose a dataset from Object Storage or upload one from the storage of the notebook you're working in. BERTScore and ROUGE are just some of the evaluation metrics available for measuring model performance. You can save the model evaluation result in Object Storage. You can set the model evaluation parameters. Under advanced options, you can choose the compute instance shape for the evaluation and optionally enter the Stop sequence.
Oracle Cloud Infrastructure (OCI) Data Science is a fully managed platform for teams of data scientists to build, train, deploy, and manage machine learning (ML) models using Python and open source tools. Use a JupyterLab-based environment to experiment and develop models. Scale up model training with NVIDIA GPUs and distributed training. Take models into production and keep them healthy with ML operations (MLOps) capabilities, such as automated pipelines, model deployments, and model monitoring. [1]
AI Quick Actions are a suite of actions that together can be used to deploy, evaluate and fine tune foundation models in OCI Data Science. AI Quick Actions target a user who wants to quickly leverage the capabilities of AI. They aim to expand the reach of foundation models to a broader set of users by providing a streamlined, code-free and efficient environment for working with foundation models. AI Quick Actions can be accessed from the Data Science Notebook. [2]
To get the full end to end guide, sample code and pre-requisites on how to try this out yourself checkout my GitHub assets available here.
Let's get started!
To start with we will need to create two OCI Object Storage Buckets to store the model results and dataset respectively.
Login to your OCI Console and from the Menu navigate to Storage and then Buckets. Here you can create a new Object Storage Bucket.
Give the Bucket a Name and Enable Object Versioning. Click Create Bucket.
Now we can repeat the same steps to create another Bucket to store our data that we will use for evaluation. We can then select the Bucket where we will store our data.
We can click on our data Bucket to upload our dataset. The dataset must be in JSONL format and must include the necessary 'prompt' and 'completion' columns. Optionally, you can include a 'category' column. If a dataset file with the same name already exists in the bucket, it's replaced by the new file. The dataset must contain a minimum of 100 records for fine-tuning.
Then we can navigate to our OCI Data Science Notebook Session and navigate to the Launcher within our OCI Data Science Notebook Session and open the AI Quick Actions Extension.
We can then select the Evaluations Tab from the AI Quick Actions Menu.
Click on Create Evaluation.
Give the evaluation a name, description and select an existing deployed model. If you want to know how to deploy a model via AI Quick Actions check out the first blog in this series here; AI Quick Actions: Deploying Mistral 7B Instruct.
We can then chose from multiple different evaluation metrics such as BERTScore, BLEU Score, Perplexity Score, Text Readability, and ROUGE. For this blog I have kept things simple and selected BERTScore.
Then select the location of our evaluation dataset within the Object Storage Bucket we created earlier.
We can then create a new experiement.
Recommended by LinkedIn
Select the location of where we want to store the results of the model evaluation. Here we select our Object Storage Bucket we created earlier to store our evaluation results. Click Next.
Here you can define your parameters for the LLM. I have left as the default. Select your Instance shape to run the evaluation. I have left as the default. VM.Standard.E3.Flex. Click Next and then Submit.
This will kick start the evaluation Job and the Lifecycle State will be In-Progress. Once the evaluation Job is complete the Lifecycle State will be updated to Succeeded.
If we scroll down we will see the evaluation metrics displayed. In our case, the BERTScore.
One of the greatest features of the Model Evaluation within AI Quick Actions is that it will automatically create you a HTML Evaluation Report which gets store within our previously selected Object Storage Bucket location. We can navigate to our Model Evaluation Results location and download the Evaluation Report.
We can open up this HTML Evaluation Report in our Browser to view. We can then take a look at the Model Evaluation Report. Here we can see a description of the Evaluation Metrics and an overview of the Evaluation Metrics Calculated.
We also get a Box Plot of the BERT F1 Score broken down by the different Categories defined in our Evaluation Dataset. We can see it performed better on Math related questions compared to the NULL Category.
We can also get a list of all the parameters the Model was invoked with.
Finally, we get a list of each individual sample in our Evaluation Dataset with the Prompt vs the Completion, vs the Response generated by the Model.
As you can see, you're now able to evaluate a LLM from OCI Data Science using the AI Quick Actions capabilities without having to write a single line of code.
To get the full end to end guide, sample data and pre-requisites on how to try this out yourself checkout my GitHub assets available here.
Thank you for staying tuned through my four part series on how you can start to use the AI Quick Actions capabilities within OCI Data Science to deploy, fine-tune and evaluate foundational models through a no-code interface to speed up experimentation and time to value.
Ismail Syed
Oracle Specialist Leader EMEA - Data Science, Vector & ML
References: