Navigating fine-tuning options on Azure AI
Generated with gpt4o

Navigating fine-tuning options on Azure AI

With the rising adoption of AI across various industries, there is an increasing need for verticalization - adapting general-purpose models to specialized tasks, industry-specific language, and unique use cases. Fine-tuning can be used to address this requirement and tailor pretrained language models, both large (LLMs) and small (SLMs), to specific scenarios. This process improves accuracy, reduces latency, and can lower costs by aligning models to the data and use case.

Why fine-tune? There are many models available today. Rather than simply choosing the model that scores highest on popular benchmarks, it often makes more sense to select a smaller model and customize it to your needs. Even with careful prompt design and using retrieval-augmented generation (RAG), a generic model may not always deliver adequate results. This is usually a clear indication that fine-tuning could help.

Fine-tuning is particularly beneficial and often leads to substantial performance improvements in scenarios such as:

- Domain-specific applications: Legal, medical, scientific texts, especially when proprietary datasets are available.

- Narrow task requirements: For example, complex classification tasks or sentiment analysis. If your prompts become very long (e.g., more than 1000 tokens), fine-tuning could simplify and enhance performance.

- Low-resource languages and dialects: Fine-tuning boosts performance for languages or dialects inadequately represented in general model training.

Azure offers two main ways to fine-tune models:

"Managed" Fine-Tuning on Azure

Azure's managed fine-tuning services simplify the customization process by handling the underlying infrastructure and optimization techniques. This option is ideal for users looking for an efficient, low-maintenance solution. Compute is totally serverless: it is charged per hour based on consumption. The managed option has an advantage that the easy process enables the user to worry about the use case and the data with Azure handling most of the complexity of setting the LLMOps processes.

- Azure OpenAI Service: Enables fine-tuning of Azure OpenAI models, allowing customization to specific datasets for improved performance. 

- Model-as-a-Service (MaaS): Through Azure AI Foundry, users can access and fine-tune various models from different providers, including Meta Llama and Microsoft's Phi series. This approach offers flexibility in selecting models that best fit specific use cases. 

For MaaS models, two further options exist: Serverless API and Managed Compute. Unless explicitly prompted with a dialog (as shown below) when selecting fine-tuning for the supported model in AI Foundry, Managed Compute is the default option.

Article content
Fine-tuning options for supported models in AI Foundry

Serverless API means the fine-tuned model is deployed on fully managed infrastructure, priced on a pay-per-token basis, along with a fixed hosting fee. This option eliminates infrastructure management concerns for both fine-tuning and model deployment.

Article content
Example pricing for managed fine-tuning option for Phi model family

Managed compute means that the portal fine-tuning option for the model is supported, however the actual infrastructure for training and hosting the fine-tuned model needs to be managed by the user. Most of the models are supported via this option.

Built-In Fine-Tuning Techniques on Azure for the "managed" fine-tuning option

Azure incorporates several advanced techniques to optimize the fine-tuning process:

- Low-Rank Adaptation (LoRA): It is ideal when having limited computational resources and need to fine-tune a model efficiently. LoRA reduces the number of trainable parameters by decomposing weight matrices into lower-rank structures, making the training process faster and less resource-intensive without significantly impacting performance. This approach is particularly beneficial for adapting large language models to specific tasks or domains where there is a moderate amount of labeled data.

- Direct Preference Optimization (DPO): Suited for scenarios where aligning the model's outputs with human preferences is crucial, especially when dealing with subjective elements like tone, style, or specific content preferences. DPO adjusts model weights based on binary human feedback, streamlining alignment without the need for complex reward models. This technique is beneficial when having access to human preference data and aim to enhance user satisfaction with the model's responses.

- Reinforcement Fine-Tuning: Applicable in dynamic environments where the model must adapt to evolving data patterns. This technique employs reinforcement learning strategies to refine model behavior based on feedback from its performance, making it suitable for tasks where predefined labels are scarce, and the model needs to learn from interactions over time.

- Distillation: Useful when the aim is to deploy models in resource-constrained environments. Model distillation involves using the outputs of large, complex models to fine-tune smaller, more efficient ones, allowing the smaller models to perform comparably on specific tasks while significantly reducing cost and latency. This approach makes sense when there is a need to maintain high performance but there are limitations on computational resources or need for faster inference times.

With the list of techniques constantly growing. Not every technique is supported for every model, and for up-to-date information it is best to consult the recent announcements.

Self-Managed Fine-Tuning on Azure

For users requiring greater control over the fine-tuning process, such as using specialized fine-tuning techniques or open-source packages, Azure provides the infrastructure to manage compute resources and customization logic independently. This approach is suitable for complex scenarios needing tailored solutions.

Azure Machine Learning (Azure ML): Offers a robust platform for orchestrating fine-tuning workflows, supporting various models and customization techniques. Users are responsible for configuring compute resources, managing data pipelines, and implementing training logic. This option provides the "classic" data science development experience.

Fine-tuning involving multi-GPU training

For large-scale models demanding substantial computational power, Azure supports advanced optimization strategies for improving the efficiency of multi-GPU training:

ONNX Runtime

Optimizes model execution across different hardware platforms, enhancing performance and scalability during training and inference. It facilitates model portability and efficient execution by converting models into an optimized intermediate representation compatible with various hardware accelerators.

- Accelerated Training: By optimizing computation graphs and utilizing efficient kernels, ORT can reduce training times by over 1.5 times compared to standard PyTorch implementations. 

- Memory Efficiency: ORT's memory optimization strategies enable the training of larger models or the use of increased batch sizes within the same hardware constraints, thereby enhancing throughput.

- Flexible Hardware Support: ORT supports both NVIDIA and AMD GPUs, allowing simple adaptation across different hardware setups without extensive code modifications.

Integrating ORT into your PyTorch training script is straightforward:

from onnxruntime.training import ORTModule 
# Define your PyTorch model 
model = YourModel() 
# Wrap the model with ORTModule  
model = ORTModule(model)        

This minimal change can lead to substantial performance improvements. ONNXRuntime module can also be simply enabled in the Azure AI portal for managed finetuning options.


Article content

DeepSpeed Optimization

Integrates with Azure ML to enable efficient training of massive models by managing memory consumption and computational load effectively. It employs parallelism strategies and memory optimization techniques that can dramatically decrease the memory requirements allowing to finetune on smaller VMs. Key features include:

- Memory Optimization: DeepSpeed's Zero Redundancy Optimizer (ZeRO) partitions model states, gradients, and optimizer states across multiple GPUs, significantly reducing memory redundancy. Deepspeed supports 3 stages of ZeRO stages:


Article content
ZeRO is implemented in stages, each one progressively increasing how much of the training state (parameters, gradients, optimizer states) is partitioned or “sharded” across devices. The ultimate goal is to remove as much redundancy as possible without making training communication overhead too large or too complicated to manage.

- Scalability: DeepSpeed supports training across thousands of GPUs, enabling the scaling of models without linear increases in training time or resource consumption.

Both ONNX Runtime and DeepSpeed can be enabled for managed fine-tuning options during the fine-tuning job submission in the portal:

Article content

Fine-tuning notes

  • Any fine-tuning process starts with data collection and curation. Since different fine-tuning methods have different data requirements, it is useful to at least have a general idea of the type of method to be used (supervised, unsupervised, or reinforcement learning) to prepare a suitable dataset. The data needs to be cleaned thoroughly, with quality double-checked, even when using distilled outputs from another model. If there is not enough data but there is a small set of high-quality examples, techniques like synthetic data generation can help expand your dataset.
  • Model selection. This part is always a little tricky. A helpful starting point, if there are no specific model constraints, is to choose from the Azure model catalog, especially models supporting managed fine-tuning. Additionally, Azure AI Foundry provides an in-built model comparison tool that lets you quickly evaluate and compare models across multiple tasks.

Article content
Supported models for chat completion fine-tuning on AI Foundry


Article content
Benchmarks models comparison in Azure AI Foundry

  • Training. When using the managed fine-tuning option in Azure AI, training setup and job submission are straightforward. If you haven't conducted prior tests, it’s usually a good practice to start with default hyperparameters and perform an initial run on a smaller subset of your data to quickly assess performance.
  • Validation and evaluation. Like other data science tasks, fine-tuning requires careful examination of loss curves and hyperparameter tuning to prevent model overfitting. A potential issue to watch out for is catastrophic forgetting, which happens when a fine-tuned model loses some previously learned general knowledge. Parameter-efficient techniques, such as LoRA, can help mitigate this issue, but for other methods, especially full supervised fine-tuning, it's something to keep in mind. A best practice for evaluation is maintaining a "golden dataset" with high-quality question-answer pairs, testing your model’s performance thoroughly both before and after fine-tuning.
  • Deployment. Deploying your fine-tuned model efficiently is crucial for achieving the desired benefits in production. Select a deployment strategy that matches the use case, for example, serverless API deployments if minimal infrastructure management is preferred, or managed compute if greater control is needed. Optimize the model for inference using quantization, or similar techniques to reduce latency and costs. When choosing managed compute, carefully select GPU machines with sufficient memory and processing power to accommodate your model's size and inference demands. Using GPUs that are undersized can lead to performance bottlenecks, latency issues, or failures during inference.
  • Monitoring. Once your fine-tuned model is deployed, it is recommended to configure continious monitoring. Establish clear performance metrics and track model predictions regularly to detect any drift or performance degradation promptly. Scheduling periodic retraining and updating your model with new data ensures sustained accuracy and reliability over time. Automated alerts and dashboards can also help manage and streamline monitoring efforts.

There are many ways to fine-tune, each suitable for different scenarios. Selecting the right approach and specific guidance will ultimately depend on the unique data, requirements, and use case.

Article content

Always check the latest guidance to pick the right solution for your situation and in the case of doubt - let us know! :)

Clemens Siebler

Sr. Specialist for Azure AI @ Microsoft

2w

Very helpful article, thanks for the write up!

Stephen Mulvihill

Owner at Cinder Systems | Solutions Architect | Drupal Consulting

1mo

Great article, thanks for sharing! I've integrated Azure OpenAI into my data migration tool to parse HTML and perform transformations. I've found that many of the models work really well with smaller chunks of HTML, but things get trickier with larger data sets. I’ve been exploring fine-tuning to help address some of these challenges, so really appreciate you putting this information together!

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics