Navigating fine-tuning options on Azure AI
With the rising adoption of AI across various industries, there is an increasing need for verticalization - adapting general-purpose models to specialized tasks, industry-specific language, and unique use cases. Fine-tuning can be used to address this requirement and tailor pretrained language models, both large (LLMs) and small (SLMs), to specific scenarios. This process improves accuracy, reduces latency, and can lower costs by aligning models to the data and use case.
Why fine-tune? There are many models available today. Rather than simply choosing the model that scores highest on popular benchmarks, it often makes more sense to select a smaller model and customize it to your needs. Even with careful prompt design and using retrieval-augmented generation (RAG), a generic model may not always deliver adequate results. This is usually a clear indication that fine-tuning could help.
Fine-tuning is particularly beneficial and often leads to substantial performance improvements in scenarios such as:
- Domain-specific applications: Legal, medical, scientific texts, especially when proprietary datasets are available.
- Narrow task requirements: For example, complex classification tasks or sentiment analysis. If your prompts become very long (e.g., more than 1000 tokens), fine-tuning could simplify and enhance performance.
- Low-resource languages and dialects: Fine-tuning boosts performance for languages or dialects inadequately represented in general model training.
Azure offers two main ways to fine-tune models:
"Managed" Fine-Tuning on Azure
Azure's managed fine-tuning services simplify the customization process by handling the underlying infrastructure and optimization techniques. This option is ideal for users looking for an efficient, low-maintenance solution. Compute is totally serverless: it is charged per hour based on consumption. The managed option has an advantage that the easy process enables the user to worry about the use case and the data with Azure handling most of the complexity of setting the LLMOps processes.
- Azure OpenAI Service: Enables fine-tuning of Azure OpenAI models, allowing customization to specific datasets for improved performance.
- Model-as-a-Service (MaaS): Through Azure AI Foundry, users can access and fine-tune various models from different providers, including Meta Llama and Microsoft's Phi series. This approach offers flexibility in selecting models that best fit specific use cases.
For MaaS models, two further options exist: Serverless API and Managed Compute. Unless explicitly prompted with a dialog (as shown below) when selecting fine-tuning for the supported model in AI Foundry, Managed Compute is the default option.
Serverless API means the fine-tuned model is deployed on fully managed infrastructure, priced on a pay-per-token basis, along with a fixed hosting fee. This option eliminates infrastructure management concerns for both fine-tuning and model deployment.
Managed compute means that the portal fine-tuning option for the model is supported, however the actual infrastructure for training and hosting the fine-tuned model needs to be managed by the user. Most of the models are supported via this option.
Built-In Fine-Tuning Techniques on Azure for the "managed" fine-tuning option
Azure incorporates several advanced techniques to optimize the fine-tuning process:
- Low-Rank Adaptation (LoRA): It is ideal when having limited computational resources and need to fine-tune a model efficiently. LoRA reduces the number of trainable parameters by decomposing weight matrices into lower-rank structures, making the training process faster and less resource-intensive without significantly impacting performance. This approach is particularly beneficial for adapting large language models to specific tasks or domains where there is a moderate amount of labeled data.
- Direct Preference Optimization (DPO): Suited for scenarios where aligning the model's outputs with human preferences is crucial, especially when dealing with subjective elements like tone, style, or specific content preferences. DPO adjusts model weights based on binary human feedback, streamlining alignment without the need for complex reward models. This technique is beneficial when having access to human preference data and aim to enhance user satisfaction with the model's responses.
- Reinforcement Fine-Tuning: Applicable in dynamic environments where the model must adapt to evolving data patterns. This technique employs reinforcement learning strategies to refine model behavior based on feedback from its performance, making it suitable for tasks where predefined labels are scarce, and the model needs to learn from interactions over time.
- Distillation: Useful when the aim is to deploy models in resource-constrained environments. Model distillation involves using the outputs of large, complex models to fine-tune smaller, more efficient ones, allowing the smaller models to perform comparably on specific tasks while significantly reducing cost and latency. This approach makes sense when there is a need to maintain high performance but there are limitations on computational resources or need for faster inference times.
With the list of techniques constantly growing. Not every technique is supported for every model, and for up-to-date information it is best to consult the recent announcements.
Self-Managed Fine-Tuning on Azure
For users requiring greater control over the fine-tuning process, such as using specialized fine-tuning techniques or open-source packages, Azure provides the infrastructure to manage compute resources and customization logic independently. This approach is suitable for complex scenarios needing tailored solutions.
Azure Machine Learning (Azure ML): Offers a robust platform for orchestrating fine-tuning workflows, supporting various models and customization techniques. Users are responsible for configuring compute resources, managing data pipelines, and implementing training logic. This option provides the "classic" data science development experience.
Fine-tuning involving multi-GPU training
For large-scale models demanding substantial computational power, Azure supports advanced optimization strategies for improving the efficiency of multi-GPU training:
Recommended by LinkedIn
ONNX Runtime
Optimizes model execution across different hardware platforms, enhancing performance and scalability during training and inference. It facilitates model portability and efficient execution by converting models into an optimized intermediate representation compatible with various hardware accelerators.
- Accelerated Training: By optimizing computation graphs and utilizing efficient kernels, ORT can reduce training times by over 1.5 times compared to standard PyTorch implementations.
- Memory Efficiency: ORT's memory optimization strategies enable the training of larger models or the use of increased batch sizes within the same hardware constraints, thereby enhancing throughput.
- Flexible Hardware Support: ORT supports both NVIDIA and AMD GPUs, allowing simple adaptation across different hardware setups without extensive code modifications.
Integrating ORT into your PyTorch training script is straightforward:
from onnxruntime.training import ORTModule
# Define your PyTorch model
model = YourModel()
# Wrap the model with ORTModule
model = ORTModule(model)
This minimal change can lead to substantial performance improvements. ONNXRuntime module can also be simply enabled in the Azure AI portal for managed finetuning options.
DeepSpeed Optimization
Integrates with Azure ML to enable efficient training of massive models by managing memory consumption and computational load effectively. It employs parallelism strategies and memory optimization techniques that can dramatically decrease the memory requirements allowing to finetune on smaller VMs. Key features include:
- Memory Optimization: DeepSpeed's Zero Redundancy Optimizer (ZeRO) partitions model states, gradients, and optimizer states across multiple GPUs, significantly reducing memory redundancy. Deepspeed supports 3 stages of ZeRO stages:
- Scalability: DeepSpeed supports training across thousands of GPUs, enabling the scaling of models without linear increases in training time or resource consumption.
Both ONNX Runtime and DeepSpeed can be enabled for managed fine-tuning options during the fine-tuning job submission in the portal:
Fine-tuning notes
There are many ways to fine-tune, each suitable for different scenarios. Selecting the right approach and specific guidance will ultimately depend on the unique data, requirements, and use case.
Always check the latest guidance to pick the right solution for your situation and in the case of doubt - let us know! :)
Sr. Specialist for Azure AI @ Microsoft
2wVery helpful article, thanks for the write up!
Owner at Cinder Systems | Solutions Architect | Drupal Consulting
1moGreat article, thanks for sharing! I've integrated Azure OpenAI into my data migration tool to parse HTML and perform transformations. I've found that many of the models work really well with smaller chunks of HTML, but things get trickier with larger data sets. I’ve been exploring fine-tuning to help address some of these challenges, so really appreciate you putting this information together!