Large Language Models vs Small Language Models

LLMs vs SLMs
Nathan Bijnens
Sr Manager, Data & AI
Microsoft

Language Calculator
Will my sleeping
bag work for my
trip to Patagonia
next month?
User input
Context
Historical weather
lookup
Behavior Context data
Output structure Profile data
Prompt engineering “the art of
asking questions” + Add your own
data
LLM
Yes, your Elite Eco
sleeping bag is
rated to 21.6F,
which is below the
average low
temperature in
Patagonia in
September
Output
Prompt Completion

SLMs
Artificial Intelligence
Machine Learning
Deep
Learning
1956
Artificial Intelligence The field of computer science that seeks to
create intelligent machines that can replicate or exceed human intelligence.
1997
Machine Learning Subset of AI that enables machines to learn from
existing data and improve upon that data to make decisions or predictions.
2012
Deep Learning A machine learning technique in which layers of neural
networks are used to process data and make decisions.
2022
Large Language Models For the first time we are able capture and
model knowledge. Further, we observe emergent behaviors as we scale
up.
2023
Advent of Phi SLMs, Tiny but mighty language models that challenge
status quo!
Large
Language
Models (LLMs)
Small
Language
Models
🧨

What makes it small?
Feature Large Language Models (LLMs) Small Language Models (SLMs)
Amount of Parameters Billions to Trillions Millions to Billions
Use Cases Complex tasks like text generation,
translation, question answering, and
summarization
Specific tasks
Costs High computational and operational
costs due to extensive resource
requirements
Lower costs, suitable for resource-
constrained environments
Training Time Several weeks to months, depending
on the model size and computational
resources
Shorter training times, often a few
days to weeks
Training Dataset Sizes Massive datasets including books,
articles, websites, and other forms of
text
Smaller datasets, often task-specific
or domain-specific
Inference Speed Slower Faster
Deployment Requires powerful hardware
(GPUs/TPUs) and cloud infrastructure
Can run on edge devices, CPUs, and
less powerful GPUs
Accuracy High accuracy and performance on a Good performance on specific tasks,

Phi-3-mini
(3.8B)
Phi-3-vision
(4.2B)
Phi-3-MoE
(6.6B)
Phi-4
(14B)
Available on
Azure AI
Model Catalog
Hugging Face
Ollama
NVIDIA NIM
ONNX Runtime
Instruction Tuned RAI Safety Aligned
Phi
Small Language Models
Groundbreaking performance for
size, with frictionless availability

Models & availability across platforms
Model Input Content
Length
Azure AI (MaaS) Azure ML (MaaP) ONNX Hugging
Face
Ollama Nvidia
NIM
Phi-3-vision-128k-
instruct
Text+Image 128k Playground & Deployment Playground, Deployment
& Finetuning
CUDA, CPU
, DirectML
Download -NA- NIM APIs
Phi-3-mini-4k-instruct Text 4k Playground & Deployment Playground, Deployment
& Finetuning
CUDA, Web Playground
&
Download
GGUF NIM APIs
Phi-3-mini-128k-
instruct
Text 128k Playground & Deployment Playground, Deployment
& Finetuning
CUDA Download -NA- NIM APIs
Phi-3-small-8k-instruct Text 8k Playground & Deployment Playground, Deployment
& Finetuning
Phi-3-small-128k-
instruct
& Finetuning
Phi-3-medium-4k-
instruct
& Finetuning
CUDA, CPU
, DirectML
Download -NA- NIM APIs
Phi-3-medium-128k-
instruct
& Finetuning
CUDA, CPU
, DirectML
Download -NA- -NA-
Phi-4 Text 16k Playground & Deployment Playground, Deployment
& Finetuning
-NA- Download Download -NA-
Phi-silica which was announced at //build is based on Phi models and is optimized for Windows NPUs. Application developers
can leverage Phi-silica via in box Windows APIs. Phi-silica is not available on Azure, hence out of scope for this presentation

Benchmark numbers
Category Benchmark phi-4 (14B) phi-3 (14B) Qwen 2.5 (14B
instruct)
GPT-4o-mini Llama-3.3
(70B instruct)
Qwen 2.5 (72B
instruct)
GPT-4o
Popular
Aggregated
Benchmark
MMLU 84.8 77.9 79.9 81.8 86.3 85.3 88.1
Science GPQA 56.1 31.2 42.9 40.9 49.1 49.0 50.6
Math MGSM 80.6 53.5 79.6 86.5 89.1 87.3 90.4
Math MATH 80.4 44.6 75.6 73.0 66.3 80.0 74.6
Code
Generation
HumanEval 82.6 67.8 72.1 86.2 78.9 80.4 90.6
Factual
Knowledge
SimpleQA 3.0 7.6 5.4 9.9 20.9 10.2 39.4
Reasoning DROP 75.5 68.3 85.5 79.3 90.2 76.7 80.9

MMLU
GPQA
MGSM
MATH
HumanEval
SimpleQA
DROP
0
50
100
Benchmark numbers
phi-4 (14B) phi-3 (14B) Qwen 2.5 (14B instruct) GPT-4o-mini
Llama-3.3 (70B instruct) Qwen 2.5 (72B instruct) GPT-4o

Benefits of Small Language Models
Low compute
footprint and
can run on older
GPUs
Ultra Low
Latency thanks
to its small size
Easy on your
wallet, and
hence business
viable
Can be deployed
on-prem or on-
edge devices
Easier &
Affordable to
customize
The only model <5B that offers long context!

Some Use Cases for Small Language Models
Text Prediction Named Entity
Recognition
Summarization Domain Specific
Tasks
The only model <5B that offers long context!

Mistral - Ministral
Ministral-3B
(3.6B)
131k token context length
$0.04 / M tokens (input and output)
Internet-less assistant On-Device translation Local Analytics
Ministral-8B
(8B)
$0.04 / M tokens (input and output)

Meta - Llama
Llama-3.2-1B
(1B)
$0.37 / M tokens (est.)
Internet-less assistant Multilingual dialogue Image/Text to Text
Llama-3.2-3B
(3B)
Llama-3.2-11B-Vision
(11B)
128k token context length 128k token context length
/
Fast Slower but smarter

DeepSeek
DeepSeek-R1
(461B)
R1-Distill-Qwen-1.5B
R1-Distill-Qwen-7B
R1-Distill-Qwen-14B
R1-Distill-Qwen-32B
R1-Distill-Llama-8B
R1-Distill-Llama-70B
Distilled Models
Open-source model
Advanced reasoning capabilities
Not a SLM!

Important Considerations
Understand the Problem at hand
Identify the problem you are solving
Determine missing capabilities, skills, and behaviors
Evaluation and Benchmarks
Make sure you can measure what you are enabling
Use LLMs as a judge
Use BabelBench with 300+ tasks, track general capabilities
Invest in Better Data
Focus on higher quality, not quantity
Less finetuning data is better to keep general capabilities
Leverage LLMs to generate data
Human annotations if available
No Free Lunch
Fine-tuning reduces general capability over time
Model forgets knowledge outside target domain
Loss of general "thinking" skills

Thank you
Nathan Bijnens
Sr Manager, Data & AI
Microsoft
Nathan.Bijnens@microsoft.com

Large Language Models vs Small Language Models

Recommended

More Related Content

Similar to Large Language Models vs Small Language Models (20)

More from Nathan Bijnens (20)

Recently uploaded (20)

Large Language Models vs Small Language Models

Editor's Notes