Gemma 3 Technical Report

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Published Mar 12, 2025

Today's paper introduces Gemma 3, the latest addition to Google DeepMind's family of open language models. This multimodal model ranges from 1 to 27 billion parameters and introduces vision understanding capabilities, wider language coverage, and longer context handling of at least 128K tokens.

Method Overview

Gemma 3 builds upon the decoder-only transformer architecture used in previous Gemma versions, with several key architectural improvements. The most significant change is the integration of vision capabilities through a SigLIP vision encoder that converts images into a sequence of 256 "soft tokens" that the language model can process. To handle images of different resolutions and aspect ratios, the model implements a "Pan and Scan" method that segments images into non-overlapping crops, processes them separately, and then combines the results.

To enable longer context handling (up to 128K tokens) without excessive memory usage, the architecture introduces a 5:1 ratio of local to global attention layers. Local attention layers use a sliding window of 1024 tokens, while global layers attend to the entire context. This design significantly reduces the memory footprint of the KV cache during inference, which typically grows linearly with context length in traditional transformer models.

The training process follows a two-stage approach. First, the models are pre-trained on a mixture of text and image data (14 trillion tokens for the 27B model), with an increased proportion of multilingual content compared to previous versions. Knowledge distillation is used during pre-training, where the model learns from a larger teacher model. The second stage involves instruction tuning through a combination of supervised fine-tuning and reinforcement learning with various reward functions to improve helpfulness, math, coding, reasoning, and multilingual abilities while minimizing harmful outputs.

For quantization, the team employs Quantization Aware Training to create more efficient versions of the models in different formats (int4, block int4, and switched fp8), making them more accessible for deployment on resource-constrained devices.

Results

Gemma 3 demonstrates significant improvements over its predecessors across various benchmarks. The 27B instruction-tuned model achieves an Elo score of 1338 in the LMSYS Chatbot Arena, placing it among the top 10 models and outperforming many larger models like DeepSeek-V3 (1318), LLaMA 3 405B (1257), and Qwen2.5-70B (1257).

Conclusion

Gemma 3 represents a significant advancement in open language models by successfully integrating vision understanding, extending context length, and improving multilingual capabilities while maintaining efficiency for consumer hardware. For more information please consult the full paper.

Congrats to the authors for their work!

Gemma Team, Google DeepMind. "Gemma 3 Technical Report." 12 Mar. 2025.

View draft history

Gemma 3 Technical Report

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Method Overview

Results

Recommended by LinkedIn

Conclusion

AI Paper of the Day

1,398 follower

More articles by Vlad Bogolin

Insights from the community

Others also viewed

Exploring LangChain's Expression Language (LCEL)

How to Build a Powerful Large Language Model with Cutting-Edge Development Services?

How to adopt a LLM Model for Your Application

Unveiling the Future: Top Trends in Large Language Model (LLM) Research

Comparison of Key Large Language Models (LLMs): GPT-4, LLaMA, Claude 3 and PaLM

Paper Review: Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Understanding LLM Agents: The ReAct Framework and Its Application

FuturProof #236: AI Technical Review (Part 8) - Pre-Training

The magic of XLM-R: Unsupervised Cross-lingual Representation Learning at Scale

Vision Language Models: Bridging the Gap Between Visual Perception and Language Understanding

Explore topics

Method Overview

Results

Recommended by LinkedIn

Conclusion

AI Paper of the Day

1,398 follower

More articles by Vlad Bogolin

On Path to Multimodal Generalist: General-Level and General-Bench

Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

RM-R1: Reward Modeling as Reasoning

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

PixelHacker: Image Inpainting with Structural and Semantic Consistency

ReasonIR: Training Retrievers for Reasoning Tasks

DeepCritic: Deliberate Critique with Large Language Models

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Phi-4-reasoning Technical Report

Insights from the community

Others also viewed

Exploring LangChain's Expression Language (LCEL)

How to Build a Powerful Large Language Model with Cutting-Edge Development Services?

How to adopt a LLM Model for Your Application

Unveiling the Future: Top Trends in Large Language Model (LLM) Research

Comparison of Key Large Language Models (LLMs): GPT-4, LLaMA, Claude 3 and PaLM

Paper Review: Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Understanding LLM Agents: The ReAct Framework and Its Application

FuturProof #236: AI Technical Review (Part 8) - Pre-Training

The magic of XLM-R: Unsupervised Cross-lingual Representation Learning at Scale

Vision Language Models: Bridging the Gap Between Visual Perception and Language Understanding

Explore topics