Fast Conformer Architecture, its Applications and rise of Conversational and Voice AI Stack

Gyanesh Kumar

Dad | Engineer | StoryTeller | Data and AI Enthusiast | Applied AI/ML

Published Mar 25, 2025

Introduction

Raising a Toddler as I speak, is fun for me . Parenting a Toddler, second time and overtime, I started appreciating motor and cognitive skills , speech recognition and other such skills that are contingent on cognitive skills. As Adults we could do things in split seconds babies cant. Its a fact that skills that are more mechanical and require reasoning, reflex and muscle movement are hard to acquire and even regaining these skills in scenarios like post-trauma is not easy.

A fun fact is skills of a Waiter or a Butler serving drinks, cups and plates are hard to be replaced even by most sophisticated AI on the planet while a Software Engineer's coding skills will likely be disrupted soon and replaced by an AI program. Honestly a Waiter's or a Butler's job is more secure than Software Engineer's especially when Software Engineers are paid way above well compared to a Waiter or a Butler, the paycheck disparity presents as an incentive for AI to replace expensive coding skills with a fully automated program.

In this post , I will draw an Analogy between Babies and Neural Networks and delve into use cases like Automatic Speech Recognition, Text to Speech, Speech to Text that are contingent on Cognitive Skills and what it takes to train a Neural Network grasp these hard skills. The Architecture Behind the scene , relevant use cases with real world examples and code samples..

Conformer Architecture - When Convolutional Neural Net meets Transformer

For those who recall raising their Toddlers, we used a combination of pictures ,art forms shapes combined with text and speech , audio video for Toddlers to recognize letters sound and then slowly recognize words from pictures and shapes. Neural Networks and the way they are trained to recognize Language and Speech is no different.

If we look at an Automatic Speech Recognition problem statement , its like combining two or more modalities and generating third modality that the Neural Network gets trained on.

In my previous post I have gone in depth with Transformers Architecture and their Attention Mechanism and Convolutional Neural Nets(CNN) and how they have facilitated Neural Network in learning Language and understanding Images. Check them out if you would like to have a detailed understanding of their inner working.

Here I am going to focus on Conformer Architecture how they have evolved and their use cases and code samples.

So What is a Conformer Architecture ?

Simply put , Conformer architecture is a neural network design that combines the strengths of Convolutional Neural Networks (CNNs) and Transformers to effectively model both local and global dependencies in sequence data, this significantly improves an LLM capability particularly for tasks like automatic speech recognition.

Conformer is a Hybrid Architecture that bridges the gaps between CNN and Transformers and leverages the strengths of both worlds.

Why a CNN or Transformer based Model alone is not enough for Automatic Speech Recognition ?

CNNs excel at capturing local features and patterns , while Transformers are known for their ability to model long-range dependencies with self attention mechanisms. These two unique features when combined together becomes a powerful tool for Comprehending Language, Voice Detection and Speech Recognition.

A Conformer block, as shown in the picture below is the fundamental building block of the Conformer architecture, typically consists of a feed-forward network (FFN), multi-headed self-attention (MHSA), a 1D convolution layer, and another feed-forward network (FFN.

Conformer architecture has demonstrated strong performance, but its computational and memory efficiency can still be a challenge, particularly with long sequences of speech and text, due to the quadratic complexity of the attention mechanism. However, A new downsampling based architecture has evolved where attention mechanism can be linearly scaled

Recommended by LinkedIn

Demystifying Large Language Models

Brij kishore Pandey 9 months ago

Using LLMs for Algorithm Generation and Optimization

Anshuman Jha 2 months ago

How Do Large Language Models (LLMs) Work?

Sharon Sahadevan 3 months ago

The Conformer architecture based models are used in tasks such as automatic speech recognition (ASR), spoken language understanding, speech translation, and as a backbone for self-supervised learning.

Conformer 1 -A model that leverages Transformer and Convolutional layers for speech recognition.

The Conformer [1] is a neural net for speech recognition that was published by Google Brain in 2020. The Conformer builds upon the now-ubiquitous Transformer architecture. By integrating convolutional layers into the Transformer architecture, the Conformer can capture both local and global dependencies while being a relatively size-efficient neural net architecture.

Real World Examples

NVIDIA recently open sourced their encoder-decoder model Canary, Canary uses FastConformer encoder and Transformer decoder Architecture.

NVIDIA NeMo Canary is a family of multi-lingual multi-tasking models that achieves state-of-the art performance on multiple benchmarks. With 1 billion parameters, Canary-1B supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC).

I was able to use this Open Source Model and create an Automatic Speech Recognition App.

I built my App using following ingredients:

a. ffmpeg and libsndfile1 package - packages to read, write, record, convert and stream audio

b. Nemo Canary Open Source LLM - for LLM Inference

c. Gradio for App UI - Check my Automatic Speech Recognition App here

d. a 2x16 CPU compute for Deploying and Hosting the docker container

Check out the App, try and experiment it and share your feedback and comments.

Conclusion

The Conversational AI Stack is improving rapidly. Systems that interact with users via speaking and listening will drive many new applications. Conformer based Models makes it easy for developers to write prompts to develop systems that deliver voice-in, voice-out experiences . Fast Conformer Architecture helps with Linearly Scalable Attention and Efficient Speech Recognition. A new family of Opens Source Model Canary, a multilingual model sets a new standard in speech-to-text recognition and translation.. This can be super helpful for building quick-and-dirty prototypes. It is still hard to control the output of voice-in voice-out models. We will see Agentic Workflows getting integrated with Conformer based LLM's that will help control the Model input and output better

Susan Stewart

Sales Executive at HINTEX

1mo

Incredible!

1 Reaction

Gyanesh Kumar

Dad | Engineer | StoryTeller | Data and AI Enthusiast | Applied AI/ML

1mo

Link to my App : https://huggingface.co/spaces/Gyaneshere/Canary-Speech-to-text

See more comments

To view or add a comment, sign in

Fast Conformer Architecture, its Applications and rise of Conversational and Voice AI Stack

Gyanesh Kumar

Dad | Engineer | StoryTeller | Data and AI Enthusiast | Applied AI/ML

Introduction

Conformer Architecture - When Convolutional Neural Net meets Transformer

Recommended by LinkedIn

Conclusion

More articles by Gyanesh Kumar

Insights from the community

Others also viewed

Understanding the Core Concepts of AI: Definitions, Analogies, Real-World Examples, and More

Codebooks in Speech Technology: From Foundations to Modern Applications

Essential AI Terminology in Banking: What You Should Know

A Comprehensive Guide on Upskilling Your Software Engineering to AI

What is Artificial intelligence System?

Exploring the Latest Developments in Transformer Neural Network Architecture: From LLMs to Multimodal Models

Exploring the Hidden Link Between Data Science and Deep Learning for Advanced AI Applications

Breaking Down the Buzzwords: Demystifying AI Terminology

How Large Language Models (LLMs) Work: A Comprehensive Guide

Explore topics

Introduction

Conformer Architecture - When Convolutional Neural Net meets Transformer

Recommended by LinkedIn

Conclusion

More articles by Gyanesh Kumar

Pytorch Memory Management Best Practices for GPU Poor

Storage-Network-Compute Trends for AI Ready Enterprise

Accelerating ETL - CPU vs GPU Tradeoff

Reinforcement Learning and the AHA Moment

Explainable AI

LLM Security and Observability - Responsible AI

Disaster Risk Monitoring System - Remote Sensing and Deep Learning

Cache Augmented Generation

Creating a Convolutional Neural Network from Scratch

Getting started with JAX

Insights from the community

Others also viewed

Understanding the Core Concepts of AI: Definitions, Analogies, Real-World Examples, and More

Codebooks in Speech Technology: From Foundations to Modern Applications

Essential AI Terminology in Banking: What You Should Know

A Comprehensive Guide on Upskilling Your Software Engineering to AI

What is Artificial intelligence System?

Exploring the Latest Developments in Transformer Neural Network Architecture: From LLMs to Multimodal Models

Exploring the Hidden Link Between Data Science and Deep Learning for Advanced AI Applications

Breaking Down the Buzzwords: Demystifying AI Terminology

How Large Language Models (LLMs) Work: A Comprehensive Guide

Explore topics