Fast Conformer Architecture, its Applications and rise of Conversational and Voice AI Stack
Image Courtsey Conformer 1

Fast Conformer Architecture, its Applications and rise of Conversational and Voice AI Stack

Introduction

Raising a Toddler as I speak, is fun for me . Parenting a Toddler, second time and overtime, I started appreciating motor and cognitive skills , speech recognition and other such skills that are contingent on cognitive skills. As Adults we could do things in split seconds babies cant. Its a fact that skills that are more mechanical and require reasoning, reflex and muscle movement are hard to acquire and even regaining these skills in scenarios like post-trauma is not easy.

A fun fact is skills of a Waiter or a Butler serving drinks, cups and plates are hard to be replaced even by most sophisticated AI on the planet while a Software Engineer's coding skills will likely be disrupted soon and replaced by an AI program. Honestly a Waiter's or a Butler's job is more secure than Software Engineer's especially when Software Engineers are paid way above well compared to a Waiter or a Butler, the paycheck disparity presents as an incentive for AI to replace expensive coding skills with a fully automated program.

In this post , I will draw an Analogy between Babies and Neural Networks and delve into use cases like Automatic Speech Recognition, Text to Speech, Speech to Text that are contingent on Cognitive Skills and what it takes to train a Neural Network grasp these hard skills. The Architecture Behind the scene , relevant use cases with real world examples and code samples..


Conformer Architecture - When Convolutional Neural Net meets Transformer

For those who recall raising their Toddlers, we used a combination of pictures ,art forms shapes combined with text and speech , audio video for Toddlers to recognize letters sound and then slowly recognize words from pictures and shapes. Neural Networks and the way they are trained to recognize Language and Speech is no different.

If we look at an Automatic Speech Recognition problem statement , its like combining two or more modalities and generating third modality that the Neural Network gets trained on.

In my previous post I have gone in depth with Transformers Architecture and their Attention Mechanism and Convolutional Neural Nets(CNN) and how they have facilitated Neural Network in learning Language and understanding Images. Check them out if you would like to have a detailed understanding of their inner working.

Here I am going to focus on Conformer Architecture how they have evolved and their use cases and code samples.

So What is a Conformer Architecture ?

Simply put , Conformer architecture is a neural network design that combines the strengths of Convolutional Neural Networks (CNNs) and Transformers to effectively model both local and global dependencies in sequence data, this significantly improves an LLM capability particularly for tasks like automatic speech recognition. 

Conformer is a Hybrid Architecture that bridges the gaps between CNN and Transformers and leverages the strengths of both worlds.


Why a CNN or Transformer based Model alone is not enough for Automatic Speech Recognition ?

CNNs excel at capturing local features and patterns , while Transformers are known for their ability to model long-range dependencies with self attention mechanisms. These two unique features when combined together becomes a powerful tool for Comprehending Language, Voice Detection and Speech Recognition.


A Conformer block, as shown in the picture below is the fundamental building block of the Conformer architecture, typically consists of a feed-forward network (FFN), multi-headed self-attention (MHSA), a 1D convolution layer, and another feed-forward network (FFN. 


Article content
A Conformer Block


Conformer architecture has demonstrated strong performance, but its computational and memory efficiency can still be a challenge, particularly with long sequences of speech and text, due to the quadratic complexity of the attention mechanism. However, A new downsampling based architecture has evolved where attention mechanism can be linearly scaled

The Conformer architecture based models are used in tasks such as automatic speech recognition (ASR), spoken language understanding, speech translation, and as a backbone for self-supervised learning. 

Conformer 1 -A model that leverages Transformer and Convolutional layers for speech recognition.

The Conformer [1] is a neural net for speech recognition that was published by Google Brain in 2020. The Conformer builds upon the now-ubiquitous Transformer architecture. By integrating convolutional layers into the Transformer architecture, the Conformer can capture both local and global dependencies while being a relatively size-efficient neural net architecture.

Real World Examples

NVIDIA recently open sourced their encoder-decoder model Canary, Canary uses FastConformer encoder and Transformer decoder Architecture.

NVIDIA NeMo Canary is a family of multi-lingual multi-tasking models that achieves state-of-the art performance on multiple benchmarks. With 1 billion parameters, Canary-1B supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC).

I was able to use this Open Source Model and create an Automatic Speech Recognition App.


I built my App using following ingredients:

a. ffmpeg and libsndfile1 package - packages to read, write, record, convert and stream audio

b. Nemo Canary Open Source LLM - for LLM Inference

c. Gradio for App UI - Check my Automatic Speech Recognition App here

d. a 2x16 CPU compute for Deploying and Hosting the docker container

Check out the App, try and experiment it and share your feedback and comments.


Conclusion

The Conversational AI Stack is improving rapidly. Systems that interact with users via speaking and listening will drive many new applications. Conformer based Models makes it easy for developers to write prompts to develop systems that deliver voice-in, voice-out experiences . Fast Conformer Architecture helps with Linearly Scalable Attention and Efficient Speech Recognition. A new family of Opens Source Model Canary, a multilingual model sets a new standard in speech-to-text recognition and translation.. This can be super helpful for building quick-and-dirty prototypes. It is still hard to control the output of voice-in voice-out models. We will see Agentic Workflows getting integrated with Conformer based LLM's that will help control the Model input and output better





Susan Stewart

Sales Executive at HINTEX

1mo

Incredible!

Gyanesh Kumar

Dad | Engineer | StoryTeller | Data and AI Enthusiast | Applied AI/ML

1mo
Like
Reply

To view or add a comment, sign in

More articles by Gyanesh Kumar

  • Pytorch Memory Management Best Practices for GPU Poor

    Introduction Imagine you are training a Transformer model or a dense CNN(Convolutional Neural Network) and in the…

  • Storage-Network-Compute Trends for AI Ready Enterprise

    Introduction Traditionally the Enterprise Tech Stack is comprised of Compute, Networking and Storage. Thats what the…

  • Accelerating ETL - CPU vs GPU Tradeoff

    Introduction One of the CIO's I met at a conference, said, Data is milk fresher the better. It just resonates every…

    4 Comments
  • Reinforcement Learning and the AHA Moment

    Introduction Reinforcement Learning is not a new topic but it has gained tremendous traction and momentum past couple…

    1 Comment
  • Explainable AI

    Introduction AI systems, often operate like black boxes: we send the inputs and receive outputs, but we don't see the…

  • LLM Security and Observability - Responsible AI

    Introduction Building Generative AI Applications and Agents is more like Software Engineering and quiet different from…

    2 Comments
  • Disaster Risk Monitoring System - Remote Sensing and Deep Learning

    Introduction I come from a state back in India where Floods is an yearly phenomenon taking hundreds of lives every…

    1 Comment
  • Cache Augmented Generation

    Introduction There's been a lot of buzz lately around Cache Augmented Generation. In a recent ArXiv Paper, researchers…

    1 Comment
  • Creating a Convolutional Neural Network from Scratch

    Introduction Convolution is a primitive in Computer Vision. CNN a.

    1 Comment
  • Getting started with JAX

    Introduction JAX-ML is a cutting-edge, scientific computing framework, created by Google in 2018. It is available as a…

    1 Comment

Insights from the community

Others also viewed

Explore topics