Vaibhav Srivastav’s Post

Vaibhav Srivastav

GPU poor @ Hugging Face

5mo

Fuck yeah! MaskGCT - New open SoTA Text to Speech model! 🔥 > Zero-shot voice cloning > Emotional TTS > Trained on 100K hours of data > Long form synthesis > Variable speed synthesis > Bilingual - Chinese & English > Available on Hugging Face Fully non-autoregressive architecture: > Stage 1: Predicts semantic tokens from text, using tokens extracted from a speech self-supervised learning (SSL) model > Stage 2: Predicts acoustic tokens conditioned on the semantic tokens. Synthesised: "Would you guys personally like to have a fake fireplace, an electric one, in your house? Or would you rather have a real fireplace? Let me know down below. Okay everybody, that's all for today's video and I hope you guys learned a bunch of furniture vocabulary!" TTS scene keeps getting lit! 🐐

22 Comments

Refat Ametov, graphic

Driving Business Automation & AI Integration | Co-founder of Devstark and SpreadSimple | Stoic Mindset

5mo

This is next level! The combination of zero-shot voice cloning and emotional TTS opens up so many possibilities. Great to see such innovation in the open-source space.

Ming-Tsung Wu, graphic

Director / Application Dept.

5mo

The TTS model is like the last mile of the telephone line in the early days of telecommunication.

Raj Panchal, graphic

Full Stack Engineer | MERN, Next JS, AWS, GCP, Three.js, OpenAI API | Empowering Businesses with advanced solutions and Interactive Web Technologies

5mo

It takes 45 seconds can there be something faster with gpu? The quality doesn't need to be SOTA the speed needs to be

Robin Williams, graphic

*

5mo

Woooottt 🔥🔥

Onkar Susladkar, graphic

Onkar Susladkar

MLE@yellow.ai | Research Scholar Northwestern University | Teaching and Research Assistant Indian Institute of Technology Roorkee | RA @IIT jodhpur | RA@IISc Banglore

5mo

can someone share the paper or code

Hyperstack, graphic

5mo

This sounds amazing! Excited to see how it transforms TTS 👀

Shreya Kar, graphic

Data Scientist | Senior Consultant at Machine Learning Reply

5mo

Sourabh Zanwar this we can use

David Mezzetti, graphic

Founder/CEO at NeuML

5mo

cc-by-nc for those who care

Manash Mishra, graphic

ML Engineer @ Paisabazaar | Ex RA NLP @ IIT BHU

5mo

Jalem Raj Rohit “Emotional TTS”

See more comments

To view or add a comment, sign in

More Relevant Posts

Marc Berlin Guerrero

Liberation through automation | No-code AI solutions that scale your business | Voiceflow | Make.com | AI Automations | AI Conversational Assistant
5mo
Report this post
introducing outeTTS-0.1-350M: text-to-speech without the baggage 🎙️. this sleek model uses pure language modeling, ditching the usual heavy architectures. imagine creating custom voices with just a few seconds of audio - that's zero-shot voice cloning for you! 🤯 with just 350 million parameters, it rivals much larger models. is this the future of tts tech? what do you think about the move towards smaller, smarter models? let's discuss. 👇
Like Comment
To view or add a comment, sign in
NVIDIA AI

1,238,924 followers
11mo
Report this post
Leverage the power of Canary, a multilingual speech-to-text model that ranks at the top of the HuggingFace Open ASR Leaderboard with an average word error rate of 6.67% outperforming all other open-source models by a wide margin. Explore the model and learn how to use it. https://nvda.ws/49ET2rb #SpeechRecognition #Transcription

New Standard for Speech Recognition and Translation from the NVIDIA NeMo Canary Model | NVIDIA Technical Blog

developer.nvidia.com
Like Comment
To view or add a comment, sign in
Ibrahim Sobh - PhD

🎓 Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer
11mo
Report this post
💎 Overview of Vision-Language Models (#VLM) ✅ Vision-Language Models (#VLMs) integrate both visual (image) and textual (language) information processing. They are designed to understand and generate content that involves both #images and #text, enabling them to perform tasks like #imagecaptioning, #visualquestionanswering, and #texttoimage generation. ✅ This primer offers an overview of their architecture and how they differ from Large Language Models (LLMs). 👉 https://lnkd.in/dnv6425Y ♻️𝒌𝒊𝒏𝒅𝒍𝒚 𝒓𝒆𝒑𝒐𝒔𝒕♻️
2 Comments
Like Comment
To view or add a comment, sign in
Hauska.io

95 followers
10mo
Report this post
Explore the capabilities of Enhanced Natural Language Prompting in Hauska Design Pro. Transform text descriptions into detailed visual designs effortlessly. Simplify your workflow and bring your ideas to life faster than ever. https://hubs.ly/Q02zkkqV0
Like Comment
To view or add a comment, sign in
Yu Kanazawa, PhD

The University of Osaka / 大阪大学 - Associate Professor (Lecturer) / 講師
2mo
Report this post
L2リスニングのコーディングスキームトップダウン、ボトムアップ、情意的、マルチモーダル、環境的 🔓 Siegel, J., Kuteeva, M., & Siegel, A. (2025). Making sense of non-comprehension issues while listening: A data-based coding scheme. Research Methods in Applied Linguistics, 4(1), Article 100181. https://lnkd.in/gKuXmmhX

Maria Kuteeva

Professor of English Linguistics, Department of English, Stockholm University
2mo

Hot off the press and open access! The first publication resulting from our VR-funded ReMoDEL project on the dynamic nature of comprehension in EMI lectures (Joe Siegel (PI), Maria Kuteeva and Aki Siegel) . This article presents a coding scheme resulting from our first data collection using the idiodynamic method. https://lnkd.in/dC27znY6

Making sense of non-comprehension issues while listening: A data-based coding scheme

sciencedirect.com
Like Comment
To view or add a comment, sign in
Alex Pillen

Associate Professor, Anthropology of Language, Kurmanji
10mo
Report this post
Architecture of language: Forced perspective https://lnkd.in/e9yjyFRk

FORCED PERSPECTIVE

alexpillen.medium.com
Like Comment
To view or add a comment, sign in
Ramin Mehran

Tech Lead @ Google DeepMind Multi-Modal perception/generation, AI Breakdown Podcaster
11mo Edited
Report this post
In this episode, we discuss TextSquare: Scaling up Text-Centric Visual Instruction Tuning by @Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, Wei Shi, Yuliang Liu, Hao Liu, Yuan Xie, Xiang Bai, @Can Huang. The paper describes advancements in text-centric visual question answering using a novel dataset called Square-10M, developed to improve Multimodal Large Language Models (MLLMs) through instruction tuning. The dataset, generated with closed-source MLLMs, employs a method named Square that covers Self-Questioning, Answering, Reasoning, and Evaluation for data construction. Experiments on the dataset indicated significant performance enhancements over existing models, highlighting the importance of the quantity of reasoning data in VQA for enhancing accuracy and reducing errors in model responses.

arxiv preprint - TextSquare: Scaling up Text-Centric Visual Instruction Tuning

podbean.com
Like Comment
To view or add a comment, sign in
Xiaoyu Liu

Member of Technical Staff, Waveforms AI
6mo Edited
Report this post
Check out our new paper on "Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved Intelligibility"! We introduce a generative model addressing denoising, dereverberation, bandwidth extension, and declipping, producing high-quality full-band (44.1 kHz) speech. Our approach involves masked audio token modeling with a language model. Notably, we propose self-supervised semantic knowledge distillation, which significantly enhances the speech intelligibility without the need for transcribed data nor increasing model size and inference time. Find the paper and samples at the provided link!

2409.09357

arxiv.org
Like Comment
To view or add a comment, sign in
Himanshu Ramchandani

I help businesses with AI Engineering & Consulting.
10mo
Report this post
GPT-4o Key takeaways 🚀 In the research paper, the human response time across languages-208ms Recognizing and identifying various speakers in an audio recording Targeted alteration of input photos in a Photoshop-like manner Much superior to GPT-4 Turbo in languages other than English Significantly better text rendering on created images and fonts Capacity to communicate human feelings and vocal abilities Benchmarks for MMLU/HumanEval have slightly improved 50% less expensive and twice as quick as GPT-4 Turbo Capability of audio recording human emotions 3D modeling and image creation Summarization of the lecture Read the Analysis here: https://lnkd.in/dGxz-Mhf
Like Comment
To view or add a comment, sign in

Vaibhav Srivastav

54,664 followers

View Profile Connect

More from this author

Explore topics

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

翻译：