Fuck yeah! MaskGCT - New open SoTA Text to Speech model! 🔥 > Zero-shot voice cloning > Emotional TTS > Trained on 100K hours of data > Long form synthesis > Variable speed synthesis > Bilingual - Chinese & English > Available on Hugging Face Fully non-autoregressive architecture: > Stage 1: Predicts semantic tokens from text, using tokens extracted from a speech self-supervised learning (SSL) model > Stage 2: Predicts acoustic tokens conditioned on the semantic tokens. Synthesised: "Would you guys personally like to have a fake fireplace, an electric one, in your house? Or would you rather have a real fireplace? Let me know down below. Okay everybody, that's all for today's video and I hope you guys learned a bunch of furniture vocabulary!" TTS scene keeps getting lit! 🐐
The TTS model is like the last mile of the telephone line in the early days of telecommunication.
It takes 45 seconds can there be something faster with gpu? The quality doesn't need to be SOTA the speed needs to be
Woooottt 🔥🔥
can someone share the paper or code
This sounds amazing! Excited to see how it transforms TTS 👀
Sourabh Zanwar this we can use
cc-by-nc for those who care
Jalem Raj Rohit “Emotional TTS”
Driving Business Automation & AI Integration | Co-founder of Devstark and SpreadSimple | Stoic Mindset
5moThis is next level! The combination of zero-shot voice cloning and emotional TTS opens up so many possibilities. Great to see such innovation in the open-source space.