One of the top 5 teams to win Multi-speaker, Multi-lingual Indic TTS with Voice Cloning grand challenge organized as part of IEEE ICASSP 2024.
In this paper, we describe the TTS models developed by NVIDIA for the MMITS-VC1 (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024 Challenge. In Tracks 1 and 2, we utilize RAD-MMM2 to perform few-shot TTS by training additionally on 5 minutes of target speaker data. In Track 3, we utilize P-Flow3 to perform zero-shot TTS by training on the challenge dataset as well as external datasets. We use HiFi-GAN4 vocoders for all submissions. RAD-MMM performs competitively on Tracks 1 and 2, while P-Flow ranks first on Track 3, with mean opinion score (MOS) 4.4 and speaker similarity score (SMOS) of 3.62.
Checkout our ICASSP 2024 talk in Seoul, South Korea: Scaling NVIDIA’s Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages
Checkout my substack blog about this project here.
This work got featured in a press release by NVIDIA here, here and Analytics India Magazine here.
Here is a 3 seconds speaker reference of a native Kannada Speaker (Female):
Here is the same speaker, speaking in various Indic languages:
Audio | Language: Input Text |
---|---|
Marathi: मी थायलंडला चाललोय, बाबांनी सगळं नियोजन करून ठेवलं आहे. | |
Kannada: ನನ್ನ ಹದಿನೆಂಟು ತಿಂಗಳ ವಯಸ್ಸಿನವನಾಗಿದ್ದಾಗ ಸಹೋದರ ಸಂಖ್ಯೆಗಳನ್ನು ಸಾಧ್ಯವಾಯಿತು ಓದಲು | |
Bengali: পরিস্থিতিকে বিজ্ঞানের এমন বলা হয় জিরো শ্যাডো ডে ভাষায় |
Here is a 3 seconds speaker reference of a native Kannada Speaker (Female):
Here is the same speaker, speaking in various Indic languages:
Audio | Language: Input Text |
---|---|
Marathi: मी थायलंडला चाललोय, बाबांनी सगळं नियोजन करून ठेवलं आहे. | |
Kannada: ನನ್ನ ಹದಿನೆಂಟು ತಿಂಗಳ ವಯಸ್ಸಿನವನಾಗಿದ್ದಾಗ ಸಹೋದರ ಸಂಖ್ಯೆಗಳನ್ನು ಸಾಧ್ಯವಾಯಿತು ಓದಲು | |
Bengali: পরিস্থিতিকে বিজ্ঞানের এমন বলা হয় জিরো শ্যাডো ডে ভাষায় |
Here is a 3 seconds speaker reference of a native Kannada Speaker (Female):
Here is the same speaker, speaking in various Indic languages:
Audio | Language: Input Text |
---|---|
Marathi: मी थायलंडला चाललोय, बाबांनी सगळं नियोजन करून ठेवलं आहे. | |
Kannada: ನನ್ನ ಹದಿನೆಂಟು ತಿಂಗಳ ವಯಸ್ಸಿನವನಾಗಿದ್ದಾಗ ಸಹೋದರ ಸಂಖ್ಯೆಗಳನ್ನು ಸಾಧ್ಯವಾಯಿತು ಓದಲು | |
Bengali: পরিস্থিতিকে বিজ্ঞানের এমন বলা হয় জিরো শ্যাডো ডে ভাষায় |
Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, and Bryan Catanzaro, “Radmmm: Multilingual multiaccented multispeaker text to speech,” arXiv, 2023. Samples for RADMMM available here. ↩
Sungwon Kim, Kevin J. Shih, Rohan Badlani, Joao Felipe Santos, Evelina Bakhturina, Mikyas T. Desta, Rafael Valle, Sungroh Yoon, and Bryan Catanzaro, “P-flow: A fast and data-efficient zero-shot TTS through speech prompting,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023. Samples from P-Flow available here ↩
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae, “Hifigan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033, 2020. ↩