Multi-speaker Multi-lingual Zero-shot Indic TTS

Logo

One of the top 5 teams to win Multi-speaker, Multi-lingual Indic TTS with Voice Cloning grand challenge organized as part of IEEE ICASSP 2024.

View the paper on arXiv 2401.13851

Scaling NVIDIA’s Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages

In this paper, we describe the TTS models developed by NVIDIA for the MMITS-VC1 (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024 Challenge. In Tracks 1 and 2, we utilize RAD-MMM2 to perform few-shot TTS by training additionally on 5 minutes of target speaker data. In Track 3, we utilize P-Flow3 to perform zero-shot TTS by training on the challenge dataset as well as external datasets. We use HiFi-GAN4 vocoders for all submissions. RAD-MMM performs competitively on Tracks 1 and 2, while P-Flow ranks first on Track 3, with mean opinion score (MOS) 4.4 and speaker similarity score (SMOS) of 3.62.

Checkout our ICASSP 2024 talk in Seoul, South Korea: Scaling NVIDIA’s Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages

Checkout my substack blog about this project here.

This work got featured in a press release by NVIDIA here, here and Analytics India Magazine here.

Audio Samples

Track 1 Few shot TTS+VC with challenge dataset with RAD-MMM

Here is a 3 seconds speaker reference of a native Kannada Speaker (Female):

Here is the same speaker, speaking in various Indic languages:

Audio Language: Input Text
Marathi: मी थायलंडला चाललोय, बाबांनी सगळं नियोजन करून ठेवलं आहे.
Kannada: ನನ್ನ ಹದಿನೆಂಟು ತಿಂಗಳ ವಯಸ್ಸಿನವನಾಗಿದ್ದಾಗ ಸಹೋದರ ಸಂಖ್ಯೆಗಳನ್ನು ಸಾಧ್ಯವಾಯಿತು ಓದಲು
Bengali: পরিস্থিতিকে বিজ্ঞানের এমন বলা হয় জিরো শ্যাডো ডে ভাষায়

Track 2 Few shot TTS + VC with challenge + external datasets with RAD-MMM

Here is a 3 seconds speaker reference of a native Kannada Speaker (Female):

Here is the same speaker, speaking in various Indic languages:

Audio Language: Input Text
Marathi: मी थायलंडला चाललोय, बाबांनी सगळं नियोजन करून ठेवलं आहे.
Kannada: ನನ್ನ ಹದಿನೆಂಟು ತಿಂಗಳ ವಯಸ್ಸಿನವನಾಗಿದ್ದಾಗ ಸಹೋದರ ಸಂಖ್ಯೆಗಳನ್ನು ಸಾಧ್ಯವಾಯಿತು ಓದಲು
Bengali: পরিস্থিতিকে বিজ্ঞানের এমন বলা হয় জিরো শ্যাডো ডে ভাষায়

Track 3 Zero shot TTS + VC with challenge + external datasets with P-Flow

Here is a 3 seconds speaker reference of a native Kannada Speaker (Female):

Here is the same speaker, speaking in various Indic languages:

Audio Language: Input Text
Marathi: मी थायलंडला चाललोय, बाबांनी सगळं नियोजन करून ठेवलं आहे.
Kannada: ನನ್ನ ಹದಿನೆಂಟು ತಿಂಗಳ ವಯಸ್ಸಿನವನಾಗಿದ್ದಾಗ ಸಹೋದರ ಸಂಖ್ಯೆಗಳನ್ನು ಸಾಧ್ಯವಾಯಿತು ಓದಲು
Bengali: পরিস্থিতিকে বিজ্ঞানের এমন বলা হয় জিরো শ্যাডো ডে ভাষায়

Collaborators

Photo of collaborators

  1. Rohan Badlani, Machine Learning Researcher, NVIDIA
  2. Sungwon Kim, Applied Deep Learning Research Scientist , NVIDIA
  3. Rafael Valle, Research Manager and Scientist, NVIDIA
  4. Bryan Catanzaro, Vice President, Applied Deep Learning Research, NVIDIA

References

  1. MMITS-VC 2024 challenge webpage 

  2. Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, and Bryan Catanzaro, “Radmmm: Multilingual multiaccented multispeaker text to speech,” arXiv, 2023. Samples for RADMMM available here

  3. Sungwon Kim, Kevin J. Shih, Rohan Badlani, Joao Felipe Santos, Evelina Bakhturina, Mikyas T. Desta, Rafael Valle, Sungroh Yoon, and Bryan Catanzaro, “P-flow: A fast and data-efficient zero-shot TTS through speech prompting,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023. Samples from P-Flow available here 

  4. Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae, “Hifigan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033, 2020.