One of the top 5 teams to win Lightweight, Multi-speaker, Multi-lingual Indic Text-to-Speech grand challenge organized as part of IEEE ICASSP 2023.
We introduce VANI (वाणी), a very lightweight multi-lingual accent controllable speech synthesis system. Our model builds upon disentanglement strategies proposed in RADMMM1,2 and supports explicit control of accent, language, speaker and fine-grained F0 and energy features for speech synthesis. We utilize the Indic languages dataset, released for LIMMITS 20233 as part of ICASSP Signal Processing Grand Challenge, to synthesize speech in 3 different languages. Our model supports transferring the language of a speaker while retaining their voice and the native accent of the target language. We utilize the large-parameter RADMMM model for Track 1 and lightweight VANI model for Track 2 and 3 of the competition.
Checkout our NVIDIA GTC 2024 talk in San Jose, California, USA: Speaking in Every Language: A Quick-Start Guide to TTS Models for Accented, Multilingual Communication [S62517]
Checkout my substack blog about this project here.
Here are the input text (and their respective translations):
Speakers / Languages | |||
---|---|---|---|
Native Hindi - Female | |||
Native Hindi - Male | |||
Native Marathi Female | |||
Native Marathi - Male | |||
Native Telugu - Female | |||
Native Telugu - Male |
Speakers / Languages | |||
---|---|---|---|
Native Hindi - Female | |||
Native Hindi - Male | |||
Native Marathi Female | |||
Native Marathi - Male | |||
Native Telugu - Female | |||
Native Telugu - Male |
Speakers / Languages | |||
---|---|---|---|
Native Hindi - Female | |||
Native Hindi - Male | |||
Native Marathi Female | |||
Native Marathi - Male | |||
Native Telugu - Female | |||
Native Telugu - Male |
The authors would like to thank Roman Korostik for discussions on spectogram enhancement, Evelina Bakhturina for discussions on dataset cleanup and Sirisha Rella for Telugu evaluation.
Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, and Bryan Catanzaro, “Radmmm: Multilingual multiaccented multispeaker text to speech,” arXiv, 2023. ↩