MOS vs. AB: Evaluating Text-to-Speech Systems Reliably Using Clustered Standard Errors was accepted for INTERSPEECH 2023, in Dublin.
This is a very interesting paper I think. Everyone doing research in TTS has to evaluate their TTS systems. One decision that will always pop up is: what method do we choose? Do we go for an MOS test, testing the system by itself, or do we go for a side-by-side comparison, comparing the new system to another one (or to recorded speech).
How to choose one over the other? Does it matter? Is one more robust or more sensitive than the other?
If these considerations have ever occurred to you... read the paper ;-)
I am really looking forward to giving a talk at the research group I was part of when I did my PhD. I wouldn't be surprised somehow, btw, if the audience is not completely up to speed with all ins and outs of speech synthesis/TTS (I certainly wasn't back when I was still there). So, it is an interesting challenge for me to come up with a nice talk anyway!
Slides coming soon...
Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks by Lev Finkelstein, Heiga Zen, Norman Casagrande, Chun-an Chan, Ye Jia, myself, Alexey Petelin, Jonathan Shen, Vincent Wan, Yu Zhang, Yonghui Wu and Rob Clark was accepted to INTERSPEECH 2022.
This paper is about transferring the accent of one speaker to another speaker, who does not have that accent, while preserving the speaker characteristics of the target speaker.
High quality transfer models are available, but they are typically expensive to run, and they can have reliability issues.
Other models may be more efficient and reliable, but they might not be as good at accent transfer.
This paper shows how to use speech data generated by the high quality, but expensive model, to train an efficient and reliable model.
Improving the Prosody of RNN-based English Text-To-Speech Synthesis by Incorporating a BERT Model by me, Manish Sharma and Rob Clark is an attempt to marry the two worlds of Natural Language Understanding (NLU) and Text-To-Speech.
the idea is that the prosody of synthetic speech improves if the a BERT model is involved, as BERT models incorprate syntactic en semantic (world) knowledge.
StrawNet: Self-Training WaveNet for TTS in Low-Data Regimes by Manish Sharma, me and Rob Clark is about distilling Parallel WaveNet models.
Parallalel Wavenet student models are typically distilled using the original dataset the teacher WaveNet model was trained on.
This doesn't work all that well if that dataset is relatively small, and the idea of this paper is to add additional synthesized speech samples (generated by the teacher model) to the dataset the used for distilling student model. Nice and simple, and it works!
This paper describes the variational auto-encoder (VAE) network used currently for text-to-speech (TTS) synthesis in the Google Assistant for the most frequently used voices.
Enjoy reading it!
Friday December 15 2017 I successfully defended my thesis, Text Understanding for Computers, at the Agnietenkapel in Amsterdam.
Many thanks to my committee members: prof. dr. Krisztian Balog (University of Stavanger), prof. dr. Antal van den Bosch (Radboud University, Meertens Instituut), prof. dr. Franciska de Jong (Utrecht University), dr. Evangelos Kanoulas (University of Amsterdam), dr. Christof Monz (University of Amsterdam), prof. dr. Khalil Sima'an (University of Amsterdam), dr. Aleksandr Chuklin (Google Research) and dr. Claudia Hauff (Delft University of Technology). Also, many thanks to my co-promotor Joris van Eijnatten (Utrecht University), and most of all, to my supervisor Maarten de Rijke.
Here is a PDF of the book.
Stay tuned for the PDF...
The final slides are now available on nn4ir.com.
Here is the pre-print on arXiv.
Here is the link to the interview on the New Scienist website.
BTW, I also designed the logo... ;-)
Camera-ready PDFs will follow shortly.
The algorithms I developed to monitor changes in vocabulary over time will be implemented in a tool that discloses a corpus of digitized historical Dutch newspapers (covering the last four centuries) used by digital humanities scholars.