This paper is based on work Paul did as part of an internship in our team at Google in London.
The paper is about the length of pauses between sentences. Speech synthesis is typically doen sentence by sentence, so one has to make a decisiosn is to how much silence to put in between these sentences. What we discovered during the internship is that people do not seem to be very sensitive to the difference lengths for these pauses between sentences, unless the diference is huge.
This is good news in a way (as in: you might not have to care for this much), but on the other hand it is a pity if you are working on models predicting the lengths of these pauses, as any improvement your model is making is unlikely to be picked up by rathers using the current evaluation methods.
I am proud that the first ever internship I supervised lead to interesting findings, and a nice publication too. Way to go Paul!
MOS vs. AB: Evaluating Text-to-Speech Systems Reliably Using Clustered Standard Errors was accepted for INTERSPEECH 2023, in Dublin.
This is a very interesting paper I think. Everyone doing research in TTS has to evaluate their TTS systems. One decision that will always pop up is: what method do we choose? Do we go for an MOS test, testing the system by itself, or do we go for a side-by-side comparison, comparing the new system to another one (or to recorded speech).
How to choose one over the other? Does it matter? Is one more robust or more sensitive than the other?
If these considerations have ever occurred to you... read the paper ;-)
I am really looking forward to giving a talk at the research group I was part of when I did my PhD. I wouldn't be surprised somehow, btw, if the audience is not completely up to speed with all ins and outs of speech synthesis/TTS (I certainly wasn't back when I was still there). So, it is an interesting challenge for me to come up with a nice talk anyway!
Slides coming soon...
Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks by Lev Finkelstein, Heiga Zen, Norman Casagrande, Chun-an Chan, Ye Jia, myself, Alexey Petelin, Jonathan Shen, Vincent Wan, Yu Zhang, Yonghui Wu and Rob Clark was accepted to INTERSPEECH 2022.
This paper is about transferring the accent of one speaker to another speaker, who does not have that accent, while preserving the speaker characteristics of the target speaker.
High quality transfer models are available, but they are typically expensive to run, and they can have reliability issues.
Other models may be more efficient and reliable, but they might not be as good at accent transfer.
This paper shows how to use speech data generated by the high quality, but expensive model, to train an efficient and reliable model.
Improving the Prosody of RNN-based English Text-To-Speech Synthesis by Incorporating a BERT Model by me, Manish Sharma and Rob Clark is an attempt to marry the two worlds of Natural Language Understanding (NLU) and Text-To-Speech.
the idea is that the prosody of synthetic speech improves if the a BERT model is involved, as BERT models incorprate syntactic en semantic (world) knowledge.
StrawNet: Self-Training WaveNet for TTS in Low-Data Regimes by Manish Sharma, me and Rob Clark is about distilling Parallel WaveNet models.
Parallalel Wavenet student models are typically distilled using the original dataset the teacher WaveNet model was trained on.
This doesn't work all that well if that dataset is relatively small, and the idea of this paper is to add additional synthesized speech samples (generated by the teacher model) to the dataset the used for distilling student model. Nice and simple, and it works!
The blog post is based on our SSW10 paper.
This paper describes the variational auto-encoder (VAE) network used currently for text-to-speech (TTS) synthesis in the Google Assistant for the most frequently used voices.
The talk will be about my work on byte-level machine reading models.
The slides are over here.
In my first ever blogpost, published on Medium, I try to explain how byte-level models work, how they compare to character-level NLP models, and to word-level models.
Enjoy reading it!
The slides can be downloaded as one file over here, but are also available as separate slide decks per session from the NN4IR website.
Lastly, we also wrote this overview paper.
The slides are available as one file over here, or per session from the NN4IR website.
Additionally, here is the overview paper.
Friday December 15 2017 I successfully defended my thesis, Text Understanding for Computers, at the Agnietenkapel in Amsterdam.
Many thanks to my committee members: prof. dr. Krisztian Balog (University of Stavanger), prof. dr. Antal van den Bosch (Radboud University, Meertens Instituut), prof. dr. Franciska de Jong (Utrecht University), dr. Evangelos Kanoulas (University of Amsterdam), dr. Christof Monz (University of Amsterdam), prof. dr. Khalil Sima'an (University of Amsterdam), dr. Aleksandr Chuklin (Google Research) and dr. Claudia Hauff (Delft University of Technology). Also, many thanks to my co-promotor Joris van Eijnatten (Utrecht University), and most of all, to my supervisor Maarten de Rijke.
Here is a PDF of the book.
Stay tuned for the PDF...
Please read the excellent blogpost on the ACM website. And thanks everyone for tweeting.
The final slides are now available on nn4ir.com.
Here is the pre-print on arXiv.
Here is the link to the interview on the New Scienist website.
BTW, I also designed the logo... ;-)
Many, many thanks to all annotators who contributed their time and effort!!!
Camera-ready PDFs will follow shortly.
The algorithms I developed to monitor changes in vocabulary over time will be implemented in a tool that discloses a corpus of digitized historical Dutch newspapers (covering the last four centuries) used by digital humanities scholars.