Mozilla layoff affecting voice projects

koan · August 24, 2020, 6:48pm

The recent Mozilla layoff unfortunately seems to affect some of their voice projects, Mozilla Voice STT (DeepSpeech) and Common Voice:

It remains to be seen whether and how these projects will stay.

synesthesiam · August 25, 2020, 3:26pm

I saw this, and it made me wonder if building on top of MozillaTTS (or DeepSpeech/STT) is going to be a good idea going forward.

Kaldi seems to be doing quite well, as shown by vosk. I’m working now on a Kaldi recipe generator so we can independently train TDNN-250 nnet3 models from public speech data.

For text to speech, I’ve heard good things about TensorflowTTS. If it’s anything like MozillaTTS, though, development is rapid and models are constantly out of date. For any kind of stability, I think we’ll need to train our own voices too.

rolyan_trauts · August 25, 2020, 5:24pm

I have been wondering about Deepspeech & Commonvoice anyway as irrespective of layoffs the direction and implementation?!? Its confused me at times and its like it ran out steam far before the layoffs.

Also https://speechbrain.github.io/ I dunno if that is still on schedule for end of 2020 release?
So much has been hit by covid what they state of play is confusing with many initiatives.

Vosk already has a lot of models but pytorch frameworks are starting to look really good. ASR/TTS are big complex models to train and it seems a strange target to head for when we haven’t even got a good simple cross language open model KWS which has much simpler requirements and processing needs than the big models of ASR & TTS.

Working on splinter specific recipe generators of a smaller community rather than sharing in wider community initiatives will only end up with lesser sucess.

synesthesiam · August 25, 2020, 6:31pm

I have someone working on the KWS stuff already, so I’m focusing on ASR/TTS. There’s a huge overlap in what I need to learn to properly do the Kaldi recipes as well as train ASR/TTS models.

Despite their claims, the deep learning ASR/TTS models do not go directly from speech to text or text to speech. They make a stop off at phonemes, and (in my opinion) seem to do it poorly. MozillaTTS, for example, runs everything through phonemizer which ultimately runs espeak-ng to guess phonetic pronunciations. This is a hack for English, since our orthography is absolutely awful and pronunciation can change depending on part of speech (“I refuse to put the refuse in the bin”).

So I’m learning about phonologies of the various Rhasspy languages and starting to understand why I’ve seen such wildly different choices for phonemes between Pocketsphinx models, Kaldi recipes, espeak, and MaryTTS. This stuff doesn’t matter as much when you can just overwhelm a model with data, but we don’t have that luxury outside of English.

This looks cool; I hope they continue with their planned schedule.

rolyan_trauts · August 25, 2020, 8:54pm

Yeah my understanding of ASR its phones with something in simple terms is fisticuffs like that tries to make a context of phones and sentence.

Whats interesting is the infrastructure as in terms of opensource for some reason we try to emulate Google/Amazon hardware functionality in the box whilst they process most in the cloud.
The area you are looking at is big tech cloud IMO I think its stupid to try and shoehorn it into Google/Amazon enclosures.

It would be great to see a home AI server that is interoperable with a range of satellite voice processors and again yeah I am saying satellites should be just audio rtp and common protocol compatible and non of this stupid proprietary bloat.

ASR/TTS is very much part of that singular HomeAI server so guess KWS and a lack of a good one can be forgotten in this case.

What is needed for each item is a model generator application that can save and exchange profiles as even down to a MFCC frame level for some regional accents to others frame size can have much optimisation.
Asian and hispanic phones are often far faster but the engine of use is often the same as much slower English.
Thats even before trying to phonemize but apparently there are only 39 …? something like that.
But why would any application try to guess pronounciation? There is about 150k words in current use and the pronounciation metadata is small. A typical text word is approx 10 bytes so a whole dictionary on the assumption pronunciation data is approx word size would be 150k X 10 bytes X 2 ?
Its inflection that TTS tries to guess not pronunciation so it conveys more sense than a robotic drawl.

How the frig that works is beyond me but if you are looking at new ASR/TTS please don’t shoehorn it into the Tardis we don’t have and look more at single server on the edge servicing multiple shelf satellites of audio.
Opensource can beat commercial AI if it stopped trying just to copy and do things that even commercialAI doesn’t do as they offload to the cloud, under the false assumption it does and so should opensource.

Make the model generators to create the models that can save a profile fingerprint and leave quantising down to the tardis to others.

I would look for a Pytorch framework purely from the amount of new projects that seem to of adopted and spurned keras and staright tensorflow and not that I know but following the crowd is often a good idea.