Now it’s finished:)
The project got a new name: Scribosermo and now can be found here:
The new models can be trained very fast (~3 days on 2x1080Ti to reach SOTA in German) and with comparatively small datasets (~280h for competitive results in Spanish). Using a little bit more time and data, the following Word-Error-Rates on CommonVoice testset were achieved:
|7.2 %||3.7 %||10.0 %||11.7 %|
Training is even simpler than with DeepSpeech before and adding new languages is easy as well. After training, the models can be exported into tflite-format for easier inference. They are able to run faster than real-time on a RaspberryPi-4.
Only downside is that the models can’t be directly integrated into DeepSpeech bindings (technical possible, but I had no need for it) and doesn’t support streaming anymore (at least until someone has the time to implement it). I don’t think the missing streaming feature should be a problem, because our inputs are quite short, they usually are processed in 1-2 seconds on a Raspi.