Rhasspy 3 Developer Preview

rolyan_trauts · March 8, 2023, 7:49pm

Yeah I hadn’t really given Gstreamer python much thought until then but I had a look at GitHub - GStreamer/gst-python: GStreamer Python binding overrides (complementing the bindings provided by python-gi). This module has been merged into the main GStreamer repo for further development. and GST sinks/sources are not that complex to make.
The websocket as a GS sink or source as still trying to work out the terminology specific to GST (I would call it a source but maybe its a sink as will be a pipeline…) would be really cool and likely make many things easier.
GS is great as where ever possible use an existing module than writing your own, where likely your own module whatever language is in a GS pipeline.

Does that Kaldi GST build still work?

fluidvoice · March 9, 2023, 11:20am

two things about R3 with porcupine I’m testing in a Ubuntu VM:

it seems the base config is not being overridden by the user config.
Changing the wake file to grasshopper_linux.ppn only took effect when I put it in the base config, but didn’t when put in the user config file. Anyone else see this?
I created a custom wake word ( linux *.ppn ) file with the Picovoice console and can’t get it to work with R3.
Is there something I need to do for compatability like an access key or something?
Error I got is shown below.

What can I use for a custom wake word if/until I can get porcupine working?

$ script/run bin/wake_detect.py --debug

DEBUG:rhasspy3.core:Loading config from /home/sass/working/rhasspy3/rhasspy3/configuration.yaml
DEBUG:rhasspy3.core:Skipping /home/sass/working/rhasspy3/config/configuration.yaml
DEBUG:wake_detect:mic program: PipelineProgramConfig(name=‘arecord’, template_args=None, after=None)
DEBUG:wake_detect:wake program: PipelineProgramConfig(name=‘porcupine1’, template_args={‘model’: ‘alice_en_linux_v2_1_0.ppn’}, after=None)
DEBUG:rhasspy3.program:mic_adapter_raw.py [‘–samples-per-chunk’, ‘1024’, ‘–rate’, ‘16000’, ‘–width’, ‘2’, ‘–channels’, ‘1’, ‘arecord -q -D “default” -r 16000 -c 1 -f S16_LE -t raw -’]
DEBUG:wake_detect:Detecting wake word
DEBUG:rhasspy3.program:.venv/bin/python3 [‘bin/porcupine_stream.py’, ‘–model’, ‘alice_en_linux_v2_1_0.ppn’]
Traceback (most recent call last):
File “/home/sass/working/rhasspy3/config/programs/wake/porcupine1/bin/porcupine_stream.py”, line 110, in
main()
File “/home/sass/working/rhasspy3/config/programs/wake/porcupine1/bin/porcupine_stream.py”, line 61, in main
porcupine = pvporcupine.create(
File “/home/sass/working/rhasspy3/config/programs/wake/porcupine1/.venv/lib/python3.10/site-packages/pvporcupine/init.py”, line 64, in create
Traceback (most recent call last):
File “/home/sass/working/rhasspy3/bin/wake_detect.py”, line 80, in
asyncio.run(main())
File “/usr/lib/python3.10/asyncio/runners.py”, line 44, in run
return loop.run_until_complete(main)
File “/usr/lib/python3.10/asyncio/base_events.py”, line 646, in run_until_complete
return future.result()
File “/home/sass/working/rhasspy3/bin/wake_detect.py”, line 69, in main
detection = await detect(rhasspy, wake_program, mic_proc.stdout)
File “/home/sass/working/rhasspy3/rhasspy3/wake.py”, line 109, in detect
wake_event = wake_task.result()
File “/home/sass/working/rhasspy3/rhasspy3/event.py”, line 48, in async_read_event
event_dict = json.loads(json_line)
File “/usr/lib/python3.10/json/init.py”, line 346, in loads
return _default_decoder.decode(s)
File “/usr/lib/python3.10/json/decoder.py”, line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File “/usr/lib/python3.10/json/decoder.py”, line 355, in raw_decode
raise JSONDecodeError(“Expecting value”, s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Nbento · March 9, 2023, 1:42pm

Regarding Procupine, I think your issue is that presently Rhasspy3 is using Porcupine1, which I don’t think supports the custom wake word generated from the present Picovoice Console since those now require the API key to use. That functionality is in Porcupine2. I know some users here (myself included) managed to patch in Porcupine2 into Rhasspy2, but doesn’t appear to be in Rhasspy3. Maybe a good enhancement request?

Regarding your first point, I didn’t play with wake word yet, was just testing using the satellite in the browser, but my overrides did appear to work (I replaced the larynx2 TTS with Mimic3 as there’s a voice on there I like lol)

fluidvoice · March 9, 2023, 2:42pm

Thanks. So what are my options for a custom wake word then? something else besides Picovoice that doesn’t require a fking API key?

Nbento · March 9, 2023, 3:21pm

Think the docs mention other wake word engines that support this besides picovoice.

I believe precise allows you to train your own wake model. Haven’t played around with it personally but I believe that was the same engine used for Mycroft. I believe you could also try installing picovoice2 and integrating it with Rhasspy3 as well, since the dev stack and modular nature should allow for that

fluidvoice · March 10, 2023, 6:43pm

wow, hard to find how to do this stuff - make custom WW w/mycroft precise.

btw, also saw the openWakeWord project, that’d require some code changes too.

github.com

dscripka/openWakeWord/blob/main/notebooks/training_models.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "825fe381",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "\n",
    "This notebook demonstrates the process of training a new openWakeWord model, using synthetic speech generated with open-source TTS models, and negative data representing music, noise, and speech. While the process here is complete, only small samples of datasets are utilized so that a new model can be trained on CPUs. In practice, much larger volumes of data (both positive and negitive examples) is needed to produce robust models. See the [documentation](https://github.com/dscripka/openWakeWord/tree/main/docs/models) for the pre-trained openWakeWord models for more information about how these models were trained."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd8e4597",
   "metadata": {},
   "source": [
    "To start, we'll need to install the requirements needed to train new openWakeWord models."
   ]
  },

This file has been truncated. show original

fluidvoice · March 10, 2023, 6:52pm

This is an OVOS plugin for openWakeWord, an open-source wakeword or phrase detection system. It has competitive performance compared to Mycroft Precise or Picovoice Porcupine, can be trained on 100% synthetic data, and can run on a single Raspberry Pi 3 core.

Nbento · March 10, 2023, 7:17pm

I think the project page has a link to documentation on how to do it, with an included tool from what I remember: Training your own wake word · MycroftAI/mycroft-precise Wiki · GitHub

fluidvoice · March 10, 2023, 8:58pm

Thanks. Well it’s a build PITA. Requires ‘tensorflow>=1.13,<1.14’
looks like maybe I got it installed.

fluidvoice · March 11, 2023, 6:03pm

well the build tensorflow, for Mycroft Precise, using python 3.10 went mostly ok until near the end but never completed. I have to roll my python back to 3.7 and try again.

synesthesiam · March 12, 2023, 3:57am

This is on my TODO list to fix, even though I don’t work for Mycroft anymore. Precise is such a simple model that it would be a tiny amount of PyTorch code these days.

Also, I have snowboy-seasalt as a Docker image if you want to train your own snowboy wake word.

Lastly, I plan to add snowman to the Rhasspy 3 wake word engine list.

fastjack · March 12, 2023, 11:58am

After a couple of years using my custom made vocal assistant I’ve come to the conclusion that there is a missing component in the system: the DSP that remove unwanted noise from the user input (be it kitchen noises or other people voices aka cocktail party effect).

Without this component any vocal assistant pipeline will be erratic at best.

I firmly believe that it also have to be in the todo list for any open source vocal assistant to work good enough for wide user adoption.

There have been some impressive advancements in this area in the last 2 years with Google Voice Filter Lite and more recently :

Using a generic KWS that feeds a voice recognition system based on the catched keyword audio (or a personalized KWS tailored for specific voices) that then feeds a centralized voice separation model before the ASR component will much more effectively improve ASR and NLU confidence than any other noise reduction, beam forming or AEC system.

The best solution should be to not only remove/attenuate noises but to only keep the utterance voice. Recent ASR systems are today pretty tolerant to some amount of noise but often fall short with overlapping voices.

It also voids the need for multiple mics which is a huge plus regarding hardware complexity (as demonstrated by recent Google Nest HW changes)

My 2 cents

rolyan_trauts · March 12, 2023, 1:10pm

Snowboy is an iconic piece of KWS history as the 1st to employ a DNN system, but sort of similar to eSpeak in use as really pretty terrible in use now, as the pace of change of technology of early 1st’s has them both now left far behind.
I guess because you can?

@fastjack Yeah I am the same as often 3rd party noise is the dominant noise and even with beamforming it just gets swamped by the likes of TV or other voices.
The only code I have found was GitHub - BUTSpeechFIT/speakerbeam, but just been looking again.
GitHub - etzinis/heterogeneous_separation: Code and data recipes for the paper: Heterogeneous Target Speech Separation seems new and one I must of missed GitHub - Edresson/VoiceSplit: VoiceSplit: Targeted Voice Separation by Speaker-Conditioned Spectrogram and just a general BSS ML repo GitHub - fakufaku/torchiva: Blind source separation with independent vector analysis family of algorithm in torch

But for a while I have thought any BSS with a personalised VAD or KWS on each stream it splits into can detect a target be it personalised VAD or KW as that is the drawback with BSS algs as generally they will split into nSignals dictated by nMics that can find distinct sources from the TDOA they detect with no concept of content.

Seems that is what Espressif are doing as its a simple BSS spliting into 2/3 streams where they simply put a KWS on each stream to select the target.
Voicefilter-lite is a ML based BSS that steers a target into a single channel and can use the target to further filter the voice required, all in a single model.
For humans to interject requires 2 voices (noise sources) and when you get 3 or more it quickly becomes a cacophony where Google said great as with just 2x mics and a clever lite weight model they can get much better results in all scenarios that are usable anyway.
Esspressif have a 3mic version I guess because they have no voice-filter-lite and like beamforming more mics equals more resolution and seperation, but also another channel to scan for if its the one you require.

fluidvoice · March 12, 2023, 6:04pm

Thanks I will try it out. If it works good enough for now, will use it, until other options become available or easier to build.

Ryan1 · March 14, 2023, 2:10am

So, my use case would be (in pseudo pipeline yaml): no audio on server… no heavy processing stt/tts on client.

client/satellite:
  mic:
  vad:
  remote: -> send vad wav to server
  sound: <- audio from server

server/base:
  asr-stt:
  handle:
  intent:
  tts: -> audio to client

It doesn’t seem like your current pipeline supports this? I could probably create a dummy wake word that always returns true… but it seem asr is hard coded with mic processing? Am I wrong about this?

-ryan

fluidvoice · March 14, 2023, 8:58pm

Got the snowboy custom wake word working, thanks!
I trained it with 25 utterances of various TTS voices from different countries/accents.
It seems to work pretty well in my Ubuntu VM. @synesthesiam Will this and Rhasspy3 work on a Pi4 also?

fluidvoice · March 17, 2023, 3:05pm

@synesthesiam have you given any thought to including speaker-verification/voice-authentication (ie, biometric) into Rhasspy3? Does the new modular nature of V3 it make this easier to integrate this feature into a custom assistant? What open source projects have some speaker-verification/authentication functionality atm, DeepSpeech? Coqui?

fluidvoice · March 20, 2023, 2:24pm

Also which of your to-do are you working on or will be soon if any? Im particularly interested in the custom STT grammars and intent systems.

A user friendly web UI
An automated method for installing programs/services and downloading models
Support for custom speech to text grammars
Intent systems besides Home Assistant
The ability to accumulate context within a pipeline

dorian · March 20, 2023, 3:41pm

Your product is incredible, are there any estimated dates for finished product by docker? And if I use rasspy 2.4 will it be hard to migrate? if i use remote http.

synesthesiam · March 21, 2023, 3:27am

These things are all on my TODO list

Accumulating context within a pipeline is going to be needed for speaker identification/verification. I don’t have a specific program in mind for this yet, though. Some projects I’m looking at are personalVAD and Personalized PercepNet.

I don’t think it will, unfortunately. Seasalt only contained code for x86_64 systems, so I don’t think it will create wake words for arm64 systems. It might be possible to extend Seasalt, but I’m not sure how they created it in the first place.

Snowman, on the other hand, should work just fine on a Pi 4

Thanks! I plan to keep the same sentences format, and I will add some backwards compatible endpoints to the HTTP API. So using 2.4 to start shouldn’t be a big problem for migration. But I don’t have any estimated dates yet, sorry