How to get Rhasspy not to ignore words that are not in sentance when spoken into the microphone

I have Rhasspy working perfectly querying a remote API via Home Assistant when using the web console, but when actually saying it into the microphone it drops everything after the um, ‘trigger phrase’.

So if I type “answer this how many moons does saturn have” into the web console it finds the right intent and sends _intent.rawInput = “answer this how many moons does saturn have” to correct Home Assistant intent which does its thing and answers the question correctly.

However when I say it into the speaker “answer this how many moons does saturn have” it understands the intent and calls the correct intent script in HA but it sends only “answer this” for _intent.rawInput value.

The Rhasspy intent sentence looks like:

[askOpenAiQuestion]
(answer this) {openaiquestion}

And the HA intent script:

askOpenAiQuestion:
  speech: 
    text: Let me check on that
  action:
    - service: input_text.set_value
      data_template:
        entity_id: input_text.last_openai_question
        value: "{{ _intent.rawInput | replace(\"answer this\",\"\") }}"

Can anyone point me in the right direction on how to get Rhasspy to send the whole phrase via _intent.rawInput when using the microphone like it does when sending via the web console?

Yeah, the sentence is “answer this” and that is wat is going into the openaiquestion slot.

The way you want it does not work with Rhasspy, you could make is work with answer this and the use continueSession to get the conversation going.
https://rhasspy.readthedocs.io/en/latest/reference/#dialogue-manager

Maybe this topic will also help:

You can also search the forum on continueSession

@romkabouter,

Thank you for the answer. You are right, after your suggestions and some more reading I dont think I will be able to accomplish this with Rhasspy.

I have it working doing the all the other functions it was designed to do and very much appreciate it and appreciate the work thats been done on Rhasspy project.

Thanks again for taking the time to help me.

Not so fast ! There are others here interested in the same thing.

As I understand it, Rhasspy was designed more as a toolbox than a particular product - though it is certainly been focussed on working with home automation systems like HA almost out of the box.

Which means that extra work will be required - using continueSession - to build a conversation. Heck, I also don’t see Home Assistant answering your question anyway … unless you have named one of your devices “Saturn” ? So sooner or later you will need a separate piece of software to forward your question to a search engine and reformat the result. I do not know of such a piece of software, but maybe those who have discussed this topic previously do, or are developing their own custom integration.

It sounds like a very useful addition, so I wish you luck getting it running.

No, no devices named Saturn. The remote API sends the question to OpenAI and takes the reply from OpenAI and send it to TTS in Home Assistant. I actually already have it 100% fully working (when typing text into the Rhasspy UI console “recognize” feature, or from a different text entry interface I set up to test it) except for this last piece that I cant get figured out.

I dont think using continueSession will work because those subsequent sentences need to be part of the training model also.

At this point I am looking at two options. The first one is promising by limited, which is to use the Home Assistant Conversation integration as that already does full text transcription, but I have yet to find a way to get that actually into HA. The second one it to find a separate STT model that does do full transcription and install that on a separate rPi and send the raw audio from Rhasspy to it to be transcribed and from there sent it along to the remote API.

I will report back here it I have any success or I guess fail completely. Thank you for your reply.

1 Like

I’d love to see you sharing some implementation details!

Sure. Theres really not much to it.

In Rhasspy (only works entering text into the console)

[askOpenAiQuestion]
(answer this) {openaiquestion}

The HA intent_script that updates a sensor that is watched by automations:

askOpenAiQuestion:
  speech:
    text: Let me check on that
  action:
    - service: input_text.set_value
      data_template:
        entity_id: input_text.last_openai_question
        value: "{{ _intent.rawInput | replace(\"answer this\",\"\") }}"

Then a couple of HA automations to watch for sensors changing:

last_openai_question updates → send value to rest_command
last_openai_answer updates → send value to TTS

- id: request_openai_question
  alias: Request OpenIA Question
  trigger:
  - entity_id: input_text.last_openai_question
    platform: state
  condition:
    condition: and
    conditions:
    - condition: template
      value_template: >
        {{ states("input_text.last_openai_question") != 'empty' }}
    - condition: template
      value_template: >
        {{ states("input_text.last_openai_question") != '' }}
  action:
  - service: rest_command.openai_raw_answer
    data:
      pKey: "{REDACTED}"
      pQuery:
        value_template: >
          {{ states("input_text.last_openai_question") }}
- id: answer_openai_question
  alias: Answer OpenIA Question
  trigger:
  - entity_id: input_text.last_openai_answer
    platform: state
  condition:
    condition: and
    conditions:
    - condition: template
      value_template: >
        {{ states("input_text.last_openai_answer") != 'empty' }}
    - condition: template
      value_template: >
        {{ states("input_text.last_openai_answer") != '' }}
  action:
  - service: tts.cloud_say
    data:
      entity_id: media_player.vlc_telnet
      message: >
        {{ states("input_text.last_openai_answer") }}
      options:
        gender: male
      language: en-GB

An HA rest_command to send the question to the remote API (my server with both local and internet access)

  openai_raw_answer:
    url: https://{MY_SERVER_DOMAIN}/queryOpenAI.php
    method: POST
    headers:
      user-agent: 'Home Assistant Rest Command'

Then the REST endpoint is just a few lines of php (removing auth and sanitizing stuff) that queries openAI and writes the answer to an HA sensor which the above automation listens for and speaks the answer via TTS:

  $theQ = $_POST['pQuery'];
  exec("curl https://api.openai.com/v1/completions -H \"Content-Type: application/json\" -H \"Authorization: Bearer {OPEN_AI_AUTH_TOKEN}\" -d '{\"model\": \"text-davinci-003\", \"prompt\": \"$theQ\", \"temperature\": 0, \"max_tokens\": 40}'", $output , $retval);

  $replyData = json_decode($output[0]);
  $replyText = rtrim(ltrim($replyData->choices[0]->text));

  exec("curl -v -X POST http://homeassist.local:8123/api/states/input_text.last_openai_answer -d '{\"state\": \"$replyText\"}' -H \"Authorization: Bearer {HA_AUTH_TOKEN}\" -H \"Content-Type: application/json\"", $output , $retval);

Im sure this could be accomplished better by someone who actually knows what they are doing, but to be honest I just spent an afternoon putting this together in an effort to try an impress my daughter.

I created a feature request for the Home Assistant companion app to enter the transcribed text into a sensor

1 Like

Very nice indeed, I’ll have a go at it as well :slight_smile:

I might have found a way, I was checking the docs and have tried something.
Only manual stuff for now, but is promising.

I checked the reference here:
https://rhasspy.readthedocs.io/en/latest/reference/#mqtt-api

To try something I have created a flow in Node-Red.
I have create two injection nodes.
First one posting this:

{“sessionId”:“voiceAi”,“stopOnSilence”:false,“sendAudioCaptured”: true, “siteId”: “matrixvoice”}

to hermes/asr/startListening

What this does is that Rhasspy starts listening on the provided siteId. The stopOnSilence prevents Rhasspy from stop listening and the sendAudioCaptured will send audio to

rhasspy/asr/\<siteId>/\<sessionId>/audioCaptured

In my case rhasspy/asr/matrixvoice/voiceAi/audioCaptured

When I want Rhasspy to stopListening, I have an injectnode posting
{"sessionId":"voiceAi","siteId": "matrixvoice"} to hermes/asr/stopListening

When that message is posted, Rhasspy stops and the recored wav is posted to
rhasspy/asr/matrixvoice/voiceAi/audioCaptured

Output log is here:

DEBUG:2023-01-13 20:56:29,788] rhasspyasr_pocketsphinx_hermes: <- AsrStartListening(site_id='matrixvoice', session_id='voiceAi', lang=None, stop_on_silence=False, send_audio_captured=True, wakeword_id=None, intent_filter=None)
[DEBUG:2023-01-13 20:56:29,789] rhasspyasr_pocketsphinx_hermes: Starting listening (session_id=voiceAi)
[DEBUG:2023-01-13 20:56:29,793] rhasspyasr_pocketsphinx_hermes: Receiving audio
[DEBUG:2023-01-13 20:56:33,356] rhasspyasr_pocketsphinx_hermes: <- AsrStopListening(site_id='matrixvoice', session_id='voiceAi')
[DEBUG:2023-01-13 20:56:33,356] rhasspyasr_pocketsphinx_hermes: Received a total of 124544 byte(s) for WAV data for session voiceAi
[DEBUG:2023-01-13 20:56:33,357] rhasspyasr_pocketsphinx_hermes: -> AsrRecordingFinished(site_id='matrixvoice', session_id='voiceAi')
[DEBUG:2023-01-13 20:56:33,357] rhasspyasr_pocketsphinx_hermes: Publishing 49 bytes(s) to rhasspy/asr/recordingFinished
[DEBUG:2023-01-13 20:56:33,358] rhasspyasr_pocketsphinx_hermes: Transcribing 114732 byte(s) of audio data
INFO: cmn.c(133): CMN: 34.42 -1.70 18.51  7.78  5.07  5.94  4.41 -2.43 -1.23 -2.93  6.99 -0.37  7.62 
INFO: ngram_search_fwdtree.c(1550):     1372 words recognized (6/fr)
INFO: ngram_search_fwdtree.c(1552):    55794 senones evaluated (240/fr)
INFO: ngram_search_fwdtree.c(1556):    30003 channels searched (129/fr), 5698 1st, 16976 last
INFO: ngram_search_fwdtree.c(1559):     1719 words for which last channels evaluated (7/fr)
INFO: ngram_search_fwdtree.c(1561):     2026 candidate words for entering last phone (8/fr)
INFO: ngram_search_fwdtree.c(1564): fwdtree 0.18 CPU 0.080 xRT
INFO: ngram_search_fwdtree.c(1567): fwdtree 0.19 wall 0.081 xRT
INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 18 words
INFO: ngram_search_fwdflat.c(948):     1502 words recognized (6/fr)
INFO: ngram_search_fwdflat.c(950):    45066 senones evaluated (194/fr)
INFO: ngram_search_fwdflat.c(952):    33472 channels searched (144/fr)
INFO: ngram_search_fwdflat.c(954):     2837 words searched (12/fr)
INFO: ngram_search_fwdflat.c(957):     1632 word transitions (7/fr)
INFO: ngram_search_fwdflat.c(960): fwdflat 0.10 CPU 0.041 xRT
INFO: ngram_search_fwdflat.c(963): fwdflat 0.10 wall 0.041 xRT
[DEBUG:2023-01-13 20:56:33,643] rhasspyasr_pocketsphinx.transcribe: Decoded audio in 0.28424000507220626 second(s)
INFO: ngram_search.c(1250): lattice start node <s>.0 end node </s>.186
INFO: ngram_search.c(1276): Eliminated 0 nodes before end node
INFO: ngram_search.c(1381): Lattice has 187 nodes, 397 links
INFO: ps_lattice.c(1374): Bestpath score: -6392
INFO: ps_lattice.c(1378): Normalizer P(O) = alpha(</s>:186:230) = -327747
INFO: ps_lattice.c(1435): Joint P(O,S) = -392548 P(S|O) = -64801
INFO: ngram_search.c(872): bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(875): bestpath 0.00 wall 0.001 xRT
INFO: ngram_search.c(1027): bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(1030): bestpath 0.00 wall 0.000 xRT
[DEBUG:2023-01-13 20:56:33,646] rhasspyasr_pocketsphinx_hermes: Transcription(text='de is de het', likelihood=0.001532505635969972, transcribe_seconds=0.28424000507220626, wav_seconds=3.584, tokens=[TranscriptionToken(token='<s>', start_time=0.0, end_time=0.09, likelihood=1.0), TranscriptionToken(token='de(2)', start_time=0.1, end_time=0.31, likelihood=0.49347559856683904), TranscriptionToken(token='<sil>', start_time=0.32, end_time=0.39, likelihood=0.4221477266507153), TranscriptionToken(token='is', start_time=0.4, end_time=0.65, likelihood=0.1650991984749785), TranscriptionToken(token='de(2)', start_time=0.66, end_time=0.96, likelihood=0.3323297126218573), TranscriptionToken(token='<sil>', start_time=0.97, end_time=1.21, likelihood=1.000100016593933), TranscriptionToken(token='het', start_time=1.22, end_time=1.67, likelihood=0.9933216234388383), TranscriptionToken(token='<sil>', start_time=1.68, end_time=1.85, likelihood=0.7483880924572297), TranscriptionToken(token='</s>', start_time=1.86, end_time=2.3, likelihood=1.0)])
[DEBUG:2023-01-13 20:56:33,651] rhasspyasr_pocketsphinx_hermes: -> AsrTextCaptured(text='de is de het', likelihood=0.001532505635969972, seconds=0.28424000507220626, site_id='matrixvoice', session_id='voiceAi', wakeword_id=None, asr_tokens=[[AsrToken(value='<s>', confidence=1.0, range_start=0, range_end=4, time=AsrTokenTime(start=0.0, end=0.09)), AsrToken(value='de(2)', confidence=0.49347559856683904, range_start=4, range_end=10, time=AsrTokenTime(start=0.1, end=0.31)), AsrToken(value='<sil>', confidence=0.4221477266507153, range_start=10, range_end=16, time=AsrTokenTime(start=0.32, end=0.39)), AsrToken(value='is', confidence=0.1650991984749785, range_start=16, range_end=19, time=AsrTokenTime(start=0.4, end=0.65)), AsrToken(value='de(2)', confidence=0.3323297126218573, range_start=19, range_end=25, time=AsrTokenTime(start=0.66, end=0.96)), AsrToken(value='<sil>', confidence=1.000100016593933, range_start=25, range_end=31, time=AsrTokenTime(start=0.97, end=1.21)), AsrToken(value='het', confidence=0.9933216234388383, range_start=31, range_end=35, time=AsrTokenTime(start=1.22, end=1.67)), AsrToken(value='<sil>', confidence=0.7483880924572297, range_start=35, range_end=41, time=AsrTokenTime(start=1.68, end=1.85)), AsrToken(value='</s>', confidence=1.0, range_start=41, range_end=46, time=AsrTokenTime(start=1.86, end=2.3))]], lang=None)
[DEBUG:2023-01-13 20:56:33,652] rhasspyasr_pocketsphinx_hermes: Publishing 1281 bytes(s) to hermes/asr/textCaptured
[DEBUG:2023-01-13 20:56:33,652] rhasspyasr_pocketsphinx_hermes: -> AsrAudioCaptured(114732 byte(s)) to rhasspy/asr/matrixvoice/voiceAi/audioCaptured
[DEBUG:2023-01-13 20:56:33,652] rhasspyasr_pocketsphinx_hermes: Stopping listening (session_id=voiceAi)
[DEBUG:2023-01-13 20:56:33,670] rhasspydialogue_hermes: <- AsrTextCaptured(text='de is de het', likelihood=0.001532505635969972, seconds=0.28424000507220626, site_id='matrixvoice', session_id='voiceAi', wakeword_id=None, asr_tokens=[[AsrToken(value='<s>', confidence=1.0, range_start=0, range_end=4, time=AsrTokenTime(start=0.0, end=0.09)), AsrToken(value='de(2)', confidence=0.49347559856683904, range_start=4, range_end=10, time=AsrTokenTime(start=0.1, end=0.31)), AsrToken(value='<sil>', confidence=0.4221477266507153, range_start=10, range_end=16, time=AsrTokenTime(start=0.32, end=0.39)), AsrToken(value='is', confidence=0.1650991984749785, range_start=16, range_end=19, time=AsrTokenTime(start=0.4, end=0.65)), AsrToken(value='de(2)', confidence=0.3323297126218573, range_start=19, range_end=25, time=AsrTokenTime(start=0.66, end=0.96)), AsrToken(value='<sil>', confidence=1.000100016593933, range_start=25, range_end=31, time=AsrTokenTime(start=0.97, end=1.21)), AsrToken(value='het', confidence=0.9933216234388383, range_start=31, range_end=35, time=AsrTokenTime(start=1.22, end=1.67)), AsrToken(value='<sil>', confidence=0.7483880924572297, range_start=35, range_end=41, time=AsrTokenTime(start=1.68, end=1.85)), AsrToken(value='</s>', confidence=1.0, range_start=41, range_end=46, time=AsrTokenTime(start=1.86, end=2.3))]], lang=None)
[WARNING:2023-01-13 20:56:33,671] rhasspydialogue_hermes: Ignoring unknown session voiceAi

There is logging in the AsrTextCaptured, but that is not a problem. Rhasspy can and wil not do anything with that.

So, the audio can be used to let is be transcribed by some other service like whisper.
That result can then be used as well.

It is obviously a rough idea, but could work.
You can trigger the askOpenAiQuestion intent and in that handler do the logic above.
Post the audio to whisper, post the result top OpenAi
Post the answer from OpenAI to hermes/dialogueManager/endSession and Rhasspy will speak the answer.

Done some more work in Node Red and it is promising indeed.

I will spend some more time and post results when I have them.

A lot to digest there that Im not sure I fully understand everything even after reading it a few times. Let me see my brain can process it and give it a try when I can. Thank you for taking the time to point me in the right direction.

Yes I understand, but basically I have this almost working with NodeRed flows with my siteId matrixvoice

  • WakeWord → “answer this” → askOpenAiQuestion recognized
  • NodeRed handles intent → sends message to startListening with sessionId “voiceai”
  • a message arrives at rhasspy/asr/recordingFinished with sessionId “voiceai”, then
  • send message to hermes/asr/stopListening, result in a message on rhasspy/asr/matrixvoice/voiceAi/audioCaptured
  • that message is the audio recorded after the intent recognition

todo:

  • use that audio to post to whisper or some other SpeechToText engine
  • use the result to post to openai
  • use result from openai to post as text to endSession, which makes Rhasspy speak the result on the satellte.
1 Like

I have been toying around with adding support to my client that allow you to send a command like so:

cmd: GetResponse

  • send a text string → client TTS → plays audio
  • record audio → send to whisper and get text string response
  • you can do anything you want with the resulting string

Trying to work on a better way to interact with the system.

Like requesting to play a random song, playlist, or fetch information from that net.

Something I have recently been toying with that was inspired by this thread

https://community.rhasspy.org/t/proof-of-concept-working-for-open-response-intent-handling/4163/8

I already used node red to handle my intents and have a trained server with slots for pre-defined things like my home automation, etc. For this I use kaldi

Then I have installed on the same machine a second instance of Rhasspy on a different port which uses vosk for the STT in open transcription mode. The trick is that Rhasspy will automatically download the small model for Vosk. So after it does this you manually download the large model from the Vosk site and put it in instead under the /profiles/en/vosk folder. At least for English this give pretty good results for the STT.

Both servers use the same MQTT broker for receiving audio from the satellite.

In this case if you put your intent as the “answer this” as done originally then the trained instance will respond with unrecognisedintent but the open translation server should respond with the intent and the raw text should be the complete sentence

I took some time this weekend to try to start understanding the Rhassy API. Simple things like telling Rhasspy to disable itself, re-enabling it from a button in Home Assistant, and triggering wakeup from a button in Home Assistant. Im starting to grasp it a bit.

My next steps are to try to do something along the lines of what romkabouter suggested above - slowly to wrapping my head around all the steps.

At this point my approach, I think, is leaning toward installing DeepSpeech on a spare rPI4 and sending the audio there for transcription. I know others have suggested send it to OpenAI Whisper but Im try to do it as off cloud as possible (yes I know sending the text question to OpenAI is not off cloud but it feels a lot different sending an interpreted text string vs an actual raw recording to a 3rd party)

I will update this thread if I make any progress on that. Thanks for the hints!

Offloading the speech to text and text to speech was one of the biggest improvements I made to my setup. In the end “offloading” may be the incorrect term. I am getting excellent results using Rhasspy’s Vosk implementation but with the full english language model (downloaded form: VOSK Models). Once it was downloaded I extracted it to /profiles/en/vosk/model on my base Rhasspy (that runs on an Intel Xeon so it handles the workload without issue). But yes running it on hardware that can better handle it and putting a more powerful engine behind make it understand a whole lot more.

Using this setup last night I got my Rhasspy timer skill to take random words for the timer name. So “Start a new timer for the potatoes for 30 minutes” starts a timer called “potatoes”, and “Start a timer for the chicken for 45 minutes” starts a timer called “chicken”. And no, I don’t have “potatoes” and “chicken” in a words list anywhere. You can say any word and the skill recognizes it. And the key to the setup for me is the Vosk server.

Edit:
Lol… Silly me, should have read further up where they told you about the Vosk server and linked to my posted where I described the setup. It really is a game changer though. And yeah my skill script uses the raw input of the intent like the HA intent above to get the “missing” text". Works like a charm if you know what the input sentence is. More complicated if you have more than one type of input sentence for the same intent.