I might have found a way, I was checking the docs and have tried something.
Only manual stuff for now, but is promising.
I checked the reference here:
https://rhasspy.readthedocs.io/en/latest/reference/#mqtt-api
To try something I have created a flow in Node-Red.
I have create two injection nodes.
First one posting this:
{“sessionId”:“voiceAi”,“stopOnSilence”:false,“sendAudioCaptured”: true, “siteId”: “matrixvoice”}
to hermes/asr/startListening
What this does is that Rhasspy starts listening on the provided siteId. The stopOnSilence prevents Rhasspy from stop listening and the sendAudioCaptured will send audio to
rhasspy/asr/\<siteId>/\<sessionId>/audioCaptured
In my case rhasspy/asr/matrixvoice/voiceAi/audioCaptured
When I want Rhasspy to stopListening, I have an injectnode posting
{"sessionId":"voiceAi","siteId": "matrixvoice"}
to hermes/asr/stopListening
When that message is posted, Rhasspy stops and the recored wav is posted to
rhasspy/asr/matrixvoice/voiceAi/audioCaptured
Output log is here:
DEBUG:2023-01-13 20:56:29,788] rhasspyasr_pocketsphinx_hermes: <- AsrStartListening(site_id='matrixvoice', session_id='voiceAi', lang=None, stop_on_silence=False, send_audio_captured=True, wakeword_id=None, intent_filter=None)
[DEBUG:2023-01-13 20:56:29,789] rhasspyasr_pocketsphinx_hermes: Starting listening (session_id=voiceAi)
[DEBUG:2023-01-13 20:56:29,793] rhasspyasr_pocketsphinx_hermes: Receiving audio
[DEBUG:2023-01-13 20:56:33,356] rhasspyasr_pocketsphinx_hermes: <- AsrStopListening(site_id='matrixvoice', session_id='voiceAi')
[DEBUG:2023-01-13 20:56:33,356] rhasspyasr_pocketsphinx_hermes: Received a total of 124544 byte(s) for WAV data for session voiceAi
[DEBUG:2023-01-13 20:56:33,357] rhasspyasr_pocketsphinx_hermes: -> AsrRecordingFinished(site_id='matrixvoice', session_id='voiceAi')
[DEBUG:2023-01-13 20:56:33,357] rhasspyasr_pocketsphinx_hermes: Publishing 49 bytes(s) to rhasspy/asr/recordingFinished
[DEBUG:2023-01-13 20:56:33,358] rhasspyasr_pocketsphinx_hermes: Transcribing 114732 byte(s) of audio data
INFO: cmn.c(133): CMN: 34.42 -1.70 18.51 7.78 5.07 5.94 4.41 -2.43 -1.23 -2.93 6.99 -0.37 7.62
INFO: ngram_search_fwdtree.c(1550): 1372 words recognized (6/fr)
INFO: ngram_search_fwdtree.c(1552): 55794 senones evaluated (240/fr)
INFO: ngram_search_fwdtree.c(1556): 30003 channels searched (129/fr), 5698 1st, 16976 last
INFO: ngram_search_fwdtree.c(1559): 1719 words for which last channels evaluated (7/fr)
INFO: ngram_search_fwdtree.c(1561): 2026 candidate words for entering last phone (8/fr)
INFO: ngram_search_fwdtree.c(1564): fwdtree 0.18 CPU 0.080 xRT
INFO: ngram_search_fwdtree.c(1567): fwdtree 0.19 wall 0.081 xRT
INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 18 words
INFO: ngram_search_fwdflat.c(948): 1502 words recognized (6/fr)
INFO: ngram_search_fwdflat.c(950): 45066 senones evaluated (194/fr)
INFO: ngram_search_fwdflat.c(952): 33472 channels searched (144/fr)
INFO: ngram_search_fwdflat.c(954): 2837 words searched (12/fr)
INFO: ngram_search_fwdflat.c(957): 1632 word transitions (7/fr)
INFO: ngram_search_fwdflat.c(960): fwdflat 0.10 CPU 0.041 xRT
INFO: ngram_search_fwdflat.c(963): fwdflat 0.10 wall 0.041 xRT
[DEBUG:2023-01-13 20:56:33,643] rhasspyasr_pocketsphinx.transcribe: Decoded audio in 0.28424000507220626 second(s)
INFO: ngram_search.c(1250): lattice start node <s>.0 end node </s>.186
INFO: ngram_search.c(1276): Eliminated 0 nodes before end node
INFO: ngram_search.c(1381): Lattice has 187 nodes, 397 links
INFO: ps_lattice.c(1374): Bestpath score: -6392
INFO: ps_lattice.c(1378): Normalizer P(O) = alpha(</s>:186:230) = -327747
INFO: ps_lattice.c(1435): Joint P(O,S) = -392548 P(S|O) = -64801
INFO: ngram_search.c(872): bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(875): bestpath 0.00 wall 0.001 xRT
INFO: ngram_search.c(1027): bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(1030): bestpath 0.00 wall 0.000 xRT
[DEBUG:2023-01-13 20:56:33,646] rhasspyasr_pocketsphinx_hermes: Transcription(text='de is de het', likelihood=0.001532505635969972, transcribe_seconds=0.28424000507220626, wav_seconds=3.584, tokens=[TranscriptionToken(token='<s>', start_time=0.0, end_time=0.09, likelihood=1.0), TranscriptionToken(token='de(2)', start_time=0.1, end_time=0.31, likelihood=0.49347559856683904), TranscriptionToken(token='<sil>', start_time=0.32, end_time=0.39, likelihood=0.4221477266507153), TranscriptionToken(token='is', start_time=0.4, end_time=0.65, likelihood=0.1650991984749785), TranscriptionToken(token='de(2)', start_time=0.66, end_time=0.96, likelihood=0.3323297126218573), TranscriptionToken(token='<sil>', start_time=0.97, end_time=1.21, likelihood=1.000100016593933), TranscriptionToken(token='het', start_time=1.22, end_time=1.67, likelihood=0.9933216234388383), TranscriptionToken(token='<sil>', start_time=1.68, end_time=1.85, likelihood=0.7483880924572297), TranscriptionToken(token='</s>', start_time=1.86, end_time=2.3, likelihood=1.0)])
[DEBUG:2023-01-13 20:56:33,651] rhasspyasr_pocketsphinx_hermes: -> AsrTextCaptured(text='de is de het', likelihood=0.001532505635969972, seconds=0.28424000507220626, site_id='matrixvoice', session_id='voiceAi', wakeword_id=None, asr_tokens=[[AsrToken(value='<s>', confidence=1.0, range_start=0, range_end=4, time=AsrTokenTime(start=0.0, end=0.09)), AsrToken(value='de(2)', confidence=0.49347559856683904, range_start=4, range_end=10, time=AsrTokenTime(start=0.1, end=0.31)), AsrToken(value='<sil>', confidence=0.4221477266507153, range_start=10, range_end=16, time=AsrTokenTime(start=0.32, end=0.39)), AsrToken(value='is', confidence=0.1650991984749785, range_start=16, range_end=19, time=AsrTokenTime(start=0.4, end=0.65)), AsrToken(value='de(2)', confidence=0.3323297126218573, range_start=19, range_end=25, time=AsrTokenTime(start=0.66, end=0.96)), AsrToken(value='<sil>', confidence=1.000100016593933, range_start=25, range_end=31, time=AsrTokenTime(start=0.97, end=1.21)), AsrToken(value='het', confidence=0.9933216234388383, range_start=31, range_end=35, time=AsrTokenTime(start=1.22, end=1.67)), AsrToken(value='<sil>', confidence=0.7483880924572297, range_start=35, range_end=41, time=AsrTokenTime(start=1.68, end=1.85)), AsrToken(value='</s>', confidence=1.0, range_start=41, range_end=46, time=AsrTokenTime(start=1.86, end=2.3))]], lang=None)
[DEBUG:2023-01-13 20:56:33,652] rhasspyasr_pocketsphinx_hermes: Publishing 1281 bytes(s) to hermes/asr/textCaptured
[DEBUG:2023-01-13 20:56:33,652] rhasspyasr_pocketsphinx_hermes: -> AsrAudioCaptured(114732 byte(s)) to rhasspy/asr/matrixvoice/voiceAi/audioCaptured
[DEBUG:2023-01-13 20:56:33,652] rhasspyasr_pocketsphinx_hermes: Stopping listening (session_id=voiceAi)
[DEBUG:2023-01-13 20:56:33,670] rhasspydialogue_hermes: <- AsrTextCaptured(text='de is de het', likelihood=0.001532505635969972, seconds=0.28424000507220626, site_id='matrixvoice', session_id='voiceAi', wakeword_id=None, asr_tokens=[[AsrToken(value='<s>', confidence=1.0, range_start=0, range_end=4, time=AsrTokenTime(start=0.0, end=0.09)), AsrToken(value='de(2)', confidence=0.49347559856683904, range_start=4, range_end=10, time=AsrTokenTime(start=0.1, end=0.31)), AsrToken(value='<sil>', confidence=0.4221477266507153, range_start=10, range_end=16, time=AsrTokenTime(start=0.32, end=0.39)), AsrToken(value='is', confidence=0.1650991984749785, range_start=16, range_end=19, time=AsrTokenTime(start=0.4, end=0.65)), AsrToken(value='de(2)', confidence=0.3323297126218573, range_start=19, range_end=25, time=AsrTokenTime(start=0.66, end=0.96)), AsrToken(value='<sil>', confidence=1.000100016593933, range_start=25, range_end=31, time=AsrTokenTime(start=0.97, end=1.21)), AsrToken(value='het', confidence=0.9933216234388383, range_start=31, range_end=35, time=AsrTokenTime(start=1.22, end=1.67)), AsrToken(value='<sil>', confidence=0.7483880924572297, range_start=35, range_end=41, time=AsrTokenTime(start=1.68, end=1.85)), AsrToken(value='</s>', confidence=1.0, range_start=41, range_end=46, time=AsrTokenTime(start=1.86, end=2.3))]], lang=None)
[WARNING:2023-01-13 20:56:33,671] rhasspydialogue_hermes: Ignoring unknown session voiceAi
There is logging in the AsrTextCaptured, but that is not a problem. Rhasspy can and wil not do anything with that.
So, the audio can be used to let is be transcribed by some other service like whisper.
That result can then be used as well.
It is obviously a rough idea, but could work.
You can trigger the askOpenAiQuestion intent and in that handler do the logic above.
Post the audio to whisper, post the result top OpenAi
Post the answer from OpenAI to hermes/dialogueManager/endSession
and Rhasspy will speak the answer.