I’ve been following Rhasspy for a few weeks now and I’m pretty impressed how fast the community is growing. Because it has so much overlap with my own project I’m trying to think of ways how to combine them so they can benefit from each other.
speech-to-text (it even has the same Zamia Speech Kaldi support Rhasspy uses)
dialog-to-service (skills, actions)
It is also a framework that has some additional features like:
Java based core server (light-weight, <100MB)
Customizable cross-platform clients for browser, Android and iOS (iOS not yet in Apple store)
User account management
SDK to build services in Java and upload them to SEPIA (kind of like a skill store, works global or per-user)
Python bridge to implement Python code for intent handling right into the SEPIA NLU chain
Since speech-to-text and text-to-intent is handled by Rhasspy as well I was thinking to connect SEPIA services to Rhasspy via SEPIA’s ‘answer’ REST endpoint or implementing Rhasspy’s intent handling into SEPIA via the Python bridge. It would also be great if we could make the STT modules compatible.
The benefits I see for both systems are:
Rhasspy could get access to dozens of SEPIA services (weather, navigation, to-do lists, reminders, music, timers, smart home control, news, etc.) and SEPIA’s services SDK
SEPIA could benefit from Rhasspy’s combination of intent extraction and speech-recognition
Rhasspy could be accessed via SEPIA’s web-based clients
I believe MQTT can be the right choice in certain parts of the interaction with Rhasspy and SEPIA (STT maybe), but for the NLU chain it would mean the additional overhead of keeping a broker connection open and the nature of broadcast/subscribe does not really fit the task. The ‘Python-Bridge’ is more of a way to call a Python function (or program) in SEPIA which ideally would be imported Python code or a (synchronous) HTTP REST GET/POST call.
Nevertheless on Rhasspy side I could think of a way to publish an intent via MQTT and SEPIA will “answer” with the Service result via MQTT as well
Hi, Florian! Great to hear back from you As you can see, I’ve been very busy!
I would love to work with you to get SEPIA talking to Rhasspy (and vice-versa). Do your Android/iOS apps stream audio directly to a SEPIA server, or do they do ASR locally? If they stream audio, I’d be interested in supporting that so Rhasspy could gain some mobile support.
Do you have a link to documentation for SEPIA’s REST API? If it’s just an HTTP POST, it might work out of the box with Rhasspy right now.
With your Python bridge, it should be pretty easy to connect to Rhasspy using the rhasspy-client library. I need to add a REST endpoint for intent handling directly from JSON (there are endpoints that handle intents derived from speech/text already).
This sounds interesting too. Are those clients REST based or websocket too?
I almost forgot, I’d also like to understand how you do online decoding in Kaldi. I have it mostly working (based on Zamia’s Python extension), but it occasionally errors out. Do you use endpointing to do silence detection, or something different?
Both You can choose between native ASR and your own server. Audio is streamed via WebSocket connection to the STT server.
Unfortunately there is no good documentation yet, but the SEPIA Control HUB has a test page for the APIs. Basically its HTTP POST including a authentication token. Authentication is required because each user can have his own set of custom commands and SDK services.
I guess this (the client lib) is something I could start experimenting with after the coming SEPIA update … I was planning to do some Python-bridge demos anyway ^^. The REST endpoint would be nice if users want to operate separate SEPIA and Rhasspy on different machines.
The main client is using WebSocket, but the SEPIA Control HUB could be called “client” too (although its only used for testing) and is using the HTTP POST interfaces.
Are the errors only related to the VAD (voice activity detection)? I had some troubles with this too and decided to cut audio agressively if its longer than 4s until I find a good VAD (server or client based).
I just remembered that I’ve started to build a SEPIA Python lib a while ago (~2 years) to authenticate the external wake-word tool and make a HTTP POST call to the ‘remote’ endpoint .
This should be useful to make calls to the ‘interpret/understand/answer’ endpoints as well
Maybe we can build a little proof-of-concept with it like using Rhasspy for STT and then sending the text to SEPIA’s ‘answer’ endpoint?