Advanced media player control?

Yes, I’m not just talking about play, play next, pause, etc. I’m wondering if it’s at all possible to call up by voice one of several hundred artists, several thousand songs, etc. If I loaded all possibilities into a database and then auto-gen’d intents, would this be at all feasible?

Or could the new list building support be used as a model for this – speech-to-text fed into a query of the database for a look-up of the desired selection? Has anyone done this? Does anyone think it would work?

I’m interested in this as well:)

You can generate some slots programmatically by putting a script to do so in profiles/en/slot_programs folder (e.g. called albums, artists, songs) that returns a plaintext list of values.

I’ve done this for TV shows and Movies in my collection with good results. I run a little API server in front of the Kodi video database that generates slots for {movies} and {tv_shows} and is called by my scripts in profiles\en\slot_programs. Have some (very messy prototype-y) code up here that might be semi useful as a guide:

I use tag synonyms (Training - Rhasspy) where the left side is the spoken audio to be detected, and the right is a media ID value that gets passed into my intent.

The problem I’ve run into with music is a big enough library just makes it hard to get good recognition results on your verbal input…and it’s painful to handle the logic around two songs with the same name/different artist, or same name/same artist but different album, things like that.

So, long story short, should be doable depending on just how large your library is and how willing you are to deal with the logic around all the fun edge/corner cases.

I appreciate your response, and even more so, your model code. But as I said in my posted question, I’m not sure of the feasibility of using this against a database of tens of thousands of rows. And by “feasibility”, I mean, can it perform with reasonable responsiveness?

But I had another question as I went thru your code – even tho I don’t pretend to understand it all (mostly because I’m not familiar with the functions called), I was confused by the unfiltered load of arrays of the entire database, until I saw that you then ran a function against each row to generate the spoken text version of the title. Could this not be persisted in each row instead of regenerated upon every invocation? And if thusly stored, could the db-lookup somehow be keyed on this value, based on Rhassby’s generated speech-to-text?

But to your point re similarities and ambiguities, yes, I definitely agree there would be some heavy lifting in this area – some sort of disambiguation processing, perhaps including follow-up prompts to narrow the result set.

Definitely some things I could make more efficient here, but I go through the entire database because these slot values are only generated when Rhasspy runs training (which is only needed when your slot values change, e.g. you added a new song to your library), not every time you invoke a command (e.g. “Play {movie}”). I run training on a nightly basis to pick up any changes to my catalog, so a few extra seconds isn’t a big deal for me.

If I understand the question, that’s what I’m doing. I say “Rhasspy, play Kill Bill 1”, which Rhasspy STT picks up as "kill bill one", and then thanks to the help of slot synonyms translates to "1234_movie" and when the intent is sent back to my media server it receives "1234_movie" as the value, which is the ID of the movie in the kodi database.

If you’re asking why I don’t just query the database for “kill bill one” instead of sending back that ID value, the answer is pretty simple: the STT values Rhasspy would be sending back lack any punctuation or special characters, so I can’t just do a simple query for the title…

A database, even a simple SQLite database, can handle this type of stuff fine at a 10k or 100ks of rows type level. Again, the database is only getting hit during training, and that shouldn’t need to be a super frequent event for you.

At some point I had over 10k possible values for a slot and I didn’t have any responsiveness issues with the actual STT model, though accuracy seemed more hit and miss. Consider how many song titles are going to sound similar phonetically when there are that many options…

I appreciate your patience answering my naive questions. That all made sense, especially after you explained that the code I was looking at was for training – I thought it was part of an actual command or request that was being processed. I’ve only done some light coding in python for the past year & I’m totally new to Rhasspy. So based on what you’ve described I’m going to have to do a deep dive into this now – it’s looking really promising for my voice controlled jukebox. (I’m actually controlling a pair of 300-CD changers – I’ve got browser-based remote control now & am trying to add voice. I was going to go with an Alexa skill but couldn’t make the artist-song-album variables work within that framework.). Thanks again.

Just a warning of some sort to consider. While having tons of artists/titles can make recognition worse, if any of them are in a different language, querying them via voice will be hard. If rhasspy runs in English and there are a few French or German titles in there, chances are, you have to pronounce them as if they were English for it to understand you. I think this can be mitigated by specifying the correct pronunciation in the words tab of rhasspy but the auto guess will be wrong so you have to manually adjust the pronunciation for each word. I imagine this would take ages depending on the music library in question.

1 Like

Thanks for the tips. I do have a lot of Latin artists/songs – will have to see how that works out.

1 Like