HA Wakeword Collective

rolyan_trauts · December 6, 2024, 11:38am

Its a another strange one from Homeassistant that seems to misunderstand the problems with room reverberation (RIR)

Room Impulse Response (RIR) is an audio signal processing task that involves capturing and analyzing the acoustic characteristics of a room

Wake Word Collective - Open Home Foundation ‘the wake word while you walk around the room’

Its extremely hard (beamforming) to remove room impulse response, but very easy to create with many github projects to do this GitHub - LCAV/pyroomacoustics: Pyroomacoustics is a package for audio signal processing for indoor applications. It was developed as a fast prototyping platform for beamforming algorithms in indoor scenarios. is one of many that can accurately add RIR
You only need <0.3m recordings that have very low RIR that can create many distances accurately with tools such as pyroomacoustics.

This allows you to provide datasets for certain devices as for a device with beamforming walking around a room including large RIR (large rooms) will greatly increase dataset entropy and resultant models will be less accurate as the beamforming will attenuate RIR.
Also any model created with samples containing RIR due to the nature of RIR sound will bounce off walls and arrive at the mic at different time periods due to differing distances.
These mix at the mic and the more distant the mic the more the recorded spectra will differ as more reverberation and mixing will happen creating very different harmonics.

At least if your recording at a distance supply metadata of that distance so a dataset can be filtered to only include near <0.3 recordings and also allow you to be sure you have an even spread than create dataset bias.

Metadata is hugely important, Recording Device, Recording difference, Gender, Age Band, Lang, Region (Reginal accents), is essential so that you can filter and create evenly spread datasets or tailor a dataset to a type of metadata for more accuracy.

From what I can see Wake Word Collective - Open Home Foundation is going to create datasets only for devices without any form of RIR attenuation (beamforming and such) and create innacurate models as there is a limit to how much RIR you can included in samples as further distances to the mic in big rooms give big differences in spectra and greatly increasing dataset entropy.