French dataset audio files

tjiho · January 22, 2021, 11:34pm

Hi rhasspy community !
I am on the way to record my voice so one day there will be a french free and opensource TTS.
I am recording all Bauderaire poems from his book “Les fleurs du mal”. There is around 3500 verses. I already record ~1000 audio files.

If 3500 sentences is not enough I have planned to also record a dataset of random sentences created by @synesthesiam

The dataset is available here: https://git.bksp.space/Tjiho/baudelaire-sentences

Do you think it will be helpful, does the record seems good enough ?

rolyan_trauts · January 23, 2021, 12:50am

If you can find any “French phonetic pangrams” that are similar to what is in English.
“With tenure, Suzie’d have all the more leisure for yachting, but her publications are no good.”
Or
“Are those shy Eurasian footwear, cowboy chaps, or jolly earthmoving headgear?”

What is a phonetic pangram in French is not known by me but maybe you can supply?

Read them several times with deliberate long pauses between words as they make great KWS word datasets.

synesthesiam · January 23, 2021, 2:46pm

This is great! Let me know when you’re finished so I can train a new voice

I checked the phoneme coverage of the Bauderaire sentences you have, and it’s got > 95%. The only thing missing is the voice velar nasal (the “ng” in the English “sing”). I don’t know how often this sound is actually used in French, but we may need to add a few sentences with it.

fastjack · January 23, 2021, 3:03pm

Example of French words with « ng »:
parking, dressing, piercing, ring, pressing, string.

There are surely many more. Most are English words pronounced « a la française ».

Cheers

synesthesiam · January 23, 2021, 3:31pm

Were these poems trying to avoid “ng” then, or was it just a coincidence?

urbatecte · January 23, 2021, 6:31pm

Because all the words stated by @fastjack came from english and are “quite new” regarding the french language.
As Baudelaire is a quite old guy, those words didn’t exist within the french common language.

Thanks a lot to @tjiho for is maybe tiedous work

fastjack · January 23, 2021, 7:32pm

Yep. Pretty sure there were not many parkings available between 1840 and 1867

rolyan_trauts · January 23, 2021, 9:04pm

@fastjack do you know of any french phonetic pangrams (all phones in one sentence).
As for KWS and sharing a basic dataset amongst languages each language is likely to have unique pangrams that only speakers will know.

“With tenure, Suzie’d have all the more leisure for yachting, but her publications are no good.”

“Are those shy Eurasian footwear, cowboy chaps, or jolly earthmoving headgear?”

If you are going to create the quickest and most concise dataset for KWS then pangrams for each language is a good place to start.

@tjiho picked a really good KW as Master Yoda is x4 syllables/Phones where much is the secret to KW as from hey google to alexa the rhyming chance and phone similarity to others is deliberately minimised hence why the ‘Sheila’ & ‘Marvin’ of the Google command set are not that great as with x2 syllables/phones they are not all that unique with high rhyming chance.
Stick a ‘hey’ on them though and they become quite robust as KW or a ‘Master’ in the same way Yoda on its own would not.

fastjack · January 23, 2021, 10:05pm

Found this on the web:

rolyan_trauts · January 23, 2021, 10:11pm

But yeah language specific sentences that you can get multiple rhasspy users to submit samples/links to distributed datasets.

A few choice words and different user count as getting as many different people saying the same for KW is a great addition.

Steal a few of them and make them French Rhasspy defaults as the constraints of the English Google command set must be an annoyance.
Even for English the command set is not that phonetically great also in the command set it contains “no” & “go” very simple phonetically similar words that are canaries for diagnosis.

@fastjack Any chance you could provide a French accent version of ‘raspberry’ & ‘hey raspberry’?

fastjack · January 24, 2021, 10:53pm

Would you need clean records (good microphone, no noise or background sounds) or more ambiant ones (near, far, etc)?

rolyan_trauts · January 24, 2021, 10:56pm

All is good, both or either.
Just what you use if that is all you have.
Clean near is always a good start as from those you can always manipulate with sox.

fastjack · January 24, 2021, 11:00pm

I’ve got a good mic at work. I’ll see what I can do

Cheers

rolyan_trauts · January 24, 2021, 11:41pm

PS great URL for ASR datasets of a plethora of languages.