French dataset audio files

Hi rhasspy community !
I am on the way to record my voice so one day there will be a french free and opensource TTS.
I am recording all Bauderaire poems from his book “Les fleurs du mal”. There is around 3500 verses. I already record ~1000 audio files.

If 3500 sentences is not enough I have planned to also record a dataset of random sentences created by @synesthesiam

The dataset is available here: https://git.bksp.space/Tjiho/baudelaire-sentences

Do you think it will be helpful, does the record seems good enough ?

3 Likes

If you can find any “French phonetic pangrams” that are similar to what is in English.
“With tenure, Suzie’d have all the more leisure for yachting, but her publications are no good.”
Or
“Are those shy Eurasian footwear, cowboy chaps, or jolly earthmoving headgear?”

What is a phonetic pangram in French is not known by me but maybe you can supply?

Read them several times with deliberate long pauses between words as they make great KWS word datasets.

This is great! Let me know when you’re finished so I can train a new voice :slight_smile:

I checked the phoneme coverage of the Bauderaire sentences you have, and it’s got > 95%. The only thing missing is the voice velar nasal (the “ng” in the English “sing”). I don’t know how often this sound is actually used in French, but we may need to add a few sentences with it.

Example of French words with « ng »:
parking, dressing, piercing, ring, pressing, string.

There are surely many more. Most are English words pronounced « a la française ».

:blush: Cheers

1 Like

Were these poems trying to avoid “ng” then, or was it just a coincidence? :smiley:

1 Like

Because all the words stated by @fastjack came from english and are “quite new” regarding the french language.
As Baudelaire is a quite old guy, those words didn’t exist within the french common language.

Thanks a lot to @tjiho for is maybe tiedous work :woozy_face:

Yep. Pretty sure there were not many parkings available between 1840 and 1867 :wink:

2 Likes

@fastjack do you know of any french phonetic pangrams (all phones in one sentence).
As for KWS and sharing a basic dataset amongst languages each language is likely to have unique pangrams that only speakers will know.

“With tenure, Suzie’d have all the more leisure for yachting, but her publications are no good.”

“Are those shy Eurasian footwear, cowboy chaps, or jolly earthmoving headgear?”

If you are going to create the quickest and most concise dataset for KWS then pangrams for each language is a good place to start.

@tjiho picked a really good KW as Master Yoda is x4 syllables/Phones where much is the secret to KW as from hey google to alexa the rhyming chance and phone similarity to others is deliberately minimised hence why the ‘Sheila’ & ‘Marvin’ of the Google command set are not that great as with x2 syllables/phones they are not all that unique with high rhyming chance.
Stick a ‘hey’ on them though and they become quite robust as KW or a ‘Master’ in the same way Yoda on its own would not.

Found this on the web:

1 Like

But yeah language specific sentences that you can get multiple rhasspy users to submit samples/links to distributed datasets.

A few choice words and different user count as getting as many different people saying the same for KW is a great addition.

Steal a few of them and make them French Rhasspy defaults as the constraints of the English Google command set must be an annoyance.
Even for English the command set is not that phonetically great also in the command set it contains “no” & “go” very simple phonetically similar words that are canaries for diagnosis.

@fastjack Any chance you could provide a French accent version of ‘raspberry’ & ‘hey raspberry’?

Would you need clean records (good microphone, no noise or background sounds) or more ambiant ones (near, far, etc)?

All is good, both or either.
Just what you use if that is all you have.
Clean near is always a good start as from those you can always manipulate with sox.

I’ve got a good mic at work. I’ll see what I can do :+1:

Cheers :grin:

1 Like

PS great URL for ASR datasets of a plethora of languages.

2 Likes