Thank you very much for the quick answer. This is great.
It was my fault. I didn’t see that phonetisaurus-pypi
creates the required corpus:
$ phonetisaurus train --help
usage: phonetisaurus train [-h] [--corpus CORPUS] [--lexicon-word-separator LEXICON_WORD_SEPARATOR] [--lexicon-phoneme-separator LEXICON_PHONEME_SEPARATOR] --model MODEL
[--casing {lower,upper,ignore}] [--debug] [--machine {x86_64,armv6l,armv7l,armv8}]
lexicon [lexicon ...]
positional arguments:
lexicon Path(s) to read one or more phonetic dictionaries
optional arguments:
-h, --help show this help message and exit
--corpus CORPUS Path to write trained g2p corpus
--lexicon-word-separator LEXICON_WORD_SEPARATOR
Separator regex between words in each lexicon entry (default: \s+)
--lexicon-phoneme-separator LEXICON_PHONEME_SEPARATOR
Separator regex between phonemes in each lexicon entry (default: \s+)
--model MODEL Path to g2p model
--casing {lower,upper,ignore}
Case transformation to apply to words
--debug Print DEBUG messages to the console
--machine {x86_64,armv6l,armv7l,armv8}
Override detected platform machine type
Now my luxembourgish corpus is ready (lb-corpus.dict : 14,1 MB) :
....
a}ɑ v}v l}l
a|a}aː c|h}χ t}t c|h}ɕ e}ə n}n
a|a}aː c|h}χ t}t c|h}ɕ e}ə
a|a}aː c|h}χ t}t e|r}ɐ c|h}ɕ e|r}ɐ
a|a}aː c|h}χ t}t d}d e|e}eː l}l e|r}ɐ
a|a}aː c|h}χ t}t e}æ c|k}k
....
The next step is the training of the crf-model
. This takes some time. While waiting for the result I visited the voice2json.org website.
Finally the model-lb.crf file is saved : 474,1 KB.
Time to check if it’s working. Let’s predict the first sentence of the fable De Norwand an d’Sonn.
python3 gruut/g2p.py predict --model /home/mbarnig/myTTS-Project/g2p/model-lb.crf --debug "An der Zäit hunn sech den Nordwand an d’Sonn gestridden, wie vun hinnen zwee wuel méi staark wier, wéi e Wanderer, deen an ee waarme Mantel agepak war, iwwert de Wee koum."
>>>
ɑ n d ə ʀ e ts æːɪ t h u n z ə ɕ d ə n ɑ̃ː n ɔ ʀ t v aː n d aː n d z o n g ə ʃ t ʀ i d ə n yː v iə v u n t h i n ə n ts w eː v uə l m ɜɪ ə ʃ t aː ɐ k s v iː ɐ ɲ v ɜɪ aː ʀ v ɑ n d ə ʀ ə ʀ ɑ̃ː d eː n ɑ n eː v aː ʀ m eː m ɑ n t ə l ɑ g ə p aː k v aː ʀ d i v ɐ t d eː v eː k əʊ m
Let’s compare with the phonemes guessed by the old model g2p-lb.fst:
phonetisaurus predict --model /home/mbarnig/myTTS-Project/g2p/g2p-lb.fst "an der zäit hunn sech den nordwand an d’sonn gestridden, wie vun hinnen zwee wuel méi staark wier, wéi e wanderer, deen an ee waarme mantel agepak war, iwwert de wee koum."
>>>
ɑ n d ɐ ts æːɪ t h u n z ə ɕ d æ n ɔ ʀ d v ɑ n d ɑ n t z o n g ə ʃ t ʀ i d æ n v iə f u n h i n ə n ts w eː v uə l m ɜɪ ʃ t aː ʀ k v iə v ɜɪ ə v ɑ n d ə ʀ ɐ d eː n ɑ n eː v aː ʀ m ə m ɑ n t ə l aː ʁ ə p aː k v ɑ ʀ i v ɐ t d ə v eː k əʊ m
Here is the official luxembourgish phonemization:
ɑn dɐ ‚ʦæ:ɪt / hun zeɕ dən ’noʀtvɑnt ɑn ‚dzon gə’ʃtʀidən / viə fun hinən ‚ʦve: vuəl ‚meɪ ʃta:ʀk viɐ / veɪ eː ‚vɑndəʀɐ / de:n ɑn eː ‚va:ʀmə ‚mɑntəl ‚a:ɡəpa:k va:ʀ / ivɐt də ‚veː kəʊm
Not quite the same, but I think both models will do a good job in combination with the pronuniciation dictionary.
After a final check of my whole configuration I will soon start the first training round with my small luxembourgish dataset and the crf-model.