UnicodeDecodeError training my Piper model

Yahir · February 29, 2024, 3:19pm

When starting the pre-training process I get the UnicodeDecodeError error. I have already tried in different distros and I get the same error… I am trying to train with a dataset that has important characters in Spanish such as the letter “ñ”, is there a solution?
Thanks for your future responses

(.venv) root@debian:~/piper/src/python# python3 -m piper_train.preprocess \
>   --language es-419 \
>   --input-dir ~/piper/my-dataset \
>   --output-dir ~/piper/my-training \
>   --dataset-format ljspeech \
>   --single-speaker \
>   --sample-rate 22050
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/piper/src/python/piper_train/preprocess.py", line 502, in <module>
    main()
  File "/root/piper/src/python/piper_train/preprocess.py", line 143, in main
    for utt in make_dataset(args):
  File "/root/piper/src/python/piper_train/preprocess.py", line 422, in ljspeech_dataset
    for row in reader:
  File "/usr/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 47: invalid continuation byte

synesthesiam · February 29, 2024, 3:24pm

You will need to make sure your dataset file is converted to UTF-8, which supports Spanish characters.
One way to do this is to first check the file’s encoding:

file -bi dataset.csv

Then you can use iconv to convert it:

iconv -f <encoding> -t utf-8 dataset.csv -o dataset-utf8.csv

To see all of the available encodings, use iconv --list