Coqui STT and TTS

,

Today I found out about Coqui STT and TTS:

Coqui STT source code:

Coqui TTS source code:

They claim it “can run in real time on anything from a Raspberry Pi 4 to a high-end GPU server.”

Something to investigate for integration with Rhasspy?

This looks like its build on deepspeech and it looks like in its current form it’s basically identical :thinking:

Edit:
ok this explains it:

Thanks, I hadn’t seen this Mozilla blog post yet, so it seems to be a “production-ready” fork of Mozilla DeepSpeech indeed. Let’s see how this evolves.

Especially interesting now that Mozilla pretty much pulled all resources from future deepspeech development for now.

On the subject of ‘new’ toolkits speechbrain from the github are doing final checks and presume days or weeks now.


The website has been updated guess for the release.
Also they have a new discourse site.

ASR though kaldi / pytorch that still confuses me where pytorch is with mobile and raspberry.

With my tests with tensorflow keep mean to give this a go.

Again not sure about the PI

PS if the above is any interest hack the setup for the non-gpu version of tensorflow as you are not going to want to train models on a pi.
Also use Aarch64 as TF & TFL are heavily optimised for 64bit and its true about approx 2-3x improvement.

There is an example for tflite https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/cpptflite and maybe it could be quite interesting for a Pi3

I made the mistake of building on the pi3 that was up and running and should of popped that card into a Pi400 and compile the latest lite. (it may take a while) :slight_smile:
Or use the bazel docker compile as on my I5-3570 its quite quick(that was for the whl, would have to check but presume that the C++ api can be done the same)

https://drive.google.com/file/d/1f-pmLtdzMt-Nm2GAK1XecZP_X2fN88xj/view?usp=sharing as the pi is still at it but took 3 mins to cross compile

I am going to have to have a go at the lite version as on my pc the results are good.

https://drive.google.com/file/d/1hAK2NNRUYulYNYK5MeWHmsfKWL93D0Zp/view?usp=sharing

https://drive.google.com/file/d/10fEt2oSXDKKdJoGffBWw5kSIm5s2Byu1/view?usp=sharing

Not sure about before and after guess that is melgan but the model seems to work ok.

I just grabbed the pretrained models mentioned in the examples for melgan & fastvoice.
You can not really tell as but approx 4 secs to get before or 6 for after.
I have no idea if you can load up and pre initialise or stream.

For some reason it caused cudu to look for an older 10.1 so guess thats cpu on a i5 3570 but would be interesting to see what the aarch tfl version is like on a pi4? and if maybe pi3

Prob not as on the Pi400 @2.0Ghz output takes 16secs which even with TFL 2-3x speed improvement it sadly is going to be just over realtime which is a real shame considering the quality of speech

Also with new speech tools speechbrain opened and is no longer private

PS

I had a play with https://github.com/TensorSpeech/TensorFlowTTS with the TFlite version

So its a Pi400 running @2.0Ghz but the results are faster than RT.

run time 8.54567575454712 wave duration 11.110748 x 0.769135953
run time 6.87847638130188 wave duration 8.870023 x1 0.77547447

Of the 2 example text inputs using audio_after

run time 9.178293704986572 wave duration 11.110748 x 0.82607343
run time 6.870104551315308 wave duration 8.87002 x 0.774530897

Of the 2 example text inputs using audio_before

I had a problem on run with scikitlearn It seems that scikit-learn has not been built correctly. but export LD_PRELOAD=libgomp.so.1 python3 fixes that.

Its always faster than realtime but giving a specific figure is hard as completion time changes.
This is a sequence of same 2 text outputs on audio_before

8.870368242263794
6.660700798034668
8.67034363746643
6.731372356414795
8.602596759796143
6.665104150772095
8.431090593338013
6.5711071491241455
8.590861797332764
6.671903133392334
8.377497673034668
6.607991933822632
8.403674602508545
6.556359767913818
8.454392433166504
6.628615856170654
8.48227572441101
6.701940536499023

Code used.

import numpy as np
import yaml
import tensorflow as tf
import soundfile as sf
import sys
import time

from tensorflow_tts.processor import LJSpeechProcessor
from tensorflow_tts.processor.ljspeech import valid_symbols

from tensorflow_tts.configs import FastSpeechConfig, FastSpeech2Config
from tensorflow_tts.configs import MelGANGeneratorConfig
from tensorflow_tts.inference import TFAutoModel, AutoConfig, AutoProcessor

from tensorflow_tts.models import TFFastSpeech, TFFastSpeech2
from tensorflow_tts.models import TFMelGANGenerator

print(tf.__version__) # check if >= 2.4.0

# initialize melgan model
with open('examples/melgan/conf/melgan.v1.yaml') as f:
    melgan_config = yaml.load(f, Loader=yaml.Loader)
melgan_config = MelGANGeneratorConfig(**melgan_config["melgan_generator_params"])
melgan = TFMelGANGenerator(config=melgan_config, name='melgan_generator')
melgan._build()
melgan.load_weights("examples/melgan/checkpoints/melgan-1M6.h5")

# initialize FastSpeech model.
with open('examples/fastspeech2/conf/fastspeech2.v1.yaml') as f:
    config = yaml.load(f, Loader=yaml.Loader)
config = FastSpeech2Config(**config["fastspeech2_params"])
fastspeech = TFFastSpeech2(config=config, name="fastspeech2",
                          enable_tflite_convertible=True)
fastspeech._build()
fastspeech.load_weights("examples/fastspeech2/checkpoints/fastspeech2-generator-1500000.h5")

starttime = time.time()
input_text = "Recent research at Harvard has shown meditating\
for as little as 8 weeks, can actually increase the grey matter in the \
parts of the brain responsible for emotional regulation, and learning."

processor = AutoProcessor.from_pretrained(pretrained_path="ljspeech_mapper.json")
input_ids = processor.text_to_sequence(input_text.lower())
input_ids = np.concatenate([input_ids, [len(valid_symbols) - 1]], -1)  # eos.

mel_before, mel_after, duration_outputs, _, _ = fastspeech.inference(
    input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
    speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),
    speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
    f0_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
    energy_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
)

audio_before = melgan(mel_before)[0, :, 0]
audio_after = melgan(mel_after)[0, :, 0]

# save to file
sf.write('./audio_before-tf.wav', audio_before, 22050, "PCM_16")
sf.write('./audio_after-tf.wav', audio_after, 22050, "PCM_16")
endtime = time.time()
print(endtime - starttime)

# Concrete Function
fastspeech_concrete_function = fastspeech.inference_tflite.get_concrete_function()

converter = tf.lite.TFLiteConverter.from_concrete_functions(
    [fastspeech_concrete_function]
)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS,
                                       tf.lite.OpsSet.SELECT_TF_OPS]
tflite_model = converter.convert()

# Save the TF Lite model.
with open('fastspeech_quant.tflite', 'wb') as f:
  f.write(tflite_model)

print('Model size is %f MBs.' % (len(tflite_model) / 1024 / 1024.0) )

# Load the TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path='fastspeech_quant.tflite')

# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Prepare input data.
def prepare_input(input_ids):
  input_ids = tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0)
  return (input_ids,
          tf.convert_to_tensor([0], tf.int32),
          tf.convert_to_tensor([1.0], dtype=tf.float32),
          tf.convert_to_tensor([1.0], dtype=tf.float32),
          tf.convert_to_tensor([1.0], dtype=tf.float32))

# Test the model on random input data.
def infer(input_text):
  processor = AutoProcessor.from_pretrained(pretrained_path="ljspeech_mapper.json")
  input_ids = processor.text_to_sequence(input_text.lower())
  interpreter.resize_tensor_input(input_details[0]['index'], 
                                  [1, len(input_ids)])
  interpreter.resize_tensor_input(input_details[1]['index'], 
                                  [1])
  interpreter.resize_tensor_input(input_details[2]['index'], 
                                  [1])
  interpreter.resize_tensor_input(input_details[3]['index'], 
                                  [1])
  interpreter.resize_tensor_input(input_details[4]['index'], 
                                  [1])
  interpreter.allocate_tensors()
  input_data = prepare_input(input_ids)
  for i, detail in enumerate(input_details):
    input_shape = detail['shape_signature']
    interpreter.set_tensor(detail['index'], input_data[i])

  interpreter.invoke()

  # The function `get_tensor()` returns a copy of the tensor data.
  # Use `tensor()` in order to get a pointer to the tensor.
  return (interpreter.get_tensor(output_details[0]['index']),
          interpreter.get_tensor(output_details[1]['index']))
          
starttime = time.time()     
   
input_text = "Recent research at Harvard has shown meditating\
for as little as 8 weeks, can actually increase the grey matter in the \
parts of the brain responsible for emotional regulation, and learning."

decoder_output_tflite, mel_output_tflite = infer(input_text)
#audio_before_tflite = melgan(decoder_output_tflite)[0, :, 0]
audio_after_tflite = melgan(mel_output_tflite)[0, :, 0]

# save to file
#sf.write('./audio_before-tfl1.wav', audio_before_tflite, 22050, "PCM_16")
sf.write('./audio_after-tfl1.wav', audio_after_tflite, 22050, "PCM_16")
endtime = time.time()
print(endtime - starttime)
starttime = time.time()
input_text = "I love TensorFlow Lite converted FastSpeech with quantization. \
The converted model file is of 28.6 Mega bytes."

decoder_output_tflite, mel_output_tflite = infer(input_text)
#audio_before_tflite = melgan(decoder_output_tflite)[0, :, 0]
audio_after_tflite = melgan(mel_output_tflite)[0, :, 0]
#sf.write('./audio_before-tfl2.wav', audio_before_tflite, 22050, "PCM_16")
sf.write('./audio_after-tfl2.wav', audio_after_tflite, 22050, "PCM_16")
endtime = time.time()
print(endtime - starttime)

https://drive.google.com/file/d/10OUcfFzotpUz8160-z_LUy2CRleVzS_W/view?usp=sharing, https://drive.google.com/file/d/15fj6D11i0EEow25UDR28M0nwcCNphJ24/view?usp=sharing, https://drive.google.com/file/d/1OF4JEtyjRHjxMLjVIPmXZC4fuj4vdpBU/view?usp=sharing, https://drive.google.com/file/d/1_zKay8rjyemVPqlStDicgOiyQysb-62T/view?usp=sharing

@koan not bad for a Pi4
They are updating some code to break & chunk sentences. (first part can be remarked out after you convert the tfl model)

1 Like

Those are some nice results!

Yeah apparently if you retrain with n_mels=40 rather than 80 that will reduce load with very little loss of quality.
I don’t fancy that training run on a 1050ti though so will take there word for it.