Skip to content

dbaelz/TalkToMe

Repository files navigation

Talk To Me

Talk To Me is a Spring Boot application that takes a text and creates an audio file. It's using the ONNX Runtime and a text-to-speech model and provides a very simple REST API.

The project is a proof of concept and in early development. It's main goal is to try out the ONNX Runtime and how to implement support for TTS engines. It's not performance-optimized, but rather focused on getting a working prototype up and running for experiments.

Features

  • Text-to-speech conversion using ONNX Runtime with GPU (CUDA) and CPU support. GPU is only supported for Windows x64 and Linux x64. See documentation
  • Base structure to support multiple TTS Engines. Currently Pocket TTS and Chatterbox are implemented
  • Most application settings are configurable. See application.properties for details
  • Generated Audio files are saved at local storage (default path: storage/)

Getting started

Support TTS engines:

Required files

Pocket TTS:

  • ONNX files: INT8 or FP32 versions of flow_lm_flow and flow_lm_main. The mimi_decoder.onnx, mimi_encoder.onnx and text_conditioner.onnx
  • Tokenizer files: tokenizer.model file for the SentencePiece tokenizer
  • WAV file for the voice cloning. This should be a WAV file with a sample rate of 24000 Hz, mono channel, and 16-bit depth and a duration of around 10 seconds. In my tests the reference_sample.wav (16000 Hz, mono) provided in the HuggingFace repository wasn't ideal. So I recommend using a custom file with the specifications.

Chatterbox:

  • ONNX files: speech_encoder.onnx, speech_encoder.onnx_data, embed_tokens.onnx, embed_tokens.onnx_data, language_model.onnx, language_model.onnx_data, conditional_decoder.onnx, conditional_decoder.onnx_data
  • Tokenizer files: tokenizer.json (required) and tokenizer_config.json (recommended)
  • WAV file for the voice cloning (voice.wav by default)

Configuration and running

  • Settings like address/port, basic auth, model path, model filenames or GPU support can be configured in application.properties
  • Execute ./gradlew bootRun to start the application. The API will be available at http://127.0.0.1:8080/api/tts
  • The included Bruno collection has example API calls. See the Bruno documentation how to install it.

License

The project is licensed by the Apache 2 license.

About

Spring Boot app that generated an audio from text using ONNX Runtime and a text-to-speech model

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors