Text Generation to Audio

4/9/2025
3-minute read

Due to hardware limitations on my devlopment machine I’m unable to load both Qwen and xtts_v2 at the same time.

GPU: NVIDIA GeForce RTX 3060 Ti Lite Hash Rate

So I use a couple of small python scripts that are then wrapped with a bash shell script.

#!/home/nessy/Git/qwen/venv/bin/python

import sys

from transformers import pipeline

def main():
    """
    """

    question = sys.argv[1]
    max_tokens = 100

    pipe = pipeline(
        "text-generation",
        model="Qwen/Qwen2.5-1.5B-Instruct",
        device="cuda"
    )

    messages = [
        {"role": "system", "content": "you are a helpful assistant"},
        {"role": "user", "content": question}
    ]

    res = pipe(messages, max_new_tokens=max_tokens)
    output = res[0]['generated_text'][-1]['content']

    print(output)


if __name__ == '__main__':
    main()

My Qwen script merely takes a question as input from the shell, and prints back the generation. This is easy to wrap with bash shell script later on.

$ ./main.py "a one sentence introduction to a fantasy world"

The results look something like this:

In the realm of Eldoria, where magic flows through every fiber of existence and mythical creatures roam freely in harmony with nature, the tale begins.

Next up is the text to voice model python script, this one is also short and easy to comprehend.

You can find pre-recorded voices on the open speech repository website, though later in this article I describe how to record your own voice.

#!/home/nessy/VirtualEnvs/tts/bin/python

from TTS.api import TTS

voice = 'voices/OSR_us_000_0010_8k.wav'

with open('content.txt') as f:
    content = f.read()

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)

# generate speech by cloning a voice using default settings
tts.tts_to_file(text=content,
                file_path="outputs/output.wav",
                speaker_wav=voice,
                language="en")

This script reads the generation stored in content.txt, and outputs a audio file using our sample voice to outputs/output.wav.

As mentioned I have a little wrapper bash script, though you can run them each separately as they have no dependencies on each other.

#!/bin/bash

question=$@

echo "Generating"
~/Git/qwen/main.py "$question" > content.txt
cat content.txt

echo "Creating Text to speech"
./process.py

echo "Playing"
mpv outputs/output.wav

As seen above the output of qwen generation is redirect to content.txt, then audio generation process script is ran, and finally I play it back with my media player mpv to hear the audio.

Now to make this even more personalized I pre-recorded myself speaking the following:

The quick brown fox jumps over the lazy dog.
She sells seashells by the seashore.
A gray mare walked before the colt.

With this saved as a .wav I stored it my voices/ directory, then I updated process.py to use it:

#!/home/nessy/VirtualEnvs/tts/bin/python

from TTS.api import TTS

voice = 'voices/nessy.wav'

with open('content.txt') as f:
    content = f.read()

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)

# generate speech by cloning a voice using default settings
tts.tts_to_file(text=content,
                file_path="outputs/output.wav",
                speaker_wav=voice,
                language="en")

text to speech python AI Huggingface Qwen Qwen2.5 xtts_v2 TTS