David Mezzetti for NeuML

Posted on Sep 27 • Originally published at neuml.hashnode.dev

Speech to Speech RAG

#ai #llm #rag #vectordatabase

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

There are many articles, notebooks and examples covering how to perform vector search and/or retrieval augmented generation (RAG) with txtai. A lesser known component of txtai is it's built-in workflow component.

Workflows are a simple yet powerful construct that takes a callable and returns elements. Workflows enable efficient processing of pipeline data. Workflows are streaming by nature and work on data in batches. This allows large volumes of data to be processed efficiently.

This article will demonstrate how to to build a Speech to Speech (S2S) workflow with txtai.

Note: This process is intended to run on local machines due to it's use of input and output audio devices.

Install dependencies

Install txtai and all dependencies.

pip install txtai[pipeline-audio] autoawq

Define the S2S RAG Workflow

The next section defines the Speech to Speech (S2S) RAG workflow. The objective of this workflow is to respond to a user request in near real-time.

txtai supports workflow definitions in Python and with YAML. We'll cover both methods.

The S2S workflow below starts with a microphone pipeline, which streams and processes input audio. The microphone pipeline has voice activity detection (VAD) built-in. When speech is detected, the pipeline returns the captured audio data. Next, the speech is transcribed to text and then passed to a RAG pipeline prompt. Finally, the RAG result is run through a text to speech (TTS) pipeline and streamed to an output audio device.

import logging

from txtai import Embeddings, RAG
from txtai.pipeline import AudioStream, Microphone, TextToSpeech, Transcription
from txtai.workflow import Workflow, StreamTask, Task

# Enable DEBUG logging
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)

# Microphone
microphone = Microphone()

# Transcription
transcribe = Transcription("distil-whisper/distil-large-v3")

# Embeddings database
embeddings = Embeddings()
embeddings.load(provider="huggingface-hub", container="neuml/txtai-wikipedia")

# Define prompt template
template = """
Answer the following question using only the context below. Only include information
specifically discussed. Answer the question without explaining how you found the answer.

question: {question}
context: {context}"""

# Create RAG pipeline
rag = RAG(
    embeddings,
    "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
    system="You are a friendly assistant. You answer questions from users.",
    template=template,
    context=10
)

# Text to speech
tts = TextToSpeech("neuml/vctk-vits-onnx")

# Audio stream
audiostream = AudioStream()

# Define speech to speech workflow
workflow = Workflow(tasks=[
    Task(action=microphone),
    Task(action=transcribe, unpack=False),
    StreamTask(action=lambda x: rag(x, maxlength=4096, stream=True), batch=True),
    StreamTask(action=lambda x: tts(x, stream=True, speaker=15), batch=True),
    StreamTask(action=audiostream, batch=True)
])

while True:
    print("Waiting for input...")
    list(workflow([None]))

Given that the input and outputs are audio, you'll have to use your imagination if you're reading this as an article.

Check out this video to see the workflow in action! The following examples are run:

Tell me about the Roman Empire
Explain how faster than light travel could work
Write a short poem about the Vikings
Tell me about the Roman Empire in French

S2S Workflow in YAML

A crucial feature of txtai workflows is that they can be defined with YAML. This enables building workflows in a low-code and/or no-code setting. These YAML workflows can then be "dockerized" and run.

Let's define the same workflow below.

# Microphone
microphone:

# Transcription
transcription:
  path: distil-whisper/distil-large-v3

# Embeddings database
cloud:
  provider: huggingface-hub
  container: neuml/txtai-wikipedia

embeddings:

# RAG
rag:
  path: "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"
  system: You are a friendly assistant. You answer questions from users.
  template: |
    Answer the following question using only the context below. Only include information
    specifically discussed. Answer the question without explaining how you found the answer.

    question: {question}
    context: {context}
  context: 10

# TTS
texttospeech:
  path: neuml/vctk-vits-onnx

# AudioStream
audiostream:

# Speech to Speech Chat workflow
workflow:
  s2s:
    tasks:
      - microphone
      - action: transcription
        unpack: False
      - task: stream
        action: rag
        args:
          maxlength: 4096
          stream: True
        batch: True
      - task: stream
        action: texttospeech
        args:
          stream: True
          speaker: 15
        batch: True
      - task: stream
        action: audiostream
        batch: True

from txtai import Application

app = Application("s2s.yml")
while True:
    print("Waiting for input...")
    list(app.workflow("s2s", [None]))

Once again, the same idea, just a different way to do it. In the video demo, the following query was asked.

As a Patriots fan, who would you guess is my favorite quarterback of all time is?
I'm tall and run fast, what do you think the best soccer position for me is?
I run slow, what do you think the best soccer position for me is?

With YAML workflows, it's possible to fully define the process outside of code such as with a web interface. Perhaps someday we'll see this with txtai.cloud 😀

Wrapping up

This article demonstrated how to build a Speech to Speech (S2S) workflow with txtai. While the workflow uses an off-the-shelf embeddings database, a custom embeddings database can easily be swapped in. From there, we have S2S with our own data!

Top comments (2)

Luis Vilca • Sep 28

Really amazing job! A local and free alternative to gpt assistant. It is really good and it seems that it runs in ~2s on average. I got some questions!
Is it possible to inject a different AI for the responses? (use chatgpt for example?)
Or a different speech to text engine (perhaps an api?)
Do you think there is a way to make this faster? (use a smaller/more optimized model)

David Mezzetti • Sep 28

Thank you for the kind words.

Each component can easily be switched out.

workflow = Workflow(tasks=[
    Task(action=microphone),
    Task(action=transcribe, unpack=False),
    StreamTask(action=lambda x: rag(x, maxlength=4096, stream=True), batch=True),
    StreamTask(action=lambda x: tts(x, stream=True, speaker=15), batch=True),
    StreamTask(action=audiostream, batch=True)
])

The action in each of the steps above just needs to be a callable or function. The callable needs one argument to accept the list of elements. The LLM pipeline with txtai has the ability to use a lot of different LLMs local and remote. For example, the LLM can use OpenAI instead of a local model or Llama 3.2 1B/3B.

DEV Community

Speech to Speech RAG

Install dependencies

Define the S2S RAG Workflow

S2S Workflow in YAML

Wrapping up

Top comments (2)

Read next

Connect to multiple databases, make or generate SQL queries, analyze or visualize.

Monica AI: Unlock the Power of AI for Developers

Tiny AI Safety Guard Matches Larger Models with 98% Accuracy, Runs on Phones

Top 7 Data Careers You Should Know About in 2025