Amnish Singh Arora

Posted on Jan 28 • Edited on Mar 18

OpenAI has Text to Speech Support now!

#openai #opensource #webdev #react

This week, as I continued diving more into ChatCraft - the Developer Oriented ChatGPT, I found a few opportunities to meaningfully contribute to the project.

In this post, I'll be sharing those contributions, with the major one focusing on OpenAI's recently released Text-to-Speech API. I'll be referring to it as TTS from now on, so bear with me.

1. The Requirement 📋
2. Implementation 🐱‍👤
       2.1. Experimenting with the SDK 🛠️
       2.2. Integration with App 🔗
             a. TTS Toggle Button
             b. Audio Queuing
             c. Avoiding duplicate announcements
             d. Buffering LLM Responses
             e. Optimizing Buffering Algorithm
3. More Work 🫡
4. Upcoming

The Requirement 📋

Earlier this week, I received a GitHub Notification from ChatCraft regarding a new issue that was filed by Taras - the project owner.

For a long time, I was looking for something exciting to work on and this was it. The fact that ChatCraft already supported Speech to Text transcription using Whisper, which is another one of OpenAI's models with unique capabilities, integrating Text-to-Speech would essentially turn our application into something like an Amazon Alexa but with a brain powered the same LLM that ChatGPT uses.

And the fact that this feature was released not so long ago made this challenge even more exciting.

TTS at 42s ⏲️

Implementation 🐱‍👤

Without wasting any time, I started exploring the offical documentation where I found some samples for getting started with SDKs for various languages,

different models for audio quality,

and the most exciting one for me being different configurable voices with preview for each.

I also found the ability to stream real-time audio, but couldn't make it work in Node as in many discussions online.

Which is why, I crafted my own algorithm for better performance which I'll discuss later in the post.

Experimenting with the SDK 🛠️

After going through the documentation, it was time to play around and actually get something working before signing up for the task.

And as always, its NEVER a smooth ride 🥹

I got weird compile errors suggesting that OpenAI did not support any such feature.

After banging my head I against the wall for a few minutes, I found that the version of OpenAI we were using did not support it.

Thanks to this guy

And so, I impulsively upgraded to the latest version of openai (I guess not anymore) without the fear of getting cut by cutting edge 😝
and got it working for some random text



export const textToSpeech = async (message: string) => {
  const { apiKey, apiUrl } = getSettings();
  if (!apiKey) {
    throw new Error("Missing API Key");
  }
  const { openai } = createClient(apiKey, apiUrl);

  const mp3 = await openai.audio.speech.create({
    model: "tts-1",
    voice: "onyx",
    input: message,
  });

  const blob = new Blob([await mp3.arrayBuffer()], { type: "audio/mpeg" });
  const objectUrl = URL.createObjectURL(blob);

  // Testing for now
  const audio = new Audio(objectUrl);
  audio.play();
};

and gathered enough confidence to sign up for the issue.

Integration with App 🔗

Getting it working for testing randomly was fairly easy, but the real deal would be integrating it with a complex application like ChatCraft.

This would mean implementing necessary UI and functionality.

I started thinking of a way to announce the response from LLM as it was being generated and a button that could allow users to enable/disable this behaviour.

TTS Toggle Button

To begin with, I added the toggle control in the prompt send button component.



{isTtsSupported() && (
  <Tooltip label={settings.announceMessages ? "TTS Enabled" : "TTS Disabled"}>
    <IconButton
      type="button"
      size="lg"
      variant="solid"
      aria-label={settings.announceMessages ? "TTS Enabled" : "TTS Disabled"}
      icon={settings.announceMessages ? <AiFillSound /> : <AiOutlineSound />}
      onClick={() =>
        setSettings({ ...settings, announceMessages: !settings.announceMessages })
      }
    />
  </Tooltip>
)}

isTtsSupproted simply checks if we using OpenAI as the provider.



// Audio Recording and Transcribing depends on a bunch of technologies
export function isTtsSupported() {
  return usingOfficialOpenAI();
}

However, this needs to be changed as other providers like OpenRouter can also start supporting this feature in the future.

To persist the user pref, I added an announceMessages option to our settings model



export type Settings = {
  apiKey?: string;
  model: ChatCraftModel;
  apiUrl: string;
  temperature: number;
  enterBehaviour: EnterBehaviour;
  countTokens: boolean;
  sidebarVisible: boolean;
  alwaysSendFunctionResult: boolean;
  customSystemPrompt?: string;
  announceMessages?: boolean;
};

which I would later leverage to determine if responses need to be announced or not!

Audio Queuing

After that, I had to find the code 🔍 that was handling response streaming, which I eventually found in this file.

use-chat-openai.ts

So I left a comment there, to continue after a short tea break ☕

Okay, I am back!!!

Now it was time to work on the actual logic.

When looking at the entire problem at once, it was too intimidating which means there was a need to break it into manageable pieces.

The first thing was to make sure that any audio clips I generated were played in order and the best thing to use for such purposes is the good old queue data structure. I used ChatCraft to help me get started, and it gave some code for what I wanted to do. That gave me an idea about how I could do it, but I was very sure that audio operations and queue management belonged to its own separate file.
So I asked ChatCraft to generate a custom hook for me, essentially abstracting away all the implementation logic.

I called it useAudioPlayer.



import { useState, useEffect } from "react";

const useAudioPlayer = () => {
  const [queue, setQueue] = useState<Promise<string>[]>([]);
  const [isPlaying, setIsPlaying] = useState<boolean>(false);

  useEffect(() => {
    if (!isPlaying && queue.length > 0) {
      playAudio(queue[0]);
    }
  }, [queue, isPlaying]);

  const playAudio = async (audioClipUri: Promise<string>) => {
    setIsPlaying(true);
    const audio = new Audio(await audioClipUri);
    audio.onended = () => {
      setQueue((oldQueue) => oldQueue.slice(1));
      setIsPlaying(false);
    };
    audio.play();
  };

  const addToAudioQueue = (audioClipUri: Promise<string>) => {
    setQueue((oldQueue) => [...oldQueue, audioClipUri]);
  };

  return { addToAudioQueue };
};

export default useAudioPlayer;

You'll notice that its managing Promises returned by textToSpeech function that you might remember from before



/**
 *
 * @param message The text for which speech needs to be generated
 * @returns The URL of generated audio clip
 */
export const textToSpeech = async (message: string): Promise<string> => {
  const { apiKey, apiUrl } = getSettings();
  if (!apiKey) {
    throw new Error("Missing API Key");
  }
  const { openai } = createClient(apiKey, apiUrl);

  const mp3 = await openai.audio.speech.create({
    model: "tts-1",
    voice: "onyx",
    input: message,
  });

  const blob = new Blob([await mp3.arrayBuffer()], { type: "audio/mp3" });
  const objectUrl = URL.createObjectURL(blob);

  return objectUrl;
};

Previously, I was awaiting this url here before pushing it into the queue

This defeated the whole purpose of queuing as the order of audio urls depended upon which one finished awaiting first.

To go around this, I decided to pass in Promise<string> i.e. raw promises like in the screenshot above and await them when playAudio was called.

To summarize, any audio url that is pushed into the queue triggers a side effect that checks if an audio clip is already playing. If not, it converts it to an Audio element and starts playing it. When any audio clip stops playing, the isPlaying state is set to false triggering that side effect again, that plays the next audio clip in the queue and so on...

Avoiding duplicate announcements

Okay, now I was confident that my audio clips would play in the order that I push them into the queue.

But I forgot to account for the fact that, whenever onData function was called, the entire currentText was passed to TTS method

leading to speech like so

"I"
"I am"
"I am ChatCraft"
and so on...

That sure was in order but you get the idea what's wrong.

To fix this, ChatCraft suggested to keep track of the last processed word and only generate audio for newWords



let lastIndex = 0;

const chat = chatWithLLM(messages, {
  model,
  functions,
  functionToCall,
  onPause() {
    setPaused(true);
  },
  onResume() {
    setPaused(false);
  },
  async onData({ currentText }) {
    if (!pausedRef.current) {
      // TODO: Hook tts code here
      const newWords = currentText.split(" ").slice(lastIndex).join(" ");
      lastIndex = currentText.split(" ").length;

      if (newWords.length > 0) {
        const audioClipUri = textToSpeech(newWords);
        addToAudioQueue(audioClipUri);
      }

      setStreamingMessage(

And as you might guess, no more repeated words.

Buffering LLM Responses

Now there were no repeated words, and they played in order. But the problem was that the LLM response stream always had only one new word. This meant every audio clip consisted of just one word and there were as many calls to the tts api as the number of words in the response.

Extremely large number of requests in such short amount of time are completely unnecessary and lead to this

Even if there was no rate limiting, the speech sounded weird as every audio clip takes time to play and think yourself if those clips are one word long.

It sounded like:
"I ... am ... ChatCraft"

In order to fix that, I came up with the idea of buffering the LLM response to a certain maximum number of words before calling tts api.

https://stackoverflow.com/questions/648309/what-does-it-mean-by-buffer

Here's the logic:



let lastTTSIndex = 0; // To calculate new words in the AI generated text stream

// Buffer the response stream before calling tts function
// This reduces latency and number of TTS api calls
const TTS_BUFFER_THRESHOLD = 50;
const ttsWordsBuffer: string[] = [];

const chat = chatWithLLM(messages, {
model,
functions,
functionToCall,
onPause() {
  setPaused(true);
},
onResume() {
  setPaused(false);
},
async onData({ currentText }) {
  if (!pausedRef.current) {
    // Hook tts code here
    const newWords = currentText.split(" ").slice(lastTTSIndex);
    const newWordsCount = currentText.split(" ").length;
    lastTTSIndex = newWordsCount;

    ttsWordsBuffer.push(...newWords);

    if (
      isTtsSupported() &&
      getSettings().announceMessages &&
      ttsWordsBuffer.length >= TTS_BUFFER_THRESHOLD
    ) {
      const audioClipUri = textToSpeech(ttsWordsBuffer.join(" "));
      addToAudioQueue(audioClipUri);

      // Clear the buffer
      ttsWordsBuffer.splice(0);
    }
    ...
    ...

The following commit has all the changes that went in
https://github.com/tarasglek/chatcraft.org/pull/357/commits/1f828ae5cfbe6ff2a07a647eada96b14023bde4f

And Voila! It was finally working as I expected.

So I opened a Pull Request

There have been many conversations since I opened the Pull Request, and there are many more things I have to work on in the future.

Optimizing Buffering Algorithm

The solution that I mentioned above was working FINE, but the time it took for speech to start was too long as it took a while for at least 50 words to pool in the buffer.

The solution was sentence based buffering. Instead of waiting for a certain number of words, I could start the speech as soon as there was one full sentence available in the buffer.

Here's the logic I came up with this time:

You can check the entire code in this commit. It took hours to make it work 👀



// Set a maximum words in a sentence that we need to wait for.
// This reduces latency and number of TTS api calls
const TTS_BUFFER_THRESHOLD = 25;

// To calculate the current position in the AI generated text stream
let ttsCursor = 0;
let ttsWordsBuffer = "";
const sentenceEndRegex = new RegExp(/[.!?]+/g);

const chat = chatWithLLM(messages, {
  model,
  functions,
  functionToCall,
  onPause() {
    setPaused(true);
  },
  onResume() {
    setPaused(false);
  },
  async onData({ currentText }) {
    if (!pausedRef.current) {
      // Hook tts code here
      ttsWordsBuffer = currentText.slice(ttsCursor);

      if (
        isTtsSupported() &&
        getSettings().announceMessages &&
        sentenceEndRegex.test(ttsWordsBuffer) // Has full sentence
      ) {
        // Reset lastIndex before calling exec
        sentenceEndRegex.lastIndex = 0;
        const sentenceEndIndex = sentenceEndRegex.exec(ttsWordsBuffer)!.index;

        // Pass the sentence to tts api for processing
        const textToBeProcessed = ttsWordsBuffer.slice(0, sentenceEndIndex + 1);
        const audioClipUri = textToSpeech(textToBeProcessed);
        addToAudioQueue(audioClipUri);

        // Update the tts Cursor
        ttsCursor += sentenceEndIndex + 1;
      } else if (ttsWordsBuffer.split(" ").length >= TTS_BUFFER_THRESHOLD) {
        // Flush the entire buffer into tts api
        const audioClipUri = textToSpeech(ttsWordsBuffer);
        addToAudioQueue(audioClipUri);

        ttsCursor += ttsWordsBuffer.length;
      }

      setStreamingMessage(
        new ChatCraftAiMessage({
          id: message.id,
          date: message.date,
          model: message.model,
          text: currentText,
        })
      );
      incrementScrollProgress();
    }
  },
});

Here's the final result 🎉

More Work 🫡

Apart from this, I also worked on improving the Audio Recording UI this week.

The aim was to adopt the "press to start, press to stop" behaviour for the recording button.

I also helped a Pull Request from Yumei get merged by reviewing and suggesting some changes.

Even though I was supposed get a Pull Request merged this week and I couldn't, I technically got some code in using the suggestion feature.

Don't say that's cheating now 😉

Here's the Pull Request
https://github.com/tarasglek/chatcraft.org/pull/369

Upcoming

In this post, I discussed about my various contributions to ChatCraft this week.

There's still a lot of work that needs to be done for TTS Support in follow ups like the ability to choose between different voices, downloading speech for a response, cancelling the currently playing audio and so on...

First I'll have to redo the sentence tokenizing logic using a library suggested by my professor .