IvanDev

Posted on Oct 2 • Edited on Oct 4

Audio to Text using Python and OpenAI

#python #openai #programming #ai

Introduction

In the following tutorial, I will build a basic app to recognize audio files and convert them to text. Moreover, with OpenAI API tools, a Python library "pydub" for audio manipulation, and the library python-dotenv to save environment variables, it's easy to do. Without cumbersome code, it is easy to follow with a detailed explanation to make it work in your daily tasks.

Let's get dirty:

Clone the repository:

git clone https://github.com/ivansing/audio-to-text-app.git
    cd audio-to-text-app

You should get the sample files in assets and copy them to your assets folder.

Setting Up the Environment

Prerequisites:

Basic Python language
Code Editor
Basic command line

Step 1: Setting Up the Project

Install Python Here. It is straightforward; just follow the (recommended) prompts to "yes."
I will use VS Code as managing any project and development is relatively easy.
Open VS Code at the top of the bar and press Terminal in the dropdown box. Select New Terminal and type the following bash command:

mkdir audio-text-app

Then move to the directory that we did previously:

cd audio-text-app

Your projects would then be located in paths like:
/home/your-username/Projects/my_project (Linux)
/Users/your-username/Projects/my_project (Mac)

In your folder audio-text-app create the following files:

touch audio-to-text.py .env

The file audio-text-app is for this small script app's main functionality and entry point.
.env is where I will save the API keys from OpenAi. I will use it in the following steps.

For Windows using Linux subsystem WSL

Open VS Code, press F1, and select connect to WSL.
Follow the previous steps from Linux/Mac.

Step 2: Install Required Libraries and folders files

In your command line windows, type the following instructions:

pip install openai pydub python-dotenv

Install FFmpeg:
- On macOS (using Homebrew): brew install ffmpeg
- On Ubuntu: sudo apt install ffmpeg
- On Windows: Download and install from ffmpeg.org
Make another directory named: assets inside the audio-text-app folder. I will use it to get the audio .wav files for testing:

mkdir assets

Step 3: Setup your OpenAI API Key:

Go to OpenAI and sign up to generate your openai-api-key.

Then, Create a new secret key, which is the openai-api-key.

Follow the steps in the popup modal window, and press "Create secret key."

Finally, Save this generated key in a notepad or safe place, always hidden from the public, and press "Done."

Now that we have already generated our precious hidden key let's keep the following:

From the previous steps, you can now paste that secret key created in the .env file:

OPENAI_API_KEY=<YOUR-API-SECRET-KEY> # Just paste the same format you copy. Don't change or add anything!

Step 4: Write the code

Import the libraries:

import openai
from pydub import AudioSegment
import os 
import uuid 
from dotenv import load_dotenv

open: For interaction with the OpenAI Whisper API to transcribe audio.
pydub: This will make the file smaller so as not to stress too much the CPU work when handling audio file manipulations like changing channels (mono) and resampling.
os: To generate unique file names for processed audio, removing redundant output files.
uuid: To generate unique file names for processed audio.
dotenv To load environment variables from a .env file, which securely stores the API key.

Functions

convert_to_mono_16k

def convert_to_mono_16k(audio_file_path, output_dir="assets"):
    """Converts audio to mono and 16kHz, returns the path to the converted audio."""
    sound = AudioSegment.from_file(audio_file_path)
    sound = sound.set_channels(1)  # Mono
    sound = sound.set_frame_rate(16000)  # 16kHz

    # Generate a unique filename for the mono version
    converted_file_name = f"{uuid.uuid4()}.wav"
    converted_file_path = os.path.join(output_dir, converted_file_name)

    # Export the converted audio file
    sound.export(converted_file_path, format="wav")
    return converted_file_path

This function takes an audio file, converts it to mono (1 audio channel), and resamples it to 16kHz, which is required for optimal transcription with Whisper.

The audio file is loaded using AudioSegment.
It is converted to mono with set_channels(1).
The sample rate is set to 16kHz using set_frame_rate(16000).
A unique file name is generated using uuid to avoid filename conflicts.
The processed audio file is exported to the specified output directory (assets by default).
This function returns the file path of the converted audio, which will be used later for transcription.

transcribe_audio

def transcribe_audio(audio_file_path, clean_up=True):
    """Transcribes audio to text using OpenAI's Whisper."""
    # Convert audio to mono and 16kHz
    mono_audio_path = convert_to_mono_16k(audio_file_path)

    # Transcribe audio using OpenAI's Whisper
    with open(mono_audio_path, "rb") as audio_file:
        transcript = openai.Audio.transcribe("whisper-1", audio_file)

    # Clean up the converted file if needed
    if clean_up:
        os.remove(mono_audio_path)

    return transcript['text']

This function transcribes an audio file into text using the OpenAI Whisper API.

It calls the convert_to_mono_16k function to ensure the audio is in the correct format (mono, 16kHz).
The converted file is opened in binary mode "rb" and passed to the Whisper API transcription.
The function optionally cleans up (deletes) the temporary audio file after the transcription, controlled by the clean_up argument.
The function returns the transcription text extracted from the Whisper API's response.

Test code

# Example usage
audio_file = "assets/jackhammer.wav"
transcription = transcribe_audio(audio_file)
print("Transcription:", transcription)

This section demonstrates how to use the transcribe_audio function.

Besides the samples that are stored in the assets folder, you can add more .wav files to test it

Test the code with the following command:

python3 audio-to-text.py

Now check the output text from the audio file in the terminal:

The audio_file variable specifies the audio file to be transcribed.
The transcribre_audio function is called with the audio file path.
The transcription result is printed to the console.

Summary

It was an ideal tutorial for learning the basis of using various Python libraries. After all, we learned the OpenAI API Whisper, which is a trained model based on a neural net on English speech recognition. And the use of pydub to manipulate the audio. I used native Python libraries os for path source and definition and uuid to rename the mono output file.

Conclusion

Python is a vast universe that is used in general software construction. You can use this tool as part of a small software package. It is an essential program that needs many more things, like more test cases than you can imagine. Ideally, you can have an output text file, but for this short tutorial, I didn't want to add more complexity; if you're going to add more features, look at Python docs, and you will be amazed at what you can manipulate in this world and with the help of API (outside programs to communicate), there will be fantastic software builds.

References

About the Author

Ivan Duarte is a backend developer with experience working freelance. He is passionate about web development and artificial intelligence and enjoys sharing their knowledge through tutorials and articles. Follow me on X, Github, and LinkedIn for more insights and updates.

DEV Community