DEV Community

Cover image for πŸ”‰From Sound to Insights: Using AIπŸ€– for Audio File Transcription and Analysis!πŸš€
Prashant Iyer for LLMWare

Posted on • Edited on

πŸ”‰From Sound to Insights: Using AIπŸ€– for Audio File Transcription and Analysis!πŸš€

If we were given an audio file, is there any way we could identify the time stamps where specific words were said? Is there any way we could extract all the key words mentioned about a topic?

With AI πŸ€–, we can do all of this and much more! The key lies in being able to parse audio into text, allowing us to then harness the natural language processing capabilities of language models to perform sophisticated analyses and inferences on our data.

Regardless of who you are, such an approach to audio transcription and analysis will augment how you interact with and extract knowledge from audio files.

Let's see how we can do this with llmware.


AI Tools πŸ€–

We'll be using two models for this example.

The first is Whisper by OpenAI. This is the model that will allow us to parse the audio files, i.e. convert them from audio to text.

The second is the SLIM (Structured Language Instruction Model) Extract Tool by LLMWare, which we'll be using to ask questions about our audio. This is a GGUF quantized version of a much larger model called slim-extract. All this means is that our model, the SLIM Extract Tool, is a smaller and faster version of the original model. This allows us to run it locally on a CPU, without the need for powerful computational resources like GPUs!

With that out of the way, let's get started with the example.


Step 1: Loading in audio files πŸ”‰πŸ”‰

If you have audio files that you want to run the example with, then feel free to use those by setting input_folder appropriately, but if not, the llmware library provides you with several sets of sample audio files!



voice_sample_files = Setup().load_voice_sample_files(small_only=False)
input_folder = os.path.join(voice_sample_files, "greatest_speeches")


Enter fullscreen mode Exit fullscreen mode

Here, we're loading in the greatest_speeches set of audio files.


Step 2: Parsing our audio files πŸ“

Now that we have our audio files, we can go about parsing them into chunks of text. Recall that we'll be needing the WhisperCPP model to do this. But fortunately, you won't have to directly interact with the model yourself since the Parser class from the llmware library will take care of this for you!



parser_output = Parser(chunk_size=400, max_chunk_size=600).parse_voice(input_folder, write_to_db=False, copy_to_library=False, remove_segment_markers=True, chunk_by_segment=True, real_time_progress=False)


Enter fullscreen mode Exit fullscreen mode

Here, the chunk_size and max_chunk_size indicate how big each chunk of parsed text will be. We're passing in our folder containing the audio files to the parse_voice() function of the Parser class.

The function does accept many more optional arguments about how we'd like the audio to be parsed, but we can ignore them for this example.


Step 3: Text searching πŸ•΅οΈ

Let's now run a text search on our parsed audio. We can try searching for the word "president". What this means is that we want to find all the portions of the audio and corresponding text that have the word "president" in it. We can do this using the fast_search_dicts() function in the Utilies class in the llmware library.



results = Utilities().fast_search_dicts("president", parser_output)


Enter fullscreen mode Exit fullscreen mode

Step 4: Making an AI call on text chunks πŸ€–

Now that we have a list of text blocks containing the word "president", lets use an AI model to identify which presidents are being mentioned in the selected text blocks.



extract_model = ModelCatalog().load_model("slim-extract-tool", sample=False, temperature=0.0, max_output=200)


Enter fullscreen mode Exit fullscreen mode

Here, we're using the ModelCatalog class to load in our SLIM Extract Tool. Let's now iterate over each text block containing "president".



final_list = []
for i, res in enumerate(results):
    response = extract_model.function_call(res["text"], params=["president name"])


Enter fullscreen mode Exit fullscreen mode

We're making a function_call() for "president name". This is how we ask our Tool to identify the president name in the text block.


Step 5: Analyzing our output πŸ”

The function_call() function would have returned a dictionary containing a lot of data about the function call. We specifically want the president_name key in the dictionary.



extracted_name = ""
if "president_name" in response["llm_response"]:
    if len(response["llm_response"]["president_name"]) > 0:
        extracted_name = response["llm_response"]["president_name"][0].lower()
    else:
        print("\nupdate: skipping result - no president name found - ", response["llm_response"], res["text"])


Enter fullscreen mode Exit fullscreen mode

If the value of the president_name key is a non-empty string, then we store its value in extracted_name. Otherwise, no result was found and we print this out.

Now lets see if the president name matched any of the recent American presidents in this list:



various_american_presidents = ["kennedy", "carter", "nixon", "reagan", "clinton", "obama"]


Enter fullscreen mode Exit fullscreen mode

To do this, we'll check if the extracted_name contains any of these American presidents. If we have a match, then we'll add it to our final_list as a dictionary containing some information about the location of the name in the audio as well as the text block it was in.



for president in various_american_presidents:
    if president in extracted_name:
        final_list.append({"key": president, "source": res["file_source"], "time_start": res["coords_x"], "text": res["text"]})


Enter fullscreen mode Exit fullscreen mode

Results! βœ…

Let's now output the final_list.



for i, f in enumerate(final_list):
    print("final results: ", i, f)


Enter fullscreen mode Exit fullscreen mode

This is what an one search result in the output would look after running the code.

Sample output

Here, we have a Python dictionary as output containing:

  • key: the name of the president identified, which here is "kennedy"
  • source: the audio file this was found in, which here is "ConcessionStand.wav"
  • time_start: the time stamp in seconds where the president was mentioned, which here is 339.9 seconds
  • text: which contains the text chunk the name was found in.

Conclusion

And we're done! To recap, we were able to parse our audio files into text, run a text search on them for the word "president", and then use our SLIM Extract Tool to identify the specific presidents named in our text chunks! And remember that we did all this on just a CPU! πŸ’»

Be sure to check out our YouTube video on this example!

If you made it this far, thank you for taking the time to go through this topic with us ❀️! For more content like this, make sure to visit our dev.to page.

The source code for many more examples like this one are on our GitHub. Find this example here.

Our repository also contains a notebook for this example that you can run yourself using Google Colab, Jupyter or any other platform that supports .ipynb notebooks.

Join our Discord to interact with a growing community of AI enthusiasts of all levels of experience!

Please be sure to visit our website llmware.ai for more information and updates.

Top comments (2)

Collapse
 
aravind profile image
Aravind Putrevu

Interesting, is this whisper behind?

Collapse
 
prashantriyer profile image
Prashant Iyer

Yes! LLMWare integrates Whisper into the Parser class to transcribe audio to text.