DEV Community

Cover image for Cloudflare AI Challenge: Audio Interaction with AudioInsight
Gabriel Sena
Gabriel Sena

Posted on

Cloudflare AI Challenge: Audio Interaction with AudioInsight

This is a submission for the Cloudflare AI Challenge.

What I Built

I built AudioInsight, an app that processes audio, transcribes it, summarizes it, generates a title for the content, and allows users to ask questions about the related audio.

The chat and audio are stored remotely, so the user can access later and ask new questions or rewatch the audio.

To make this app, I used some products from the Cloudflare catalog, such as: Cloudflare D1, R2, Workers AI, Cloudflare Pages, and AI Models: Automatic Speech Recognition, Summarization, and ​​Text Generation. These will be explained in the Journey section.

Demo

Demo Link
Original Cloudflare Pages Demo Link

My Code

GitHub logo gabrielsenadev / audioinsight

AudioInsight is a web application that processes audio, generates transcriptions, and allows users to ask questions about the related audio.

Audio Insight Running Exampl

AudioInsight

AudioInsight is a full-stack application that processes audio, generates transcriptions, and allows users to ask questions about the related audio.

Its creation was motivated by participation in a dev.to challenge.

Table of Contents

How to Install

  1. Start by cloning this repository:
git clone git@github.com:gabrielsenadev/audioinsight.git
Enter fullscreen mode Exit fullscreen mode
  1. Install dependencies:
npm ci
Enter fullscreen mode Exit fullscreen mode
  1. Configure your environment

See Environment Variables.

  1. Run application
npm run dev
Enter fullscreen mode Exit fullscreen mode

Environment Variables

This application depends on some providers to work with ai and database. It has been developed with minimal provider dependency. So, if you prefer a different provider, you can easily switch.

Cloudflare AI:

This application integrates with the Cloudflare AI ecosystem to utilize AI Models.

  • CLOUDFLARE_ACCOUNT_ID
  • CLOUDFLARE_API_TOKEN

Netlify Blobs:

For storing audio data, this application relies on Netlify Blobs. You will need a Netlify Site and Account.

  • NETLIFY_SITE_ID
  • NETLIFY_TOKEN

MongoDB:

MongoDB is used to store chats and chat messages.

  • …

Journey

Working with AI is a curious thing to me. I was thinking about developing something using AI, and this challenge is enough to motivate me to do this. So, one of the incredible AI features for me is the capability to transform voice into text. Therefore, I decided to follow this path.

After some time thinking about what to do, I decided to process the audio, generate the content of the audio, summarize the content, and allow the user to ask questions about the uploaded audio.

I also decided to explore more of the Cloudflare ecosystem. Thus, one of my personal requirements is the capability to store chat and audio remotely and provide a way for the user to go back later.

After defining my requirements and goals, I started learning about the AI, how it works, and how Workers AI works. In this process, I decided to use these AI models: a audio to text (whisper), a content summarization (bart-large-cnn) and text generation to answer questions and generate chat title (neural-chat-7b-v3-1-awq).

In the Multiple Models and/or Triple Task Types section, I explain how I use these models and show the application flow, which explains how I combine these AI models to participate in the Additional Prize Category.

After developing the main idea, I began to understand how Cloudflare databases and Cloudflare R2 work. Then, I implemented the capability to store user's chats and audio.

Multiple Models and/or Triple Task Types

To create this app, I utilized three different AI model types to generate its content.

  • whisper is responsible for converting audio to text.
  • bart-large-cnn is tasked with generating a summary of the related audio content.
  • neural-chat-7b-v3-1-awq handles generating the chat title and answering questions about the related content.

When the user uploads an audio, start the chat creation process. Here, I combine all three models to generate a piece of content for each: audio transcription, summarization, and chat title.

When the user asks a question, I only use the text generation AI model to answer the user's question.

Application Flow
Follow the flow below to understand how it works.

An image with two flows shows how this application combines Cloudflare solutions, including three different AI models, to develop this app

In this flow, we understand how I use these different AI models and how I utilized Cloudflare storage solutions to develop this app.

Final words

Developing this entry helped me understand more about how the AI ecosystem works and how I could use the Cloudflare ecosystem to empower my ideas into products.

Looking ahead, I'm considering incorporating private chats and additional chat features to enhance user interaction with audio.

Thank you for this challenge!

Top comments (0)