DEV Community

Wesley Chun (@wescpy)
Wesley Chun (@wescpy)

Posted on • Edited on

Generate audio clips with Gemini 2.0 Flash

TL;DR:

Happy holidays! Google recently "gifted" us the new Gemini 2.0 Flash model, expanding on what's available in the original 1.x models. One of the new features is the ability to generate text-based audio clips. Sure good ol' fashioned predictive AI's text-to-speech functionality is certainly useful, but this takes it to the next level, giving genAI users "idea-to-speech" capabilities. Learn how to access this new feature from Python today!

Build with Gemini

Introduction

Welcome to the blog focusing on using Google APIs from Python and sometimes Node.js. Today's post focuses on Gemini, but you'll find plenty of content beyond Gemini in other posts:

Today, we're taking a break from the flow of the previous posts in this series covering the Gemini API by exploring one new feature. While some users may be content using ChatGPT or Gemini online or via app, the Gemini API brings generative AI abilities to your apps, so if you're new or exploring, check out the other posts on how to get started as well as see some of its basic capabilities. This post only looks at one feature from the Gemini 2.0 Flash model: text-based audio clip generation.

Prerequisites

New client library improves user experience (UX)

You need a client library to talk to Gemini from code. While several client libraries already exist for Gemini, Google has recently introduced a new one. The new library features an improved UX, so I have to give Google some credit. In the first Gemini post in the series, I lamented that making the API available from two different platforms confuses developers:

  1. Google AI
  2. GCP Vertex AI

Differing client libraries, numerous code samples, documentation in different locations under different web domains, etc., all add to a less-than-optimal UX. A replacement client library with the ability to work across platforms allows users to get started and experiment on Google AI then "upgrade" to Vertex AI, when ready for production, without changing their code.

💡 Yes, it's an "ifdef"
If you're like me and like to dig around in code, you may be curious about how the new client library works across both Google AI and Vertex AI. It's not magic, so you'll find if-else blocks where it matters, like a C/C++ ifdef. In the new client library, any time you see mldev, think Google AI, and as expected, vertex is Vertex AI.

One example is found the Live API code while another is in the models code. (NOTE: these links will probably be wrong when a new version is pushed, but I'll update them once the library has an official release)

At the time of this writing, the new client library is only available in Python and Go. (Java and JS/Node.js are next. Keep checking the Gemini API SDKs page for the latest releases.) The sample app is only available in Python^, but I'm happy to explore a Golang PR if you get to an equivalent port before I do.

^ -- Python 3 only; Python 2 support is not available for the Gemini API

Installation and setup

As we're exploring this new feature, the app will run on Google AI. In an upcoming post, we'll explore how to run the same app on Vertex AI. So follow these steps to install the client library and get set up:

  1. Install the new client library: pip install -U google-genai
  2. Create an API key (if you don't already have one)
  3. Save API key as a string to settings.py as API_KEY = 'YOUR_API_KEY' (and follow the suggestions in the sidebar below to protect it)

⚠️ WARNING: Keep API keys secure
Storing API keys in files (or hard-coding them for use in actual code or even assigning to environment variables) is for prototyping and learning purposes only. When going to production, put them in environment variables or in a secrets manager. Files like settings.py or .env containing API keys are susceptible. Under no circumstances should you upload files like those to any public or private repo, have sensitive data like that in TerraForm config files, add such files to Docker layers, etc., as once your API key leaks, everyone in the world can use it.

If you're new to Google developer tools, API keys are one of the credentials types supported by Google APIs, and they're the only type supported by Maps APIs. Other credentials types include OAuth client IDs, mostly used by GWS APIs, and service accounts, mostly used by Google Cloud (GCP) APIs. While this post doesn't cover Google Maps, the Maps team put together a great guide on API key best practices, so check it out!

The app

The sample app gem20-audio.py sends a prompt of Describe a cat in a few sentences to Gemini and requests an audio clip in response, so the app's functionality is pretty brief: make the request, get the response, and save the audio file.

The code

import asyncio
import contextlib
import wave

from google import genai
from settings import API_KEY

CLIENT = genai.Client(api_key=API_KEY, http_options={'api_version': 'v1alpha'})
MODEL = 'gemini-2.0-flash-exp'
CONFIG = {'generation_config': {'response_modalities': ['AUDIO']}}
PROMPT = 'Describe a cat in a few sentences'
FILENAME = 'whatacatis.wav'

@contextlib.contextmanager
def wave_file(filename, channels=1, rate=24000, sample_width=2):
    'set up .wav file writer'
    with wave.open(filename, 'wb') as wf:
        wf.setnchannels(channels)
        wf.setsampwidth(sample_width)
        wf.setframerate(rate)
        yield wf

async def request_audio(prompt=PROMPT, filename=FILENAME):
    'request LLM generate audio file given prompt'
    print(f'\n** LLM prompt: "{prompt}"')
    async with CLIENT.aio.live.connect(model=MODEL, config=CONFIG) as session:
        with wave_file(filename) as f:
            await session.send(prompt, end_of_turn=True)
            async for response in session.receive():
                if response.data:
                    f.writeframes(response.data)
    print(f'** Saved audio to "{filename}"')

asyncio.run(request_audio())
Enter fullscreen mode Exit fullscreen mode
[CODE] gem20-audio.py: Audio "Hello World!" sample

 

App components

There are 4 major chunks to this script:

  1. Imports
  2. Constants
  3. Audio file writer
  4. Core functionality

Imports

From the Python standard library, asyncio is required because the Multimodal Live API (feature and usage) is only available asynchronously. The contextlib.contextmanager decorator is needed so we can wrap and use the audio file-writer with Python's with statement. The last "stdlib" package used is wave, which processes WAVE audio files. This is followed by importing Google's new "genAI" client library.

Like in previous code samples in this series, the API key is saved to settings.py. Alternatively, you can save your API key to the GOOGLE_API_KEY environment variable, or use the python-dotenv package, storing the API key in .env to more closely mirror working in a Node.js environment. There's also the GCP Secret Manager as yet another option.

Constants and audio file writer

Constants for the API client, generative large language model (genAI LLM), and model configuration follow. The last pair of constants are the user's prompt and filename to save the generated audio to.

The WAV file (wave_file()) writer just sets up the basic parameters as a generator and wraps it in a context manager, allowing it to be used with the with statement. You'll find nearly-identical code in various samples and Notebooks in the Gemini 2.0 cookbook repo.

Core functionality

All of the "real work" takes place in request_audio(). It's a single session using the Live API, kicking it off by opening the WAV file for write and sending the prompt to the LLM. The rest of it continuously waits for a server response, writing the chunks of audio data received until it's been exhausted, terminating the session.

This is minimal code required to do the job. In other examples from Google, you'll find reference to server_content, inline_data and writing out parts. Most of this relates to supporting a multi-turn conversation, but for a single request-response "cycle," less code is less confusing.

Running the script

Running the script produces an audio file along with the expected output:

$ python3 gem20-audio.py

** LLM prompt: "Describe a cat in a few sentences"
** Saved audio to "whatacatis.wav"
Enter fullscreen mode Exit fullscreen mode

Your mileage may vary, but this is the audio track I got from Gemini:

Summary

Developers are eager to jump into the world of AI/ML, especially GenAI & LLMs, and accessing Google's Gemini models via API is part of that picture. The previous posts in the series got your foot in the door, and today, we explore a new feature available from Gemini 2.0 Flash. In upcoming posts, we'll continue this journey by describing how to run it from Vertex AI.

If you want to see Gemini API code for both platforms, check out the intro (1st) post in this series. In another future post, we'll pick up the basic genAI web app from the previous (3rd) post (link also below) and show you how to deploy it to Google Cloud.

If you find errors or have suggestions on content you'd like to see in future posts, also leave a comment below, and if your organization needs help integrating Google technologies via its APIs, reach out to me by submitting a request at https://cyberwebconsulting.com. Thanks for reading, and I hope to meet you if I come through your community... you'll find my travel calendar at the bottom of that page as well. Season's greetings and see you next year!

PREV POST: Part 3: Gemini API 102a... Putting together basic GenAI web apps

References

Below are various links relevant to this post:

Code samples

Gemini API (Google AI)

Gemini 2.0 Flash

Other Generative AI and Gemini resources

Other Gemini API content by the author



WESLEY CHUN, MSCS, is a Google Developer Expert (GDE) in Google Cloud (GCP) & Google Workspace (GWS), author of Prentice Hall's bestselling "Core Python" series, co-author of "Python Web Development with Django", and has written for Linux Journal & CNET. He runs CyberWeb specializing in GCP & GWS APIs and serverless platforms, Python & App Engine migrations, and Python training & engineering. Wesley was one of the original Yahoo!Mail engineers and spent 13+ years on various Google product teams, speaking on behalf of their APIs, producing sample apps, codelabs, and videos for serverless migration and GWS developers. He holds degrees in Computer Science, Mathematics, and Music from the University of California, is a Fellow of the Python Software Foundation, and loves to travel to meet developers worldwide at conferences, user group events, and universities. Follow he/him @wescpy & his technical blog. Find this content useful? Contact CyberWeb for professional services or buy him a coffee (or tea)!

Top comments (0)