This is a submission for the AssemblyAI Challenge : Sophisticated Speech-to-Text.
What I Built
A Speech-to-Text Transcription Web Application using Flask for the backend and AssemblyAI's API for real-time audio transcription. The frontend, built with HTML, CSS, and jQuery, offers an interactive interface for users to control the transcription process and view transcribed text in real-time.
Demo
Here is the link to my app
Journey
Key Features
Real-Time Transcription:
- Utilizes AssemblyAI's real-time API to process live audio input from the user's microphone and convert it to text.
- Supports both partial and final transcripts.
Web Interface:
- Clean and intuitive design with buttons to start and stop transcription.
- Displays the transcribed text dynamically in a formatted
block.
Flask Backend:
- Handles routes for starting (/start), stopping (/stop), and retrieving the transcript (/transcript).
- Runs transcription in a separate thread to ensure non-blocking operations.
Polling Mechanism:
- Implements a JavaScript-based polling system using jQuery to fetch the latest transcribed text every second.
Customizable Word Boost:
- Boosts recognition accuracy for specific words like "AWS," "Azure," and "Google Cloud."
Responsive Design:
- Ensures usability across devices with a centralized, easy-to-use layout.
Technology Stack
Backend:
- Python (Flask): Manages the web server and API interactions.
- AssemblyAI API: Handles speech-to-text transcription.
import assemblyai as aai
from flask import Flask, render_template, jsonify
import os
from dotenv import load_dotenv
import threading
app = Flask(__name__)
load_dotenv()
aai.settings.api_key = os.getenv('API_KEY')
transcriber = None
transcribed_text = ""
def on_open():
print("Transcription started!")
def on_data(transcript: aai.RealtimeTranscript):
global transcribed_text
if not transcript.text:
return
if isinstance(transcript, aai.RealtimeFinalTranscript):
transcribed_text += transcript.text + "\n"
print("Transcribed:", transcript.text) # Verify text here
else:
print("Received partial:", transcript.text)
def on_error(error):
print("Error:", error)
def on_close():
print("Transcription stopped!")
def start_transcription():
global transcriber
microphone_stream = aai.extras.MicrophoneStream(sample_rate=16_000)
transcriber = aai.RealtimeTranscriber(
encoding=aai.AudioEncoding.pcm_mulaw,
sample_rate=16_000,
word_boost=["aws", "azure", "google cloud"],
end_utterance_silence_threshold=500,
on_open=on_open,
on_data=on_data,
on_error=on_error,
on_close=on_close,
)
for audio_data in microphone_stream:
if transcriber is not None:
transcriber.stream(audio_data)
else:
break
@app.route('/')
def index():
return render_template('index.html')
@app.route('/start')
def start():
global transcribed_text
transcribed_text = "" # Clear previous transcript
threading.Thread(target=start_transcription).start()
return jsonify({"message": "Transcription started!"})
@app.route('/stop')
def stop():
global transcriber
if transcriber is not None:
transcriber.close()
transcriber = None
print("Transcriber closed")
return jsonify({"message": "Transcription stopped!"})
@app.route('/transcript')
def transcript():
global transcribed_text
return jsonify({"transcript": transcribed_text})
if __name__ == "__main__":
app.run(debug=True)
Frontend:
- HTML & CSS: Provides structure and styling for the user interface.
- jQuery: Handles AJAX requests for starting, stopping, and polling the transcription.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Speech to Text App</title>
<script src="https://code.jquery.com/jquery-3.5.1.min.js"></script>
<style>
body {
margin: 0;
display: flex;
justify-content: center;
align-items: center;
height: 100vh; /* Full viewport height */
font-family: Arial, sans-serif;
background-color: #f4f4f4; /* Light background for better readability */
}
#container {
text-align: center;
background: #ffffff;
padding: 20px;
box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
border-radius: 8px;
}
button {
margin: 10px;
padding: 10px 20px;
font-size: 16px;
border: none;
border-radius: 5px;
background-color: #007bff;
color: white;
cursor: pointer;
}
button:hover {
background-color: #0056b3;
}
pre {
padding: 10px;
background-color: #e9ecef;
border-radius: 5px;
overflow: auto;
}
</style>
</head>
<body>
<div id="container">
<h1>Speech-to-Text Transcription</h1>
<button id="start">Start Transcription</button>
<button id="stop">Stop Transcription</button>
<h2>Transcribed Text:</h2>
<pre id="transcript"></pre>
</div>
<script>
$(document).ready(function() {
let pollInterval; // Variable to hold the interval ID
// Start transcription
$('#start').click(function() {
$.get('/start', function(data) {
console.log(data.message);
// Start polling for transcripts if not already polling
if (!pollInterval) {
pollInterval = setInterval(function() {
$.ajax({
type: 'GET',
url: '/transcript',
dataType: 'json',
success: function(data) {
console.log(data);
if (data && data.transcript) {
$('#transcript').text(data.transcript);
} else {
$('#transcript').text('No transcription available yet.');
}
},
error: function(err) {
console.error('Error fetching transcript:', err);
}
});
}, 1000);
}
});
});
// Stop transcription
$('#stop').click(function() {
$.get('/stop', function(data) {
console.log(data.message);
// Stop polling for transcripts
if (pollInterval) {
clearInterval(pollInterval);
pollInterval = null; // Reset the interval variable
}
});
});
});
</script>
</body>
</html>
Audio Input:
- AssemblyAI's MicrophoneStream: Streams audio data for real-time processing.
I utilized additional prompts to enhance the project. I employed the #FlaskWebFramework for rendering templates and returning JSON responses, and I used the #dotenv library to load environment variables from the env file. On the frontend, I implemented CSS for styling the user interface.
Lastly, I want to thank my team, @devnenyasha, and @lindiwe09, for their UI idea. If not for them my UI would have been a mess.
Top comments (0)