This is a submission for the AssemblyAI Challenge: Sophisticated Speech-to-Text.
What I Built
A web app which analyses a song on YouTube, then displays the lyrics in a Karaoke style.
Demo
Live demo: https://assemblyai-challenge-202411.manychois.site/
(Unfortunately, YouTube has blocked my server so the real-time transcribing does not work. You can still pick one of the pre-built examples to see how it works. Alternatively, run the app in your local machine)
Source code: https://github.com/manychois/assemblyai-challenge-202411
The Idea
When I inspected the API documentation of AseemblyAI, one of its features caught my eyes - Word level timestamp. In what situation would I need a timestamp for each spoken word? Subtitles. Displaying the subtitle texts would be a bit dull, so I twisted my idea into transcribing a song and displaying the lyrics in a Karaoke way. Let's get hands dirty!
Implementation Journey
The app will need to fulfil 3 things:
- Be able to download YouTube video.
- Utilise AssemblyAI to convert the audio part into transcript.
- Roll the lyrics along with the YouTube video.
The transcription part
The first two points are quite easy to implement, thanks to the cookbook provided by AssemblyAI. Not knowing yt-dlp
in the past, it is really a great command-line tool to download YouTube videos and convert them into various formats.
A nice tip from the cookbook:
"m4a" is the format with the best audio version.
Since I am not an active Python developer, I pick the Typescript library instead, and write a simple function to invoke yt-dlp
:
import { exec } from 'node:child_process';
function downloadYouTube(videoId: string, url: string): Promise<string> {
return new Promise((resolve, reject) => {
const tempFilePath = `/tmp/youtube-${videoId}.m4a`;
const command = `yt-dlp -o ${tempFilePath} -x --audio-format m4a --audio-quality 8 "${url}"`;
exec(command, (error) => {
if (error) {
console.error(`exec error: ${error}`);
reject(error);
}
resolve(tempFilePath);
});
});
}
And calling the AssemblyAI API is extremely easy. I don't even need to worry about handling the upload of the local file, the library does it all! Here is the function to wrap the service:
async function transcribe(language: string, file: string): Promise<Transcript> {
const client = new AssemblyAI({ apiKey: ASSEMBLYAI_API_KEY });
const apiParams: TranscribeParams = { audio: file, language_code: language };
const transcript = await client.transcripts.transcribe(apiParams);
return transcript;
}
OK, now I will need a web app framework to piece things together. Do you know Svelte 5 is alive recently? This can be my first exercise to try out the latest SvelteKit.
The UI part
With some research, it is good to find that the official library let you interact with the player and will tell you where the video is up to. Below is a simplified code to show how I link the current video play time to the reactive state currentTime
:
let currentTime = $state(0); // in milliseconds
let player = new window.YT.Player('player', { ... });
player.playVideo();
let wordHighlighter = setInterval(() => {
currentTime = player.getCurrentTime() * 1000;
}, 100);
Then I let Svelte to do its magic. When it is time to highlight the word, the style class start
will be applied.
{#each line as { text, start, end }}
{@const duration = Math.round((end - start) / 100) * 100}
<span class="word" class:start={start <= currentTime}
data-text={text}
data-duration={duration}>{text}</span>
{/each}
The Karaoke style trick is like this (in SCSS syntax):
.word {
display: inline-block;
position: relative;
font-size: 1.5rem;
color: #777;
white-space: nowrap;
margin-right: 0.5em;
&.start {
&::after {
content: attr(data-text);
position: absolute;
left: 0;
top: 0;
color: #00f;
overflow: hidden;
animation: run-text 2s 1 linear;
width: 100%;
}
@for $i from 1 through 20 {
&[data-duration='#{$i * 100}']::after {
animation: run-text #{$i * 100}ms 1 linear;
}
}
}
}
@keyframes run-text {
from { width: 0; }
to { width: 100%; }
}
As you can see, the data attribute data-text
is used to create the overlay highlighted text. I have also tried to use animation: run-text attr(data-duration ms) 1 linear;
to dynamically assign the duration but the browser does not support that. So I have to round off the duration and set a bunch of corresponding rules.
Finally, the scroll-along effect:
let lyricsObserver = new IntersectionObserver(
(entries) => {
entries.forEach((entry) => {
if (!entry.isIntersecting) {
const word = entry.target as HTMLElement;
word.scrollIntoView({ behavior: 'smooth', block: 'center' });
}
lyricsObserver!.unobserve(entry.target);
});
},
{
root: document.querySelector('.lyrics'),
threshold: 1,
rootMargin: '0px 0px -30% 0px'
}
);
let lastHightlighted: null | Element = null;
let scrollChecker = setInterval(() => {
const highlighted = document.querySelectorAll('.word.start');
if (highlighted.length > 0) {
const last = highlighted[highlighted.length - 1];
if (lastHightlighted !== last) {
lastHightlighted = last;
lyricsObserver!.observe(last);
}
}
}, 100);
That is quite massive, but the general idea is:
- For every 100ms, find out the last
.word.start
element. That is where the lyrics are highlighted up to. - Push that element to our
IntersectionObserver
for insepection. - If it is not within the top 70% visible region, we scroll it up to the middle.
- Pop the element out from the
IntersectionObserver
to lower the performance cost.
The Result
I am happy with the result. Backed by Svelte, the lyrics will move accordingly if you pause or even jump the music at any point.
Top comments (0)