INTRODUCTION
YouTube has become an unparalleled resource for information, entertainment, and educational content. However, extracting the spoken words from videos programmatically can be a challenge.
IMPLEMENTATION
In this article, we'll explore how to harness the power of Python to read and transcribe YouTube video content using the YouTube Transcript API library and the machine learning models manager library called llama index.
Step 01: installing libraries and modules
Before diving into the code, make sure you have the required packages installed. You can do this by running the following command on the command line terminal.
pip install youtube_transcript_api llama_index
Step 02: importing libraries and modules
Now, let's import the crucial modules and components needed for
implementation. The code block below includes the necessary imports.
import re
from typing import Any, List, Optional
from llama_index.readers.base import BaseReader
from llama_index.readers.schema.base import Document
from importlib.util import find_spec
Step 03: Defining expected YouTube Videos URLs
The YOUTUBE_URL_PATTERNS list contains regular expressions to match various YouTube URL formats. These patterns are crucial for extracting the video ID.
YOUTUBE_URL_PATTERNS = [
r"^https?://(?:www\.)?youtube\.com/watch\?v=([\w-]+)",
r"^https?://(?:www\.)?youtube\.com/embed/([\w-]+)",
r"^https?://youtu\.be/([\w-]+)", # youtu.be does not use www
]
Step 04: Verifying YouTube Video
From a list of many YouTube videos links, the is_youtube_video function determines if a given URL is a valid YouTube video link by matching it against the defined patterns.
def is_youtube_video(url: str) -> bool:
"""
Returns whether the passed in `url` matches the various YouTube URL formats
"""
for pattern in YOUTUBE_URL_PATTERNS:
if re.search(pattern, url):
return True
return False
Step 05: Initializing the transcriber
The YoutubeTranscriptReader class checks for the presence of the youtube_transcript_api package and raises an ImportError if not found.
class YoutubeTranscriptReader(BaseReader):
"""Youtube Transcript reader."""
def __init__(self) -> None:
if find_spec("youtube_transcript_api") is None:
raise ImportError(
"Missing package: youtube_transcript_api.\n"
"Please `pip install youtube_transcript_api` to use this Reader"
)
super().__init__()
Step 06: Loading Videos Data/transcription
The load_data method takes a list of YouTube links (ytlinks) and optional language parameters. It uses the YouTubeTranscriptApi to fetch and compile transcripts for each video.
def load_data(
self,
ytlinks: List[str],
languages: Optional[List[str]] = ["en"],
**load_kwargs: Any,
) -> List[Document]:
"""Load data from the input directory.
Args:
pages (List[str]): List of youtube links \
for which transcripts are to be read.
"""
from youtube_transcript_api import YouTubeTranscriptApi
results = []
for link in ytlinks:
video_id = self._extract_video_id(link)
if not video_id:
raise ValueError(
f"Supplied url {link} is not a supported youtube URL."
"Supported formats include:"
" youtube.com/watch?v={video_id} "
"(with or without 'www.')\n"
" youtube.com/embed?v={video_id} "
"(with or without 'www.')\n"
" youtu.be/{video_id} (never includes www subdomain)"
)
transcript_chunks = YouTubeTranscriptApi.get_transcript(
video_id, languages=languages
)
chunk_text = [chunk["text"] for chunk in transcript_chunks]
transcript = "\n".join(chunk_text)
results.append(Document(text=transcript, extra_info={"video_id": video_id}))
return results
Step 07: Extracting the video ID from its data
The _extract_video_id method extracts the video ID from a given YouTube link using the predefined URL patterns.
@staticmethod
def _extract_video_id(yt_link) -> Optional[str]:
for pattern in YOUTUBE_URL_PATTERNS:
match = re.search(pattern, yt_link)
if match:
return match.group(1)
# return None if no match is found
return None
CONCLUSION
By following these steps, you can implement a powerful YouTube transcript reader in Python. This opens the door to a wide range of applications, from content analysis to language processing. Experiment with different videos and languages to unlock the full potential of this simple yet effective tool.
Happy coding!
Do you have a project π that you want me to assist you email meπ€π: wilbertmisingo@gmail.com
Have a question or wanna be the first to know about my posts:-
Follow β
me on Twitter/X π
Follow β
me on LinkedIn πΌ
Top comments (0)