Artur Chukhrai for SerpApi

Posted on Aug 26, 2022 • Edited on Feb 6, 2023 • Originally published at serpapi.com

Scrape YouTube autocomplete results with Python

#python #tutorial #programming #webscraping

What will be scraped
Full Code
Preparation
Code Explanation
Output

What will be scraped

Full Code

If you don't need explanation, have a look at full code example in the online IDE.

import re, json, time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from parsel import Selector


def scrape_youtube_autocomplete():
    service = Service(ChromeDriverManager().install())

    options = webdriver.ChromeOptions()
    options.add_argument("--headless")
    options.add_argument('--lang=en')
    options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")

    queries = ["lord of the rings ", "fus ro dah ", "harry potter "]

    youtube_autocomplete = []

    for query in queries:
        driver = webdriver.Chrome(service=service, options=options)
        driver.get("https://www.youtube.com/")

        WebDriverWait(driver, 10000).until(EC.visibility_of_element_located((By.TAG_NAME, 'body')))

        search_input = driver.find_element(By.XPATH, '//input[@id="search"]')
        search_input.click()

        search_input.send_keys(query)

        time.sleep(1)

        selector = Selector(driver.page_source)

        # https://regex101.com/r/zZb3X0/1
        autocomplete_results = [
            re.search(r'">(.*)</b>', result).group(1).replace("<b>", "")
            for result in selector.css('.sbqs_c').getall()
        ]

        youtube_autocomplete.append({
            "query": query.strip(),
            "autocomplete_results": autocomplete_results
        })

        driver.quit()

    print(json.dumps(youtube_autocomplete, indent=2, ensure_ascii=False))


scrape_youtube_autocomplete()

Preparation

Install libraries:

pip install parsel selenium webdriver webdriver_manager

Basic knowledge scraping with CSS selectors

CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.

If you haven't scraped with CSS selectors, there's a dedicated blog post of mine
about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they matter from a web-scraping perspective.

Reduce the chance of being blocked

Make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.

There's a how to reduce the chance of being blocked while web scraping blog post that can get you familiar with basic and more advanced approaches.

Scrape without code

You can also parse data using following URL which will output a .txt file:

https://clients1.google.com/complete/search?client=youtube&hl=en&q=minecraft

Code Explanation

Import libraries:

import re, json, time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from parsel import Selector

Library	Purpose
`re`	to extract parts of the data via regular expression.
`json`	to convert extracted data to a JSON object.
`time`	to work with time in Python.
`webdriver`	to drive a browser natively, as a user would, either locally or on a remote machine using the Selenium server.
`Service`	to manage the starting and stopping of the ChromeDriver.
`By`	to set of supported locator strategies (By.ID, By.TAG_NAME, By.XPATH etc).
`WebDriverWait`	to wait only as long as required.
`expected_conditions`	contains a set of predefined conditions to use with WebDriverWait.
`Selector`	XML/HTML parser that have full XPath and CSS selectors support.

The algorithm for getting autocomplete results is as follows:

Go to the main page of YouTube.
Click on the search field.
Enter a query there.
Scrape suggested autocomplete results.
Go to step 1 until the queries are over.

To simulate user actions in the browser, let's use the selenium library. This will help complete steps 1, 2 and 3. For selenium to work, you need to use ChromeDriver, which can be downloaded manually or using code. In our case, the second method is used. To control the start and stop of ChromeDriver, you need to use Service which will install browser binaries under the hood:

service = Service(ChromeDriverManager().install())

You should also add options to work correctly:

options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument('--lang=en')
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")

Chrome options	Explanation
`--headless`	to run Chrome in headless mode.
`--lang=en`	to set the browser language to English.
`user-agent`	to act as a "real" user request from the browser by passing it to request headers. Check what's your `user-agent`.

Use the queries list, which will have a few queries and pass each one in a for loop. Also, create the youtube_autocomplete list that will store the extracting data.

queries = ["lord of the rings ", "fus ro dah ", "harry potter "]

youtube_autocomplete = []

for query in queries:
    # the following code will be here

Now we can start webdriver and pass the url to the get() method.

driver = webdriver.Chrome(service=service, options=options)
driver.get("https://www.youtube.com/")

Sometimes it is difficult to calculate how long it will take to load a page, it all depends on the speed of the Internet, the power of the computer and other factors. The method described below is much better than using a delay in seconds since the wait occurs exactly until the moment when the page is fully loaded:

WebDriverWait(driver, 10000).until(EC.visibility_of_element_located((By.TAG_NAME, 'body')))

📌Note: In this case, we give 10 seconds for the page to load, if it loads earlier then the wait will end.

When the page has loaded, it is necessary to find the search field. Selenium provides the ability to find element by XPath.

To enter a search query, we can use click() method on the search_input element to make it possible to enter text using the send_keys() method in the search field. Wait a while for the autocomplete results to load using the sleep() method.

search_input = driver.find_element(By.XPATH, '//input[@id="search"]')
search_input.click()

search_input.send_keys(query)

time.sleep(1)

To extract suggested queries, we can use Parsel library, in which we pass the html structure with all the data that was received earlier.

The parsel has much faster scraping times because of the engine itself and there is no network component anymore, no real-time interaction with a page and the element, there is only HTML parsing involved.

selector = Selector(driver.page_source)

# https://regex101.com/r/zZb3X0/1
autocomplete_results = [
    re.search(r'">(.*)</b>', result).group(1).replace("<b>", "")
    for result in selector.css('.sbqs_c').getall()
]

youtube_autocomplete.append({
    "query": query.strip(),
    "autocomplete_results": autocomplete_results
})

driver.quit()

Code	Explanation
`autocomplete_results`	a temporary `list` where extracted data will be appended.
`css()`	to access elements by the passed selector.
`getall()`	to actually extract text data from all matching objects.
`search()`	to search for a pattern in a string and return the corresponding match object.
`group()`	to extract the found element from the match object.
`replace()`	to replace all occurrences of the old substring with the new one without extra elements.
`youtube_autocomplete.append({})`	to `append` extracted data to a `list` as a dictionary.

In the gif below, I demonstrate how this function works:

Output

[
  {
    "query": "lord of the rings ",
    "autocomplete_results": [
      "lord of the rings amazon trailer",
      "lord of the rings soundtrack",
      "lord of the rings trailer",
      "lord of the rings amazon",
      "lord of the rings rings of power",
      "lord of the rings reaction",
      "lord of the rings music",
      "lord of the rings audiobook",
      "lord of the rings ambience",
      "lord of the rings theme",
      "lord of the rings ost",
      "lord of the rings full movie",
      "lord of the rings online",
      "lord of the rings rings of power trailer"
    ]
  },
  {
    "query": "fus ro dah ",
    "autocomplete_results": [
      "fus ro dah sound effect",
      "fus ro dah song",
      "fus ro dah skyrim sound effect",
      "fus ro dah sound",
      "fus ro dah reaction",
      "fus ro dah meme",
      "fus ro dah all races",
      "fus ro dah lyrics",
      "fus ro dah misheard lyrics",
      "fus ro dah shout",
      "fus ro dah remix",
      "fus ro dah anime",
      "fus ro dah earrape",
      "fus ro dah trailer"
    ]
  },
  {
    "query": "harry potter ",
    "autocomplete_results": [
      "harry potter music",
      "harry potter intro",
      "harry potter game",
      "harry potter audiobook",
      "harry potter react to harry as",
      "harry potter shittyflute",
      "harry potter leviosa",
      "harry potter full movie",
      "harry potter theme",
      "harry potter soundtrack",
      "harry potter and the cursed child",
      "harry potter kalimba",
      "harry potter piano",
      "harry potter ambience"
    ]
  }
]