Contents: intro, imports, what will be scraped, process, code, links, outro.
Intro
This blog post is a continuation of the DuckDuckGo web scraping series. Here you'll see how to scrape Inline Image results using Python with selenium
library. An alternative API solution will be shown.
Prerequisites: familiarity with selenium
library and regular expressions.
Imports
from selenium import webdriver
import re, urllib.parse
What will be scraped
Process
The process is very much like from other DuckDuckGo blog posts series.
Selecting container, title, link, thumbnail, image URL CSS selectors from which the .get_attribute()
method will be used to grab data-id
, src
, and href
attributes.
SelectorGadget Chrome extension was used in the GIF above to select CSS
selectors.
Code
from selenium import webdriver
import re, urllib.parse
driver = webdriver.Chrome(executable_path='path/to/chromedriver.exe')
driver.get('https://duckduckgo.com/?q=elon musk dogecoin&kl=us-en&ia=web')
for result in driver.find_elements_by_css_selector('.js-images-link'):
title = result.find_element_by_css_selector('.js-images-link a img').get_attribute('alt')
link = result.find_element_by_css_selector('.js-images-link a').get_attribute('href')
thumbnail_encoded = result.find_element_by_css_selector('.js-images-link a img').get_attribute('src')
# https://regex101.com/r/4pgG5m/1
match_thumbnail_urls = ''.join(re.findall(r'https\:\/\/external\-content\.duckduckgo\.com\/iu\/\?u\=(.*)&f=1', thumbnail_encoded))
# https://www.kite.com/python/answers/how-to-decode-a-utf-8-url-in-python
thumbnail = urllib.parse.unquote(match_thumbnail_urls).replace('&h=160', '')
image = result.get_attribute('data-id')
print(f'{title}\n{link}\n{thumbnail}\n{image}\n')
driver.quit()
--------------------------
'''
Dogecoin (DOGE) Price Crash Below Key Support and Even ...
https://duckduckgo.com/?q=elon%20musk%20dogecoin&iax=images&ia=images&iai=https://cdn.coingape.com/wp-content/uploads/2021/07/02195033/dogecoin-elon-musk-snl-memes.jpg&kl=us-en
https://tse1.mm.bing.net/th?id=OIF.UGa1KGFCz%2f5axclMfq0k4w&pid=Api
https://cdn.coingape.com/wp-content/uploads/2021/07/02195033/dogecoin-elon-musk-snl-memes.jpg
...
'''
Using DuckDuckGo Inline Images API
SerpApi is a paid API with a free plan.
The difference that you'll see immediately is that API provides 30 results, rather than ~8-10 results.
Alternatively, all you have to do is to iterate over structured JSON
string without thinking how to scrape data without rendering the page, or how to grab certain elements if they are the ones that hard to get.
import json
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "duckduckgo",
"q": "elon musk dogecoin",
"kl": "us-en"
}
search = GoogleSearch(params)
results = search.get_dict()
print(json.dumps(results['inline_images'], indent=2, ensure_ascii=False))
----------------------
'''
[
{
"position": 1,
"title": "'Dogefather' Elon Musk Tweets in Support of the ...",
"link": "https://gadgets.ndtv.com/cryptocurrency/news/elon-musk-dogecoin-price-cryptocurrency-bitcoin-ethereum-ether-twitter-tweet-support-market-gain-2483505",
"thumbnail": "https://tse1.mm.bing.net/th?id=OIF.ryyLYCT1jVMZDADJDf1LVA&pid=Api",
"image": "https://i.gadgets360cdn.com/large/elon_musk_reuters_1610084738222.jpg"
}
...
{
"position": 20,
"title": "Beware! Your love for Elon Musk and Dogecoin may land you ...",
"link": "http://www.businesstelegraph.co.uk/beware-your-love-for-elon-musk-and-dogecoin-may-land-you-in-a-scam-economic-times/",
"thumbnail": "https://tse1.mm.bing.net/th?id=OIF.Y4geZY10AJX80AvM8EPCjQ&pid=Api",
"image": "http://www.businesstelegraph.co.uk/wp-content/uploads/2021/07/Beware-Your-love-for-Elon-Musk-and-Dogecoin-may-land.jpg"
}
]
'''
Links
GithHub Gist • DuckDuckGo Inline Images API
Outro
If you have any questions or something isn't working correctly or you want to write something else, feel free to drop a comment in the comment section or via Twitter at @serp_api.
Yours,
Dimitry, and the rest of SerpApi Team.
Top comments (0)