In this post of ScrapingTheFamous, I am going o write a scraper that will scrape data from Amazon. I do not need to tell you what is Amazon. You are here because you already know about it 🙂
So, we are going to write two different scripts: one would be fetch.py that would be fetching URLs of individual listings and save in a text file. Later another script, parse.py that will have a function taking an individual listing URL, scrape data, and save in JSON format.
I will be using Scraper API service for parsing purposes which makes me free from all worries blocking and rendering dynamic sites since it takes care of everything.
The first script is to fetching listings of a category. So let’s do it!
import requests
from bs4 import BeautifulSoup
if __name__ == ' __main__':
headers = {
'authority': '[www.amazon.com'](http://www.amazon.com'),
'pragma': 'no-cache',
'cache-control': 'no-cache',
'rtt': '600',
'downlink': '1.5',
'ect': '3g',
'upgrade-insecure-requests': '1',
'dnt': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': '[https://google.com'](https://google.com'),
'accept-language': 'en-US,en;q=0.9,ur;q=0.8,zh-CN;q=0.7,zh;q=0.6',
}
API_KEY = None
links_file = 'links.txt'
links = []
with open('API_KEY.txt', encoding='utf8') as f:
API_KEY = f.read()
URL_TO_SCRAPE = '[https://www.amazon.com/s?i=electronics&rh=n%3A172541%2Cp\_n\_feature\_four\_browse-bin%3A12097501011&lo=image'](https://www.amazon.com/s?i=electronics&rh=n%3A172541%2Cp_n_feature_four_browse-bin%3A12097501011&lo=image')
payload = {'api_key': API_KEY, 'url': URL_TO_SCRAPE, 'render': 'false'}
r = requests.get('[http://api.scraperapi.com'](http://api.scraperapi.com'), params=payload, timeout=60)
if r.status_code == 200:
text = r.text.strip()
soup = BeautifulSoup(text, 'lxml')
links_section = soup.select('h2 > .a-link-normal')
for link in links_section:
url = '[https://amazon.com'](https://amazon.com') + link['href']
links.append(url)
if len(links) > 0:
with open(links_file, 'a+', encoding='utf8') as f:
f.write('\n'.join(links))
print('Links stored successfully.')
So here is the script. I picked the electronics category, you may choose anyone you want. I arranged relevant headers. You can either pick manually via Chrome Inspector or use https://curl.trillworks.com/ to do this for you by copying the cURL request.
I am using the ScraperAPI API endpoint by setting both the payload and api_key. I am using h2 > .a-link-normal selector because there are many .a-link-normal links which are not required so I am using h2 > to make sure to pick the required links.
Once we have links we will be saving in a text file.
Now the next part of the post is about parsing the info:
import requests
from bs4 import BeautifulSoup
def parse(url):
record = {}
headers = {
'authority': '[www.amazon.com'](http://www.amazon.com'),
'pragma': 'no-cache',
'cache-control': 'no-cache',
'rtt': '1000',
'downlink': '1.5',
'ect': '3g',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9,ur;q=0.8,zh-CN;q=0.7,zh;q=0.6',
}
payload = {'api_key': API_KEY, 'url': url, 'render': 'false'}
r = requests.get('[http://api.scraperapi.com'](http://api.scraperapi.com'), params=payload, timeout=60)
if r.status_code == 200:
data = r.text.strip()
soup = BeautifulSoup(data, 'lxml')
title_section = soup.select('#productTitle')
price_section = soup.select('#priceblock_ourprice')
availability_section = soup.select('#availability')
features_section = soup.select('#feature-bullets')
asin_section = soup.find('link', {'rel': 'canonical'})
print(asin_section)
if title_section:
title = title_section[0].text.strip()
if price_section:
price = price_section[0].text.strip()
if availability_section:
availability = availability_section[0].text.strip()
if features_section:
features = features_section[0].text.strip()
if asin_section:
asin_url = asin_section['href']
asin_url_parts = asin_url.split('/')
asin = asin_url_parts[len(asin_url_parts) - 1]
record = {'title': title, 'price': price, 'availability': availability, 'asin': asin, 'features': features}
return record
if __name__ == ' __main__':
API_KEY = None
with open('API_KEY.txt', encoding='utf8') as f:
API_KEY = f.read()
result = parse('[https://www.amazon.com/Bambino-Seconds-Stainless-Japanese-Automatic-Leather/dp/B07B49QG1H/'](https://www.amazon.com/Bambino-Seconds-Stainless-Japanese-Automatic-Leather/dp/B07B49QG1H/'))
print(result)
Pretty straightforward. I fetched title, scraped Amazon ASIN, and other stuff. You can scrape many other things like Amazon reviews as well. The data is stored in JSON format.
Conclusion
In this post, you learned how you can scrape Amazon data easily by using Scraper API in Python. You can enhance this script as per your need like writing a price monitoring script or as an ASIN scraper.
Writing scrapers is an interesting journey but you can hit the wall if the site blocks your IP. As an individual, you can’t afford expensive proxies either. Scraper APIprovides you an affordable and easy to use API that will let you scrape websites without any hassle. You do not need to worry about getting blocked because Scraper API by default uses proxies to access websites. On top of it, you do not need to worry about Selenium either since Scraper API provides the facility of a headless browser too. I also have written a postabout how to use it.
Click here to signup with my referral linkor enter promo code adnan10 , you will get a 10% discount on it. In case you do not get the discount then just let me know via email on my site and I’d sure help you out.
Originally published at http://blog.adnansiddiqi.me on November 16, 2020.
Top comments (0)