Web scrapping is important and needed in all spheres of IT. One of good use-cases is to create a drop-shipping website, and to populate database with items from some shop. This is even better if you work with digital items, so you don't care about item-delivery. You just have to put some provision, and to give something to the users that will make them use your site (create quality content, add forum, chat, scrap some news about your products and show them in news feed...).
I will create Ruby script to scrap data from Delije Butik as sample shop. You can implement this as rake task on heroku, to scrap items every hour and update database if something changed.
First install all gems that we will use:
gem install nokogiri rest-client sequel colorize
Now create a new document and call it scrap.rb
. As always on start add shebang and require all gems:
#!/usr/bin/env ruby
require nokogiri
require rest-client
require sequel
require colorize
Now we can start... I will use sequel to connect to database and populate scrapped data. Rest-Client is used to make request with website, and Nokogiri to actually scrap our data. At the end, gem Colorize is used to make our output in terminal look nice.
DB = Sequel.sqlite('db/development.sqlite3')
URL = 'https://delijebutik.com/shop'
def open_page(url)
html = RestClient.get(url)
@page = Nokogiri::HTML(html)
@data = @page.search('div.thunk-product')
end
Now open the web-page you want to scrap, to find all div
names you need. So go open inspector (right click + inspect element) and find main field that contain all others. So we don't need header and all body, we need item name, price, photo, description etc... In this case, field is <div class="thunk-product">
:
Now create new method that will actually scrap all those data and populate database:
table = DB[:products]
@data.each do |x|
title = x.search('div.thunk-product-content > h2').text rescue title = nil
price = x.search('div.thunk-product-content > span > span').text rescue price = nil
photo = x.search('div.thunk-product-image > img')[0]['src'] rescue photo = nil
photoH = x.search('div.thunk-product-image > img')[1]['src'] rescue photoH = nil
link = x.search('> a')[0]['href'] rescue link = nil
unless table.where(title: title, price: price, link: link).all.count > 0
table.insert(
title: title,
price: price,
photo: photo,
photoH: photoH,
link: link,
created_at: Time.now,
updated_at: Time.now )
puts 'Naziv: ' + title.yellow.bold
puts 'Cena: ' + price.green.bold
puts 'Slika: ' + photo.red
puts 'Link: ' + link.yellow
60.times { print '='.white }; puts ''
else
puts "Product: #{title} has been skipped!"
end end
So how we found all other fields? After we found main field div.thunk-product, we look at other fields against main. Take a look at code or go to the website and look at page source:
So it was easy, but what what about pages? We will find total number of pages, and define counter as zero. Then we will scrap one page, increase counter for 1, then scrap again... until we reach total number of pages:
@c = 0
open_page(URL)
@data = @page.search('div.thunk-product')
last_number = @page.search("a.page-numbers")[5].text
@last_page_number = last_number.to_i
@last_page_number.times do
@c +=1
puts "\n Scrapping page [" + "#{@c}".red.bold + "]\n" + "\n"
open_page("#{URL}/page/#{@c}") and scrap!
puts "\nFinished scrapping [".white + "#{@last_page_number}".red.bold + "] pages!\n\n"
end
And that's it! You have a web-scrapper for your new e-commerce app! I used sqlite3
, but you can use whatever DB fit your needs. Full code look like this, and in my case it was in lib
folder of rails app
, for manual execution:
ruby lib/scrap.rb
Edit:
Added if response code = 200
, to check is web-page available:
#!/usr/bin/env ruby
require 'nokogiri'
require 'sequel'
require 'rest-client'
require 'colorize'
class DelijeShop
DB = Sequel.sqlite('db/development.sqlite3')
URL = "https://delijebutik.com/shop"
def initialize
@c = 0
open_page(URL)
if @html.code == 200
@data = @page.search('div.thunk-product')
last_number = @page.search("a.page-numbers")[5].text
@last_page_number = last_number.to_i
@last_page_number.times do
@c +=1
puts "\n Scrapping page [" + "#{@c}".red.bold + "]\n" + "\n"
open_page("#{URL}/page/#{@c}") and scrap! and sleep(3)
end
puts "\nFinished scrapping [".white + "#{@last_page_number}".red.bold + "] pages!\n\n"
else
raise "Connection error, code #{@html.code} returned"
end
end
def open_page(url)
@html = RestClient.get(url)
@page = Nokogiri::HTML(html)
@data = @page.search('div.thunk-product')
end
def scrap!
table = DB[:products]
@data.each do |x|
title = x.search('div.thunk-product-content > h2').text rescue title = nil
price = x.search('div.thunk-product-content > span > span').text rescue price = nil
photo = x.search('div.thunk-product-image > img')[0]['src'] rescue photo = nil
photoH = x.search('div.thunk-product-image > img')[1]['src'] rescue photoH = nil
link = x.search('> a')[0]['href'] rescue link = nil
unless table.where(title: title, price: price, link: link).all.count > 0
table.insert(
title: title,
price: price,
photo: photo,
photoH: photoH,
link: link,
created_at: Time.now,
updated_at: Time.now )
puts 'Naziv: ' + title.yellow.bold
puts 'Cena: ' + price.green.bold
puts 'Slika: ' + photo.red
puts 'Link: ' + link.yellow
60.times { print '='.white }; puts ''
else
puts "Product: #{title} has been skipped!"
end end
end
end # end_of_class
puts "\n" + 'Scrapping Delije Shop Products List ...'.yellow
50.times { print '-'.yellow }; puts '' and DelijeShop.new
Top comments (0)