Suppose you want some data of a product from a company? Let's say the price of all commodities to be in a comma separated value(CSV) or photos from a social media! what will you do?
Actually, you can copy information from the respective site and paste it into your own file. But what if you want to get a huge amount of information from the site as soon as possible? Such as large amounts of data from a website to train a Machine Learning algorithm?
In that case, copy and paste will not work! And then you will need to use Web Scraping.Web scraping uses intelligence automation methods to get thousands or even millions of data sets in a smaller amount of time.
What is Web Scraping?
Web scraping is a means of extracting vast volumes of data from websites in an automated manner. The majority of this data is unstructured HTML data that is converted to structured data in a spreadsheet or database before being used in various applications.
To gather data from websites, web scraping can be done in a variety of methods. These options include leveraging internet services, specific APIs, and even writing your own web scraping code from scratch. Many huge websites, such as Google, Twitter, Facebook, StackOverflow, and others, provide APIs that let you access their data in a structured fashion.
Application of web scrapping
- Market research
- Price monitoring
- News monitoring
- Email marketing
- Sentiment Analysis
Prerequisites
- Python
Why python๐ค, since it is the most popular language for web scraping as it can handle most of the processes easily. It also has a variety of libraries that were created specifically for Web Scraping that is scrapy and beautiful soup.
So let's start ๐๐๐๐ช๐ช
1. Installing of python.
Install python 3 and virtualenv then make virtual environment.
Install python 3 first by running following line of code in terminal:
$ sudo apt install python3
Then install virtual environment, in our terminal type in:
$ sudo apt install python3-venv
After installing python and virtualenv, create a folder and virtualenv then activate the created virtualenv.
- Create project folder:
mkdir web_scrap
So lets go to the inside of web_scrap directory :
cd web_scrap
- Create virtualenv:
virtualenv env
- activate virtualenv:
. env/bin/activate
This are basic steps to setup our coding environment, check out this for more.
2. Create python file.
Create a python file scrap.py and open it in visual studio or on your favorite text editor.
3. Import packages.
Download and import packages in the virtual environment.
pip install requests
pip install bs4
pip install termcolor
The python modules that will be using:
- re - regular expression.
- requests- to scrap data directory from Instagram.
- beautifulSoup - to get specific filtered part from all data.
- urllib - to use request to download from url.
- os - to store downloaded file in our media folder.
4. Get website link.
Let's add a simple input system to get any url as an input url:
url = input("enter here your url from instagram")
Get any url from Instagram then get data from the url using requests
.
data = requests.get(url)
You can print the data and check the results.
print(data)
The codes
The outuput
Now let's take a case for a video.
https://www.instagram.com/p/B_wH2aCnyEh/?utm_medium=copy_link
This is the page with the video.
And here is the source code.
And In This Page If you just find(by ctrl + F) โmp4โ . Then You will find something like this:
The link that contain the mp4 is the main thing we need:
"https://instagram.fnbo9-1.fna.fbcdn.net/v/t50.2886-16/95332972_323221645317471_817729865566514230_n.mp4?efg=eyJ2ZW5jb2RlX3RhZyI6InZ0c192b2RfdXJsZ2VuLjQ4MC5mZWVkLmRlZmF1bHQiLCJxZV9ncm91cHMiOiJbXCJpZ193ZWJfZGVsaXZlcnlfdnRzX290ZlwiXSJ9\u0026_nc_ht=instagram.fnbo9-1.fna.fbcdn.net\u0026_nc_cat=103\u0026_nc_ohc=Q1fkDGBA2oEAX9xsGin\u0026edm=AABBvjUBAAAA\u0026vs=18035297806253182_2714272676\u0026_nc_vs=HBksFQAYJEdHeXFyZ1ZmVlZybjl5VUJBRGJzVWUtNktGa0xia1lMQUFBRhUAAsgBABUAGCRHSFlhdkFWNG9oRUFsSEFHQVAwaFlDdDdtOVl0YmtZTEFBQUYVAgLIAQAoABgAGwGIB3VzZV9vaWwBMBUAACb8yIvTv8CJQBUCKAJDMywXQCbul41P3zsYEmRhc2hfYmFzZWxpbmVfMV92MREAdeoHAA%3D%3D\u0026ccb=7-4\u0026oe=621DCC10\u0026oh=00_AT_7jbU74b8Fm9-U5y6GQhURJihmzKNI_AEvVNjI4e-Blw\u0026_nc_sid=83d603"
Due to Instagram terms instead use the below link for video:
https://www.w3schools.com/html/movie.mp4
match = re.findall(rโurl\W\W\W([-\W\w]+)\W\W\Wvideo_view_countโ, str)
What the code above does is to find the url above whenever we run the code.
To extract the video we have to declare a variable name extraction and inside this variable we will store the file format for video, as shown below.
extraction = โ.mp4โ
Also do the same for image but use profile_pic_url
:
"https://instagram.fnbo9-1.fna.fbcdn.net/v/t51.2885-19/274607143_1204294113308064_418123174948225933_n.jpg?stp=dst-jpg_s150x150\u0026_nc_ht=instagram.fnbo9-1.fna.fbcdn.net\u0026_nc_cat=100\u0026_nc_ohc=L3oR46dvCW0AX-fS68k\u0026edm=AABBvjUBAAAA\u0026ccb=7-4\u0026oh=00_AT_7whkb_tXXNikAlnrI8yBifCb9zDwZK0Zt5q462q93Vw\u0026oe=6222855B\u0026_nc_sid=83d603"
as shown below.
source code :
search profile_pic_url
:
For image link use:
https://www.w3schools.com/html/pic_trulli.jpg
match = re.findall(r'profile_pic_url\W\W\W([\W\w]+)\W\W\Wdisplay_resourcesโ, str)
And Now Our extraction variable value is :
extraction = โ.jpgโ
So last line of this step is to collect the actual post video or imageโs url in a variable as a regular exp. array to string. To do that :
res = match[0]
5. Data extraction.
Here we have to download and get the caption of the post.
We will use BeautifulSoup in our code to get the caption or title of the post.We have to assign all data (str) to pass through BS4 and filter it .
page = BeautifulSoup(str, "html.parser")
title = page.find("title")
title = title.get_text()
So the code will find the title of this page and store the title varible.
After this we have to perform regular expression to make our file name saved and also store in a media folder.
title = re.sub(r"\W+", "_", title)
title = "download/web_scrap"+title+"web_scrap"
print("\n"+title)
We use download/
because we want to store our downloaded file in a new folder called download/
.
if res != "" :
print('found \n \n'+'\033[1m'+colored(res, 'green')+'\033[0m'+'\n') #'found word:cat'
download = input("Do you want to download(y/N) : ")
if (download == "y" or download == "Y"):
try:
fileName = title
print("Downloading.....")
DFU.urlretrieve(res, fileName+extraction)
print("Download Successfully!")
os.system("tree download")
except:
print("Sorry! Download Unsuccessful")
else:
print('did not find or post is from private account')
exit()
So if res
variable is not empty then print the actual link of post.Then make a input and this input will ask you that you want to download this file or not.To do so, answer with y or n .If answer is Y or y then it will continue working.
if (download == โyโ):
That's all on how to download an image and a video from a social media Instagram.
Get the source code here
THank you for taking your time to go through this article.
Top comments (8)
Thanks for sharing. I'm learning programming languages and hopefully one day i might be good programmer. I have my personal website launched as I'm a freelance web designer in Dubai. Appreciate work you put on to post this.
Thank you so much for taking your time to read my article I appreciate
A good doc, keep informing.
if (download == "y" or download == "Y"):
in the above code I think you can reduce it by taking in any input whether Y or y, then convert it to lower.
if download.lower() == 'y':
do something...
This can replace the below statement saving you more...
if (download == "y" or download == "Y"):
on it thank you
Great article ๐ Here , really insightful Keep it up ๐๐๐ฅณ
Thank you Brayan
I have built a tool for content creators to generate open graph images for social media posts.
see -> og-image-client.vercel.app
Must check it out
Great work