Table of Contents
My recent two posts have been light introduction's into instascrape
: the lightweight, open source Instagram web scraper written for Python ๐!
Scrape data from Instagram with instascrape and Python
Chris Greening ใป Oct 20 '20
Visualizing Instagram engagement with instascrape and Python
Chris Greening ใป Oct 21 '20
In this post, I'm going to show some of the ways I've personally explored Instagram programatically using Python and instascrape
๐
The Content
๐ถ Working with static content...
On its own, instascrape
is a purely static web scraper. That means it only scrapes the initial source HTML served back by Instagram and does not deal with dynamic content rendered by JavaScript.
...and dealing with the dynamic ๐
Like many other modern websites, Instagram uses a combination of server-side and client-side technology's such as AJAX to dynamically load content as you scroll. This allows Instagram to respond to an HTTP request quickly and then load more content as it's needed. By doing this, the user is presented with a clean, seamless user-experience (UX) with infinite scrolls and fast page refresh times.
While great for UX, this dynamically rendered content can become a bit of a pain for web scraping... but no worries ๐! There are ways we can get around this and take it in step. For the most part, I use selenium
which allows us to automate web browsers such as Google Chrome and Firefox using Python ๐ป! With this tool at our disposal, we can render the JavaScript and grab the HTML as it's loaded, integrating it into instascrape
for scraping.
The Tools
๐พ Data processing: the before and after
Regarding the data analysis, I use a combination of
-
pandas
: powerful tools and data structures for analyzing and exploring data -
numpy
: support for multidimensional arrays -
scikit-learn
: machine learning library that we will use for preprocessing data and building regression models
๐๏ธ Data visualization and interaction
The library I use for data visualization is matplotlib. I use Jupyter Notebook or the IPython console for interactivity.
The Exploration
๐ด Takin' a peak at politicians
With one of the early iterations of instascrape
in early March 2020, I used it to take a look at how various politician's Instagram game's stacked up against one another, specifically Bernie Sanders and Joe Biden:
Fascinating! Let's take a look at Bernie first. It appears he's enjoyed very steady growth on his Instagram since 2016, nearly quintupling his likes per post. Additionally, we can see when he's on the campaign trail based on the frequency of posts.
Now let's take a look at Joe. He has no posts prior to mid-2018 and it's clear he enjoys less likes than Bernie did at the time of this data collection. This certainly makes sense considering Bernie is so popular with younger voters who make up a larger portion of social media platforms!
๐ The rise and fall of @chris_greening
Yes that's a David Bowie reference; yes I am Chris Greening and my insta is in fact falling ๐ข... but that's okay ๐คท! It made for a fun exercise to analyze. Let's check out the data:
Gasp! Shock! I know, it's tragic. But let's get down to it ๐. We can see that my growth was quite stagnant between 2016 and 2020 until mid-March of this year when my page suddenly blew up ๐ฎ (quarantine was beginning and I decided to learn Photoshop) Let's zoom in a bit to just 2020 ๐:
I went from averaging <100 likes per post to almost 400 likes in just a matter of months with some posts netting over 800! We can also see that I was pretty steady with my frequency of posting until June when I slipped up and missed an entire month! Whoops! And it's all been downhill from there ๐. This type of data can be great though for seeing how a page is performing!
Let's take a look at a popular Instagram page right now and see how they're doing:
Wow, honestly kind of incredible how linear @dudewithsign's growth has been since he started posting, nearly an exact straight line.
โ Determining best time of day/day of week to post
In the same vein using the same data as the previous exploration we just did, we can also create a heatmap that will show us the best time of day/day of week for @chris_greening (me) to post to net the highest average engagement ๐ฅ:
It certainly seems that I get the most engagement when I post in the late morning/early afternoon but additionally, we can see some of my best engaged posts were posted in the middle of the week on Wednesday and Thursday. This is great information to keep in mind the next time I go to post something ๐.
โ Scraping a post in real time
For the final bit of data exploration I'll show in this post, let's take a look at the output of a program I wrote that watches a post's engagement as it grows in real time. The program tracked a post by @dacre_montgomery and gathered how many likes/comments it got as a time-series across a 30 minute window:
The red/left y-axis represents likes on the post while the blue/right y-axis represents comments on the post. Incredibly enough, Dacre was able to amass over 35,000 likes and almost 400 comments in that time period alone (and that was after the post had already been up for an hour or so). That's more likes/comments than I have probably ever gotten on all my posts combined ๐ฌ
A future idea could be to write a script that watches a user's page continuously and as soon as a new post is detected, the real-time engagement tracker is triggered and we could watch the post grow across a longer period, say 8 hours!
The Conclusion
And there you have it! Leveraging instascrape
to gather data, I was able to perform some really great data exploration I wouldn't have been able to do otherwise. Not only was I able to explore my own profile but I was able to look at the profile's of some public figures as well. These are just drops in a great ocean of possibilities you can accomplish using instascrape
and the data is just out there waiting for you!
Keep an eye out for a future post with more data exploration that will take a look at real-time hashtag growth, real-time post growth, and some more interesting examples to mess around with.
Let me know what you think in the comments below or even better, contribute to the official repo ๐
chris-greening / instascrape
Powerful and flexible Instagram scraping library for Python, providing easy-to-use and expressive tools for accessing data programmatically
instascrape: powerful Instagram data scraping toolkit
Note: This module is no longer actively maintained.
DISCLAIMER:
Instagram has gotten increasingly strict with scraping and using this library can result in getting flagged for botting AND POSSIBLE DISABLING OF YOUR INSTAGRAM ACCOUNT. This is a research project and I am not responsible for how you use it. Independently, the library is designed to be responsible and respectful and it is up to you to decide what you do with it. I don't claim any responsibility if your Instagram account is affected by how you use this library.
What is it?
instascrape is a lightweight Python package that provides an expressive and flexible API for scraping Instagram data. It is geared towards being a high-level building block on the data scientist's toolchain and can be seamlessly integrated and extended with industry standard tools for web scraping, data science, and analysis.
Top comments (1)
Hey I'm trying to do a heatmap with seaborn like yours for a school project with some data from my instagram profile. I'm having some trouble making it look similiar like the one above. Can you show me the code you used to create the heatmap? Also what data did you used to crete it?