In this blog post, I'm going to give a quick tutorial on how you can scrape every post on an Instagram profile page using instascrape with less tha...
For further actions, you may consider blocking this person and/or reporting abuse
Thanks so much for these tutorials and sharing your work! Question: I ran this exact code for my own instagram data and it has been executing for ~30 minutes now. I have 552 instagram posts. I'm hesitant to kill it but I am unsure if it is stuck. Any ideas?
Unfortunately it's an incredibly slow approach, Instagram starts blocking if you scrape too much too fast so I try to play the long game and let it run in the background.
In the
scrape_posts
function, you'll seepause=10
which refers to a 10s pause between each post scrape. Considering you have 552 posts, that'll be (552*10)/60 = 92 minutes 😬In the future, passing
silent=False
as an argument will print what number the scrape is currently on, I'm actually gonna edit that in right now for anyone else reading the article in the future!Thanks for reaching out!
If it's any consolation though, that means it's working! You're just gonna have to wait an extra hour or so before you can get your data 😬
haha thank you! So it did eventually finish without error but then I appeared to have a list of "Post" objects of which I could not tell how I was to get the data from. From reading the GitHub documentation I tried various methods but to no avail (this isn't a knock on you more a knock on my learning curve).
So now after a few hours of messing around I tried to run the "joe biden code" for my own account and even though I am setting login_first=False in the get_posts function, the chrome driver brings me to a login page. Im able to log into instagram but meanwhile my code says it has finished running without error but my posts and scraped_posts objects are now just empty lists.
oh I guess I should also mention that my end goal is to collect data similar to the data you analyzed in your donald trump post. I saw you published a notebook of the analysis code (thank you!) but didn't see a line-by-line on how you got that data.
scraped
Post
objects contain the scraped data as instance attributes! Try using theto_dict
method on one of thePost
's and it should return a dictionary with the data it scraped for thatPost
. The key/values of the returneddict
will correspond one-to-one with the available instance attributesI'll take a look at the
login_first
bug rn and see if I can replicate it, it might be on the library's end! Instagram has been making a lot of changes the last month or so and have been making it increasingly harder to scrapeahhh okay, so when you set
login_first=False
, Instagram is still redirecting to the login page automatically butinstascrape
is trying to start scrolling immediately which results in an empty list since there are no posts rendered on the pageto access dynamically rendered content like posts you're pretty much always gonna have to be logged in so it's best to leave
login_first
asTrue
unless you're chaining scrapes and your webdriver is already logged in manuallyamazing thank you! So I was able to get my first 10 posts no problem by specifying amount=10 but then I tried to do all ~500 pictures and after 232 pictures I came across this error:
ConnectionError: ('Connection aborted.', OSError("(54, 'ECONNRESET')"))
Im guessing this means instagram blocked my request? Have you come across this issue?
Hi Chris,
Thank you for your sharing. I tried to use your code but I am getting this error.
ImportError: cannot import name 'QUOTE_NONNUMERIC' from partially initialized module 'csv' (most likely due to a circular import) (/home/idil/Masaüstü/csv.py)
Do you know what this is about?
Firstly, thankyou for making such an awesome library/module. What if I want to scrape first 12 posts or the first page (containing the recent posts) of a public profile, like how to apply that? And if I want these recent posts to be returned in the form of .JSON, how to do that? Here's the sample of an object I want to be returned from .JSON array:
Thank you so much for these tutorial and for publishing your work 🙏🙏
when i'm trying to save both scraped_posts and unscraped_posts, it says function has no 'to_csv' member. I do see the urls and the upload date but i can't see whether or not if the individual posts are being scraped or not.
Maybe i'm doing it wrong, i've looked through your other blog posts and documentation i couldn't find any examples of how to save the scraped data or use the to_csv/to_json line ( yes i am a beginner in programming, apologies if this question sounds stupid)
This is Cool Man
Hey thanks so much, I appreciate it! 😄
InvalidArgumentException: invalid argument: 'url' must be a string
Do you know why I might be getting this error?
Code is as follows:
import pandas as pd
from selenium.webdriver import Chrome
from instascrape import Profile, scrape_posts
from webdriver_manager.chrome import ChromeDriverManager
defining path for Google Chrome webdriver;
driver = webdriver.Chrome(ChromeDriverManager().install())
Scraping Joe Biden's profile
SESSIONID = 'session id' #Actual session id excluded on purpose
headers = {"user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57",
"cookie": f"sessionid={SESSIONID};"}
prof = Profile('instagram.com/username/') #username exlcuded as well
prof.scrape()
Scraping the posts
posts = prof.get_posts(webdriver=driver, login_first=True)
scraped, unscraped = scrape_posts(posts, silent=False, headers=headers, pause=10)
posts_data = [post.to_dict() for post in posts]
posts_df = pd.DataFrame(posts_data)
print(posts_df[['upload_date', 'comments', 'likes']])
The issue seems to be stemming from my 'get_posts' call
Hi Chris,
I tried running the above code, but i keep on getting below error. I have put my valid session id.
"Instagram is redirecting you to the login page instead of the page you are trying to scrape. This could be occuring because you made too many requests too quickly or are not logged into Instagram on your machine. Try passing a valid session ID to the scrape method as a cookie to bypass the login requirement"
Hi Chris,
joebiden.py:8: DeprecationWarning: executable_path has been deprecated, please pass in a Service object
webdriver = Chrome("/home/pramod/Downloads/chromedriver/chromedriver")
Traceback (most recent call last):
File "joebiden.py", line 19, in
scraped, unscraped = scrape_posts(posts, silent=False, headers=headers, pause=10)
File "/home/pramod/.local/lib/python3.8/site-packages/instascrape/scrapers/scrape_tools.py", line 179, in scrape_posts
post.scrape(session=session, webdriver=webdriver, headers=headers)
File "/home/pramod/.local/lib/python3.8/site-packages/instascrape/scrapers/post.py", line 88, in scrape
return_instance.upload_date = datetime.datetime.fromtimestamp(return_instance.timestamp)
I am facing an error . Please help me out .
thanks
Always crisp and clear... thanks for sharing ...
Thanks for your hard work. I'm really lucky because I found out about this project just as I wanted to scrape my business IG profile. Keep up with the good work!
This is exactly why I released it, thanks so much for the feedback 😄 motivates me to keep working on it
Thank you very much for this precious tool!
I'm trying to run the code, but despite inserting my session id i still get 'MissingCookiesWarning' and 'InstagramRedirectLoginError1.
How to fix this?
I am getting the error "ValueError: Invalid value NaN (not a number)" when the scrape_posts() method is called. Something about getting time shows up before if that is relevant. Thanks
awesome thanks so much :) I'm pretty new to python but I could run your code!
Is there a way to save the posts (images + texts) to a certain folder?
Why am I getting this error :(
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Hello amazing post, i wish the link of the second part of analizing data,
To do this scrapping do you need change ip or use multiple ig accounts?
Thanks so much
it's still working? not works for me .
neither for me, there is something wrong with .get_posts(), list is empty after that