In the past (before Elon Musk…), you could easily and freely apply for a developer account to get your own tokens and start using Twitter API without any struggle. One of the strengths of the developer account besides making bots and tweeting via API was search API. You could almost grab all the tweets you want. But after Elon Musk, unfortunately, you have to pay for it!
Tiers will start at $500,000 a year for access to 0.3 percent of the company's tweets. Researchers say that's too much for too little data. [source]
There is one solution that almost always works. Selenium! (Also, it's good to know that the great alternative for selenium in JS is puppeter).
It almost allows you to scrape everything on the surface of the web. Just you have to write a script for your use case with the selenium library.
How
The algorithm for scraping tweets is so easy.
These are the steps:
- Open Twitter search with an advanced search query.
- Scrape specific tags to get the value
- scroll
- Repeat the steps until you scrape the number of tweets you need.
Code
It can be written by your script or using other libraries like twitter_scraper_selenium
It's available on PyPI and GitHub.
pip install twitter_scraper_selenium
(Note: For saving as CSV and working as data frames, we must install pandas and other dependencies too)
Then you can write your own wrapper function like this
from twitter_scraper_selenium import scrape_keyword
import json
import pandas as pd
import asyncio
def scrape_profile_tweets_since_2023(username: str):
kword = "from:" + username
path = './users/' + username
file_path = path + '.csv'
tweets = scrape_keyword(
headless=True,
keyword=kword,
browser="chrome",
tweets_count=2, # Just last 2 tweets
filename=path,
output_format="csv",
since="2023-01-01",
# until="2025-03-02", # Until Right now
)
data = pd.read_csv(file_path)
data = json.loads(data.to_json(orient='records'))
return data
You can call this function for multiple accounts at the same time, as this:
from twitter_scraper_selenium import scrape_keyword
import json
import pandas as pd
import asyncio
def scrape_profile_tweets_since_2023(username: str):
kword = "from:" + username
path = './users/' + username
file_path = path + '.csv'
tweets = scrape_keyword(
headless=True,
keyword=kword,
browser="chrome",
tweets_count=2, # Just last 2 tweets
filename=path,
output_format="csv",
since="2023-01-01",
# until="2025-03-02", # Until Right now
)
data = pd.read_csv(file_path)
data = json.loads(data.to_json(orient='records'))
return data
You can call this function for multiple accounts at the same time, as this:
from multiprocessing import Pool
# Just one account
# scrape_profile_tweets_since_2023('elonmusk')
# Run in parallely
def functionToRunParallely(i):
return i
noOfPools = 5
if __name__ == "__main__":
with Pool(noOfPools) as p:
p.map(scrape_profile_tweets_since_2023,['elonmusk', 'BarackObama', 'cathiedwood'])
Result
Your result will be something like this:
In the next post, we are going to scrape mentions/replies as well.
If you like the post, please use clap or follow me on GitHub and LinkedIn!
Github.com/iw4p
Top comments (1)
1- I dont see the fucntion to run parallel being used
2- it gives me an error str object has no attribute close
can you please help