Scraping Twitter data using python for NLP tasks

#python #machinelearning #datascience #nlp

"In God we trust. All others must bring data." - W. Edwards Deming

If you're starting in the incredible field of NLP, you'll want to get your hands dirty with real textual data that you can use to play around with the concepts you've learned. Twitter is an excellent source of such data. In this post, I'll be presenting a scraper that you can use to scrape the tweets of the topics that you're interested in and get all nerdy once you've obtained your dataset.

I've used this amazing library that you can find here. I'll go over how to install and use this library and also suggest some methods to make the entire process faster using parallelization. The complete notebook containing the code can be found here

Installation

The library can be installed using pip3 using the following command

pip3 install twitter_scraper

Creating a list of keywords

The next task is to create a list of keywords that you want to use for scraping twitter.

# List of hashtags that we're interested in
keywords = ['machinelearning', 'ML', 'deeplearning', 
            '#artificialintelligence', '#NLP', 'computervision', 'AI', 
            'tensorflow', 'pytorch', "sklearn", "pandas", "plotly", 
            "spacy", "fastai", 'datascience', 'dataanalysis']

Scraping tweets for one keyword

Before we run our program to extract all the keywords, we'll run our program with one keyword and print out the fields that we can extract from the object. In the code below, I've shown how to iterate over the returned object and print out the fields that you want to extract. You can see that we have the following fields that we extract

Tweet ID
Is a retweet or not
Time of the tweet
Text of the tweet
Replies to the tweet
Total retweets
Likes to the tweet
Entries in the tweet

# Lets run one iteration to understand how to implement this library
tweets = get_tweets("#machinelearning", pages = 5)
tweets_df = pd.DataFrame()

# Lets print the keys and values obtained
for tweet in tweets:
  print('Keys:', list(tweet.keys()), '\n')
  break

# Running the code for one keyword and extracting the relevant data
for tweet in tweets:
  _ = pd.DataFrame({'text' : [tweet['text']],
                    'isRetweet' : tweet['isRetweet'],
                    'replies' : tweet['replies'],
                    'retweets' : tweet['retweets'],
                    'likes' : tweet['likes']
                    })
  tweets_df = tweets_df.append(_, ignore_index = True)
tweets_df.head()

Running the code sequentially for all keywords

Now that we've decided what kind of data we want to store from our object, we'll run our program sequentially to obtain the tweets of topics we're interested in. We'll do this using our familiar for loop to go over each keyword one by one and store the successful results.

# We'll measure the time it takes to complete this process sequentially
%%time
all_tweets_df = pd.DataFrame()
for word in tqdm(keywords):
  tweets = get_tweets(word, pages = 100)
  try:
    for tweet in tweets:    
      _ = pd.DataFrame({'hashtag' : word, 
                        'text' : [tweet['text']],
                        'isRetweet' : tweet['isRetweet'],
                        'replies' : tweet['replies'],
                        'retweets' : tweet['retweets'],
                        'likes' : tweet['likes']
                      })
      all_tweets_df = all_tweets_df.append(_, ignore_index = True)
  except Exception as e: 
    print(word, ':', e)
    continue

Running the code in parallel

From the documentation, Multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine.

First, we'll implement a function to scrape the data.

# We'll create a function to fetch the tweets and store it for us
def fetch_tweets(word):
  tweet_df = pd.DataFrame()
  tweets = get_tweets(word, pages=100)
  try:
    for tweet in tweets:    
      _ = pd.DataFrame({'hashtag' : word, 
                        'text' : [tweet['text']],
                        'isRetweet' : tweet['isRetweet'],
                        'replies' : tweet['replies'],
                        'retweets' : tweet['retweets'],
                        'likes' : tweet['likes']
                      })
      tweet_df = tweet_df.append(_, ignore_index = True)
  except Exception as e: 
    print(word, ':', e)
  return tweet_df

Next, we'll create subprocesses to run our code in parallel.

# We'll run this in parallel with 4 subprocesses to compare the times
%%time
with Pool(4) as p:
    records = p.map(fetch_tweets, keywords)

Conclusion

As you can see, we reduced our process time to almost 1/4th of sequential execution. You can use this method for similar tasks and make your python code much faster.
Good luck with the scraping!