Using Twitter rest APIs in Python to search and download tweets in bulk

01.02.2019 - Jay M. Patel - Reading time ~5 Minutes

Getting Twitter data

Let’s use the Tweepy package in python instead of handling the Twitter API directly. The two things we will do with the package are, authorize ourselves to use the API and then use the cursor to access the twitter search APIs.

Let’s go ahead and get our imports loaded.

import tweepy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.set()
%matplotlib inline

Twitter authorization

To use the Twitter API, you must first register to get an API key. To get Tweepy just install it via pip install Tweepy. The Tweepy documentation is best at explaining how to authenticate, but I’ll go over the basic steps.

Once you register your app you will receive API keys, next use Tweepy to get an OAuthHandler. I have the keys stored in a separate config dict.

config = {"twitterConsumerKey":"XXXX", "twitterConsumerSecretKey" :"XXXX"}
auth = tweepy.OAuthHandler(config["twitterConsumerKey"], config["twitterConsumerSecretKey"])
redirect_url = auth.get_authorization_url()
redirect_url

Now that we’ve given Tweepy our keys to generate an OAuthHandler, we will now use the handler to get a redirect URL. Go to the URL from the output in a browser where you can allow your app to authorize your account so you can get access to the API.

Once you’ve authorized your account with the app, you’ll be given a PIN number. Use that number in Tweepy to let it know that you’ve authorized it with the API.

pin = "XXXX"
auth.get_access_token(pin)

Searching for tweets

After getting the authorization, we can use it to search for all the tweets containing the term “British Airways”; we have restricted the maximum results to 1000.

query = 'British Airways'
max_tweets = 10
searched_tweets = [status for status in tweepy.Cursor(api.search, q=query,tweet_mode='extended').items(max_tweets)]

search_dict = {"text": [], "author": [], "created_date": []}

for item in searched_tweets:
    if not item.retweet or "RT" not in item.full_text:
        search_dict["text"].append(item.full_text)
        search_dict["author"].append(item.author.name)
        search_dict["created_date"].append(item.created_at)

df = pd.DataFrame.from_dict(search_dict)
df.head()

Out:
    text                                                author      created_date
0   @RwandAnFlyer @KenyanAviation @KenyaAirways @U...   Bkoskey     2019-03-06 10:06:14
1   @PaulCol56316861 Hi Paul, I'm sorry we can't c...   British Airways     2019-03-06 10:06:09
2   @AmericanAir @British_Airways do you agree wit...   Hat     2019-03-06 10:05:38
3   @Hi_Im_AlexJ Hi Alex, I'm glad you've managed ...   British Airways     2019-03-06 10:02:58
4   @ZRHworker @British_Airways @Schmidy_87 @zrh_a...   Stefan Paetow   2019-03-06 10:02:33

Language detection

The tweets downloaded by the code above can be in any language, and before we use this data for further text mining, we should classify it by performing language detection.

In general, language detection is performed by a pretrained text classifier based on either the Naive Bayes algorithm or more modern neural networks. Google’s compact language detector library is an excellent choice for production-level workloads where you have to analyze hundreds of thousands of documents in less than a few minutes. However, it’s a bit tricky to set up and as a result, a lot of people rely on calling a language detection API from third-party providers like Algorithmia which are free to use for hundreds of calls a month (sign up required).

Let’s keep things simple in this example and just use a Python library called Langid which is orders of magnitude slower than the options discussed above but should be OK for us in this example since we are only to analyze about a hundred tweets.

from langid.langid import LanguageIdentifier, model
def get_lang(document):
    identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)
    prob_tuple = identifier.classify(document)
    return prob_tuple[0]

df["language"] = df["text"].apply(get_lang)

We find that there are tweets in four unique languages present in the output, and only 45 out of 100 tweets are in English, which are filtered as shown below.

df["language"].unique()

Out:
array(['en', 'rw', 'nl', 'es'], dtype=object)

df_filtered = df[df["language"]=="en"]
df_filtered.shape
Out:
(45, 4)

Getting sentiments score for tweets

We can take df_filtered created in the preceding section and run it through a pretrained sentiments analysis library. For illustration purposes we are using the one present in Textblob library, however, I would highly recommend using a more accurate sentiments model such as those in coreNLP or train your own model using Sklearn or Keras.

Alternately, if you choose to go via the API route, then there is a pretty good sentiments API at Algorithmia.

from textblob import TextBlob

def get_sentiments(text):
    blob = TextBlob(text)
#     sent_dict = {}
#     sent_dict["polarity"] = blob.sentiment.polarity
#     sent_dict["subjectivity"] = blob.sentiment.subjectivity
    
    if blob.sentiment.polarity > 0.1:
        return 'positive'
    elif blob.sentiment.polarity < -0.1:
        return 'negative'
    else:
        return 'neutral'

def get_sentiments_score(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity
    
df_filtered["sentiments"]=df_filtered["text"].apply(get_sentiments)
df_filtered["sentiments_score"]=df_filtered["text"].apply(get_sentiments_score)
df_filtered.head()
Out:
    text                                                author          created_date    language    sentiments  sentiments_score
0   @British_Airways Having some trouble with our ...   Rosie Smith     2019-03-06 10:24:57     en  neutral     0.025
1   @djban001 This doesn't sound good, Daniel. Hav...   British Airways     2019-03-06 10:24:45     en  positive    0.550
2   First #British Airways Flight to #Pakistan Wil...   Developing Pakistan     2019-03-06 10:24:43     en  positive    0.150
3   I don’t know why he’s not happy. I thought he ...   Joyce Stevenson     2019-03-06 10:24:18     en  negative    -0.200
4   Fancy winning a global holiday for you and a f...   Selective Travel Mgt 🌍  2019-03-06 10:23:40     en  positive    0.360

Let us plot the sentiments score to see how many negative, neutral, and positive tweets people are sending for “British airways”. You can also save it as a CSV file for further processing at a later time.

import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
%matplotlib inline

g = sns.countplot(df_filtered["sentiments"])
loc, labels = plt.xticks()
g.set_xticklabels(labels, rotation=90)
g.set_ylabel("Count")
g.set_xlabel("Sentiment %")

Figure 1: Sentiments for tweets containing the term British airways

web-scraping twitter sentiments machine-learning text-mining tweepy language-detection