Scrape Twitter data with Python

A complete guide on collecting and storing Tweets.

Scrape Twitter data with Python

A couple of weeks back, I was working on a project that required me to analyze data from Twitter.

After a quick Google search, I realized that the most popular way to do this was with the Twitter API.

This API is called Tweepy, and there are various levels of access you can get depending on what you want to use it for.

However, Tweepy has its limitations. Firstly, you will need to create a Twitter Developer Account and apply for API access. You'll need to answer a series of questions to do this, which is incredibly time consuming.

Even once you get approved, there is a limit on the number of Tweets you can scrape.

To get around this, I started looking at alternatives to Tweepy.


Twint

Twint is an advanced Twitter scraping tool written in Python that allows for scraping Tweets from Twitter profiles without using Twitter's API.

While the Twitter API only allows you to scrape 3200 Tweets at once, Twint has no limit.

It is very quick to set up, and you don't need any kind of authentication or access permission.

Start scraping

First, install the Twint library:

pip install Twint

Then, run the following lines of code to scrape Tweets related to a topic. In this case, I'm going to scrape every Tweet that mentions Taylor Swift:

import twint

c = twint.Config()

c.Search = ['Taylor Swift']       # topic
c.Limit = 500      # number of Tweets to scrape
c.Store_csv = True       # store tweets in a csv file
c.Output = "taylor_swift_tweets.csv"     # path to csv file

twint.run.Search(c)

Finally, all you need to is read the .csv file back into a data frame:

import pandas as pd
df = pd.read_csv('taylor_swift_tweets.csv')

Taking a look at the head of the data-frame, we see output that looks like this:

The contents of all Tweets are stored in the 'tweet' column:

df['tweet']

Running the above line of code will render the content of all Tweets:


The codes above just scratches the surface of what you can do with Twint.

You can tailor the output you want according to your needs (even filter Tweets between a particular time-frame or language).

Check out their documentation to understand their features better.