Introduction to Data Science - Twitter

I started taking an online MOOC course about Big Data. I plan on uploading my code and thoughts on each project as I proceed through the course. You can follow me on github at @nikhilsaraf and twitter at @nikhilsaraf9.

I am just about to begin the first assignment. it involves parsing Twitter data using Python and a supplied Tweet downloader. The assignment is due in 2 days, I am a Java developer and am very new to Python. I've gone through a few lectures but need to start the assignment now otherwise I won'e be able to finish it. I hope this assignment is not any harder than I am used to but I am about to find out...

The assignment recommends taking a tutorial on python from codeacademy or Google's Python class, but it's 4:30 p.m. on Saturday and I don't have time for that. I think I'm just going to try and wing it. I've already downloaded Python version 2.7.1, here's the download link for any readers that want it.

As I am using github for this project, I want to be able to push code to it as efficiently as possible. I have used bitbucket more often in the past and so I did not have my credentials set up in my git client for my github account. Because of this, github is constantly asking me for my username and password when I push code. In order to save time in the long run, I've decided to preauthenticate my computer for code checkins to github. See how to set up github to not have it constantly ask for your username and password here.

Now that I've organized this blog and set up my git credentials, I'm finally getting started with my first mini python script:

import urllib
import json

for page in range(1,11):
        raw_response = urllib.urlopen(""
                     + "/search.json?q=Twitter&page=" + str(page))
        json_object = json.load(raw_response)
        print "Page: " + str(page)
        for tweet in json_object['results']:
                print tweet['text']
                print ""
$> python | less
This script allows you to fetch the first 10 pages of tweets that contain the word "twitter" and is really a warm up script.
Now to move on to create my Twitter application. I've chosen to call it "Top 1% Tweet Feeder", I think that's a good name for this project.
I need to copy my twitter credentials into the file so that my application can be authenticated to access twitter data. Having just done that I am now running the script. This file will be used for the life of the assignment and its suggested to run it for 10 minutes, the first 20 lines from this output needs to be submitted for the assignment. Let's whip up a quick and dirty command line bash command to help us with the time.
$> echo $(date); echo "starting twitterstream download..."; python > nikhilsaraf_output.txt; echo "...finished twitterstream download. see nikhilsaraf_output.txt"; echo $(date); echo "Now getting first 20 lines..."; head -n 20 nikhilsaraf_output.txt > output.txt; echo "...finished getting first 20 lines. see output.txt";
Let's see the how may tweets we got in the output file after 15 minutes of downloading:
$> wc -l nikhilsaraf_output.txt
   57058 nikhilsaraf_output.txt


The first part of the assignment asks to get the tweet sentiment of each tweet based on a provided dictionary file. This was quite simple as it was primarily just a for loop which calculated the total sentiment for each tweet after it was retrieved using the correct json key.
The second part involved inferring the sentiment value of unknown words, i.e. words for which the sentiment value was not already defined in the tweet file. A word's sentiment value was defined by the total sentiment value of the tweet it was found in, averaged out over all the tweets. This involved maintaining a dictionary of all the new "derived" words and then once i had gone through all the tweets, I had a step that computed the sentiment value for all derived words. This took a little trial and error based on how I chose to split up a word, as I had to make decisions like is a user "@username" considered a word or should that be considered as "username". The course allows one to be flecible about this and I chose to go with keeping the "@" symbol. This allowed me to see the output for the most positive and negative sentimented users. Similarly, this also applied to hashtags.
The third part was quite straightforward and built upon the logic of the third. This required one to get the frequency of words across all tweets, averaged out for all terms. A detail not to be missed here is that the divisor is the total terms encountered, not the sum of all the total terms. Interestingly enough, it turned out that the most frequently occurring word was RT.
The fourth was a little more tedious, but the most interesting. It involved finding out the happiest state based on the tweets. The course disallowed one's solution to use the internet during this computation, and so we were unable to use the coordinates supplied in the tweet metadata. Twitter offers a "place" json key that provides location data in the form of text. I parsed this information for state abbreviations for each tweet, discarding invalid entries, combined it with the sentiment value for each tweet, giving the states a total sentiment value.
The result turned out that Nevada was the happiest state, and texas was the saddest state. I am not convinced by these results and am currently downloading a larger dataset (and will potentially modify the code to read coordinate data) in order to rerun the computation. A similar study was performed at a slighly larger scale by The University of Vermont showing that Hawii is the happiest state based on Tweets.
The last part of the first assignment was to filter out the top ten hashtags used. This was really simple, as I parsed out the "entities" key, followed by the hashtag and applied the same algorithm to track the sentiment for each hashtag, and finally computed the top ten from the dictionary of hashtags.
This pretty much wraps up the first assignment working with Twitter data. It was  an interesting learning experience and I'm currently in the process of downloading more twitter data to see if I can get anything more useful out of it. Maybe trend on some hot topics being discussed, or just rerun these queries on more data. Please leave any comments or thoughts after reading!
Show comments 0