Python and sentiment analysis

While looking for datasets to throw at sklearn , I came across UCI Sentiment Labelled Sentences Data Set .

UCI is providing us with positive / negative tagging on real world data, the data comes from three sources ( Amazon , Yelp , and IMDB ).

The only problem is the format is a little strange.. We have a .txt file for each source, this is a raw unstructured  formatting, plus not every line is tagged with sentiment.

To make the data easier to interact with, I generated a json file with only the results containing sentiment. Go ahead and download it.

$ zcat sentiment.json.gz | head -n 25
[
 {
 "result": 0,
 "source": "amazon_cells_labelled.txt",
 "label": "negative",
 "text": "So there is no way for me to plug it in here in the US unless I go by a converter."
 },
 {
 "result": 1,
 "source": "amazon_cells_labelled.txt",
 "label": "positive",
 "text": "Good case, Excellent value."
 },
 {
 "result": 1,
 "source": "amazon_cells_labelled.txt",
 "label": "positive",
 "text": "Great for the jawbone."
 },
 {
 "result": 0,
 "source": "amazon_cells_labelled.txt",
 "label": "negative",
 "text": "Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!"
 },

Lets jump into an IPython interrupter and load the data:

In [1]: import json

In [2]: raw = open('sentiment.json').read()

Now that we have the data as a Python dictionary, create a DataFrame in the proper format:

In [1]: data = get_data_frame('sentiment.json')

In [2]: data.shape
Out[2]: (3000, 2)

Next lets split our full dataset into a training, and testing dataset:

In [1]: train, test = get_train_test_data(data, size=0.2)

In [2]: train.shape
Out[2]: (2400, 2)

In [3]: test.shape
Out[3]: (600, 2)

We are now set to run a bit of accuracy testing:

In [1]: test_predict(train, test)
Out[1]: {
  'test_score': 0.80000000000000004,
  'train_score': 0.98375000000000001
}

We can slice our full dataset a few more times, just to make sure our accuracy test is.. accurate:

In [1]: train, test = get_train_test_data(data, size=0.2)

In [2]: test_predict(train, test)
Out[2]: {
  'test_score': 0.79000000000000004,
  'train_score': 0.98416666666666663
}

In [3]: train, test = get_train_test_data(data, size=0.2)

In [4]: test_predict(train, test)
Out[4]: {
  'test_score': 0.80666666666666664,
  'train_score': 0.98291666666666666
}

In [5]: train, test = get_train_test_data(data, size=0.5)

In [6]: test_predict(train, test)
Out[6]: {
  'test_score': 0.79466666666666663,
  'train_score': 0.98999999999999999
}

All that is left is to feed in the entire dataset and predict on new sentences:

In [1]: predict(data, 'This was the worst experience.')
Out[1]: [
  (u'positive', 0.17704535094140364),
  (u'negative', 0.82295464905859583)
]

In [2]: predict(data, 'The staff here was fabulous')
Out[2]: [
  (u'negative', 0.20651083543376234),
  (u'positive', 0.79348916456623764)
]
In [1]: predict(data, 'I hate you')
Out[1]: [
  (u'positive', 0.22509671479185445),
  (u'negative', 0.77490328520814555)
]

In [2]: predict(data, 'I love you')
Out[2]: [
  (u'negative', 0.10593166714256422),
  (u'positive', 0.89406833285743614)
]

Lets put it all together by looking at the Python functions:

import re
import json
import random
import string

from pandas import DataFrame, concat

from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import NearestNeighbors
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer, \
 CountVectorizer


# update the pipeline to get best test results!
steps = [
 ('count_vectorizer', CountVectorizer(
 stop_words='english', ngram_range=(1, 2))),
 ('tfidf_transformer', TfidfTransformer()),
 ('classifier', MultinomialNB())
]


def id_generator(size=6):
 """
 Return random string.
 """
 chars = string.ascii_uppercase + string.digits
 return ''.join(random.choice(chars) for _ in range(size))

def get_data_frame(filename):
 """
 Read tweets.json from directory and return DataFrame.
 """

 raw = dict(ids=[], text=[])

 # open file and read as json
 _data = open(filename)
 data = json.loads(_data.read())

 # loop over all tweets in json file
 for d in data:

 # update raw list with tweet values.
 raw['text'].append(
 dict(text=d['text'], classification=d['label']))
 raw['ids'].append(
 id_generator(size=12))

 return DataFrame(raw['text'], index=raw['ids'])

def merge(*args):
 """
 Merge two or more DataFrames.
 """
 return concat(args)

def get_train_test_data(data, size=0.2):
 """
 Split DataFrame and return a training and testing set.
 """
 train, test = train_test_split(data, test_size=size)
 return train, test

def test_predict(train, test):
 """
 Run predictions on training and test data,
 then return scores.
 """

 pipeline = Pipeline(steps)
 pipeline.fit(train.text.values, train.classification.values)

 train_score = pipeline.score(
 train.text.values, train.classification.values)
 test_score = pipeline.score(
 test.text.values, test.classification.values)

 return dict(train_score=train_score, test_score=test_score)

def predict(data, text):

 pipeline = Pipeline(steps)
 pipeline.fit(data.text.values, data.classification.values)

 res = zip(pipeline.classes_, pipeline.predict_proba([text])[0])
 return sorted(res, key=lambda x:x[1])