Python and sentiment analysis

While looking for datasets to throw at sklearn, I came across UCI Sentiment Labelled Sentences Data Set.

UCI is providing us with positive / negative tagging on real world data, the data comes from three sources (Amazon, Yelp, and IMDB).

The only problem is the format is a little strange.. We have a .txt file for each source, this is a raw unstructured  formatting, plus not every line is tagged with sentiment.

To make the data easier to interact with, I generated a json file with only the results containing sentiment. Go ahead and download it.

$ zcat sentiment.json.gz | head -n 25
 "result": 0,
 "source": "amazon_cells_labelled.txt",
 "label": "negative",
 "text": "So there is no way for me to plug it in here in the US unless I go by a converter."
 "result": 1,
 "source": "amazon_cells_labelled.txt",
 "label": "positive",
 "text": "Good case, Excellent value."
 "result": 1,
 "source": "amazon_cells_labelled.txt",
 "label": "positive",
 "text": "Great for the jawbone."
 "result": 0,
 "source": "amazon_cells_labelled.txt",
 "label": "negative",
 "text": "Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!"

Lets jump into an IPython interrupter and load the data:

In [1]: import json

In [2]: raw = open('sentiment.json').read()

Now that we have the data as a Python dictionary, create a DataFrame in the proper format:

In [1]: data = get_data_frame('sentiment.json')

In [2]: data.shape
Out[2]: (3000, 2)

Next lets split our full dataset into a training, and testing dataset:

In [1]: train, test = get_train_test_data(data, size=0.2)

In [2]: train.shape
Out[2]: (2400, 2)

In [3]: test.shape
Out[3]: (600, 2)

We are now set to run a bit of accuracy testing:

In [1]: test_predict(train, test)
Out[1]: {
  'test_score': 0.80000000000000004,
  'train_score': 0.98375000000000001

We can slice our full dataset a few more times, just to make sure our accuracy test is.. accurate:

In [1]: train, test = get_train_test_data(data, size=0.2)

In [2]: test_predict(train, test)
Out[2]: {
  'test_score': 0.79000000000000004,
  'train_score': 0.98416666666666663

In [3]: train, test = get_train_test_data(data, size=0.2)

In [4]: test_predict(train, test)
Out[4]: {
  'test_score': 0.80666666666666664,
  'train_score': 0.98291666666666666

In [5]: train, test = get_train_test_data(data, size=0.5)

In [6]: test_predict(train, test)
Out[6]: {
  'test_score': 0.79466666666666663,
  'train_score': 0.98999999999999999

All that is left is to feed in the entire dataset and predict on new sentences:

In [1]: predict(data, 'This was the worst experience.')
Out[1]: [
  (u'positive', 0.17704535094140364),
  (u'negative', 0.82295464905859583)

In [2]: predict(data, 'The staff here was fabulous')
Out[2]: [
  (u'negative', 0.20651083543376234),
  (u'positive', 0.79348916456623764)
In [1]: predict(data, 'I hate you')
Out[1]: [
  (u'positive', 0.22509671479185445),
  (u'negative', 0.77490328520814555)

In [2]: predict(data, 'I love you')
Out[2]: [
  (u'negative', 0.10593166714256422),
  (u'positive', 0.89406833285743614)

Lets put it all together by looking at the Python functions:


import re
import json
import random
import string

from pandas import DataFrame, concat

from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import NearestNeighbors
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer, \

# update the pipeline to get best test results!
steps = [
(‘count_vectorizer’, CountVectorizer(
stop_words=’english’, ngram_range=(1, 2))),
(‘tfidf_transformer’, TfidfTransformer()),
(‘classifier’, MultinomialNB())

def id_generator(size=6):
Return random string.
chars = string.ascii_uppercase + string.digits
return ”.join(random.choice(chars) for _ in range(size))

def get_data_frame(filename):
Read tweets.json from directory and return DataFrame.

raw = dict(ids=[], text=[])

# open file and read as json
_data = open(filename)
data = json.loads(

# loop over all tweets in json file
for d in data:

# update raw list with tweet values.
dict(text=d[‘text’], classification=d[‘label’]))

return DataFrame(raw[‘text’], index=raw[‘ids’])

def merge(*args):
Merge two or more DataFrames.
return concat(args)

def get_train_test_data(data, size=0.2):
Split DataFrame and return a training and testing set.
train, test = train_test_split(data, test_size=size)
return train, test

def test_predict(train, test):
Run predictions on training and test data,
then return scores.

pipeline = Pipeline(steps), train.classification.values)

train_score = pipeline.score(
train.text.values, train.classification.values)
test_score = pipeline.score(
test.text.values, test.classification.values)

return dict(train_score=train_score, test_score=test_score)

def predict(data, text):

pipeline = Pipeline(steps), data.classification.values)

res = zip(pipeline.classes_, pipeline.predict_proba([text])[0])
return sorted(res, key=lambda x:x[1])



  1. Hi Nessy,thanks for the JSON file.
    Can u please describe how u converted the text file to the JSON file?I was struck with the same thing.It would be very helpful if you would describe the method or mail me the code.

    1. Hello Aditya,

      I’m sorry but I no longer the script used. If I recall correctly all I did was used Python and the JSON library.

  2. Hi nessy,
    I am getting an error called
    TypeError: predict_proba() missing 1 required positional argument: ‘X’
    Please help me with this issue.

  3. Is it okay if we feature your site in our next email newsletter? It’s a perfect fit for a piece we’re doing and I think our audience would find some of the content on your site super useful.

    I know you’re probably busy, so just a simple yes or no would suffice.

    Many Thanks,

Leave a Reply

Your email address will not be published.