Python and Sentiment Analysis
While looking for datasets to throw at sklearn , I came across UCI Sentiment Labelled Sentences Data Set.
UCI is providing us with positive / negative tagging on real world data, the data comes from three sources ( Amazon , Yelp , and IMDB ).
The only problem is the format is a little strange.. We have a .txt file for each source, this is a raw unstructured formatting, plus not every line is tagged with sentiment.
To make the data easier to interact with, I generated a json file with only the results containing sentiment. Go ahead and download it.
$ zcat sentiment.json.gz | head -n 25
[
{
"result": 0,
"source": "amazon_cells_labelled.txt",
"label": "negative",
"text": "So there is no way for me to plug it in here in the US unless I go by a converter."
},
{
"result": 1,
"source": "amazon_cells_labelled.txt",
"label": "positive",
"text": "Good case, Excellent value."
},
{
"result": 1,
"source": "amazon_cells_labelled.txt",
"label": "positive",
"text": "Great for the jawbone."
},
{
"result": 0,
"source": "amazon_cells_labelled.txt",
"label": "negative",
"text": "Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!"
},
Lets jump into an IPython interrupter and load the data:
In [1]: import json
In [2]: raw = open('sentiment.json').read()
Now that we have the data as a Python dictionary, create a DataFrame in the proper format:
In [1]: data = get_data_frame('sentiment.json')
In [2]: data.shape
Out[2]: (3000, 2)
Next lets split our full dataset into a training, and testing dataset:
In [1]: train, test = get_train_test_data(data, size=0.2)
In [2]: train.shape
Out[2]: (2400, 2)
In [3]: test.shape
Out[3]: (600, 2)
We are now set to run a bit of accuracy testing:
In [1]: test_predict(train, test)
Out[1]: {
'test_score': 0.80000000000000004,
'train_score': 0.98375000000000001
}
We can slice our full dataset a few more times, just to make sure our accuracy test is.. accurate:
In [1]: train, test = get_train_test_data(data, size=0.2)
In [2]: test_predict(train, test)
Out[2]: {
'test_score': 0.79000000000000004,
'train_score': 0.98416666666666663
}
In [3]: train, test = get_train_test_data(data, size=0.2)
In [4]: test_predict(train, test)
Out[4]: {
'test_score': 0.80666666666666664,
'train_score': 0.98291666666666666
}
In [5]: train, test = get_train_test_data(data, size=0.5)
In [6]: test_predict(train, test)
Out[6]: {
'test_score': 0.79466666666666663,
'train_score': 0.98999999999999999
}
All that is left is to feed in the entire dataset and predict on new sentences:
In [1]: predict(data, 'This was the worst experience.')
Out[1]: [
(u'positive', 0.17704535094140364),
(u'negative', 0.82295464905859583)
]
In [2]: predict(data, 'The staff here was fabulous')
Out[2]: [
(u'negative', 0.20651083543376234),
(u'positive', 0.79348916456623764)
]
In [1]: predict(data, 'I hate you')
Out[1]: [
(u'positive', 0.22509671479185445),
(u'negative', 0.77490328520814555)
]
In [2]: predict(data, 'I love you')
Out[2]: [
(u'negative', 0.10593166714256422),
(u'positive', 0.89406833285743614)
]
Lets put it all together by looking at the Python functions:
import re
import json
import random
import string
from pandas import DataFrame, concat
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import NearestNeighbors
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer, \
CountVectorizer
# update the pipeline to get best test results!
steps = [
('count_vectorizer', CountVectorizer(
stop_words='english', ngram_range=(1, 2))),
('tfidf_transformer', TfidfTransformer()),
('classifier', MultinomialNB())
]
def id_generator(size=6):
"""
Return random string.
"""
chars = string.ascii_uppercase + string.digits
return ''.join(random.choice(chars) for _ in range(size))
def get_data_frame(filename):
"""
Read tweets.json from directory and return DataFrame.
"""
raw = dict(ids=[], text=[])
# open file and read as json
_data = open(filename)
data = json.loads(_data.read())
# loop over all tweets in json file
for d in data:
# update raw list with tweet values.
raw['text'].append(
dict(text=d['text'], classification=d['label']))
raw['ids'].append(
id_generator(size=12))
return DataFrame(raw['text'], index=raw['ids'])
def merge(*args):
"""
Merge two or more DataFrames.
"""
return concat(args)
def get_train_test_data(data, size=0.2):
"""
Split DataFrame and return a training and testing set.
"""
train, test = train_test_split(data, test_size=size)
return train, test
def test_predict(train, test):
"""
Run predictions on training and test data,
then return scores.
"""
pipeline = Pipeline(steps)
pipeline.fit(train.text.values, train.classification.values)
train_score = pipeline.score(
train.text.values, train.classification.values)
test_score = pipeline.score(
test.text.values, test.classification.values)
return dict(train_score=train_score, test_score=test_score)
def predict(data, text):
pipeline = Pipeline(steps)
pipeline.fit(data.text.values, data.classification.values)
res = zip(pipeline.classes_, pipeline.predict_proba([text])[0])
return sorted(res, key=lambda x:x[1])