Test Your Machine Learning
In my previous post " Python Machine Learning with Presidential Tweets “, I started messing around with sklearn and text classification.
Since then I’ve discovered a great tutorial from SciPy 2015 . This video starts out slow enough for novices, and a reoccurring theme is testing your datasets.
After watching a good chunk of this video, I decided to go back to my code and implement a testing phase. Basically I’ll split my data into two pieces, a training set , and a testing set .
After training on the training set, I can feed in the testing set, then score predictions for accuracy.
Lets look at this a couple blocks at a time.
First I’ll load my Twitter data into a Panda DataFrames :
In [1]: t = get_data_frame('realDonaldTrump')
In [2]: len(t)
Out[2]: 3100
In [3]: h = get_data_frame('HillaryClinton')
In [4]: len(h)
Out[4]: 2605
These DataFrames include the text results (Tweet), and classifications (Twitter username).
In [1]: t.text.values[0]
Out[1]: u'Thank you Council Bluffs, Iowa! Will be back soon. Remember- everything you need to know about Hillary -- just\u2026 https://t.co/45kIHxdX83'
In [2]: t.classification.values[0]
Out[2]: 'realDonaldTrump'
In [3]: h.text.values[0]
Out[3]: u'"If a candidate regularly and flippantly makes cruel and insulting comments about women...that\'s who that candidate\u2026 https://t.co/uOdGkhKWRY'
In [4]: h.classification.values[0]
Out[4]: 'HillaryClinton'
I then merge both DataFrames into a single DataFrame:
In [1]: data = merge(t, h)
In [2]: len(data)
Out[2]: 5705
Keep in mind, at this stage the DataFrame is ordered:
In [1]: data.classification.values[0:10]
Out[1]:
array(['realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
'realDonaldTrump'], dtype=object)
In [2]: data.classification.values[-10:]
Out[2]:
array(['HillaryClinton', 'HillaryClinton', 'HillaryClinton',
'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
'HillaryClinton'], dtype=object)
Next we need to randomize the DataFrame, then split into a training and testing sets.
In [1]: train, test = get_train_test_data(data, size=0.2)
In [2]: len(train)
Out[2]: 4564
In [3]: len(test)
Out[3]: 1141
Lets make sure our data was randomized before the split:
In [1]: train.classification.values[0:20]
Out[1]:
array(['realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
'HillaryClinton', 'realDonaldTrump', 'realDonaldTrump',
'HillaryClinton', 'realDonaldTrump', 'realDonaldTrump',
'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
'HillaryClinton', 'HillaryClinton', 'realDonaldTrump',
'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
'HillaryClinton', 'HillaryClinton'], dtype=object)
In [2]: test.classification.values[0:20]
Out[2]:
array(['realDonaldTrump', 'realDonaldTrump', 'HillaryClinton',
'HillaryClinton', 'HillaryClinton', 'realDonaldTrump',
'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
'HillaryClinton', 'HillaryClinton'], dtype=object)
Now comes the fun part. Lets do some prediction on both the trained data and testing data . We are expecting the trained data to get near 100% accuracy, after all this was the dataset we trained on.
In [1]: test_predict(train, test)
Out[1]: (0.99649430324276955, 0.92813321647677471)
In the above we hit 99% on the trained data set, and a 92% on testing data.
And for fun, lets run a prediction using the full dataset:
In [1]: predict(data, 'Python programming is exciting.')
Out[1]:
[('HillaryClinton', 0.24396931839432631),
('realDonaldTrump', 0.75603068160567388)]
The full Python code is below, have fun hacking on it!
[python title="Classification Predictions"]
import re
import json
import random
import string
from pandas import DataFrame, concat
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer, \
CountVectorizer
# update the pipeline to get best test results!
steps = [
('count_vectorizer', CountVectorizer(
stop_words='english', ngram_range=(1, 2))),
('tfidf_transformer', TfidfTransformer()),
('classifier', MultinomialNB())
]
def id_generator(size=6):
"""
Return random string.
"""
chars = string.ascii_uppercase + string.digits
return ''.join(random.choice(chars) for _ in range(size))
def get_data_frame(username):
"""
Read tweets.json from directory and return DataFrame.
"""
raw = dict(ids=[], text=[])
# open file and read as json
data = open('./%s/tweets.json' % username)
tweets = json.loads(data.read())
# loop over all tweets in json file
for t in tweets:
# skip retweets.
if re.search('^RT ', t['tweet']):
continue
# update raw list with tweet values.
raw['text'].append(
dict(text=t['tweet'], classification=username))
raw['ids'].append(
'%s-%s' % (t['time'], id_generator()))
return DataFrame(raw['text'], index=raw['ids'])
def merge(*args):
"""
Merge two or more DataFrames.
"""
return concat(args)
def get_train_test_data(data, size=0.2):
"""
Split DataFrame and return a training and testing set.
"""
train, test = train_test_split(data, test_size=size)
return train, test
def test_predict(train, test):
"""
Run predictions on training and test data,
then return scores.
"""
pipeline = Pipeline(steps)
pipeline.fit(train.text.values, train.classification.values)
train_score = pipeline.score(
train.text.values, train.classification.values)
test_score = pipeline.score(
test.text.values, test.classification.values)
return train_score, test_score
def predict(data, text):
pipeline = Pipeline(steps)
pipeline.fit(data.text.values, data.classification.values)
res = zip(pipeline.classes_, pipeline.predict_proba([text])[0])
return sorted(res, key=lambda x:x[1])
[/python]
For additional content I might suggest some of the courses found on CourseDuck .