Test Your Machine Learning
In my previous post " Python Machine Learning with Presidential Tweets “, I started messing around with sklearn and text classification.
Since then I’ve discovered a great tutorial from SciPy 2015 . This video starts out slow enough for novices, and a reoccurring theme is testing your datasets.
After watching a good chunk of this video, I decided to go back to my code and implement a testing phase. Basically I’ll split my data into two pieces, a training set , and a testing set .
After training on the training set, I can feed in the testing set, then score predictions for accuracy.
Lets look at this a couple blocks at a time.
First I’ll load my Twitter data into a Panda DataFrames :
In [1]: t = get_data_frame('realDonaldTrump')
In [2]: len(t)
Out[2]: 3100
In [3]: h = get_data_frame('HillaryClinton')
In [4]: len(h)
Out[4]: 2605
These DataFrames include the text results (Tweet), and classifications (Twitter username).
In [1]: t.text.values[0]
Out[1]: u'Thank you Council Bluffs, Iowa! Will be back soon. Remember- everything you need to know about Hillary -- just\u2026 https://t.co/45kIHxdX83'
In [2]: t.classification.values[0]
Out[2]: 'realDonaldTrump'
In [3]: h.text.values[0]
Out[3]: u'"If a candidate regularly and flippantly makes cruel and insulting comments about women...that\'s who that candidate\u2026 https://t.co/uOdGkhKWRY'
In [4]: h.classification.values[0]
Out[4]: 'HillaryClinton'
I then merge both DataFrames into a single DataFrame:
In [1]: data = merge(t, h)
In [2]: len(data)
Out[2]: 5705
Keep in mind, at this stage the DataFrame is ordered:
In [1]: data.classification.values[0:10]
Out[1]:
array(['realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump'], dtype=object)
In [2]: data.classification.values[-10:]
Out[2]:
array(['HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton'], dtype=object)
Next we need to randomize the DataFrame, then split into a training and testing sets.
In [1]: train, test = get_train_test_data(data, size=0.2)
In [2]: len(train)
Out[2]: 4564
In [3]: len(test)
Out[3]: 1141
Lets make sure our data was randomized before the split:
In [1]: train.classification.values[0:20]
Out[1]:
array(['realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'HillaryClinton', 'realDonaldTrump',
 'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton'], dtype=object)
In [2]: test.classification.values[0:20]
Out[2]:
array(['realDonaldTrump', 'realDonaldTrump', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton', 'realDonaldTrump',
 'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
 'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton'], dtype=object)
Now comes the fun part. Lets do some prediction on both the trained data and testing data . We are expecting the trained data to get near 100% accuracy, after all this was the dataset we trained on.
In [1]: test_predict(train, test)
Out[1]: (0.99649430324276955, 0.92813321647677471)
In the above we hit 99% on the trained data set, and a 92% on testing data.
And for fun, lets run a prediction using the full dataset:
In [1]: predict(data, 'Python programming is exciting.')
Out[1]:
[('HillaryClinton', 0.24396931839432631),
 ('realDonaldTrump', 0.75603068160567388)]
The full Python code is below, have fun hacking on it!
[python title="Classification Predictions"]
import re
import json
import random
import string
from pandas import DataFrame, concat
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer, \
    CountVectorizer
# update the pipeline to get best test results!
steps = [
    ('count_vectorizer',  CountVectorizer(
        stop_words='english', ngram_range=(1,  2))),
    ('tfidf_transformer', TfidfTransformer()),
    ('classifier',        MultinomialNB())
]
def id_generator(size=6):
    """
    Return random string.
    """
    chars = string.ascii_uppercase + string.digits
    return ''.join(random.choice(chars) for _ in range(size))
def get_data_frame(username):
    """
    Read tweets.json from directory and return DataFrame.
    """
    raw = dict(ids=[], text=[])
    # open file and read as json
    data = open('./%s/tweets.json' % username)
    tweets = json.loads(data.read())
    # loop over all tweets in json file
    for t in tweets:
        # skip retweets.
        if re.search('^RT ', t['tweet']):
            continue
        # update raw list with tweet values.
        raw['text'].append(
            dict(text=t['tweet'], classification=username))
        raw['ids'].append(
            '%s-%s' % (t['time'], id_generator()))
    return DataFrame(raw['text'], index=raw['ids'])
def merge(*args):
    """
    Merge two or more DataFrames.
    """
    return concat(args)
def get_train_test_data(data, size=0.2):
    """
    Split DataFrame and return a training and testing set.
    """
    train, test = train_test_split(data, test_size=size)
    return train, test
def test_predict(train, test):
    """
    Run predictions on training and test data,
    then return scores.
    """
    pipeline = Pipeline(steps)
    pipeline.fit(train.text.values, train.classification.values)
    train_score = pipeline.score(
        train.text.values, train.classification.values)
    test_score = pipeline.score(
        test.text.values, test.classification.values)
    return train_score, test_score
def predict(data, text):
    pipeline = Pipeline(steps)
    pipeline.fit(data.text.values, data.classification.values)
    res = zip(pipeline.classes_, pipeline.predict_proba([text])[0])
    return sorted(res, key=lambda x:x[1])
[/python]
For additional content I might suggest some of the courses found on CourseDuck .