Test Your Machine Learning

In my previous post " Python Machine Learning with Presidential Tweets “, I started messing around with sklearn and text classification.

Since then I’ve discovered a great tutorial from SciPy 2015 . This video starts out slow enough for novices, and a reoccurring theme is testing your datasets.

After watching a good chunk of this video, I decided to go back to my code and implement a testing phase. Basically I’ll split my data into two pieces, a training set , and a testing set .

After training on the training set, I can feed in the testing set, then score predictions for accuracy.

Lets look at this a couple blocks at a time.

First I’ll load my Twitter data into a Panda DataFrames :

In [1]: t = get_data_frame('realDonaldTrump')

In [2]: len(t)
Out[2]: 3100

In [3]: h = get_data_frame('HillaryClinton')

In [4]: len(h)
Out[4]: 2605

These DataFrames include the text results (Tweet), and classifications (Twitter username).

In [1]: t.text.values[0]
Out[1]: u'Thank you Council Bluffs, Iowa! Will be back soon. Remember- everything you need to know about Hillary -- just\u2026 https://t.co/45kIHxdX83'

In [2]: t.classification.values[0]
Out[2]: 'realDonaldTrump'

In [3]: h.text.values[0]
Out[3]: u'"If a candidate regularly and flippantly makes cruel and insulting comments about women...that\'s who that candidate\u2026 https://t.co/uOdGkhKWRY'

In [4]: h.classification.values[0]
Out[4]: 'HillaryClinton'

I then merge both DataFrames into a single DataFrame:

In [1]: data = merge(t, h)

In [2]: len(data)
Out[2]: 5705

Keep in mind, at this stage the DataFrame is ordered:

In [1]: data.classification.values[0:10]
Out[1]:
array(['realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump'], dtype=object)

In [2]: data.classification.values[-10:]
Out[2]:
array(['HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton'], dtype=object)

Next we need to randomize the DataFrame, then split into a training and testing sets.

In [1]: train, test = get_train_test_data(data, size=0.2)

In [2]: len(train)
Out[2]: 4564

In [3]: len(test)
Out[3]: 1141

Lets make sure our data was randomized before the split:

In [1]: train.classification.values[0:20]
Out[1]:
array(['realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'HillaryClinton', 'realDonaldTrump',
 'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton'], dtype=object)

In [2]: test.classification.values[0:20]
Out[2]:
array(['realDonaldTrump', 'realDonaldTrump', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton', 'realDonaldTrump',
 'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
 'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton'], dtype=object)

Now comes the fun part. Lets do some prediction on both the trained data and testing data . We are expecting the trained data to get near 100% accuracy, after all this was the dataset we trained on.

In [1]: test_predict(train, test)
Out[1]: (0.99649430324276955, 0.92813321647677471)

In the above we hit 99% on the trained data set, and a 92% on testing data.

And for fun, lets run a prediction using the full dataset:

In [1]: predict(data, 'Python programming is exciting.')
Out[1]:
[('HillaryClinton', 0.24396931839432631),
 ('realDonaldTrump', 0.75603068160567388)]

The full Python code is below, have fun hacking on it!

[python title="Classification Predictions"]

import re
import json
import random
import string

from pandas import DataFrame, concat

from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer, \
    CountVectorizer

# update the pipeline to get best test results!
steps = [
    ('count_vectorizer',  CountVectorizer(
        stop_words='english', ngram_range=(1,  2))),
    ('tfidf_transformer', TfidfTransformer()),
    ('classifier',        MultinomialNB())
]

def id_generator(size=6):
    """
    Return random string.
    """
    chars = string.ascii_uppercase + string.digits
    return ''.join(random.choice(chars) for _ in range(size))

def get_data_frame(username):
    """
    Read tweets.json from directory and return DataFrame.
    """

    raw = dict(ids=[], text=[])

    # open file and read as json
    data = open('./%s/tweets.json' % username)
    tweets = json.loads(data.read())

    # loop over all tweets in json file
    for t in tweets:

        # skip retweets.
        if re.search('^RT ', t['tweet']):
            continue

        # update raw list with tweet values.
        raw['text'].append(
            dict(text=t['tweet'], classification=username))
        raw['ids'].append(
            '%s-%s' % (t['time'], id_generator()))

    return DataFrame(raw['text'], index=raw['ids'])

def merge(*args):
    """
    Merge two or more DataFrames.
    """
    return concat(args)

def get_train_test_data(data, size=0.2):
    """
    Split DataFrame and return a training and testing set.
    """
    train, test = train_test_split(data, test_size=size)
    return train, test

def test_predict(train, test):
    """
    Run predictions on training and test data,
    then return scores.
    """

    pipeline = Pipeline(steps)
    pipeline.fit(train.text.values, train.classification.values)

    train_score = pipeline.score(
        train.text.values, train.classification.values)
    test_score = pipeline.score(
        test.text.values, test.classification.values)

    return train_score, test_score

def predict(data, text):

    pipeline = Pipeline(steps)
    pipeline.fit(data.text.values, data.classification.values)

    res = zip(pipeline.classes_, pipeline.predict_proba([text])[0])
    return sorted(res, key=lambda x:x[1])

[/python]

For additional content I might suggest some of the courses found on CourseDuck .