Test your Machine Learning

In my previous post “Python Machine Learning with Presidential Tweets“, I started messing around with sklearn and text classification.

Since then I’ve discovered a great tutorial from SciPy 2015. This video starts out slow enough for novices, and a reoccurring theme is testing your datasets.

After watching a good chunk of this video, I decided to go back to my code and implement a testing phase. Basically I’ll split my data into two pieces, a training set, and a testing set.

After training on the training set, I can feed in the testing set, then score predictions for accuracy.

Lets look at this a couple blocks at a time.

First I’ll load my Twitter data into a Panda DataFrames:

In [1]: t = get_data_frame('realDonaldTrump')

In [2]: len(t)
Out[2]: 3100

In [3]: h = get_data_frame('HillaryClinton')

In [4]: len(h)
Out[4]: 2605

These DataFrames include the text results (Tweet), and classifications (Twitter username).

In [1]: t.text.values[0]
Out[1]: u'Thank you Council Bluffs, Iowa! Will be back soon. Remember- everything you need to know about Hillary -- just\u2026 https://t.co/45kIHxdX83'

In [2]: t.classification.values[0]
Out[2]: 'realDonaldTrump'

In [3]: h.text.values[0]
Out[3]: u'"If a candidate regularly and flippantly makes cruel and insulting comments about women...that\'s who that candidate\u2026 https://t.co/uOdGkhKWRY'

In [4]: h.classification.values[0]
Out[4]: 'HillaryClinton'

I then merge both DataFrames into a single DataFrame:

In [1]: data = merge(t, h)

In [2]: len(data)
Out[2]: 5705

Keep in mind, at this stage the DataFrame is ordered:

In [1]: data.classification.values[0:10]
Out[1]:
array(['realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump'], dtype=object)

In [2]: data.classification.values[-10:]
Out[2]:
array(['HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton'], dtype=object)

Next we need to randomize the DataFrame, then split into a training and testing sets.

In [1]: train, test = get_train_test_data(data, size=0.2)

In [2]: len(train)
Out[2]: 4564

In [3]: len(test)
Out[3]: 1141

Lets make sure our data was randomized before the split:

In [1]: train.classification.values[0:20]
Out[1]:
array(['realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'HillaryClinton', 'realDonaldTrump',
 'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton'], dtype=object)

In [2]: test.classification.values[0:20]
Out[2]:
array(['realDonaldTrump', 'realDonaldTrump', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton', 'realDonaldTrump',
 'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
 'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton'], dtype=object)

Now comes the fun part. Lets do some prediction on both the trained data and testing data. We are expecting the trained data to get near 100% accuracy, after all this was the dataset we trained on.

In [1]: test_predict(train, test)
Out[1]: (0.99649430324276955, 0.92813321647677471)

In the above we hit 99% on the trained data set, and a 92% on testing data.

And for fun, lets run a prediction using the full dataset:

In [1]: predict(data, 'Python programming is exciting.')
Out[1]:
[('HillaryClinton', 0.24396931839432631),
 ('realDonaldTrump', 0.75603068160567388)]

The full Python code is below, have fun hacking on it!



import re
import json
import random
import string

from pandas import DataFrame, concat

from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer, \
    CountVectorizer


# update the pipeline to get best test results!
steps = [
    ('count_vectorizer',  CountVectorizer(
        stop_words='english', ngram_range=(1,  2))),
    ('tfidf_transformer', TfidfTransformer()),
    ('classifier',        MultinomialNB())
]

def id_generator(size=6):
    """
    Return random string.
    """
    chars = string.ascii_uppercase + string.digits
    return ''.join(random.choice(chars) for _ in range(size))

def get_data_frame(username):
    """
    Read tweets.json from directory and return DataFrame.
    """

    raw = dict(ids=[], text=[])

    # open file and read as json
    data = open('./%s/tweets.json' % username)
    tweets = json.loads(data.read())

    # loop over all tweets in json file
    for t in tweets:

        # skip retweets.
        if re.search('^RT ', t['tweet']):
            continue

        # update raw list with tweet values.
        raw['text'].append(
            dict(text=t['tweet'], classification=username))
        raw['ids'].append(
            '%s-%s' % (t['time'], id_generator()))

    return DataFrame(raw['text'], index=raw['ids'])

def merge(*args):
    """
    Merge two or more DataFrames.
    """
    return concat(args)

def get_train_test_data(data, size=0.2):
    """
    Split DataFrame and return a training and testing set.
    """
    train, test = train_test_split(data, test_size=size)
    return train, test

def test_predict(train, test):
    """
    Run predictions on training and test data,
    then return scores.
    """

    pipeline = Pipeline(steps)
    pipeline.fit(train.text.values, train.classification.values)

    train_score = pipeline.score(
        train.text.values, train.classification.values)
    test_score = pipeline.score(
        test.text.values, test.classification.values)

    return train_score, test_score

def predict(data, text):

    pipeline = Pipeline(steps)
    pipeline.fit(data.text.values, data.classification.values)

    res = zip(pipeline.classes_, pipeline.predict_proba()[0])
    return sorted(res, key=lambda x:x[1])



Python Machine Learning with Presidential Tweets

I’ve been spending a little bit of time researching Machine Learning, and was very happy to come across a Python library called sklearn.

While digging around Google, I came across a fantastic write up on Document Classification by Zac Steward. This article went pretty deep into writing a spam filter using machine learning, and sklearn. After reading the article I wanted to try some of the concepts, but had no interest in writing a spam filter.

I decided instead to write a predictor using Tweets as the learning source, and what better users than the US Presidential candidates!

Let me forewarn, this is merely using term frequencies, and n-grams on the tweets, and probably isn’t really useful or completely accurate, but hey, it could be fun, right? 🙂

In [1]: tweets = get_tweets('HillaryClinton')
In [2]: tweets = get_tweets('realDonaldTrump')

In [3]: h = get_data_frame('HillaryClinton')
In [4]: t = get_data_frame('realDonaldTrump')

In [5]: data = merge_data_frames(t, h)

A good baseline might be to predict on an actual tweet the candidate has posted:

screen-shot-2016-09-29-at-6-41-08-pm

In [1]: predict(data, 'The question in this election: Who can put the plans into action that will make your life better?')
('realDonaldTrump', 0.15506298409438407)
('HillaryClinton', 0.84493701590561299)

Alright that is an 84% to 15% prediction, pretty good.

screen-shot-2016-09-29-at-6-44-35-pm

In [1]: predict(data, 'I won every poll from last nights Presidential Debate - except for the little watched @CNN poll.')
('HillaryClinton', 0.069884565641135613)
('realDonaldTrump', 0.93011543435886102)

This prediction is giving a 93% to 6%, even better.

Now lets have a little fun by throwing in things we would assume, but the candidates did not post:

In [1]: predict(data, 'I have really big hands')
('HillaryClinton', 0.39802148371499757)
('realDonaldTrump', 0.60197851628500265)

In [2]: predict(data, 'I am for woman rights')
('realDonaldTrump', 0.3698772371039914)
('HillaryClinton', 0.63012276289600766)

We could also feed in some famous quotes:

screen-shot-2016-09-29-at-6-48-17-pm

In [1]: predict(data, "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.")
('realDonaldTrump', 0.28321206465202214)
('HillaryClinton', 0.71678793534798135)

screen-shot-2016-09-29-at-6-49-35-pm

In [1]: predict(data, 'A room without books is like a body without a soul.')
('realDonaldTrump', 0.39169158094239315)
('HillaryClinton', 0.60830841905760524)

Alright, so go have a look at the code, you can find it on my Github page.

Happy Hacking!

SensorTag data merged with Open Weather Maps

About a week ago I worked on SensorTag metrics with Grafana.

This week had some interesting weather today here in Austin, and I wanted to see to visualize it as well. Luckily Open Weather Maps offers a free API for gather near real-time weather data based on city code.

screen-shot-2016-09-25-at-7-16-21-pm

screen-shot-2016-09-25-at-7-41-20-pm

def __get_open_weather_data():

  url_path = 'http://api.openweathermap.org/data/2.5/weather'
  api_key = '??????????'
  url = '%s?zip=73301&APPID=%s'

  res = requests.get(url % (url_path, api_key))
  if res:
    if res.json().get('main'):
      return res.json()

  res = requests.get(url % (url_path, api_key))
  if res:
    if res.json().get('main'):
      return res.json()

def get_open_weather():

  data = __get_open_weather_data()

  # format our json response
  temp = round(data['main']['temp'] * 9/5 - 459.67, 2)
  pressure = round(data['main']['pressure'], 1)
  humidity = round(data['main']['humidity'], 2)
  rain = round(data['rain'].get('1h', 0.00), 2)
  clouds = data['clouds']['all']
  wind = data['wind']['speed']

  return dict(
      open_weather_temperature=temp, 
      open_weather_pressure=pressure,
      open_weather_humidity=humidity,
      open_weather_rain=rain,
      open_weather_clouds=clouds,
      open_weather_wind=wind
  )

Then I merge with my SensorTag data, appending these new keys to my json file:

$ cat /sensor.json
{
 "open_weather_temperature": 71.65,
 "temperature": 74.19,
 "open_weather_pressure": 1018.5,
 "light": 0,
 "humidity": 55.16,
 "pressure": 989.8,
 "open_weather_humidity": 99,
 "open_weather_rain": 7.07,
 "open_weather_clouds": 48,
 "open_weather_wind": 2.81
}

Arduino meet Raspberry Pi

While at the electronics store the other day, I noticed they had motion detectors on sale for only $4. I decided with my latest obsession of electronic tinkering, picking up a OSEEP Passive Infrared Sensor (PIR) Module might be fun.

I guess I should have done a little more reading on the packaging; by the time I was home, I noticed this sensor reported in analog, not digital. This was an issue as the Raspberry Pi only reads digital input.

Lucky for me, I also picked up an Arduino UNO Starter Kit awhile back. I decided this would be a great time to learn more about converting analog signals to digital (one great thing about the UNO is that it has both digital and analog input/output pins).

As an extra, I learned the Nexcon Solar Charger 5000mAh I bough for hiking and camping works great as a Raspberry Pi power source, in theory I can have a portable motion detector 😀

motion_1

motion_2

The wiring is rather basic, there is no need for resistors or capacitors, just direct connections.

* Connect motion sensor to the Adruino’s 5v power and ground.
* Connect motion sensor’s signal pin to Analog A0 pin on Adruino
* Connect Adruino’s Digital 2 pin to Raspberry Pi’s GPIO 18
* Connect Andruino’s ground to Raspberry Pi’s Ground

screen-shot-2016-09-24-at-2-21-03-pm

Once we are wired up, we can compile and upload the Arduino UNO code using Arduino Studio.

Arduino

/*
OSEPP Motion detector analog to digital convertor
http://nessy.info
*/

int analog = A0;
int digital = 2

void setup(){

 // set our digital pin to OUTPUT
 pinMode(digital, OUTPUT);
}

void loop()
{

 // read value from analog pin
 int analog_value = analogRead(analog);

 // send digital signal when motion detected
 if (analog_value > 0) {
   digitalWrite(digital, HIGH);
 } else {
   digitalWrite(digital, LOW);
 }

 delay(100); // slow down the loop just a bit
}

This Arduino code will read analog input from our motion detector, and any time more than 0v is detected it sends a signal to digital pin 2.

Raspberry Pi (Python)

import time
from datetime import datetime

import RPi.GPIO as GPIO

GPIO.setmode(GPIO.BCM)
GPIO.setup(18, GPIO.IN)

def detect():
  while True:
    if GPIO.input(18):
      print '[%s] Movement Detected!' % datetime.now().ctime()
    time.sleep(1)


detect()  # run movement detection

On the Raspberry Pi side we will listen for signal on GPIO pin 18, and print out a little message, and timestamp.

screen-shot-2016-09-24-at-1-42-55-pm

From here we can do all sort of things, Happy Hacking!