Test your Machine Learning

In my previous post “Python Machine Learning with Presidential Tweets“, I started messing around with sklearn and text classification.

Since then I’ve discovered a great tutorial from SciPy 2015. This video starts out slow enough for novices, and a reoccurring theme is testing your datasets.

After watching a good chunk of this video, I decided to go back to my code and implement a testing phase. Basically I’ll split my data into two pieces, a training set, and a testing set.

After training on the training set, I can feed in the testing set, then score predictions for accuracy.

Lets look at this a couple blocks at a time.

First I’ll load my Twitter data into a Panda DataFrames:

In [1]: t = get_data_frame('realDonaldTrump')

In [2]: len(t)
Out[2]: 3100

In [3]: h = get_data_frame('HillaryClinton')

In [4]: len(h)
Out[4]: 2605

These DataFrames include the text results (Tweet), and classifications (Twitter username).

In [1]: t.text.values[0]
Out[1]: u'Thank you Council Bluffs, Iowa! Will be back soon. Remember- everything you need to know about Hillary -- just\u2026 https://t.co/45kIHxdX83'

In [2]: t.classification.values[0]
Out[2]: 'realDonaldTrump'

In [3]: h.text.values[0]
Out[3]: u'"If a candidate regularly and flippantly makes cruel and insulting comments about women...that\'s who that candidate\u2026 https://t.co/uOdGkhKWRY'

In [4]: h.classification.values[0]
Out[4]: 'HillaryClinton'

I then merge both DataFrames into a single DataFrame:

In [1]: data = merge(t, h)

In [2]: len(data)
Out[2]: 5705

Keep in mind, at this stage the DataFrame is ordered:

In [1]: data.classification.values[0:10]
array(['realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump'], dtype=object)

In [2]: data.classification.values[-10:]
array(['HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton'], dtype=object)

Next we need to randomize the DataFrame, then split into a training and testing sets.

In [1]: train, test = get_train_test_data(data, size=0.2)

In [2]: len(train)
Out[2]: 4564

In [3]: len(test)
Out[3]: 1141

Lets make sure our data was randomized before the split:

In [1]: train.classification.values[0:20]
array(['realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'HillaryClinton', 'realDonaldTrump',
 'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton'], dtype=object)

In [2]: test.classification.values[0:20]
array(['realDonaldTrump', 'realDonaldTrump', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton', 'realDonaldTrump',
 'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
 'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton'], dtype=object)

Now comes the fun part. Lets do some prediction on both the trained data and testing data. We are expecting the trained data to get near 100% accuracy, after all this was the dataset we trained on.

In [1]: test_predict(train, test)
Out[1]: (0.99649430324276955, 0.92813321647677471)

In the above we hit 99% on the trained data set, and a 92% on testing data.

And for fun, lets run a prediction using the full dataset:

In [1]: predict(data, 'Python programming is exciting.')
[('HillaryClinton', 0.24396931839432631),
 ('realDonaldTrump', 0.75603068160567388)]

The full Python code is below, have fun hacking on it!

import re
import json
import random
import string

from pandas import DataFrame, concat

from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer, \

# update the pipeline to get best test results!
steps = [
    ('count_vectorizer',  CountVectorizer(
        stop_words='english', ngram_range=(1,  2))),
    ('tfidf_transformer', TfidfTransformer()),
    ('classifier',        MultinomialNB())

def id_generator(size=6):
    Return random string.
    chars = string.ascii_uppercase + string.digits
    return ''.join(random.choice(chars) for _ in range(size))

def get_data_frame(username):
    Read tweets.json from directory and return DataFrame.

    raw = dict(ids=[], text=[])

    # open file and read as json
    data = open('./%s/tweets.json' % username)
    tweets = json.loads(data.read())

    # loop over all tweets in json file
    for t in tweets:

        # skip retweets.
        if re.search('^RT ', t['tweet']):

        # update raw list with tweet values.
            dict(text=t['tweet'], classification=username))
            '%s-%s' % (t['time'], id_generator()))

    return DataFrame(raw['text'], index=raw['ids'])

def merge(*args):
    Merge two or more DataFrames.
    return concat(args)

def get_train_test_data(data, size=0.2):
    Split DataFrame and return a training and testing set.
    train, test = train_test_split(data, test_size=size)
    return train, test

def test_predict(train, test):
    Run predictions on training and test data,
    then return scores.

    pipeline = Pipeline(steps)
    pipeline.fit(train.text.values, train.classification.values)

    train_score = pipeline.score(
        train.text.values, train.classification.values)
    test_score = pipeline.score(
        test.text.values, test.classification.values)

    return train_score, test_score

def predict(data, text):

    pipeline = Pipeline(steps)
    pipeline.fit(data.text.values, data.classification.values)

    res = zip(pipeline.classes_, pipeline.predict_proba()[0])
    return sorted(res, key=lambda x:x[1])

Python Machine Learning with Presidential Tweets

I’ve been spending a little bit of time researching Machine Learning, and was very happy to come across a Python library called sklearn.

While digging around Google, I came across a fantastic write up on Document Classification by Zac Steward. This article went pretty deep into writing a spam filter using machine learning, and sklearn. After reading the article I wanted to try some of the concepts, but had no interest in writing a spam filter.

I decided instead to write a predictor using Tweets as the learning source, and what better users than the US Presidential candidates!

Let me forewarn, this is merely using term frequencies, and n-grams on the tweets, and probably isn’t really useful or completely accurate, but hey, it could be fun, right? 🙂

In [1]: tweets = get_tweets('HillaryClinton')
In [2]: tweets = get_tweets('realDonaldTrump')

In [3]: h = get_data_frame('HillaryClinton')
In [4]: t = get_data_frame('realDonaldTrump')

In [5]: data = merge_data_frames(t, h)

A good baseline might be to predict on an actual tweet the candidate has posted:


In [1]: predict(data, 'The question in this election: Who can put the plans into action that will make your life better?')
('realDonaldTrump', 0.15506298409438407)
('HillaryClinton', 0.84493701590561299)

Alright that is an 84% to 15% prediction, pretty good.


In [1]: predict(data, 'I won every poll from last nights Presidential Debate - except for the little watched @CNN poll.')
('HillaryClinton', 0.069884565641135613)
('realDonaldTrump', 0.93011543435886102)

This prediction is giving a 93% to 6%, even better.

Now lets have a little fun by throwing in things we would assume, but the candidates did not post:

In [1]: predict(data, 'I have really big hands')
('HillaryClinton', 0.39802148371499757)
('realDonaldTrump', 0.60197851628500265)

In [2]: predict(data, 'I am for woman rights')
('realDonaldTrump', 0.3698772371039914)
('HillaryClinton', 0.63012276289600766)

We could also feed in some famous quotes:


In [1]: predict(data, "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.")
('realDonaldTrump', 0.28321206465202214)
('HillaryClinton', 0.71678793534798135)


In [1]: predict(data, 'A room without books is like a body without a soul.')
('realDonaldTrump', 0.39169158094239315)
('HillaryClinton', 0.60830841905760524)

Alright, so go have a look at the code, you can find it on my Github page.

Happy Hacking!

SensorTag data merged with Open Weather Maps

About a week ago I worked on SensorTag metrics with Grafana.

This week had some interesting weather today here in Austin, and I wanted to see to visualize it as well. Luckily Open Weather Maps offers a free API for gather near real-time weather data based on city code.



def __get_open_weather_data():

  url_path = 'http://api.openweathermap.org/data/2.5/weather'
  api_key = '??????????'
  url = '%s?zip=73301&APPID=%s'

  res = requests.get(url % (url_path, api_key))
  if res:
    if res.json().get('main'):
      return res.json()

  res = requests.get(url % (url_path, api_key))
  if res:
    if res.json().get('main'):
      return res.json()

def get_open_weather():

  data = __get_open_weather_data()

  # format our json response
  temp = round(data['main']['temp'] * 9/5 - 459.67, 2)
  pressure = round(data['main']['pressure'], 1)
  humidity = round(data['main']['humidity'], 2)
  rain = round(data['rain'].get('1h', 0.00), 2)
  clouds = data['clouds']['all']
  wind = data['wind']['speed']

  return dict(

Then I merge with my SensorTag data, appending these new keys to my json file:

$ cat /sensor.json
 "open_weather_temperature": 71.65,
 "temperature": 74.19,
 "open_weather_pressure": 1018.5,
 "light": 0,
 "humidity": 55.16,
 "pressure": 989.8,
 "open_weather_humidity": 99,
 "open_weather_rain": 7.07,
 "open_weather_clouds": 48,
 "open_weather_wind": 2.81

Arduino meet Raspberry Pi

While at the electronics store the other day, I noticed they had motion detectors on sale for only $4. I decided with my latest obsession of electronic tinkering, picking up a OSEEP Passive Infrared Sensor (PIR) Module might be fun.

I guess I should have done a little more reading on the packaging; by the time I was home, I noticed this sensor reported in analog, not digital. This was an issue as the Raspberry Pi only reads digital input.

Lucky for me, I also picked up an Arduino UNO Starter Kit awhile back. I decided this would be a great time to learn more about converting analog signals to digital (one great thing about the UNO is that it has both digital and analog input/output pins).

As an extra, I learned the Nexcon Solar Charger 5000mAh I bough for hiking and camping works great as a Raspberry Pi power source, in theory I can have a portable motion detector 😀



The wiring is rather basic, there is no need for resistors or capacitors, just direct connections.

* Connect motion sensor to the Adruino’s 5v power and ground.
* Connect motion sensor’s signal pin to Analog A0 pin on Adruino
* Connect Adruino’s Digital 2 pin to Raspberry Pi’s GPIO 18
* Connect Andruino’s ground to Raspberry Pi’s Ground


Once we are wired up, we can compile and upload the Arduino UNO code using Arduino Studio.


OSEPP Motion detector analog to digital convertor

int analog = A0;
int digital = 2

void setup(){

 // set our digital pin to OUTPUT
 pinMode(digital, OUTPUT);

void loop()

 // read value from analog pin
 int analog_value = analogRead(analog);

 // send digital signal when motion detected
 if (analog_value > 0) {
   digitalWrite(digital, HIGH);
 } else {
   digitalWrite(digital, LOW);

 delay(100); // slow down the loop just a bit

This Arduino code will read analog input from our motion detector, and any time more than 0v is detected it sends a signal to digital pin 2.

Raspberry Pi (Python)

import time
from datetime import datetime

import RPi.GPIO as GPIO

GPIO.setup(18, GPIO.IN)

def detect():
  while True:
    if GPIO.input(18):
      print '[%s] Movement Detected!' % datetime.now().ctime()

detect()  # run movement detection

On the Raspberry Pi side we will listen for signal on GPIO pin 18, and print out a little message, and timestamp.


From here we can do all sort of things, Happy Hacking!

Raspberry Pi and Official NFL Score Board API

Now that I’ve got my hands on a Raspberry Pi 3 Model B Motherboard, I decided nows a good time to play around with the 16×2 LCD Module Controller HD44780 I had laying around (I’ve had this thing since December 2015).

A live NFL (National Football League) score board seemed fitting as the season just started.

I found a really good write up on raspberrypi-spy.co.uk about wiring up the controller and Pi, here is the diagram I used:


The code to power this controller is simple Python (pushed to Github):

#!/usr/bin/env python

import time

import Adafruit_CharLCD as LCD
import requests

# Raspberry Pi pin configuration:
lcd_rs = 7
lcd_en = 8
lcd_d4 = 25
lcd_d5 = 24
lcd_d6 = 23
lcd_d7 = 18

# Define LCD column and row size for 16x2 LCD.
lcd_columns = 16
lcd_rows = 2

# Initialize the LCD using the pins above.
lcd = LCD.Adafruit_CharLCD(
  lcd_rs, lcd_en, lcd_d4, lcd_d5, lcd_d6, lcd_d7,
  lcd_columns, lcd_rows

# Start fetch / display loop
while True:

  # fetch current score from NFL api
  res = requests.get('http://www.nfl.com/liveupdate/scorestrip/ss.json')

  if res:

    for game in res.json().get('gms', []):

      # parse our teams and scores
      home_team = '%s %s' % (game['h'], game['hnn'])
      home_team = home_team[:13]
      home_team_score = game['hs']

      away_team = '%s %s' % (game['v'], game['vnn'])
      away_team = away_team[:13]
      away_team_score = game['vs']

      # clear lcd screen

      # write results to lcd screen
      lcd.message('%s %s\n%s %s' % (
          home_team, home_team_score, away_team, away_team_score