Python says, Simon’s hipster brother

Many of you may remember playing with a Simon Electronic Memory Game when you were younger, you know something that looks like this:

At it’s core the game is rather simple, the device lights up random colors, and you need to repeat the pattern. Of course it gets harder the longer you play.

I thought it would be fun to build a Simon game using Raspberry Pi and a few electronic components:

I used the following components to assemble the project:

  • Raspberry Pi 3
  • 3x 330 Ohm resistor
  • 3x 1k Ohm resistor
  • White LED
  • Blue LED
  • Red LED
  • Breadboard
  • Assortment of wires

Here is a close up of the bread board and components:

The Raspberry Pi’s GPIO pins are then connected to the bread board,
and a small Python script powers the Simon game:


from RPi import GPIO

from sys import exit
from random import choice
from time import sleep

# define our pins for leds
white = 14
blue = 15
red = 18

# define our pins for buttonss
white_button = 21
blue_button = 20
red_button = 16

# disable warnings
GPIO.setwarnings(False)

# set the board to use broadcom pin numbering
GPIO.setmode(GPIO.BCM)

# setup our LED pins as output
GPIO.setup(white, GPIO.OUT)
GPIO.setup(blue, GPIO.OUT)
GPIO.setup(red, GPIO.OUT)

# setup our buttons as input
GPIO.setup(white_button, GPIO.IN)
GPIO.setup(blue_button, GPIO.IN)
GPIO.setup(red_button, GPIO.IN)

# create empty pattern list for simon says game
pattern = []

# create a list of our choices for simon says game
choices = [white, blue, red]

# starting difficulty based on blink durations
duration = 0.75


def add_color():
    """
    Append a random color to our pattern list
    """

    color = choice(choices)
    pattern.append(color)


def get_button():
    """
    Gets the next button press and returns
    """

    while True:
        if GPIO.input(white_button):
            return white

        if GPIO.input(blue_button):
            return blue

        if GPIO.input(red_button):
            return red

def blink(led, duration):
    """
    Blink a led for duration
    """

    GPIO.output(led, GPIO.HIGH)
    sleep(duration)
    GPIO.output(led, GPIO.LOW)


def blink_pattern(duration):
    """
    Blinks our pattern using duration as waits
    """

    for led in pattern:
        sleep(duration)
        blink(led, duration)


def check_pattern():
    """
    Checks our button presses against pattern
    """

    for led in pattern:    
        if led != get_button():
            return False
        sleep(0.3)  # delay so button press doesn't overlap
    return True


def game_over():
    """
    Game over function
    """

    print 'Pattern Length: {}'.format(len(pattern))
    print '''
       _____          __  __ ______    ______      ________ _____  
      / ____|   /\   |  \/  |  ____|  / __ \ \    / /  ____|  __ \ 
     | |  __   /  \  | \  / | |__    | |  | \ \  / /| |__  | |__) |
     | | |_ | / /\ \ | |\/| |  __|   | |  | |\ \/ / |  __| |  _  / 
     | |__| |/ ____ \| |  | | |____  | |__| | \  /  | |____| | \ \ 
      \_____/_/    \_\_|  |_|______|  \____/   \/   |______|_|  \_\

    '''

    # blink all leds to show game over
    for _ in range(3):
        for c in choices:
            blink(c, duration=0.1)

    exit()


if __name__ == '__main__':

    # populate initial pattern
    add_color()
    add_color()

    while True:

        # blink back pattern
        blink_pattern(duration)

        # check if our inputs were correct, else end game
        if not check_pattern():
            game_over()

        # add a new color to pattern
        add_color()

        # decrease our duration to increase difficulty
        if duration > 0.05:
            duration -= 0.07


Happy Hacking!

Arduino values to Python over Serial

I’ve done a little bit of reading on the ReadAnalogVoltage of Arduino’s home page, and they give a straight forward way to read voltage from an analog pin.

I wanted to take this one step further and send the value over serial, then read it in Python using pySerial.

My setup is very straight forward, I have a Arduino UNO, a bread board, and a battery pack holding 4x AA batteries:

voltage_setup

To start out I want to merely print the voltage value in Arduino Studio to the serial console, my code looks something like this:

void setup() {
  // connect to serial
  Serial.begin(9600);
}

void loop() {

  // read value from analog pin
  int sensorValue = analogRead(A0);
 
  // convert to voltage and print to serial connection
  // https://www.arduino.cc/en/Tutorial/ReadAnalogVoltage
  float voltage = sensorValue * ( 5.0 / 1023.0 );
  Serial.println(voltage);

}

Now that we’ve verified this works, lets make a couple modification to the Arduino code.

Since the value of the analogRead may be over 255 (more than can fit in a single byte), we will need to send two bytes, a high byte, and a low byte. This concept is called most significant byte, and least significant byte.

void setup() {
  // connect to serial
  Serial.begin(9600);
}

void loop() {

  // read value from analog pin
  int sensorValue = analogRead(A0);
 
  // get the high and low byte from value
  byte high = highByte(sensorValue);
  byte low = lowByte(sensorValue);

  // write the high and low byte to serial
  Serial.write(high);
  Serial.write(low);

}

Then on the Python side we can use pySerial to read two bytes, and convert using the formula Arduino gave us.

import serial

# open our serial port at 9600 baud
dev = '/dev/cu.usbmodem1411'
with serial.Serial(dev, 9600, timeout=1) as ser:

  while True:

    # read 2 bytes from our serial connection
    raw = ser.read(size=2)

    if raw:

      # read the high and low byte
      high, low = raw

      # add up our bits from high and low byte
      # to get the final value
      val = ord(high) * 256 + ord(low)

      # print our voltage reading
      # https://www.arduino.cc/en/Tutorial/ReadAnalogVoltage
      print round(val * ( 5.0 / 1023.0), 2)

One thing to take into consideration is, if we do not have voltage sent to the analog pin the result will be random and invalid. You will see this in the video before I connect the battery pack. Keep in mind my battery pack is producing about 5 volts:

Python and sentiment analysis

While looking for datasets to throw at sklearn, I came across UCI Sentiment Labelled Sentences Data Set.

UCI is providing us with positive / negative tagging on real world data, the data comes from three sources (Amazon, Yelp, and IMDB).

The only problem is the format is a little strange.. We have a .txt file for each source, this is a raw unstructured  formatting, plus not every line is tagged with sentiment.

To make the data easier to interact with, I generated a json file with only the results containing sentiment. Go ahead and download it.

$ zcat sentiment.json.gz | head -n 25
[
 {
 "result": 0,
 "source": "amazon_cells_labelled.txt",
 "label": "negative",
 "text": "So there is no way for me to plug it in here in the US unless I go by a converter."
 },
 {
 "result": 1,
 "source": "amazon_cells_labelled.txt",
 "label": "positive",
 "text": "Good case, Excellent value."
 },
 {
 "result": 1,
 "source": "amazon_cells_labelled.txt",
 "label": "positive",
 "text": "Great for the jawbone."
 },
 {
 "result": 0,
 "source": "amazon_cells_labelled.txt",
 "label": "negative",
 "text": "Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!"
 },

Lets jump into an IPython interrupter and load the data:

In [1]: import json

In [2]: raw = open('sentiment.json').read()

Now that we have the data as a Python dictionary, create a DataFrame in the proper format:

In [1]: data = get_data_frame('sentiment.json')

In [2]: data.shape
Out[2]: (3000, 2)

Next lets split our full dataset into a training, and testing dataset:

In [1]: train, test = get_train_test_data(data, size=0.2)

In [2]: train.shape
Out[2]: (2400, 2)

In [3]: test.shape
Out[3]: (600, 2)

We are now set to run a bit of accuracy testing:

In [1]: test_predict(train, test)
Out[1]: {
  'test_score': 0.80000000000000004,
  'train_score': 0.98375000000000001
}

We can slice our full dataset a few more times, just to make sure our accuracy test is.. accurate:

In [1]: train, test = get_train_test_data(data, size=0.2)

In [2]: test_predict(train, test)
Out[2]: {
  'test_score': 0.79000000000000004,
  'train_score': 0.98416666666666663
}

In [3]: train, test = get_train_test_data(data, size=0.2)

In [4]: test_predict(train, test)
Out[4]: {
  'test_score': 0.80666666666666664,
  'train_score': 0.98291666666666666
}

In [5]: train, test = get_train_test_data(data, size=0.5)

In [6]: test_predict(train, test)
Out[6]: {
  'test_score': 0.79466666666666663,
  'train_score': 0.98999999999999999
}

All that is left is to feed in the entire dataset and predict on new sentences:

In [1]: predict(data, 'This was the worst experience.')
Out[1]: [
  (u'positive', 0.17704535094140364),
  (u'negative', 0.82295464905859583)
]

In [2]: predict(data, 'The staff here was fabulous')
Out[2]: [
  (u'negative', 0.20651083543376234),
  (u'positive', 0.79348916456623764)
]
In [1]: predict(data, 'I hate you')
Out[1]: [
  (u'positive', 0.22509671479185445),
  (u'negative', 0.77490328520814555)
]

In [2]: predict(data, 'I love you')
Out[2]: [
  (u'negative', 0.10593166714256422),
  (u'positive', 0.89406833285743614)
]

Lets put it all together by looking at the Python functions:


import re
import json
import random
import string

from pandas import DataFrame, concat

from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import NearestNeighbors
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer, \
    CountVectorizer


# update the pipeline to get best test results!
steps = [
    ('count_vectorizer',  CountVectorizer(
        stop_words='english', ngram_range=(1,  2))),
    ('tfidf_transformer', TfidfTransformer()),
    ('classifier',        MultinomialNB())
]


def id_generator(size=6):
    """
    Return random string.
    """
    chars = string.ascii_uppercase + string.digits
    return ''.join(random.choice(chars) for _ in range(size))

def get_data_frame(filename):
    """
    Read tweets.json from directory and return DataFrame.
    """

    raw = dict(ids=[], text=[])

    # open file and read as json
    _data = open(filename)
    data = json.loads(_data.read())

    # loop over all tweets in json file
    for d in data:

        # update raw list with tweet values.
        raw['text'].append(
            dict(text=d['text'], classification=d['label']))
        raw['ids'].append(
            id_generator(size=12))

    return DataFrame(raw['text'], index=raw['ids'])

def merge(*args):
    """
    Merge two or more DataFrames.
    """
    return concat(args)

def get_train_test_data(data, size=0.2):
    """
    Split DataFrame and return a training and testing set.
    """
    train, test = train_test_split(data, test_size=size)
    return train, test

def test_predict(train, test):
    """
    Run predictions on training and test data,
    then return scores.
    """

    pipeline = Pipeline(steps)
    pipeline.fit(train.text.values, train.classification.values)

    train_score = pipeline.score(
        train.text.values, train.classification.values)
    test_score = pipeline.score(
        test.text.values, test.classification.values)

    return dict(train_score=train_score, test_score=test_score)

def predict(data, text):

    pipeline = Pipeline(steps)
    pipeline.fit(data.text.values, data.classification.values)

    res = zip(pipeline.classes_, pipeline.predict_proba()[0])
    return sorted(res, key=lambda x:x[1])

Test your Machine Learning

In my previous post “Python Machine Learning with Presidential Tweets“, I started messing around with sklearn and text classification.

Since then I’ve discovered a great tutorial from SciPy 2015. This video starts out slow enough for novices, and a reoccurring theme is testing your datasets.

After watching a good chunk of this video, I decided to go back to my code and implement a testing phase. Basically I’ll split my data into two pieces, a training set, and a testing set.

After training on the training set, I can feed in the testing set, then score predictions for accuracy.

Lets look at this a couple blocks at a time.

First I’ll load my Twitter data into a Panda DataFrames:

In [1]: t = get_data_frame('realDonaldTrump')

In [2]: len(t)
Out[2]: 3100

In [3]: h = get_data_frame('HillaryClinton')

In [4]: len(h)
Out[4]: 2605

These DataFrames include the text results (Tweet), and classifications (Twitter username).

In [1]: t.text.values[0]
Out[1]: u'Thank you Council Bluffs, Iowa! Will be back soon. Remember- everything you need to know about Hillary -- just\u2026 https://t.co/45kIHxdX83'

In [2]: t.classification.values[0]
Out[2]: 'realDonaldTrump'

In [3]: h.text.values[0]
Out[3]: u'"If a candidate regularly and flippantly makes cruel and insulting comments about women...that\'s who that candidate\u2026 https://t.co/uOdGkhKWRY'

In [4]: h.classification.values[0]
Out[4]: 'HillaryClinton'

I then merge both DataFrames into a single DataFrame:

In [1]: data = merge(t, h)

In [2]: len(data)
Out[2]: 5705

Keep in mind, at this stage the DataFrame is ordered:

In [1]: data.classification.values[0:10]
Out[1]:
array(['realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump'], dtype=object)

In [2]: data.classification.values[-10:]
Out[2]:
array(['HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton'], dtype=object)

Next we need to randomize the DataFrame, then split into a training and testing sets.

In [1]: train, test = get_train_test_data(data, size=0.2)

In [2]: len(train)
Out[2]: 4564

In [3]: len(test)
Out[3]: 1141

Lets make sure our data was randomized before the split:

In [1]: train.classification.values[0:20]
Out[1]:
array(['realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'HillaryClinton', 'realDonaldTrump',
 'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton'], dtype=object)

In [2]: test.classification.values[0:20]
Out[2]:
array(['realDonaldTrump', 'realDonaldTrump', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton', 'realDonaldTrump',
 'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
 'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton'], dtype=object)

Now comes the fun part. Lets do some prediction on both the trained data and testing data. We are expecting the trained data to get near 100% accuracy, after all this was the dataset we trained on.

In [1]: test_predict(train, test)
Out[1]: (0.99649430324276955, 0.92813321647677471)

In the above we hit 99% on the trained data set, and a 92% on testing data.

And for fun, lets run a prediction using the full dataset:

In [1]: predict(data, 'Python programming is exciting.')
Out[1]:
[('HillaryClinton', 0.24396931839432631),
 ('realDonaldTrump', 0.75603068160567388)]

The full Python code is below, have fun hacking on it!



import re
import json
import random
import string

from pandas import DataFrame, concat

from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer, \
    CountVectorizer


# update the pipeline to get best test results!
steps = [
    ('count_vectorizer',  CountVectorizer(
        stop_words='english', ngram_range=(1,  2))),
    ('tfidf_transformer', TfidfTransformer()),
    ('classifier',        MultinomialNB())
]

def id_generator(size=6):
    """
    Return random string.
    """
    chars = string.ascii_uppercase + string.digits
    return ''.join(random.choice(chars) for _ in range(size))

def get_data_frame(username):
    """
    Read tweets.json from directory and return DataFrame.
    """

    raw = dict(ids=[], text=[])

    # open file and read as json
    data = open('./%s/tweets.json' % username)
    tweets = json.loads(data.read())

    # loop over all tweets in json file
    for t in tweets:

        # skip retweets.
        if re.search('^RT ', t['tweet']):
            continue

        # update raw list with tweet values.
        raw['text'].append(
            dict(text=t['tweet'], classification=username))
        raw['ids'].append(
            '%s-%s' % (t['time'], id_generator()))

    return DataFrame(raw['text'], index=raw['ids'])

def merge(*args):
    """
    Merge two or more DataFrames.
    """
    return concat(args)

def get_train_test_data(data, size=0.2):
    """
    Split DataFrame and return a training and testing set.
    """
    train, test = train_test_split(data, test_size=size)
    return train, test

def test_predict(train, test):
    """
    Run predictions on training and test data,
    then return scores.
    """

    pipeline = Pipeline(steps)
    pipeline.fit(train.text.values, train.classification.values)

    train_score = pipeline.score(
        train.text.values, train.classification.values)
    test_score = pipeline.score(
        test.text.values, test.classification.values)

    return train_score, test_score

def predict(data, text):

    pipeline = Pipeline(steps)
    pipeline.fit(data.text.values, data.classification.values)

    res = zip(pipeline.classes_, pipeline.predict_proba()[0])
    return sorted(res, key=lambda x:x[1])



Python Machine Learning with Presidential Tweets

I’ve been spending a little bit of time researching Machine Learning, and was very happy to come across a Python library called sklearn.

While digging around Google, I came across a fantastic write up on Document Classification by Zac Steward. This article went pretty deep into writing a spam filter using machine learning, and sklearn. After reading the article I wanted to try some of the concepts, but had no interest in writing a spam filter.

I decided instead to write a predictor using Tweets as the learning source, and what better users than the US Presidential candidates!

Let me forewarn, this is merely using term frequencies, and n-grams on the tweets, and probably isn’t really useful or completely accurate, but hey, it could be fun, right? 🙂

In [1]: tweets = get_tweets('HillaryClinton')
In [2]: tweets = get_tweets('realDonaldTrump')

In [3]: h = get_data_frame('HillaryClinton')
In [4]: t = get_data_frame('realDonaldTrump')

In [5]: data = merge_data_frames(t, h)

A good baseline might be to predict on an actual tweet the candidate has posted:

screen-shot-2016-09-29-at-6-41-08-pm

In [1]: predict(data, 'The question in this election: Who can put the plans into action that will make your life better?')
('realDonaldTrump', 0.15506298409438407)
('HillaryClinton', 0.84493701590561299)

Alright that is an 84% to 15% prediction, pretty good.

screen-shot-2016-09-29-at-6-44-35-pm

In [1]: predict(data, 'I won every poll from last nights Presidential Debate - except for the little watched @CNN poll.')
('HillaryClinton', 0.069884565641135613)
('realDonaldTrump', 0.93011543435886102)

This prediction is giving a 93% to 6%, even better.

Now lets have a little fun by throwing in things we would assume, but the candidates did not post:

In [1]: predict(data, 'I have really big hands')
('HillaryClinton', 0.39802148371499757)
('realDonaldTrump', 0.60197851628500265)

In [2]: predict(data, 'I am for woman rights')
('realDonaldTrump', 0.3698772371039914)
('HillaryClinton', 0.63012276289600766)

We could also feed in some famous quotes:

screen-shot-2016-09-29-at-6-48-17-pm

In [1]: predict(data, "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.")
('realDonaldTrump', 0.28321206465202214)
('HillaryClinton', 0.71678793534798135)

screen-shot-2016-09-29-at-6-49-35-pm

In [1]: predict(data, 'A room without books is like a body without a soul.')
('realDonaldTrump', 0.39169158094239315)
('HillaryClinton', 0.60830841905760524)

Alright, so go have a look at the code, you can find it on my Github page.

Happy Hacking!