Arduino values to Python over Serial

I’ve done a little bit of reading on the ReadAnalogVoltage of Arduino’s home page, and they give a straight forward way to read voltage from an analog pin.

I wanted to take this one step further and send the value over serial, then read it in Python using pySerial.

My setup is very straight forward, I have a Arduino UNO, a bread board, and a battery pack holding 4x AA batteries:

voltage_setup

To start out I want to merely print the voltage value in Arduino Studio to the serial console, my code looks something like this:

void setup() {
  // connect to serial
  Serial.begin(9600);
}

void loop() {

  // read value from analog pin
  int sensorValue = analogRead(A0);
 
  // convert to voltage and print to serial connection
  // https://www.arduino.cc/en/Tutorial/ReadAnalogVoltage
  float voltage = sensorValue * ( 5.0 / 1023.0 );
  Serial.println(voltage);

}

Now that we’ve verified this works, lets make a couple modification to the Arduino code.

Since the value of the analogRead may be over 255 (more than can fit in a single byte), we will need to send two bytes, a high byte, and a low byte. This concept is called most significant byte, and least significant byte.

void setup() {
  // connect to serial
  Serial.begin(9600);
}

void loop() {

  // read value from analog pin
  int sensorValue = analogRead(A0);
 
  // get the high and low byte from value
  byte high = highByte(sensorValue);
  byte low = lowByte(sensorValue);

  // write the high and low byte to serial
  Serial.write(high);
  Serial.write(low);

}

Then on the Python side we can use pySerial to read two bytes, and convert using the formula Arduino gave us.

import serial

# open our serial port at 9600 baud
dev = '/dev/cu.usbmodem1411'
with serial.Serial(dev, 9600, timeout=1) as ser:

  while True:

    # read 2 bytes from our serial connection
    raw = ser.read(size=2)

    if raw:

      # read the high and low byte
      high, low = raw

      # add up our bits from high and low byte
      # to get the final value
      val = ord(high) * 256 + ord(low)

      # print our voltage reading
      # https://www.arduino.cc/en/Tutorial/ReadAnalogVoltage
      print round(val * ( 5.0 / 1023.0), 2)

One thing to take into consideration is, if we do not have voltage sent to the analog pin the result will be random and invalid. You will see this in the video before I connect the battery pack. Keep in mind my battery pack is producing about 5 volts:

Python and sentiment analysis

While looking for datasets to throw at sklearn, I came across UCI Sentiment Labelled Sentences Data Set.

UCI is providing us with positive / negative tagging on real world data, the data comes from three sources (Amazon, Yelp, and IMDB).

The only problem is the format is a little strange.. We have a .txt file for each source, this is a raw unstructured  formatting, plus not every line is tagged with sentiment.

To make the data easier to interact with, I generated a json file with only the results containing sentiment. Go ahead and download it.

$ zcat sentiment.json.gz | head -n 25
[
 {
 "result": 0,
 "source": "amazon_cells_labelled.txt",
 "label": "negative",
 "text": "So there is no way for me to plug it in here in the US unless I go by a converter."
 },
 {
 "result": 1,
 "source": "amazon_cells_labelled.txt",
 "label": "positive",
 "text": "Good case, Excellent value."
 },
 {
 "result": 1,
 "source": "amazon_cells_labelled.txt",
 "label": "positive",
 "text": "Great for the jawbone."
 },
 {
 "result": 0,
 "source": "amazon_cells_labelled.txt",
 "label": "negative",
 "text": "Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!"
 },

Lets jump into an IPython interrupter and load the data:

In [1]: import json

In [2]: raw = open('sentiment.json').read()

Now that we have the data as a Python dictionary, create a DataFrame in the proper format:

In [1]: data = get_data_frame('sentiment.json')

In [2]: data.shape
Out[2]: (3000, 2)

Next lets split our full dataset into a training, and testing dataset:

In [1]: train, test = get_train_test_data(data, size=0.2)

In [2]: train.shape
Out[2]: (2400, 2)

In [3]: test.shape
Out[3]: (600, 2)

We are now set to run a bit of accuracy testing:

In [1]: test_predict(train, test)
Out[1]: {
  'test_score': 0.80000000000000004,
  'train_score': 0.98375000000000001
}

We can slice our full dataset a few more times, just to make sure our accuracy test is.. accurate:

In [1]: train, test = get_train_test_data(data, size=0.2)

In [2]: test_predict(train, test)
Out[2]: {
  'test_score': 0.79000000000000004,
  'train_score': 0.98416666666666663
}

In [3]: train, test = get_train_test_data(data, size=0.2)

In [4]: test_predict(train, test)
Out[4]: {
  'test_score': 0.80666666666666664,
  'train_score': 0.98291666666666666
}

In [5]: train, test = get_train_test_data(data, size=0.5)

In [6]: test_predict(train, test)
Out[6]: {
  'test_score': 0.79466666666666663,
  'train_score': 0.98999999999999999
}

All that is left is to feed in the entire dataset and predict on new sentences:

In [1]: predict(data, 'This was the worst experience.')
Out[1]: [
  (u'positive', 0.17704535094140364),
  (u'negative', 0.82295464905859583)
]

In [2]: predict(data, 'The staff here was fabulous')
Out[2]: [
  (u'negative', 0.20651083543376234),
  (u'positive', 0.79348916456623764)
]
In [1]: predict(data, 'I hate you')
Out[1]: [
  (u'positive', 0.22509671479185445),
  (u'negative', 0.77490328520814555)
]

In [2]: predict(data, 'I love you')
Out[2]: [
  (u'negative', 0.10593166714256422),
  (u'positive', 0.89406833285743614)
]

Lets put it all together by looking at the Python functions:


import re
import json
import random
import string

from pandas import DataFrame, concat

from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import NearestNeighbors
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer, \
    CountVectorizer


# update the pipeline to get best test results!
steps = [
    ('count_vectorizer',  CountVectorizer(
        stop_words='english', ngram_range=(1,  2))),
    ('tfidf_transformer', TfidfTransformer()),
    ('classifier',        MultinomialNB())
]


def id_generator(size=6):
    """
    Return random string.
    """
    chars = string.ascii_uppercase + string.digits
    return ''.join(random.choice(chars) for _ in range(size))

def get_data_frame(filename):
    """
    Read tweets.json from directory and return DataFrame.
    """

    raw = dict(ids=[], text=[])

    # open file and read as json
    _data = open(filename)
    data = json.loads(_data.read())

    # loop over all tweets in json file
    for d in data:

        # update raw list with tweet values.
        raw['text'].append(
            dict(text=d['text'], classification=d['label']))
        raw['ids'].append(
            id_generator(size=12))

    return DataFrame(raw['text'], index=raw['ids'])

def merge(*args):
    """
    Merge two or more DataFrames.
    """
    return concat(args)

def get_train_test_data(data, size=0.2):
    """
    Split DataFrame and return a training and testing set.
    """
    train, test = train_test_split(data, test_size=size)
    return train, test

def test_predict(train, test):
    """
    Run predictions on training and test data,
    then return scores.
    """

    pipeline = Pipeline(steps)
    pipeline.fit(train.text.values, train.classification.values)

    train_score = pipeline.score(
        train.text.values, train.classification.values)
    test_score = pipeline.score(
        test.text.values, test.classification.values)

    return dict(train_score=train_score, test_score=test_score)

def predict(data, text):

    pipeline = Pipeline(steps)
    pipeline.fit(data.text.values, data.classification.values)

    res = zip(pipeline.classes_, pipeline.predict_proba()[0])
    return sorted(res, key=lambda x:x[1])

Test your Machine Learning

In my previous post “Python Machine Learning with Presidential Tweets“, I started messing around with sklearn and text classification.

Since then I’ve discovered a great tutorial from SciPy 2015. This video starts out slow enough for novices, and a reoccurring theme is testing your datasets.

After watching a good chunk of this video, I decided to go back to my code and implement a testing phase. Basically I’ll split my data into two pieces, a training set, and a testing set.

After training on the training set, I can feed in the testing set, then score predictions for accuracy.

Lets look at this a couple blocks at a time.

First I’ll load my Twitter data into a Panda DataFrames:

In [1]: t = get_data_frame('realDonaldTrump')

In [2]: len(t)
Out[2]: 3100

In [3]: h = get_data_frame('HillaryClinton')

In [4]: len(h)
Out[4]: 2605

These DataFrames include the text results (Tweet), and classifications (Twitter username).

In [1]: t.text.values[0]
Out[1]: u'Thank you Council Bluffs, Iowa! Will be back soon. Remember- everything you need to know about Hillary -- just\u2026 https://t.co/45kIHxdX83'

In [2]: t.classification.values[0]
Out[2]: 'realDonaldTrump'

In [3]: h.text.values[0]
Out[3]: u'"If a candidate regularly and flippantly makes cruel and insulting comments about women...that\'s who that candidate\u2026 https://t.co/uOdGkhKWRY'

In [4]: h.classification.values[0]
Out[4]: 'HillaryClinton'

I then merge both DataFrames into a single DataFrame:

In [1]: data = merge(t, h)

In [2]: len(data)
Out[2]: 5705

Keep in mind, at this stage the DataFrame is ordered:

In [1]: data.classification.values[0:10]
Out[1]:
array(['realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump'], dtype=object)

In [2]: data.classification.values[-10:]
Out[2]:
array(['HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton'], dtype=object)

Next we need to randomize the DataFrame, then split into a training and testing sets.

In [1]: train, test = get_train_test_data(data, size=0.2)

In [2]: len(train)
Out[2]: 4564

In [3]: len(test)
Out[3]: 1141

Lets make sure our data was randomized before the split:

In [1]: train.classification.values[0:20]
Out[1]:
array(['realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'realDonaldTrump', 'realDonaldTrump',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'HillaryClinton', 'realDonaldTrump',
 'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton'], dtype=object)

In [2]: test.classification.values[0:20]
Out[2]:
array(['realDonaldTrump', 'realDonaldTrump', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton', 'realDonaldTrump',
 'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
 'realDonaldTrump', 'HillaryClinton', 'HillaryClinton',
 'realDonaldTrump', 'realDonaldTrump', 'realDonaldTrump',
 'HillaryClinton', 'HillaryClinton', 'HillaryClinton',
 'HillaryClinton', 'HillaryClinton'], dtype=object)

Now comes the fun part. Lets do some prediction on both the trained data and testing data. We are expecting the trained data to get near 100% accuracy, after all this was the dataset we trained on.

In [1]: test_predict(train, test)
Out[1]: (0.99649430324276955, 0.92813321647677471)

In the above we hit 99% on the trained data set, and a 92% on testing data.

And for fun, lets run a prediction using the full dataset:

In [1]: predict(data, 'Python programming is exciting.')
Out[1]:
[('HillaryClinton', 0.24396931839432631),
 ('realDonaldTrump', 0.75603068160567388)]

The full Python code is below, have fun hacking on it!


import re
import json
import random
import string

from pandas import DataFrame, concat

from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer, \
    CountVectorizer

# update the pipeline to get best test results!
steps = [
    ('count_vectorizer',  CountVectorizer(
        stop_words='english', ngram_range=(1,  2))),
    ('tfidf_transformer', TfidfTransformer()),
    ('classifier',        MultinomialNB())
]

def id_generator(size=6):
    """
    Return random string.
    """
    chars = string.ascii_uppercase + string.digits
    return ''.join(random.choice(chars) for _ in range(size))

def get_data_frame(username):
    """
    Read tweets.json from directory and return DataFrame.
    """

    raw = dict(ids=[], text=[])

    # open file and read as json
    data = open('./%s/tweets.json' % username)
    tweets = json.loads(data.read())

    # loop over all tweets in json file
    for t in tweets:

        # skip retweets.
        if re.search('^RT ', t['tweet']):
            continue

        # update raw list with tweet values.
        raw['text'].append(
            dict(text=t['tweet'], classification=username))
        raw['ids'].append(
            '%s-%s' % (t['time'], id_generator()))

    return DataFrame(raw['text'], index=raw['ids'])

def merge(*args):
    """
    Merge two or more DataFrames.
    """
    return concat(args)

def get_train_test_data(data, size=0.2):
    """
    Split DataFrame and return a training and testing set.
    """
    train, test = train_test_split(data, test_size=size)
    return train, test

def test_predict(train, test):
    """
    Run predictions on training and test data,
    then return scores.
    """

    pipeline = Pipeline(steps)
    pipeline.fit(train.text.values, train.classification.values)

    train_score = pipeline.score(
        train.text.values, train.classification.values)
    test_score = pipeline.score(
        test.text.values, test.classification.values)

    return train_score, test_score

def predict(data, text):

    pipeline = Pipeline(steps)
    pipeline.fit(data.text.values, data.classification.values)

    res = zip(pipeline.classes_, pipeline.predict_proba()[0])
    return sorted(res, key=lambda x:x[1])

For additional content I might suggest some of the courses found on CourseDuck.