Python Machine Learning with Presidential Tweets

I’ve been spending a little bit of time researching Machine Learning, and was very happy to come across a Python library called sklearn.

While digging around Google, I came across a fantastic write up on Document Classification by Zac Steward. This article went pretty deep into writing a spam filter using machine learning, and sklearn. After reading the article I wanted to try some of the concepts, but had no interest in writing a spam filter.

I decided instead to write a predictor using Tweets as the learning source, and what better users than the US Presidential candidates!

Let me forewarn, this is merely using term frequencies, and n-grams on the tweets, and probably isn’t really useful or completely accurate, but hey, it could be fun, right? 🙂

In [1]: tweets = get_tweets('HillaryClinton')
In [2]: tweets = get_tweets('realDonaldTrump')

In [3]: h = get_data_frame('HillaryClinton')
In [4]: t = get_data_frame('realDonaldTrump')

In [5]: data = merge_data_frames(t, h)

A good baseline might be to predict on an actual tweet the candidate has posted:

screen-shot-2016-09-29-at-6-41-08-pm

In [1]: predict(data, 'The question in this election: Who can put the plans into action that will make your life better?')
('realDonaldTrump', 0.15506298409438407)
('HillaryClinton', 0.84493701590561299)

Alright that is an 84% to 15% prediction, pretty good.

screen-shot-2016-09-29-at-6-44-35-pm

In [1]: predict(data, 'I won every poll from last nights Presidential Debate - except for the little watched @CNN poll.')
('HillaryClinton', 0.069884565641135613)
('realDonaldTrump', 0.93011543435886102)

This prediction is giving a 93% to 6%, even better.

Now lets have a little fun by throwing in things we would assume, but the candidates did not post:

In [1]: predict(data, 'I have really big hands')
('HillaryClinton', 0.39802148371499757)
('realDonaldTrump', 0.60197851628500265)

In [2]: predict(data, 'I am for woman rights')
('realDonaldTrump', 0.3698772371039914)
('HillaryClinton', 0.63012276289600766)

We could also feed in some famous quotes:

screen-shot-2016-09-29-at-6-48-17-pm

In [1]: predict(data, "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.")
('realDonaldTrump', 0.28321206465202214)
('HillaryClinton', 0.71678793534798135)

screen-shot-2016-09-29-at-6-49-35-pm

In [1]: predict(data, 'A room without books is like a body without a soul.')
('realDonaldTrump', 0.39169158094239315)
('HillaryClinton', 0.60830841905760524)

Alright, so go have a look at the code, you can find it on my Github page.

Happy Hacking!