I’ve been spending a little bit of time researching Machine Learning , and was very happy to come across a Python library called sklearn .
While digging around Google, I came across a fantastic write up on Document Classification by Zac Steward . This article went pretty deep into writing a spam filter using machine learning, and sklearn. After reading the article I wanted to try some of the concepts, but had no interest in writing a spam filter.
I decided instead to write a predictor using Tweets as the learning source, and what better users than the US Presidential candidates!
Let me forewarn, this is merely using term frequencies , and n-grams on the tweets, and probably isn’t really useful or completely accurate, but hey, it could be fun, right? :)
In : tweets = get_tweets('HillaryClinton') In : tweets = get_tweets('realDonaldTrump') In : h = get_data_frame('HillaryClinton') In : t = get_data_frame('realDonaldTrump') In : data = merge_data_frames(t, h)
A good baseline might be to predict on an actual tweet the candidate has posted:
In : predict(data, 'The question in this election: Who can put the plans into action that will make your life better?') ('realDonaldTrump', 0.15506298409438407) ('HillaryClinton', 0.84493701590561299)
Alright that is an 84% to 15% prediction, pretty good.
In : predict(data, 'I won every poll from last nights Presidential Debate - except for the little watched @CNN poll.') ('HillaryClinton', 0.069884565641135613) ('realDonaldTrump', 0.93011543435886102)
This prediction is giving a 93% to 6%, even better.
Now lets have a little fun by throwing in things we would assume, but the candidates did not post:
In : predict(data, 'I have really big hands') ('HillaryClinton', 0.39802148371499757) ('realDonaldTrump', 0.60197851628500265) In : predict(data, 'I am for woman rights') ('realDonaldTrump', 0.3698772371039914) ('HillaryClinton', 0.63012276289600766)
We could also feed in some famous quotes:
In : predict(data, "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.") ('realDonaldTrump', 0.28321206465202214) ('HillaryClinton', 0.71678793534798135)
In : predict(data, 'A room without books is like a body without a soul.') ('realDonaldTrump', 0.39169158094239315) ('HillaryClinton', 0.60830841905760524)
Alright, so go have a look at the code, you can find it on my Github page.