Python Machine Learning With Presidential Tweets
I’ve been spending a little bit of time researching Machine Learning , and was very happy to come across a Python library called sklearn .
While digging around Google, I came across a fantastic write up on Document Classification by Zac Steward . This article went pretty deep into writing a spam filter using machine learning, and sklearn. After reading the article I wanted to try some of the concepts, but had no interest in writing a spam filter.
I decided instead to write a predictor using Tweets as the learning source, and what better users than the US Presidential candidates!
Let me forewarn, this is merely using term frequencies , and n-grams on the tweets, and probably isn’t really useful or completely accurate, but hey, it could be fun, right? :)
In [1]: tweets = get_tweets('HillaryClinton')
In [2]: tweets = get_tweets('realDonaldTrump')
In [3]: h = get_data_frame('HillaryClinton')
In [4]: t = get_data_frame('realDonaldTrump')
In [5]: data = merge_data_frames(t, h)
A good baseline might be to predict on an actual tweet the candidate has posted:
In [1]: predict(data, 'The question in this election: Who can put the plans into action that will make your life better?')
('realDonaldTrump', 0.15506298409438407)
('HillaryClinton', 0.84493701590561299)
Alright that is an 84% to 15% prediction, pretty good.
In [1]: predict(data, 'I won every poll from last nights Presidential Debate - except for the little watched @CNN poll.')
('HillaryClinton', 0.069884565641135613)
('realDonaldTrump', 0.93011543435886102)
This prediction is giving a 93% to 6%, even better.
Now lets have a little fun by throwing in things we would assume, but the candidates did not post:
In [1]: predict(data, 'I have really big hands')
('HillaryClinton', 0.39802148371499757)
('realDonaldTrump', 0.60197851628500265)
In [2]: predict(data, 'I am for woman rights')
('realDonaldTrump', 0.3698772371039914)
('HillaryClinton', 0.63012276289600766)
We could also feed in some famous quotes:
In [1]: predict(data, "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.")
('realDonaldTrump', 0.28321206465202214)
('HillaryClinton', 0.71678793534798135)
In [1]: predict(data, 'A room without books is like a body without a soul.')
('realDonaldTrump', 0.39169158094239315)
('HillaryClinton', 0.60830841905760524)
Alright, so go have a look at the code, you can find it on my Github page.
Happy Hacking!