Bayesian classification is an algorithm which allows us to categorize documents probabilistically. I recently started playing with Twitter data and realized there was no Ruby gem which would allow me to build a spam detector for tweets. The classifier gem just works on a set of text by figuring out which words appear in a category but a tweet is much more complicated than that. A tweet looks like this:

{:text=>"Firesale prices, too! RT @nirajc: Time to change your Facebook password. Hacker selling 1.5m accounts.",
:truncated=>false, :created_at=>"Fri Apr 23 18:26:51 +0000 2010", :coordinates=>nil, :geo=>nil, :favorited=>false,
:source=>"TweetDeck",  :place=>nil, :contributors=>nil,
:user=>{:verified=>false, :profile_text_color=>"666666", :friends_count=>226, :created_at=>"Wed Oct 08 07:15:23 +0000 2008",
:profile_link_color=>"2FC2EF", :favourites_count=>12, :description=>"All the news that's fit to tweet (and most that isn't)",
:lang=>"en", :profile_sidebar_fill_color=>"252429", :location=>"Brooklyn, NY", :following=>nil, :notifications=>nil,
:time_zone=>"Eastern Time (US & Canada)", :statuses_count=>981, :profile_sidebar_border_color=>"181A1E",
:profile_background_image_url=>"", :protected=>false,
:contributors_enabled=>false, :url=>"", :screen_name=>"carlfranzen", :name=>"Carl Franzen",
:profile_background_tile=>false, :profile_background_color=>"1A1B1F", :id=>16645918, :geo_enabled=>false,
:utc_offset=>-18000, :followers_count=>174}, :id=>12717456105}

As you can see, a tweet is just a hash of variables. So which variables are a better indicator of spam? I don’t know and chances are you don’t either. But if we create a corpus of ham tweets and a corpus of spam tweets, we can train a Bayesian classifier with the two datasets and it will figure out which variable values are seen often in spam and which in ham.

Some variables don’t work, statistically speaking:

  • :id, :created_at – these variables are unique for each tweet which means they are useless for classification. BayesMotel will trim any variable values that don’t appear in more than 3% of the corpus.
  • :followers_count – this is probably a pretty good spam/ham indicator in general, but not as a simple number. There are millions of possible values (@aplusk has 4.5 million followers) but we are only training on hundreds or thousands of tweets. What would be better is the binary logarithm of the followers_count to create discrete buckets: 32-64 followers = 5, 1024-2048 = 10 and so on. I’d bet any tweet with a value greater than 12 or so (i.e. 4096+ followers) is very likely to be ham.

There are additional things we could do to improve our spam detector:

  • We aren’t deep inspecting the value of the tweet text. It might be useful to have variables like “text_link_count” or “text_hashtag_count” to provide basic metrics for the tweet text content.
  • We aren’t performing any timeline checks or storing previous tweet state – spammers tend to tweet the same text over and over and their tweets all contain links. This is beyond the scope of a generic Bayesian system.

I wrote bayes_motel based on my research this last weekend. Give it a try and send a pull request if you make changes you’d like to see. The test suite gives more detail about the API and has a few thousand tweets to use as sample data. Happy coding!