November 21, 2004

Review of Spambayes 1.0

As mentioned in an earlier entry, I was not satisfied with the job that Mozilla Thunderbird was doing for filtering spam. I decided to have another program do spam filtering before the mail reached Thunderbird. I was looking at two possible programs to do this: DSPAM and Spambayes. I was intrigued by DSPAM for their sophistication, successful filter rate, and graphs (probably because of the statistician in me). I didn't go with DSPAM because it would require me to install the Apache web server (which I am not ready to do at this time) and from what documentation I could find, the setup looks difficult. I've also worked with Spambayes before on a Linux install that died due to a hard drive crash; it only requires Python, which I already have installed, and is much easier to setup.

Spambayes was very easy to setup. The documentation found on Spambayes website is sufficient in guiding a user how to set up a proxy for their mail program to connect to. Setup is done through a web page located on localhost (which is the user's computer) port 8880. Spambayes classifies mail into three categories: ham, spam, and unsure. I setup Spambayes to place the category name into the subject line of the e-mail when it was spam or unsure. This served two purposes: so I could setup a rule in Thunderbird to automatically place spam in the trash and also know when I needed to train Spambayes with mail it was unsure with and to correct it on incorrect ham or spam. Training Spambayes consists of forwarding the e-mail to spambayes_ham@localhost or spambayes_spam@localhost, depending on whether the user decides the e-mail is ham or spam.

Since I started using Spambayes, I have trained Spambayes on 151 pieces of spam and 19 pieces of ham. Spambayes complains about the high difference between ham and spam training; I usually don't forward mails when it correctly classifies it, but at the same time I do notice it having a difficult time correctly classifying some e-mail, so I have started to send it ham, even when it already correctly guessed that. Spambayes is doing a better job overall than Thunderbird, though there have been a few instances where Thunderbird correctly classified e-mail as spam that Spambayes missed.

I'll continue to use Spambayes, training it on ham and spam, until the end of the year then reevaluate it's performance. If I'm not totally happy with the results, I may try out DSPAM.

No comments: