The Spam Eater Spam Blocker
A Programmer's Perspective
by Chris Fortune
Update 1/Aug/2004 - I have softened my stance a little, and now use a combination of techniques: E-mail is "scored" as -1% 'bogus', then is gradually given weighted points if it shows up on 1. RBLs (remote or local), 2. DCC, 3. local header blacklists. If it passes these inexpensive tests it is put through antivirus, then bayesian filtering. If it looks sufficiently spammy it is rejected. If it looks very hammy it is optimisticly delivered, with the option for users to report it. If the spam score is moderate and it is not a mailing list or system mail, it is put into the user's webmail quarantine and challenged. 550 responses are cleaned out of quarantine after a day. After two days, Vipul's Razor and DCC are used to test the quarantined mail again. Remaining quarantine is deleted according to it's spam score, with highest scoring mail deleted first. Bayesian filter is used as a repository of memory, so that the system gradually 'learns'. This strategy has proven to be extrodinarily effective for very low expense. It looks a bit similar to SpamAssassin, but does not use the massive regular expression matching heuristics that spamassassin does, so is much faster with fewer false positives.
Anti spam software analysis:
What is the best way to stop spam email? All methods have some merits ... and demerits, as I have learned after hundreds of hours of blood, sweat and code. No matter what technical method is used, spam is not a technical problem but a social problem, which requires a human-driven solution ...
This is not an objective research paper, but a sharing of my experiences and thoughts while "in the trenches" fighting the anti spam war. Read it and then drop a message into my web-mailer (no I do not publish my home email address on the web!)
I have written three pieces of server-side anti spam software: 1. a heuristic spam filter (using Spam Assassin), 2. a spam IP blocker (using SpamCop, ORDB, Vipul's Razor, etc), and 3. a challenge / response whitelist blocker (spameater.com - spam blocker). The first two methods proved to be only about 70%-90% effective even when finely tuned and were losing 1%-12% good email, but the user wants 100% of the spam removed and 0% good mail removed, so I rejected those methods and focused on the third method, using my own independant research. To my happy surprise, the third design worked!
The problem lies in the original design of the email system, which was originally used only by University researchers, military, and computer programmers. In the 1970's, nobody was thinking about spam, and the creators of SMTP believed that it would be replaced soon, so there was no authentication built in. Authentication was inherantly provided by limited access. Now, 30 years later, any teenage hacker, con artist, drug dealer, or wannabe porno king can send you an email, thousands of emails! We have authentication systems for everything in our lives: You need a key for your car, a key for your house, password for your bank card, a secretary for your office, but there is no authentication for your email! A solution to this problem is desperately needed if email is to survive, but the mail protocols are 'grandfathered' into the network, and a major overhaul of the Internet mail system is too costly, so it has been left to the private sector to come up with a 'fix'. A number of software companies have produced software for use by individuals and ISP's (Internet Service Providers) that try to fix the problem by analyzing each spam email, but the email itself is not the problem. What is really needed is an authentication system for email users so that nobody can abuse the trust relationships of the Internet.
Here are some of the most popular methods:
I'm just a humble programmer, so for the purposes of this article I'll discard any discussion of methods 4 and 5 as merely political. I'll break the remaining methods into pros and cons with nice bold headlines for easy consumption.
Pros: User doesn't have to interact, spam simply 'disappears without a trace'. Can be effective when two or more filters are combined. Can be installed at the "front gate" of the mail server, like a security guard, thus stopping spam at the earliest source, the MTA level. Can be setup once for all users on the mail server. There is a large online community of professional programmers who collaborate on the development of the filtering software. If you are a responsible for supporting an email filtering system, you can milk your company/customers for years with service contracts (sarcasm).
Cons: False positives! Users can (and do!) lose desired emails, friends and clients get blocked, generally resulting in a lot of yelling, upset phone calls, and the occasional law suit. Reducing the sensitivity setting of the text filters results in fewer false positives but Poor results, up to 30% of spam gets through. Expensive, takes a lot of processing power to run. Spammers can just download heuristic software and analyze it for weaknesses. Even Bayesian filters must be 'trained' every week or two, requiring a substantial amount of man and machine resources, and the only way to train them worth a damn is by using your users' actual email, so there are privacy issues as well. Bayesian spam filters hold a lot of promise and a very low false positives rate, but are CPU intensive so burden the mail server, so are better suited for the desktop, but most users don't have the technical saavy to operate one.
Pros: What could be simpler for a sysadmin? Just identify the originating IP addess in incoming email headers and compare to your blacklist, then delete the bad guys without the users' knowledge. Sharing of blacklists is a democratic process, whereby IP addresses are hourly 'voted' onto or off of the blacklist. Fairly easy on the processor/network. User doesn't have to be involved.
Cons: The blacklist changes constantly, so there must be a central managed blacklist server (like SpamCop) that answers requests one at a time... who pays them, and what happens when they are sued into receivership, like ORBZ? Distributed blacklists (like SpamNet) suffer from intentional "blacklist pollution" by spammers. If a spammer sends junk email from an otherwise innocent IP address (like my home town dialup provider for instance), many other ISP's will block all mail from innocent users that just happen to have accounts on the same server. Spammers do report millions of innocent IP addresses to blacklist servers just for the fun of it, or for revenge. Spammers have written custom email engines and networking software that obscure or falsify the originating IP address, thus making it difficult or impossible to determine which server they came from.
Pros: It works, with no false positives. The user has access to ALL of his email in quarantine, and can control who contacts him. Stops all spam mail immediately. It doesn't become obsolete. It relies on human interaction to control machines, instead of the other way around. Very effective 'hands free' operation. Spammers cannot use trickery to get around it. Easy on the server resources. Very little technical maintenance.
Cons: New senders might be offended by having to 'jump through a hoop' to reach their intended receiver. The user has to manage his inbox, requiring a little training. Increased network traffic with verification emails. Badly written auto-reply robots can create a mail-loop. Special care must be taken with mailing lists and newsgroups. It is possible to challenge a challenge email, which requires cooperation between c/r servers.
[Methods 4 & 5 ignored in this article - see above]
Pros: Spammers can't find you.
Cons: Neither can anyone else. Why not just hide your email address and give it to no-one (except your mother of course).
Pros: User is in control of his Inbox. Already implemented in most modern desktop email clients.
Cons: Only works for one or two days at best, until the spammer changes his Originating addresses.
A careful reading of the above material shows that the two most popular methods of spam fighting (server filtering and server blocking) are inadequate to the task, because they try to battle the spam on it's own turf, the computer. However, computers are not (yet) smart enough to decipher human trickery, and the best automatic filters can be easily defeated by a moderately intelligent human. The spam problem is a distributed human problem, and as such deserves a distributed human solution. Until now, the accepted strategy has been to put the onus on the recipient's computer to prove that an email is spam, costing a lot of processing power and lost mail. I have found that it is much more effective to turn the problem on it's head and put the onus on the sender to prove that his email isn't spam. This comes at the cost of more network traffic, and a different social protocol: you must knock at the door before entering.
Specifically I favor the method of automatically quarantining every unrecognized sender until he/she proves they are not a spam robot. It's called "Challenge / response whitelisting", and in my opinion it is the answer to the rapidly escalating spam plague. The #1 insight driving this decision is that Spam is delivered by unattended machinery (over 99% is), and the originating addresses are falsified, so if you quarantine and challenge them, they almost never answer. Further, if they do answer, spam computers are stymied when asked to do something human (like read a word hidden in artwork or identify a photograph of a baby). If a spammer white-lists himself, the user will simply respond by blacklisting him permanently. This takes only a little training. Challenge-response whitelisting is a bit extreme, but 70% of all email is spam (as of March, 2004) and more coming every month, so people are more receptive to changing the way they do email if they can do it without spam. Also, C/R is a very 'light' use of computer resources which will scale well to the enormous demands of increased spam email.
And it works! After writing two anti spam programs and ripping my hair out due to their ineffectiveness, I was forced by logic and evidence to write a third anti spam program that uses this method. It is now the only method that I use to filter spam out of my own email, and it is more effective than text-filtering and RBL blocking combined, with no false positives. The downside is that I have to check my quarantine area every few days, but that's a small price to pay for re-owning my email.
I hope you too use my program (spameater.com - spam blocker) to once again make your inbox your own, and thwart those low-life spammers. Email is a public resource, and we have to act intelligently and diligently (and soon) if we want to avoid the unfortunate future of having our government and police controlling it for us. By the way, never buy anything offered in a spam email.
Have a beautiful day on your free, publicly owned Internet,
Chris Fortune
Computer programmer
http://cfortune.kics.bc.ca
ps, Please help support my work by linking to this website, or purchasing an inexpensive ($1.50!) monthly subscription. Thanks.
Some food for thought:
In the EFF Statement Regarding Anti Spam Measures it is stated that "The focus of efforts to stop spam should include protecting end users and should not only consider stopping spammers at all costs. Specifically, any measure for stopping spam must ensure that all non-spam messages reach their intended recipients. Proposed solutions that do not fulfill these minimal goals are themselves a form of Internet abuse and are a direct assault on the health, growth, openness, and liberty of the Internet".
http://www.eff.org/ Public Interest Position on Junk Email: Protect Innocent Users; EFF Statement Regarding Anti spam Measures.EFFector, Vol.14,No.31; Oct 26, 2001.
1.2. CRI and Consent.
In [CHARTER] the spam problem is approached as one of consent:
"The definition of spam messages is not clear and is not
consistent across different individuals or organizations.
Therefore, we generalize the problem into "consent-based
communication". This means that an individual or organization
should be able to express consent or lack of consent for
certain communication and have the architecture support
those desires."
Challenge / Response Interworking (CRI) Framework
for Challenge / Response Email Systems
A working document of the Anti Spam Research Group (ASRG) of
the Internet Research Task Force (IRTF)
http://www.ietf.org/internet-drafts/draft-irtf-asrg-cri-00.txt