Spam filtering techniques
Summary and Resources
Given the testing methodology described earlier, let's look at the concrete testing results. While I do not present any quantitative data on speed, the chart is arranged in order of speed, from fastest to slowest. Trigrams are fast, Pyzor (network lookup) is slow. In evaluating techniques, as I stated, I consider false positives very bad, and false negatives only slightly bad. The quantities in each cell represent the number of correctly identified messages vs. incorrectly identified messages for each technique tested against each body of e-mail, good and spam.
Table 1. Quantitative accuracy of spam filtering techniques
Technique | Good corpus (correctly identified vs. incorrectly identified) | Spam corpus (correctly identified vs. incorrectly identified) |
"The Truth" | 1851 vs. 0 | 1916 vs. 0 |
Trigram model | 1849 vs. 2 | 1774 vs. 142 |
Word model | 1847 vs. 4 | 1819 vs. 97 |
SpamAssassin | 1846 vs. 5 | 1558 vs. 358 |
Pyzor | 1847 vs. 0 (4 err) | 943 vs. 971 (2 err) |
Resources
-
The TDMA home page provides more information about the Tagged Message Delivery Agent.
-
You can get more information about ChoiceMail from DigitalPortal Software.
-
Pyzor is a Python-based distributed spam catalog/filter.
-
Vipul's Razor
is a very popular distributed spam catalog/filter. Razor is optionally
called by a number of other filter tools, such as SpamAssassin.
-
Read Paul Graham's essay "A Plan for Spam."
-
Eric Raymond has created a fast implementation of Paul Graham's idea under the name "bogofilter."
In addition to using some efficient data representation and storage
strategies, bogofilter tries to be smart about identifying what makes a
meaningful word.
-
My own trigram-based categorization tools
are still at an early alpha or prototype level. However, you are
welcome to use them as a basis for development. They are public domain,
like all the tools I write for developerWorks articles.
-
Lawrence Lessig has written a number of books and articles that
insightfully contrast what he metonymically calls "west-coast code" and
"east-coast code," in other words, the laws passed in Washington D.C.
(and elsewhere) versus the software written in Silicon Valley (and
elsewhere). I've written a short review of Lessig's Code and Other Laws of Cyberspace. See Lessig's Web site for more to think about.
- Find more Linux articles in the developerWorks Linux zone.
View Spam filtering techniques Discussion
Page: 1 2 3 4 5 6 7 8 9 Next Page: Six approaches to eliminating unwanted e-mailFirst published by IBM developerWorks