Spam Filtering Techniques 1. Basic structured text filters

Spam filtering techniques

By David Mertz, Ph.D. - 2004-04-06 Page: 1 2 3 4 5 6 7 8 9

1. Basic structured text filters

The e-mail client I use has the capability to sort incoming e-mail based on simple strings found in specific header fields, the header in general, and/or in the body. Its capability is very simple and does not even include regular expression matching. Almost all e-mail clients have this much filtering capability.

Over the last few months, I have developed a fairly small number of text filters. These few simple filters correctly catch about 80% of the spam I receive. Unfortunately, they also have a relatively high false positive rate -- enough that I need to manually examine some of the spam folders from time to time. (I sort probable spam into several different folders, and I save them all to develop message corpora.) Although exact details will differ among users, a general pattern will be useful to most readers:

Set 1: A few people or mailing lists do funny things with their headers that get them flagged on other rules. I catch something in the header (usually the From:) and whitelist it (either to INBOX or somewhere else).
Set 2: In no particular order, I run the following spam filters:
- Identify a specific bad sender.
- Look for "<>" as the From: header.
- Look for "@<" in the header (lots of spam has this for some reason).
- Look for "Content-Type: audio". Nothing I want has this, only virii (your mileage may vary).
- Look for "euc-kr" and "ks_c_5601-1987" in the headers. I can't read that language, but for some reason I get a huge volume of Korean spam (of course, for an actual Korean reader, this isn't a good rule).
Set 3: Store messages to known legitimate addresses. I have several such rules, but they all just match a literal To: field.
Set 4: Look for messages that have a legit address in the header, but that weren't caught by the previous To: filters. I find that when I am only in the Bcc: field, it's almost always an unsolicited mailing to a list of alphabetically sequential addresses (mertz1@..., mertz37@..., etc).
Set 5: Anything left at this point is probably spam (it probably has forged headers to avoid identification of the sender).

View Spam filtering techniques Discussion

Page: 1 2 3 4 5 6 7 8 9 Next Page: 2. Whitelist/verification filters

First published by IBM developerWorks