Queen's University Logo

Fighting SpamBots

Thomas R. Dean

Spambots are robots that crawl the web and newsgroups looking for email addresses. These addresses are harvested, compiled into lists and sold to mass marketers. The result is that posting your address on a web page is a quick way to ensure that you get many advertisements for Viagra, Home Loans, and Nigeria 419 solicitations. So how does one make their email addresses available on the web while limiting your exposure to spam? There are several solutions, but as with most security approaches, no solution is complete, and all have tradeoffs between ease of use and the protection that is provided. It is also essentially an arms race, as each approach used to foil the bots is countered by more sophisticated bots.

Before we discuss approaches to fighting the spambot, we have to consider the larger picture. The first is whether to bother at all. There are multiple ways in which your email will end up on these lists, and mass marketers sell addresses to each other with wild abandon. While most online sites which require registration (for purchase, for posting on message board etc.) may provide a privacy policy, some of these policies are so full of weasel words that they don't have much teeth. More reputable companies may take appropriate precautions, but there have been several public examples of smaller companies (and not so small companies) accidentally leaving back doors open allowing the spammers to recover private subscriber information. There are other avenues in which your email may end up in machine readable format, such as posting messages to newsgroups, or including your address in PDF documents that are publicly accessible (you didn't think spambots limited themselves to web pages did you?). While taking some of the precautions listed in this document may help to reduce the amount of spam, if you have made your email address machine readable in almost any form, you will get spam and lots of it.

One of the simplest approaches is to remove the mailto: tags turning the addresses into text. Most people using this approach replace the '@' character in each of the addresses is with a text representation such as '[at]' or '<at>'. and replace the '.' characters in the addresses with '[dot]' or '<dot>'. A similar approach is to modify the email address by inserting extra characters that, to a human, are obviously not part of the address. An example of such an address is joe@NOSPAMfoobar.com. To a human it is clear that the string 'NOSPAM' should be removed. This approach requires some imagination when adding the strings to the address. There is only one real problem with these approaches: They don't work. The strings '[at]' and '<at>' do not normally occur in web pages (other than in email addresses), and are no more difficult to find than the character '@'. Similarly, almost all strings that would be obvious to remove can be compiled into a dictionary for use by the spambots. As a result, these approaches do nothing to stop the harvesting of email addresses while inconveniencing the very people you want to send you messages. It always surprises me that otherwise intelligent individuals think that this is reasonable way to cloak email addresses. It is like much of the visible anti-terrorist efforts, a useless waste of time and money, but it makes the public think you are doing something.

Text processing by the spambots is further simplified since many addresses often occur in tabular form or near the string 'Email' in the body of the web page. The hardware available at home to high school teenagers is often sufficient to process the pages as they are retrieved from the web. A slightly better version is to omit the bracket characters, since the words 'at' and 'dot' are words that occur naturally in the english language or use special characters such as the bullet character (• = &bull;) or degree character (° = &deg;). It also helps to remove any other cues from nearby text in the page such as 'email' or E-mail'. Usually, the page can formatted in such a way make clear that it is an email addresses (see my home page for an example).

Another stronger approach renders the address into an image, so the spambot cannot detect the text (without using OCR). Unfortunately, image processing has become better and the hardware available more powerful. I am not aware of any current spambot that uses image processing, but I am sure it is only a matter of time. If you do use this approach, make sure that you don't give the bot any indication that the image contains an email address. This approach has several disadvantages. The first is that legitimate senders of email must manually type in the address. It also runs counter to accessibility guidelines and putting the email address in the alt tag for the image sort of defeats the whole purpose of the exercise.

A third textual approach is to provide an prose description of your email address. For example a friend of mine uses "my first name followed by domainname". The use of the @ character is inferred. One other person I know of simply uses the phrase "guess, it is obvious". Mail aliases are added to the mail server to redirect the most obvious guesses to the correct address.

There are more active approaches to dealing with spambots. Project Honey Pot, started by Uspam LLC is a project that attempts to link the spambots with the businesses that use the addresses that are harvested. Some of these businesses have claimed that all of the addresses on their lists were opt in. That is, the addresses were given by people who had explicitly requested to be added to the list. Web pages are seeded with generated addresses that are forwarded back to Project Honey Pot. The addresses are dynamically generated by the server, sending information about both the agent (bot software) and the originating IP address to Project Honey Pot when the pages are accessed. Any unsolicited email that arrives at the generated address can then be linked to the spambot that harvested the address.

A second active approach is to use the agent string to dynamically change the url of the page as it is accessed. Known spambots are then given pages that do not contain valid addresses. However new spambots with a new agent string (or spoofing a known browser) must be added as they are discovered. A final more active approach is attempt to poison the spambots list with many invalid email addresses, often hidden from normal view, but visible in the HTML. In some cases the fake addresses are generated dynamically with a recursive link in the page trapping the spambot into harvesting more and more invalid email addresses. The idea is that a spambot may discard all of the addresses from a web site if enough of them are bogus. Even if the addresses are not discarded at least the spammers end up with a lot of invalid email addresses. However, unless this approach becomes widely adopted, the effect is limited. There have been some studies that have shown that significant portions of commercially available lists contain invalid addresses. With the low cost of sending email, invalid addresses do not provide much of a deterrent, and tools have become available to automatically remove invalid email addresses from lists.

Currently one of the best solutions is to use javascript to dynamically write the email address into the page. This has several advantages. The first is that the mailto: tag may be included in the result, so that legitimate senders of email regain the simple click to send email interface. Meanwhile the address is much more difficult to recover. There are many variations on this theme, many of which include some form of javascript encryption so that the address never appears directly in the page. The idea is that a spambot must interpret the Javascript in order to find the address. However, for the more simpler versions, it is still possible to statically detect the address without interpreting the JavaScript, particularly if cues exist in the surrounding text. At the current point in time, these approaches basically follow the burglar principle. You don't have to change your house into a fortress, you need only make it harder to break into than the neighbours.

The approach I use is threefold. As a basis I use one of the javascript approaches. I did not write the code but adapted it from several of the many other sources on the net. In researching this page, I have found a couple of better approaches that I may upgrade to in the near future. The following JavaScript code is placed in the file 'mssg.js'. The name of the script file does not give away the purpose of the file, but is close enough to messages to retain some human readability.

var name = "tom.dean";
var domain = "queensu.ca";
document.write('<a href=\"mailto:' + name + '@' + domain + '\">');
document.write(name + '@' + domain + '</a>');
   

This JavaScript constructs the email address and a mailto: link from component parts and writes it into the document. I then use a script tag to invoke the code from within any page which will contain my email address. This has the advantage that the key strings (e.g. mailto:) do not occur directly in the web page. The weakest part of the approach is the noscript alternative contains the descriptive text approach. It is only visible to those with JavaScript turned off. It is still visible to the spambot which is a vulnerability. However, if I wish to be available to those without JavaScript, it is currently about the best that can be done. It does have the advantage that those with JavaScript enabled get a clickable link.

Thus adding the following two lines anywhere I need my address suffices:

<script type="text/javascript" language="JavaScript" src="URL of mssg.js"></script>
<noscript><B>Email:</B> My three initials followed by queensu &bull; ca</noscript>

As an extra defence, I have a small web page that contains an infinite loop in JavaScript. This page is linked to with one or more hidden links throughout my web site. If any spambot author does implement a JavaScript engine, traversing all of the links in the page will lead to the trap sending the spambot into an infinite loop. Since an innocent site visitor might actually stumble across the page, it contains an explanation and a confirmation dialog is invoked on each pass of the loop. The default action is to continue the loop, but a human readable message explains how to cancel the loop. Normal archival bots such as the GoogleBot only index the content, and thus are not affected by the infinite loop.

However, as any one approach to defeating spambots becomes commonly used, then the spambots will adapt to it. So adopting any one of these approaches without modification will be self defeating. For example, a spambot does not have to execute JavaScript in order to recognize the most common methods of cloaking email addresses. It can pattern mach the code directly.

Some Resources:

  1. http://www.neilgunton.com/spambot_trap/
  2. http://en.wikipedia.org/wiki/Spambot
  3. http://www.webtechniques.com/archives/2001/08/champeon/
  4. http://blog.as2max.com/archives/2004/12/fighting_spam_t.php
  5. http://spamlinks.net/track-trace-honeypot.htm
  6. http://javascript.internet.com/page-details/spambot-countermeasure.html
  7. http://www.webmasterworld.com/forum91/4948.htm
  8. http://www.jracademy.com/~jtucek/email/index.php
  9. http://scott.yang.id.au/2003/06/obfuscate-email-address-with-javascript-rot13/

Note: this is the third version of this article.

Copyright © Thomas R. Dean, 2005-2008