Screen Scraping Spambots are robots that crawl the web looking for email addresses. These addresses are harvested, compiled into lists and sold to mass marketers. The result is that posting your address on a web page is a quick way to ensure that you get lots of advertisements for Viagra, Home Loans, and Nigeria 419 solicitations. So how does one make their email addresses available on the web while limiting your exposure to spam? There are several solutions, but as with most security approaches, no solution is complete, and all have tradeoffs between ease of use and the protection that is provided. It is also essentially an arms race, as each approach used to foil the bots is countered by more sophisticated bots.
One of the simplest approaches is to remove the mailto: tags turning the addresses into text. In addition the '@' character in each of the addresses is replaced with a text representation such as '[at]' or '<at>'. More ambitious changes include replacing the '.' characters in the addresses with '[dot]' or '<dot>'. The down side of this approach is that the senders of email must edit the address before it can be used. The other downside of the approach is that it only stops the simplest of spambots. It may have been an effective technique five or more years ago when few people used it. But the practiced ha become widespread and spambots have become more sophisticated. The strings '[at]' and '<at>' do not normally occur in web pages (other than in email addresses), and are no more difficult to find than the character '@'. As a result, this approach does little to stop the harvesting of email addresses, while inconveniencing the very people you want to send you messages.
A second approach is to modify the email address by inserting extra characters that, to a human, are obviously not part of the address. An example of such an address is name@NOSPAMdomain.top. To a human it is clear that the string 'NOSPAM' should be removed. This approach requires some imagination when adding the strings to the address. The most common ones (such as NOSPAM) are easily detected and removed. This approach also has the disadvantage that the address must be edited by the sender.
Both of these approaches have the basic disadvantage that they are easily defeated by text processing. The text processing is further simplified since many addresses often occur in tabular form or near the string '[Ee]mail' in the body of the web page. The hardware available at home to high school teenagers is often sufficient to process the pages as they are retrieved from the web. A much better version is to omit the bracket characters, since the words 'at' and 'dot' are words that do occur naturally in the english language or use special characters such as the bullet character (• = •) or degree character (° = °). It also helps to remove any other cues from nearby text such as 'email' or E-mail'. Usually, the formatting of the page can be used to make clear that it is an email addresses. Another stronger approach renders the address into an image, so the spambot cannot detect the text (without spending significant horsepower on OCR). However it also has the disadvantage that legitimate senders of email must manually type in the address. The text based approaches could be cut and paste into the email client and edited by the user after.
A third textual approach is to provide an prose description of your email address. For example a friend of mine uses "my first name followed by domainname". The use of the @ character is inferred. One other address I know of simply uses the phrase "guess, it is obvious". Mail aliases are added to the mail server to make all of the most obvious guesses work.
There are more active approaches to dealing with spambots. Project Honey Pot, started by Uspam LLC is a project that attempts to link the spambots with the businesses that use the addresses that are harvested. Some of these businesses have claimed that all of the addresses on their lists were opt in. That is the addresses were given by people who had explicity requested to be added to the list. Web pages are seeded with generated addresses that are forwarded back to Project Honey Pot. The addresses are dynamically generated, sending information about both the agent (bot software) and the originating IP address to Project Honey Pot when the pages are accessed. Any unsolicited email that arrives at the generated address can then be linked to the spambot that harvested the address.
A second active approach is to use the agent string to dynamically change the url of the page as it is accessed. Known spambots are then given pages that do not contain valid addresses. However new spambot applications with a new agent string (or spoofing a known browser string) must be added as they are discovered. A final more active approach is attempt to poison the spambots list with many invalid email addresses, often hidden from normal view, but visible in the HTML. In some cases the fake addresses are generated dynamically with a recursive link in the page trapping the spambot into havesting more and more invalid email addresses. The idea is that a spambot may discard all of the addresses from a web site if most of them are bogus. Even if the addresses are not discarded at least the spammers end up with a lot of invalid email addresses. However, unless this approach becomes widely adopted, the effect is limited. There have been some studies that have shown that signficant portions of commercially available lists contain invalid addresses. With the low cost of sending email, invalid addresses do not provide much of a deterrent, and tools have become available to automatically remove invalid email addresses from lists.
Currently one of the best solutions is to use javascript to dynamically write the email address into the page. This has several advantages. The first is that the mailto: tag may be included in the result, so that legitimate senders of email regain the simple click to send email interface. Meanwhile the address is much more difficult to recover. There are many variations on this theme, many of which include some form of javascript encryption so that the address never appears directly in the page. The idea is that a spambot must interpret the Javascript in order to find the address. However, for the more simpler versions, it is still possible to statically detect the address without interpreting the JavaScript, particularly if cues exist in the surrounding text. At the current point in time, these approaches basically follow the burgler principle. You don't have to change your house into a fortress, you need only make it harder to break into than the neighbors.
The approach I use is threefold. As a basis I use one of the javascript approaches. I did not write the code but adapted it from several of the many other sources on the net. In researching this page, I have found a couple of better approaches that I may upgrade to in the near future. The following JavaScript code is placed in the file 'mssg.js'. The name of the script file does not give away the purpose of the file, but is close enough to messages to retain some human readability.
var name = "tom.dean";
var domain = "queensu.ca";
document.write('<a href=\"mailto:' + name + '@' + domain + '\">');
document.write(name + '@' + domain + '</a>');
This JavaScript constructs the email address and a mailto: link from component parts and writes it into the document. I then use a script tag to invoke the code from within any page which will contain my email address. This has the advantage that the key strings (e.g. mailto:) do not occur directly in the web page. The weakest part of the approach is the noscript alternative contains the descriptive text approach. It is only visible to those with JavaScript turned off. It is still visible to the spambot, a vulnerability. However, if I wish to be available to those without JavaScript, it is currently about the best that can be done. It does have the advantage that those with JavaScript get a clickable link.
Thus adding the following two lines anywhere I need my address suffices.
<script type="text/javascript" language="JavaScript" src="URL of mssg.js"></script> <noscript><B>Email:</B> My three initials followed by post • queensu • ca</noscript>
As an extra defense, I have a small web page that contains an infinite loop in JavaScript. This page is linked to with one or more hidden links througout my web site. If any spambot author does implement a JavaScript engine, traversing all of the links in the page will lead to the trap sending the spambot into an infinite loop. Since an innocent site visitor might actually stumble accross the page, it contains an explantation and a confirmation dialog is invoked on each pass of the loop. The default action is to continue the loop, but a human readable message explains how to cancel the loop. Normal archival bots such as the GoogleBot only index the content, and thus are not affected by the infinite loop.
Some Resources:
Telephone: (613) 533-6482
Fax: (613) 533-6615