To solve these problems, one needs to first detect them in order to respond. If you note, spiders cannot read semantically. They only read links, and then follow links. If you present a link that only a spider will follow, hopefully the spider will follow the link and your users won't. In this way, it is machine-oriented fly-paper.
Ways of presenting the link so that users won't click on it but spiders will are various.
<!-- <a href="http://...">spider trap</a> -->
. This won't display to an end user, but for dumb spiders that just search for href=, this will get them.
<a href="http://..."></a>
<a href="http://..."> </a>
makes the link text only a non-breaking space.
<a style="color:white" href="http://..."> </a>
It should be noted that Google hates people who hides links deviously like this on their sites. The simplest and best solution may simply be - a la HidingInPlainSight? and OpenProcess - to write a link <a href="http://...">Spider trap</a>
, but then give curious cats a way out of the mess they have entered. On the next screen (or all the screens thereafter), offer opportunities for HumanVerification (e.g. a CAPTCHA).
Alternatively, there are SpiderTraps designed for spam-bots which need not be hidden from users, such as a SpamPoisoner.
You can make a SpiderTrap link a SelfBan as well, if you want to block all spiders. This is somewhat dangerous as unwitting users will routinely click on the URL, and it exposes the URL to would be attackers. If you do this, you may have to use some extra precautions. You could generate a nonce every time you display the SpiderTrap-SelfBan link, and expire it quickly. Another option is to allow people to unban themselves with HumanVerification.
You can combine the SpiderTrap with a SurgeProtector. If the spider trips the trap more than X times a minute, or X percentage of total hits, you block it then. This can help detect misbehaved spiders (like a wget) without banning GoogleBot? or Yahoo! Slurp. Additionally, this can help overcome the problem of users self-banning themselves. If someone is dumb enough to continuously load the SpiderTrap, perhaps they deserve being banned.
Warning. SearchEngineCloak detection algorithms are secret. They simply use a second spider to check to see the pages are similar. If you SpiderTrap one and not the other, your site may be banned. The solution is to put the SpiderTrap link in RobotsDotTxt? - after all, you are looking for bots that do not adhere to the RobotsExclusionStandard.