RobotsExclusionStandard

MeatballWiki | RecentChanges | Random Page | Indices | Categories

[The Robots Exclusion Standard (RES)] was created in 1994 in order to deal with the growth of WebCrawlers. Because crawlers are fairly bandwidth intensive and CPU intensive, they have the potential of doing a DenialOfService attack on their target sites, whether they intend to or not.

Moreover, with the advent of the CommonGatewayInterface? (CGI), many portions of a website may not be appropriate for a spider, either because the CGI script may take a lot of CPU time (in the case of wikis certainly) or because the script tracks user information, e.g. the ShoppingCart? sections of amazon.com.

There are many other cases where it is important to block robots and indexers from your site.

The RES provides two mechanisms to exclude robots: robots.txt and the Robots meta tag. These are both detailed in the above link.

Robots.txt

The file at your website's "slash" (WhatIsaSlash) called robots.txt. e.g. http://www.example.com/robots.txt. As per the RobotsExclusionStandard.

One problem with this, if used injudiciously, is that you end up telling everyone where all the good stuff is. Not good. Of course, enough people have been burned that hackers know to look there and webmasters mostly know to handle it tenderly. For example, don't put "disallow:/hidden_creditcard_numbers/", put the credit card numbers in /Data/something/creditcard_numbers/ and make "something" have few permissions and not give any information or subdirectories. Do the same with Data, and preferably have a password-like name for "something" (non-obvious, not a real word)

: This is of course a form of SecurityThroughObscurity, which is considered to be no security at all. If you want to secure hidden_creditcard_numbers, you don't hide them, you lock them up so individuals can't access them at all. -- anon

Passwords are also SecurityThroughObscurity by this logic, yet oddly they're still widely used; I don't think your sentiment, "[this] is considered to be no security at all", is as authoritative as it sounds. Nevertheless, I agree that credit card numbers are best protected by hard security. In this case, that doesn't mean faffing around with access permissions: it means taking them out of public_html. Trivial. -- ChrisPurcell

CategoryWebTechnology

RobotsExclusionStandard

Robots.txt

Discussion