Webmasters who want to tell search engines what they can or can’t download simply place a a file called robots.txt after their domain name with instructions for the search engine (explanation here).

Something interesting was reported today by Richard Smith on funsec.

Why is the White House using such a large robots.txt file to disallow so much?  You can see it here.

# robots.txt for http://www.whitehouse.gov/

User-agent: *
Disallow: /cgi-bin
Disallow: /search
Disallow: /query.html
Disallow: /help
Disallow: /360pics/iraq
Disallow: /360pics/text
Disallow: /911/911day/iraq
Disallow: /911/911day/text
Disallow: /911/heroes/iraq
Disallow: /911/heroes/text
Disallow: /911/iraq
Disallow: /911/messages/text
Disallow: /911/patriotism/iraq
etc….

Odd.

Alex Eckelberry 

UpdateComment from David helps to explain it: “This has been mentioned a couple of times previously on other sites. They’ve been burned by having search accessible documents (with meta data) that can’t be found on the White House Web site. Thus the extensive block list. Now *only* those things that are actually on the web site are accessible (and only through the web site).