Webmasters who want to tell search engines what they can or can’t download simply place a a file called robots.txt after their domain name with instructions for the search engine (explanation here).
# robots.txt for http://www.whitehouse.gov/
Update: Comment from David helps to explain it: “This has been mentioned a couple of times previously on other sites. They’ve been burned by having search accessible documents (with meta data) that can’t be found on the White House Web site. Thus the extensive block list. Now *only* those things that are actually on the web site are accessible (and only through the web site).