|
Writing robots.txt files 101
It’s great when you first get search engine crawlers creeping all over your
site. However sometimes you get them indexing content that you don’t want
them do. For instance, they may be crawling admin/index.asp or other such files
– and believe me, people are searching for URL’s such as that.
Robots.txt to the rescue
The solution is robots.txt. A small text file which most search crawlers pay
attention to, to check to see if they can index content or not. This allows
you to define separate folders and separate crawler information.
There are 3 parts to each instruction in a robots.txt file:
User-Agent: *
Allow: /
Disallow: /cgi-bin
The first part, User-Agent allows you to define which crawler you are talking
to. Usually you will want this to apply to all of them. So here you use a *
wildcard to declare that it applies to all of them.
The second part is what they are allowed to. This will usually be / unless
you only have a specific folder that you want the crawler to index. This is
usually only be used when are you talking to one specific crawler.
The third part is the disallow section. This will allow you to stop certain
folders being indexed. Examples of folders you might want hidden are:
/admin
/cgi-bin
/webmail
You can allow disallow index of dynamically generated pages by using:
Disallow: /*?
Identifying crawlers
If you want to talk to specific crawl bots, you need to get the names. These
can usually be obtained from stats such as Awstats when they arrive on your
site. For instance the Google crawler is identified by:
GoogleBot
And the Alexa crawl bot is identified by:
ia_archiver
|