Build
Javascript, CSS, ASP, PHP, Java, HTML, Flash, Software Tutorials, SQL

Design
Style, Layout, Domains

Premote
Advertising, Marketing, Communities, Search Engines

Sell
Selling Advertising, Online Business's

Newsletter
Subscribe, Unsubscribe

Writing robots.txt files 101

It’s great when you first get search engine crawlers creeping all over your site. However sometimes you get them indexing content that you don’t want them do. For instance, they may be crawling admin/index.asp or other such files – and believe me, people are searching for URL’s such as that.

Robots.txt to the rescue
The solution is robots.txt. A small text file which most search crawlers pay attention to, to check to see if they can index content or not. This allows you to define separate folders and separate crawler information.

There are 3 parts to each instruction in a robots.txt file:

User-Agent: *
Allow: /
Disallow: /cgi-bin

The first part, User-Agent allows you to define which crawler you are talking to. Usually you will want this to apply to all of them. So here you use a * wildcard to declare that it applies to all of them.

The second part is what they are allowed to. This will usually be / unless you only have a specific folder that you want the crawler to index. This is usually only be used when are you talking to one specific crawler.

The third part is the disallow section. This will allow you to stop certain folders being indexed. Examples of folders you might want hidden are:

/admin
/cgi-bin
/webmail

You can allow disallow index of dynamically generated pages by using:

Disallow: /*?

Identifying crawlers
If you want to talk to specific crawl bots, you need to get the names. These can usually be obtained from stats such as Awstats when they arrive on your site. For instance the Google crawler is identified by:

GoogleBot

And the Alexa crawl bot is identified by:

ia_archiver

Written by: Chris Worfolk - http://www.worfolk.biz
Posted: 1/31/2004