Control bots, spiders, and crawlers

Overview

This article explains how to use a robots.txt file to control search-engine bots from crawling your site.

Background

Bots, spiders, and other crawlers hitting your website can potentially increase resource usage. This can lead to high load on the server and slow down your site(s).

One option to manage these bots is to create a robots.txt file at the root of your website. This tells search engines what content on your site they should and should not index. If you prefer not to create this file yourself, you can have DreamHost create one for you automatically (on a per-domain basis) on the Block Spiders page.

While most of the major search engines respect robots.txt directives, this file only acts as a suggestion to compliant search engines and does not prevent search engines or other similar tools from accessing the content or making it available.

Using caution when blocking

Please be aware of the following before creating rules to block search engines.

Blocking all bots

Blocking all bots (User-agent: *)from your entire site (Disallow: /) will get your site de-indexed from legitimate search engines. DreamHost recommends that you only block specific User-agents and files/directories, rather than all, unless you're absolutely sure that's what you want.

Bad bots

The way that 'Bad bots' operate must also be taken into account:

Bad bots will likely ignore your robots.txt file, so you may want to block their user-agent with an .htaccess file instead.
Bad bots may also use false or misleading User-agents, so blocking User-agents with .htaccess may not work as well as anticipated.
Bad bots may use your robots.txt file as a target list, so you may want to skip listing directories in the robots.txt file.

How to block various bots

The following sections explain how to block specific bots from crawling your website.

Determining the company to block

You can check which company an IP belongs to by running the host command via SSH. For example, if the IP of 66.249.66.167 was found in your access.log, run the following.

[server]$ host 66.249.66.167
167.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-167.googlebot.com.

This confirms it's originating from Google, so you can use the instructions in the next section to block it.

Blocking Googlebots

To block this Googlebot, add the following in your robots.txt file:

# go away Googlebot
User-agent: Googlebot
Disallow: /

Explanation of the fields above:

# go away — This is a comment so you remember why you created this rule.
User-agent — The name of the bot to which the next rule will apply.
Disallow — The path of the URL you wish to block. This forward slash means the entire site will be blocked.

See this page for further information about Google robots.

Blocking Yahoo

Yahoo's crawling bots comply to the crawl-delay rule in robots.txt, which limits their fetching activity. For example, to tell Yahoo not to fetch a page more than once every 10 seconds, you would add the following:

# slow down Yahoo
User-agent: Slurp
Crawl-delay: 10

Explanation of the fields above:

# slow down Yahoo — This is a comment so remember why you created this rule.
User-agent: Slurp — Slurp is the Yahoo User-agent name. You must use this to block Yahoo.
Crawl-delay — Tells the User-agent to wait 10 seconds between each request to the server.

See this page for further information about Yahoo robots.

Blocking all bots

Add this code to disallow all bots:

User-agent: *
Disallow: /

You can also specify a directory.

User-agent: *
Disallow: /your-directory/

Explanation of the fields above:

User-agent: * — Applies to all User-agents.
Disallow: / — Disallows the indexing of everything.
Disallow: /your-directory/ — Disallows the indexing of this single directory.

Slowing good bots

Use the following to slow some, but not all, good bots:

User-agent: * 
Crawl-Delay: 10

Explanation of the fields above:

User-agent: * — Applies to all User-agents.
Crawl-delay — Tells the User-agent to wait 10 seconds between each request to the server.

Googlebot

See the following pages for further assistance with Googlebot:

Overview

Using caution when blocking

How to block various bots

Determining the company to block

Blocking Googlebots

Blocking Yahoo

Blocking all bots

Slowing good bots

See also

Still not finding what you're looking for?

Status Updates

Cookies

ARTICLES & TUTORIALS

COMPANY ANNOUNCEMENTS