Control bots, spiders, and crawlers

Overview

You can use a robots.txt file to control which search-engine bots crawl your site.

Background

Bots, spiders, and other crawlers hitting your website can potentially increase resource usage. This can lead to high load on the server and slow down your site(s).

One option to manage these bots is to create a robots.txt file at the root of your website. This tells search engines what content on your site they should and should not index. If you prefer not to create this file yourself, you can have DreamHost create one for you automatically (on a per-domain basis) on the Block Spiders page.

While most of the major search engines respect robots.txt directives, this file only acts as a suggestion to compliant search engines and does not prevent search engines or other similar tools from accessing the content or making it available.

What are the default robots.txt and agents.txt files?

DreamHost automatically includes default robots.txt and agents.txt files for websites hosted on Web Hosting plans. These files are provided by Apache and apply to your site if no custom versions are present in your site's web directory. To override either file, simply create your own robots.txt or agents.txt in the root of your site's web directory (e.g., /home/username/example.com/).

Once a custom file is in place, Apache will serve it instead of the default file, giving you full control over how bots and crawlers interact with your site.

What should I know before blocking bots?

Please be aware of the following before creating rules to block search engines.

Blocking all bots

Blocking all bots (User-agent: *)from your entire site (Disallow: /) will get your site de-indexed from legitimate search engines. DreamHost recommends that you only block specific User-agents and files/directories, rather than all, unless you're absolutely sure that's what you want.

Bad bots

The way that 'Bad bots' operate must also be taken into account:

Bad bots will likely ignore your robots.txt file, so you may want to block their user-agent with an .htaccess file instead.
Bad bots may also use false or misleading User-agents, so blocking User-agents with .htaccess may not work as well as anticipated.
Bad bots may use your robots.txt file as a target list, so you may want to skip listing directories in the robots.txt file.

How do I block various bots?

The following sections explain how to block specific bots from crawling your website.

How do I determine the company to block?

You can check which company an IP belongs to by running the host command via SSH. For example, if the IP of 66.249.66.167 was found in your access.log, run the following.

[server]$ host 66.249.66.167
167.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-167.googlebot.com.

This confirms it's originating from Google, so you can use the instructions in the next section to block it.

How do I block Googlebots?

To block this Googlebot, add the following in your robots.txt file:

# go away Googlebot
User-agent: Googlebot
Disallow: /

Explanation of the fields above:

# go away — This is a comment so you remember why you created this rule.
User-agent — The name of the bot to which the next rule will apply.
Disallow — The path of the URL you wish to block. This forward slash means the entire site will be blocked.

See this page for further information about Google robots.

How do I block Yahoo?

Yahoo's crawling bots comply to the crawl-delay rule in robots.txt, which limits their fetching activity. For example, to tell Yahoo not to fetch a page more than once every 10 seconds, you would add the following:

# slow down Yahoo
User-agent: Slurp
Crawl-delay: 10

Explanation of the fields above:

# slow down Yahoo — This is a comment so remember why you created this rule.
User-agent: Slurp — Slurp is the Yahoo User-agent name. You must use this to block Yahoo.
Crawl-delay — Tells the User-agent to wait 10 seconds between each request to the server.

See this page for further information about Yahoo robots.

How do I block all bots?

Add this code to disallow all bots:

User-agent: *
Disallow: /

You can also specify a directory.

User-agent: *
Disallow: /your-directory/

Explanation of the fields above:

User-agent: * — Applies to all User-agents.
Disallow: / — Disallows the indexing of everything.
Disallow: /your-directory/ — Disallows the indexing of this single directory.

How do I slow good bots?

Use the following to slow some, but not all, good bots:

User-agent: * 
Crawl-Delay: 10

Explanation of the fields above:

User-agent: * — Applies to all User-agents.
Crawl-delay — Tells the User-agent to wait 10 seconds between each request to the server.

Googlebot

See the following pages for further assistance with Googlebot:

Overview

What are the default robots.txt and agents.txt files?

What should I know before blocking bots?

How do I block various bots?

How do I determine the company to block?

How do I block Googlebots?

How do I block Yahoo?

How do I block all bots?

How do I slow good bots?

See also

Still not finding what you're looking for?

Status Updates

Cookies

ARTICLES & TUTORIALS

COMPANY ANNOUNCEMENTS