Overview
This article explains how to use a robots.txt file to control search-engine bots from crawling your site.
Background
Bots, spiders, and other crawlers hitting your website can potentially increase resource usage. This can lead to high load on the server and slow down your site(s).
One option to manage these bots is to create a robots.txt file at the root of your website. This tells search engines what content on your site they should and should not index. If you prefer not to create this file yourself, you can have DreamHost create one for you automatically (on a per-domain basis) on the Block Spiders page.
While most of the major search engines respect robots.txt directives, this file only acts as a suggestion to compliant search engines and does not prevent search engines or other similar tools from accessing the content or making it available.
Using caution when blocking
Please be aware of the following before creating rules to block search engines.
Blocking all bots
Blocking all bots (User-agent: *)from your entire site (Disallow: /) will get your site de-indexed from legitimate search engines. DreamHost recommends that you only block specific User-agents and files/directories, rather than all, unless you're absolutely sure that's what you want.
Bad bots
The way that 'Bad bots' operate must also be taken into account:
- Bad bots will likely ignore your robots.txt file, so you may want to block their user-agent with an .htaccess file instead.
- Bad bots may also use false or misleading User-agents, so blocking User-agents with .htaccess may not work as well as anticipated.
- Bad bots may use your robots.txt file as a target list, so you may want to skip listing directories in the robots.txt file.
How to block various bots
The following sections explain how to block specific bots from crawling your website.
Determining the company to block
-
You can check which company an IP belongs to by running the host command via SSH. For example, if the IP of 66.249.66.167 was found in your access.log, run the following.
[server]$ host 66.249.66.167 167.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-167.googlebot.com.
This confirms it's originating from Google, so you can use the instructions in the next section to block it.
Blocking Googlebots
-
To block this Googlebot, add the following in your robots.txt file:
# go away Googlebot User-agent: Googlebot Disallow: /
Explanation of the fields above:
- # go away — This is a comment so you remember why you created this rule.
- User-agent — The name of the bot to which the next rule will apply.
- Disallow — The path of the URL you wish to block. This forward slash means the entire site will be blocked.
See this page for further information about Google robots.
Blocking Yahoo
-
Yahoo's crawling bots comply to the crawl-delay rule in robots.txt, which limits their fetching activity. For example, to tell Yahoo not to fetch a page more than once every 10 seconds, you would add the following:
# slow down Yahoo User-agent: Slurp Crawl-delay: 10
Explanation of the fields above:
- # slow down Yahoo — This is a comment so remember why you created this rule.
- User-agent: Slurp — Slurp is the Yahoo User-agent name. You must use this to block Yahoo.
- Crawl-delay — Tells the User-agent to wait 10 seconds between each request to the server.
See this page for further information about Yahoo robots.
Blocking all bots
-
Add this code to disallow all bots:
User-agent: * Disallow: /
You can also specify a directory.
User-agent: * Disallow: /your-directory/
Explanation of the fields above:
- User-agent: * — Applies to all User-agents.
- Disallow: / — Disallows the indexing of everything.
- Disallow: /your-directory/ — Disallows the indexing of this single directory.
Slowing good bots
Use the following to slow some, but not all, good bots:
User-agent: *
Crawl-Delay: 10
Explanation of the fields above:
- User-agent: * — Applies to all User-agents.
- Crawl-delay — Tells the User-agent to wait 10 seconds between each request to the server.
Googlebot
See the following pages for further assistance with Googlebot: