Overview
Bots, spiders, and other crawlers hitting your dynamic pages can cause extensive resource (memory and CPU) usage. This can lead to high load on the server and slow down your site(s).
One option to reduce server load from bots, spiders, and other crawlers is to create a robots.txt file at the root of your website. This tells search engines what content on your site they should and should not index. This can be helpful, for example, if you want to keep a portion of your site out of the Google search engine index.
If you prefer not to create this file yourself, you can have DreamHost create one for you automatically (on a per-domain basis) on the Block Spiders page.
While most of the major search engines respect robots.txt directives, this file only acts as a suggestion to compliant search engines and does not prevent search engines (or other similar tools, such as email/content scrapers) from accessing the content or making it available.
Blocking robots
The problem may be that Google, Yahoo, or another search engine bot is over-browsing your site. (This is the sort of problem that feeds on itself; if the bot is not able to complete its search because of a lack of resources, it may launch the same search over and over again.)
Blocking Googlebots
In the following example, the IP of 66.249.66.167 was found in your access.log. You can check which company this IP belongs to by running the ‘host’ command via SSH:
[server]$ host 66.249.66.167 167.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-167.googlebot.com.
To block this Googlebot, use the following in your robots.txt file:
# go away Googlebot User-agent: Googlebot Disallow: /
-
Explanation of the fields above:
- # go away
- This is a comment which is only used so you know why you created this rule.
- User-agent
- The name of the bot to which the next rule will apply.
- Disallow
- The path of the URL you wish to block. This forward slash means the entire site will be blocked.
View further information about Google robots by clicking the following:
Blocking Yahoo
Yahoo's crawling bots comply to the crawl-delay rule in robots.txt which limits their fetching activity. For example, to tell Yahoo not to fetch a page more than once every 10 seconds, you would add the following:
# slow down Yahoo User-agent: Slurp Crawl-delay: 10
-
Explanation of the fields above:
- # slow down Yahoo
- This is a comment which is only used so you know why you created this rule.
- User-agent: Slurp
- Slurp is the Yahoo User-agent name. You must use this to block Yahoo.
- Crawl-delay
- Tells the User-agent to wait 10 seconds between each request to the server.
View further information about Yahoo robots by clicking the following:
Slowing good bots
Use the following to slow some, but not all, good bots:
User-agent: *
Crawl-Delay: 10
-
Explanation of the fields above:
- User-agent: *
- Applies to all User-agents.
- Crawl-delay
- Tells the User-agent to wait 10 seconds between each request to the server.
Googlebot
View the following pages for further assistance with Googlebot:
Blocking all bots
To disallow all bots:
User-agent: * Disallow: /
To disallow them on a specific folder:
User-agent: *
Disallow: /yourfolder/
Bad bots may use this content as a list of targets.
-
Explanation of the fields above:
- User-agent: *
- Applies to all User-agents.
- Disallow: /
- Disallows the indexing of everything.
- Disallow: /yourfolder/
- Disallows the indexing of this single folder.
Use caution
Blocking all bots (User-agent: *) from your entire site (Disallow: /) will get your site de-indexed from legitimate search engines. Also, note that bad bots will likely ignore your robots.txt file, so you may want to block their user-agent with an .htaccess file.
Bad bots may use your robots.txt file as a target list, so you may want to skip listing directories in the robots.txt file. Bad bots may also use false or misleading User-agents, so blocking User-agents with .htaccess may not work as well as anticipated.
If you don't want to block anyone, this is a good default robots.txt file:
User-agent: * Disallow:
You may need to remove robots.txt in this case, if you don't mind 404 requests in your logs.
DreamHost recommends that you only block specific User-agents and files/directories, rather than *, unless you're 100% sure that's what you want.
Blocking bad referrers
For detailed instructions, please visit the article on how to block referrers.