Block Apache access from malicious user agents and spiders

Q I run my own small site and it seems most of my traffic is from web crawlers. How can I control access to my Apache web server from potential malicious user agents, crawlers, spiders et al?

A Web crawlers and spiders can be used to pirate content and gain information about the structure of your website that you may want to keep hidden, and have been known to bring sites down due to the load they can put on a server. These agents are also commonly used by search engines to catalogue the content of websites. This is all well and good but if you do not want your site to be searched in this manner it is a good idea to block the associated agents that do the searching, and take some load off your server at the same time. Most of the time well-behaved web crawlers will read the robots.txt file at the root of the website. But if they don't, we have to adopt strong tactics. One way to achieve this is to block using the HTTP header information. Though there are ways around this type of filtering, this is a good first step and in most cases is all you require to block this type of access. You need to change webcopier to a string that is being sent from the spider. Try

setenvif User-Agent ^webcopier block
<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=block
</Limit>

or

RewriteEngine On RewriteCond %{HTTP_USER_ AGENT} ^WebCopier [NC,OR] RewriteRule ^.* - [F,L]

Back to the list