Have you ever experienced a chock-full of misbehaved, hyper-aggressive spiders hitting your servers with request rates to the tune of several thousand per second?
I got sick of seeing it clogging up my sites stats every day, so I blocked it. That was roughly ten hours ago, and in those ten hours MJ12 bot has tried to hit this site over a hundreds times. Yes, really – staggering 129 attempted hits – in 10 hours! This is roughly 3-6 megs of data being uploaded in a day which is not big deal right? But if you count this by month or year – its even bigger. I believe you can see why I was annoyed? Of course you can. Most of us (webmasters) know that good bots obey robots on your site however not in my case. I did some research online for this bot to be just sure I am not blocking good thing on internet.
MJ12bot (http://www.majestic12.co.uk/projects/dsearch/mj12bot.php) claims to be a project to ‘spider the Web for the purpose of building a search engine’. The company that makes it asks volunteers to install the indexing software on their own computers, using their own bandwidth and CPU resources instead of using the company’s. The idea of a community-run search engine which sounds really great – however, the MJ12bot authors have not operated a search engine for many years and they are in business from early 2007. Instead, they use the information that people are generating to sell SEO services on a different site: https://majestic.com. They also claim to sell your SEO and links out to your competitors.
They do claim on many forums that their bot is good ‘thing’ on internet and that it will obey robots on your site, however its not true. I can see this very clearly on my server logs. Basically anything to protect their business which equals money for them – not for you.
What I noticed from my server logs the MJ12bot software often requests malformed URLs that generate ‘404 not found’ errors, increasing CPU usage on WordPress sites that are running on my server. Because of this, and because the MJ12bot software is often one of the primary causes of site slowdowns and CPU overage, today I’ve blocked it from all sites hosted on this server. In end of the day they make business from crawling the sites for free and trust me you will not get any benefits from letting then do this – no single user has ever come back from their site to me.
Here is what I’ve done on my server – simple rule without editing every single site is deploy fail2ban filter. Basically I let them visit any of sites hosted on my server (they seems to be interesting on sites that have google pr 2 and higher) for one time. Since this bot is decentralized – sort of speaking, as soon as the bot hit any website hosted on my server, its IP gets banned for a week. In sort of 10 hours from deploying this filter it did catch 11 IP addresses per website and for the bans deployed it seems that crawling from this parasite bot is in end, however we all know, that IPs get often changed so automatic weekly refresh of filter is good idea. Also if the IP of the bot will change any earlier, filter will catch it and block it from accessing any site hosted on this server. I do strongly suggest you to do same with MJ12bot and perhaps with any other bot you dislike to crawling your site(s). Fail2ban is really great and it does what you would expect!