Originally Posted by
smontanaro
The server logs must give you an idea which domains are hitting the server the hardest. Also, the robots.txt file is advisory. Clients agree to adhere to its dictates, but there's no enforcement. It's possible someone is either ignoring the crawler delay or hitting one or more of the disallowed URLs. In either case, that would warrant a block. If you banned a single IP, might it have just been a poorly behaved homegrown crawler? If it was a larger crawler, they would likely have hit you from multiple IP addresses, so you might have to block a larger block of addresses.
Also, sticking a modern web server like nginx in front of your actual web server (if you haven't already) could give you more knobs to turn regarding blocks, pauses, blacklists, etc.
I am not tech so I do not know exactly what they did. They easily could have done more than what I relayed as it was pushed to different teams. Regardless, whatever it is that they did or did not do specifically, seems to have worked as there were no reports yesterday about any database errors.