Вы находитесь на странице: 1из 1

Why it happens? Scraper implementation issues Ignoring robots.

txt

Bug in the scraper implementation

Scraper is too aggressive.


Your website has bottlenecks or your
implementation is stupid.

Evil people want to bring down your website.

Find a few expensive pages on a website and


How they do it? fetch them repeatedly.

Call a few pages repeatedly.

Tools Simple scraper Curl/wget

A script or a program
Application level DDoS

Slowloris http://ha.ckers.org/slowloris/

Cron job that analyzes log files and automatically


Detection mechanism DIY applies actions.

Tools fail2ban Also a mitigation mechanism.

denyhosts Also a mitigation mechanism.

Non-technical solutions Find who the scraper is and send them an email
to play nice.

How to be safe?
If they don't listen, complain to the ISP.

If the ISP doesn't listen, block them.

ToS Have a ToS on your website.

Make sure your robots.txt is correct.

Technical solutions Architecture level Scale your architecture to handle more traffic. Get a more beefy server

Cache data, so load is minimal on backend.

Figure out some attribute in the incoming


request and use a server level rule to block the
Application level (HTTP server level) traffic. Attributes UserAgent string.

CIDR range of IPs.

Tools mod_security More efficient rules.

http://www.modsecurity.org/documentation/
ModSecurity_The_Open_Source_Web_Application_
Firewall_Nov2007.pdf

If the URL's can be black-holed, use it to deny


mod_rewrite requests.

Network level Use iptables to block or rate limit traffic.


Ex: No more than 128 connections in
this /24.
# connection limit for HTTP
-A INPUT -i eth0 -p tcp --syn --dport
www -m connlimit ! --connlimit-above 128
--connlimit-mask 24 -j ACCEPT

Dangerous since you can block yourself.!

Call your network provider to block traffic.

Вам также может понравиться