不守规矩的爬虫

在apache access log里看到很多一搜(现在好像叫“神马搜索”)的爬虫:

yisou

貌似完全不理会robots.txt啊:

User-agent: *
Disallow: 
Crawl-delay: 60

果断在tomcat的server.xml里禁掉IP段:

<Context ...>
<!-- 220.181.108.*, 123.125.71.*: baidu spider, 106.11.15*.*: yisou -->
<Valve className="org.apache.catalina.valves.RemoteAddrValve" allow="" deny="106.11.15\d.\d+, 220.181.108.\d+, 123.125.71.\d+"/>
</Context>

看看效果:

yisou-forbid

参考资料:

Tomcat doc - Remote Address Filter