View Full Version : Reducing bad bots, spiders, spammers and their effect on ShopWindow operations quota
Hi
Mod: if this is the wrong place for this posting - apologies. Please re-direct it to the best place.
I'm using SW Client V2 and having had a lot of problems with my SW operations quota being swallowed up despite relatively few impressions (info from AW is a rough guide - one impression uses six operations - ish), I realised it was spammers and bad bots that were draining resources.
Two pages of this site were very useful for an overview of the problem and how to deal with unwanted visitors.
http://www.affiliatebeginnersguide.com/articles/page_content.html
http://www.affiliatebeginnersguide.com/articles/block_bots.html
I then trawled the web and found Project Honeypot (http://www.projecthoneypot.org). This site provides the means to trap bad bots for the greater good, and identify bad bots etc. Using their black list, it's possible to find out which visitors are search engines, or known comment spammers, e-mail harvesters etc and block them accordingly.
Analysing the web server logs was a revelation - I discovered a newish search engine (cuil with a user agent 'twiceler') that was also using up resources to no particular benefit to me - I have blocked this search engine, but if it ever becomes the number one search tool, I can easily unblock it.
My latest problem has been a malicious attack from one IP address visiting the SW search every few seconds and wiping out my quota. As it's only one address, it was easy to block once I'd spotted it, but the problem is detecting an attack before your quota's disappeared for the day. OK if monitoring the computer daily, but not so useful if not (there are other things to do in life, I believe, such as weekends off, holidays...). I'll be working out some code to detect such frequent visits and automatically putting any badly behaved visitor on a blacklist.
I have already reported this to SW support, but would repeat my request to them here to put some detection code in place to monitor visits and protect its resources too.
Rgds
Val
SW can't distinguish between user_agents accessing pages on your server, so you'll need to kill them yourself on request.
The best way is using .htaccess (assuming a linux box) as spambots tend not to follow the robots.txt protocol.
add this lot to your .htaccess:
#######kill some bad bots
RewriteCond %{HTTP_USER_AGENT} ^Balihoo [OR]
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^IRL/bot[OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} libwww-perl.*
RewriteCond %{HTTP_USER_AGENT} ^Zeus
I am not claiming it to be an exstensive list, but it'll cut out alot of your problems.
Next job, kill the good bots that're taking too much.
add this to your robots.txt:
User-agent: IRLbot
Disallow: /
User-agent: twiceler
Disallow: /
neither of those are evil bots and both follow robots.txt. IRLbot is used by some university in the states as a teaching aid, and as you know, twiceler is Cuil.com's bot.
Hi Andy
Thanks for that lot - have been doing this, but thought others might like to know about the various useful websites I've come across.
I had solved the problem until the malicious attack from one IP address.
I was hoping that as these all attacked the SW search field, SW might be able to detect and protect itself from malicious attacks on its own server too.
Rgds
Val
Further to thwarting any possible repeats of this spambot swarm (ie attack on SW search numerous times per second) I've been trawling the web and discovered that this type of automated spamming targets obvious filenames, such as productlist, which SW uses, and search, which I was using.
Now I've changed the filenames to something much less obvious. It won't stop a human finding files to target, but should reduce the risk of automatic detection.
Hope this info is useful to fellow affiliates.
No idea what the point is of attacks like this - other than being annoying by swallowing up bandwidth and possibly attempting denial of service?
Rgds
Val
Ive found a new "legitimate" bot that is taking alot of bandwidth with absolutely no commercial benifit to us at all.
The user agent is:
SapphireWebCrawler/Nutch-1.0-dev (Sapphire Web Crawler using Nutch; http://boston.lti.cs.cmu.edu/crawler/; mhoy@cs.cmu.edu)
which is described here:
http://boston.lti.cs.cmu.edu/crawler/
The IPs I have are:
64.88.164.198
64.88.164.196
2 IP addresses over 5503 page views this year. Perhaps we could stop it at request by blocking the range.
Sapphire follows the robots.txt protocol, so simply add the following:
User-agent: SapphireWebCrawler
Disallow: /
Hi Andy
Thanks for that info.
My quota usage is now much more in line with my impressions statistics, so all the blocking, and also the filename changing, are certainly worth doing.
As it would save resources at SW's end too, should this advice be included in the SW documentation? Prevention is better than cure, surely?
Rgds
Val
Hi again
Here's an update on how we've stopped my quota disappearing before my eyes:
As mentioned in an earlier posting, we changed certain filenames in Shop Window Client Software V2.0 because things like 'productlist' were probably being targeted by spammers and bots.
We stopped hardcoding URLs to our Shop Window Client Software V2.0 to stop harvesters following them and targeting pages. Instead we use PHP to generate the URLs that appear on the webpages.
Signed up with Project Honeypot (http://www.projecthoneypot.org/home.php) and contributed to it, thus gaining access to the realtime black list - and then wrote a programme to block blacklisted visitors - this programme (posted on http://fonant.mirrors.phpclasses.org/browse/package/5138.html) is for those with a limited understanding of PHP. It writes reports to text files or MYSQL database showing which visitors have been blocked and which visitors allowed, using IP addresses. Also shows search engine visits - some of these are less than beneficial! You can also set threat levels to allow/deny access.
This piece of software runs on every page of my websites with no obvious overhead.
The combined effect has been successful.
Hope this helps other affiliates.
Cheers
Val
Hi
We received an email today saying that the black listing software Spam blocker (see link above) is ranked as number 5 on the PHP Programming Innovation Award in the PHP Classes site during the month of February of 2009. The software is free to use. All you have to do is join PHP Classes.
Val
Got the same email.
It's nothing to do with spam bots, it's all about making sure the IP of your Server doesn't get inadvertantly blacklisted as an Email Spammer by adding your domains to a whitelist
Got the same email.
It's nothing to do with spam bots, it's all about making sure the IP of your Server doesn't get inadvertantly blacklisted as an Email Spammer by adding your domains to a whitelist
There's such a thing as a whitelist? How would I go about getting my sites on them? I honestly haven't heard of this before. :o
Confuscius
05-04-09, 15:38
Certainly an interesting topic, especially to me as I have had various encounters with the scrapers and spammers over the last year that have, on occasion, left me tearing my hair out (still plenty left at my tender age!).
Anyways, I set up various routines to try to trap the 'bots' by passing IP addresses through to my clickrefs - the main reason being to identify 'real' visitors. The number of clicks reported through the affiliate management centre that are 'fake' is quite staggering and to be brutally honest, if this behaviour repeats across all AW affiliates then the stats produced by AW are so far off the mark compared to my discovered reality. It means my real conversion stats are a lot higher than one would have imagined and if you do any sort of split testing then you need accurate underlying data to be able to draw conclusion plus sufficient traffic to validate the tests!).
The bottom line is that it would be clearly in everyone's interest if everyone could implement some sort of 'spam/bot' control system. I have tried the php class approach and still find it interesting that bots to real visitors is something like 19 to 1 in my case, 1 in 20 visitors is NOT a bot!
Now as we are all running php then I would have thought that most people would have access to a mySql database for logging purposes and it would be nice to hook something into the SW software for all to use - the project honeypot idea works well but as I also have a few blogs then I am aware of an extension that goes a little further BUT als makes use of project honeypot data - bad behavior - URL - bad behavior (http://www.bad-behavior.ioerror.us/documentation/porting-guide/) - although there is a WP plugin for thsi then it also comes with a general class but little guide on how to hook it into a mySql database BUT I am sure that ANDY (hint! hint!) could knock up the requisite files in a few minutes and I am sure that there would be quite a few of us that would implement such functionality and with a few honeypots and links to honeypot traps then we could collectively rid ourselves of those that target SW sites. My interest is two fold, better information in AW and less demand on my servers by stopping the larger scumbags in their tracks.
I am surprised that AW have not yet developed protection mechanisms for SW users and this seems like a good time to try to do so - given that AW response time would probably be ready for version 5.0 - a user led initiative may be better and quicker.
Views?
*hint taken* Paul!
I had a look at the distro you pointed to and it's alittle confusing. All scripts used as demos are CMS specific, so without knowing those CMS inside out, it's difficult to port it backwards.
But, fear not, I have a solution. Bad Behavior is now installed on one of my SW sites and appears to be running OK.
I'll nip out for a ciggy, make a brew, de-install it then re-install with an easy easy "howto" for you and everyone else.
If this works as it's supposed to Paul, you've probably just saved SW millions of calls/day mate!
I'll post soon
Confuscius
06-04-09, 18:30
Hi Andy
Spot on! When I read the distro it was VERY confusing to me so I thought that it might be less confusing to you!
All those millions of saved calls could always be split to both your and my 'unlimited' calls allowance - we all know there is no such thing as unlimited!
I have now removed the 500 or so IP addresses that I was manually blocking and am letting honeypot have a go - I can read the IP from my click report if it gets past honeypot and then look it up in honeypot - what I have seen so far is lots of comment spam IPs that are fairly new, so we should be able to hone down the parameter tests further to customise the time frames to catch more spammers and hopefully not block valid visitors - needs a bit more research, but looks promising.
Not only are they a waste of resources but they create thousands of false clicks as they trigger the goto script which then fires off a request which ends up in the click report.
AW - please make all cheques payable to ....!
Andy - you may want to PM me the bits or call me at home/skype me so that I can do a test on Version 1 and we can investigate the parameter setting further PRIOR to release to others!
UPDATE: Andy has now sent be some bits and we are on the case! Watch this space!
Hi
Must have been a long tea break! Or were you distracted? Or is it somewhere else on AW, ready for us to use...
Spam blocker
And...not visited the forum for a long while, so missed Andy's response regarding the PHP class 'Spam blocker'. In fact this is not about keeping your own IP address off a black list - you must be mixing it up with another PHP class that was notified a bit later on.
The class I was talking about (http://fonant.mirrors.phpclasses.org/browse/package/5138.html) provides a solution to help PHP sites quickly block the IP addesses of spam robots, by using the Project Honey Pot black list.
This is what it does in techie talk:
This class can be used to check spammers' IP address in Project Honey Pot RBL (Realtime Black List).
It can perform a query to the Project Honey Pot RBL DNS server for a given IP address using your RBL API key.
The class analyzes the RBL response and set a session variable if the current user IP address is of a spammer that should be blocked.
The blocked and allowed addresses are logged to a file or to tables of a mysql database.
I hope that between you, Andy, and Confuscious, you can reduce the crawling of SW sites. We now seem to have one particular Googlebot attacking/searching only our SW (not the rest of our sites) and it's using keywords, including those that would only reveal SW sites (e.g. iListOffset=50) and peculiar search terms totally unrelated to the site specialism or totally peculiar (e.g. Thankyou in the q string pair). What's all this about? Presumably not Google? But the IP address is a Googlebot one. Any idea how to screen this one out without blocking Google?
Cheers
Val
according to wikipedia
http://en.wikipedia.org/wiki/IP_address_spoofing
"IP spoofing is most frequently used in denial-of-service attacks. In such attacks, the goal is to flood the victim with overwhelming amounts of traffic,"
sounds like someone does not like shopwindow?
Unfortunately if you kick out any ip that is also used by the genuine googlebot you end up number one in google for the term 'this website is closed - pleas try again later' which is not what you want!
By the way does anyone know of a way to get back the quota remaing in a php program?
Hi again
As you say, blocking Googlebot IPs is not a good idea, even if it does block the dodgy masqueraders.
So I thought I could re-direct the crawlers to the opening page of my shop instead, which would stop the hits on SW and my quota, but should not affect genuine search engine results.
Any thoughts?
Cheers
Val
If you're only going to redirect the bad crawlers, you might as well block them altogether from the site. Blocking them doesn't just save your quota, but saves you precious (and often expensive) bandwidth.
If you're thinking of redirecting the good crawlers too (Google, Yahoo, MSN etc) then you would be defeating the object of them crawling your site and be de-indexed very quickly from all of them, especially if you used a 301 permanent redirect.
Confi and I are both testing a concept seperately, trying to find the best and most reliable way of implementing it. You posted that we "Must be on a tea break" a full day after confi posted that we were on with something. 1 day does not a reliable test make. Think more along the lines of 2-3 months before we'd be confident enough to present it to some others for beta testing. In the meantime, blocking the rogue IP addresses using htaccess will suffice.
You can help with my side of the testing by going to SearchEngenie (http://searchengenie.co.uk)and having a look round. you dont have to do a great deal, just browse around so the system can get to know you. The more info we have the quicker this thing could be rolled out.
Confuscius
11-06-09, 20:26
As Andy said this is a non trivial matter and you can guess my views on whether it is users or SW who should be directing resources to this issue!!!
Just a couple of quick examples - I have a 'trap' that passes through click refs of IP addresses where the referrer has NOT come via one of my pages - so it looks 'like' a bad bot you would think? Wrong!!!! It turns out that these click refs then get attached to real sales so what are they? Further investigation reveals that they are in fact people visiting via proxy servers who have presumably previously bookmarked a page so the referer is an IP address rather than an internal page.
I have one protection mechanism that links directly to project honeypot and I have honeypot traps on some SW installs and honeypot tells me when I am the first to highlight a new spammer on several occasions! Most of these spammers are comment spammers so of no use to anyone and they just scrape pages for links and forms to fill - they also traverse goto scripts and generate false clicks in SW reporting to the extent that SW stats are basically very wrong and very misleading.
What Andy and I are trying to do is take a concept from Wordpress - the bad behaviour plugin and hook this into SW AND also make use of project Honeypot which is remarkably effective - a sort of two pronged protection system BUT issues arise with internally generated SW links and are made even worse IF you have mod rewritten SW so finding ONE solution that fits WHATEVER anyone has done to SW is non trivial. May I suggest that ALL SW users go and join project honeypot and set up traps on their sites then any bot that targets the SW footprint will soon become identified and blocked by what we are trying to do.
I am a bit busy until early next week but I will compare notes with Andy and we will see what we can suggest BUT if what we suggest screws up what you have done then do not blame us!!
Off for another cuppa!
Hi
The date format must be American then - or differ from posting to posting as it looked like a 6th April posting by you (06-04-09, whereas mine show 07-06-09 and 08-06-9), so that made it a two-month interval to me....
Glad you are both on the case.
I am not going to block/redirect Google from my sites, but from the shop window area - surely there is no point in any search engine (legit or not) crawling it and coming up against search fields which must be a dead end from its point of view? Or have I got that wrong as well as the dates?
And I agree with Confuscious that as this benefits AW too, it would make sense for AW to come up with something we can all use asap.
Glad to say that project honeypot says our sites have highlighted and trapped various spammers too.
Enjoy that cuppa.
Cheers
Val
There's more on this topic in the 'Quota not working' thread in the Client Bug Reports forum.
After a few weeks, thought we'd finally worked out an automated system to keep out the spammers, but our quota has been hammered again overnight (13th/14th Sept 09).
Just checking the event logs now to identify the culprit(s), but has anyone else had any trouble?
Having checked our event logs, there's no obvious culprit as our filter seems to be working. We'd set up our own quota of shop visits per visitor and this worked well until yesterday. Now we've set up a quota for visits to our website pages too, as it looks like spiders were getting in through the side door.
Meanwhile, can I repeat the plea made by others on this forum? It would be much better if AW reset shop quotas in the morning, not evening. The quotas tend to disappear when we're all asleep and it's too late by the time we check so the shops is out of quota for the whole working day.
Any reason why this could not be done?
Val
Have looked at the event logs and discovered that the search engines, spiders etc etc are following paths we hadn't expected/noticed that lead directly into the shop, where they are spidering everything in SW. Not surprising that we're running out of quota as the spiders follow every single link within SW.
Dare I say that we think we've blocked all the routes now? Until next time....
Routes? Side Door?
Spiders don't act like human users, they dont land on a page and click a link, they land on a page EXTRACT ALL links and store them in a database alongside the page they've just indexed. Whether that page is on your website or not, if they find a URL anywhere, they will extract it and use it.
for as long as the URL resolves on a page without a 404 status code.
They then hit each new link they've stored by going directly to the URL. No HTTP_REFERRER at all. It's as if they've typed the url in the address bar, which, they kind of have.
You could block all requests that don't have a HTTP_REFERRER if you like, that would stop em all (and I mean ALL robots, not just bad ones):
if($_SERVER['HTTP_REFERRER'] == ""){
header("HTTP/1.0 404 Not Found");
exit();
}
That will prevent anyone who typed in the url, or is Google, or Yahoo etc, but it will also get the baddies. It'll drop the page out of every index, but still allow real people to view the 'shop' through the links on your 'website'
It's a very archaic solution, but it sounds to me like you're only bothered about your existing visitor traffic rather than natural search engine traffic.