PDA

View Full Version : can you put my cgi back please?


esllou
12-07-2003, 01:07 PM
would appreciate it if you could re-activate my cgi-bin. Have e-mailed support for last three hours without luck.

I have deleted the script that you objected to. We can discuss that script afterwards but I would just like my cgi-bin back up and alive as at the moment, I have no:

- guestbook
- forum
- site search
- CNC and therefore newsletter sending ability
- newsletter subscription
- no amazon links working due to that script being pulled

so I now have a site which doesn't work to all intents and purposes and even when it is back up and working, no amazon means it can't pay its way. :waa:

I do understand where you guys are coming from on this one. It is just that I did all that was possible to prevent my cgi being hit by spiders, both through robots.txt and .htaccess. I even put a NOINDEX for seven days meta tag in the html pages produced by the script! :(

I hope we can work things out....just give me back my cgi please!! :)

Terra
12-07-2003, 01:13 PM
What timing...

I was working up a followup to the FAN sent last night, with a more formal explanation of what happened... I am still sifting through the rubble though and will take a bit of time...

One of the first things I had done when arriving at the office was to re-enable your CGI ability (completed at 12:51 pm EST) since you had agreed to remove the offending script...

--
Terra
--sleeping is evil--
FutureQuest

esllou
12-07-2003, 01:23 PM
ok, great.

now we can start a civilised discussion about the future of THAT script.

I wanted to add a mod_rewrite to it which I then realised, due mainly to your intervention, would have caused mayhem on the server once I was hit with Google spiders.

So I never went down that road. And I further, pro-actively updated both my .htaccess and robots.txt to try and keep both google and other spiders from out of my cgi-bin in general and that script in particular.

Is there any way we can proceed from here?? What was the particular problem last night? What spider was it?

My site without amazon is a non-site. Simple as that. I can't picture a situation beyond having to shut the site down. Would I be entitled to a refund from FQ for unused credit. Pains me to be in this situation really.

The script in question is not one of these scripts you can find around which generates pages on the fly....it is an Amazon web services tool which I use in conjunction with .shtml pages which were generated automatically. I am sure you are well aware of how the script has been used. Would it be possible to put the script back on site if I delete the .shtml pages it has been used together with??

Hoping for a solution....thanks for the cgi reactivation

Terra
12-07-2003, 01:49 PM
The script has no rate or limiting controls built into it, and is highly subjective to vast amounts of abuse...

It's runtime averages 3 seconds, with a memory footprint averaging 6.7MB...

It does not take much to hose that script... External blocking, as you have already done, is simply not enough... The rate limiting controls must go into the script itself, since our SRC (Spider Rate Control) is too liberal to keep this script well behaved...

Your script is rare in that it falls just outside of our SRC zone and if it was configured to trap your script - SRC would unfairly penalize 95% of all other scripts executed on our servers...

We had issues from 3 different IPs last night:
203.76.197.150
203.76.199.235
203.76.200.34

We would firewall the IP, then the person would drop the IP and pickup a new one... Each time our Server Guardian having to step in and shutdown *everyones* CGI when the server load hit 40.00+, all due to one script 'amazon_products_feed.cgi'... It became a whack-a-mole situation...

Overall, we have been having off-and-on problems with the 'amazon' style scripts and this is not the first time this particular script has ended up in the line of fire...

Even if you were to move, this problem is only going to cause issues elsewhere - creating a vicious cycle... The ultimate solution is to add the rate limiting controls to your script, since that is the correct and responsible thing to do... Consider it an evolutionary change...

--
Terra
--it is like adding a safety to an uzi--
FutureQuest

esllou
12-07-2003, 02:26 PM
what numbers are we talking about as far as making it "safer"...the script, that is.

if I can make it friendly only to humans, I will. Even it if means blocking the great Google in the sky.

if you can give me some hard numbers, I can go to the script support forum and begin to get it sorted this evening.

One other thing....was it not possible to disable just THAT script last night?? Or is that an ignorant question? %)

mromero
12-07-2003, 02:33 PM
I am curious as to which script is being referred to. I recall reading on Webmaster World about someone who had used a script to corner (temporarily) the market on a particular Google search word. That is until he was found out and banned by Google.

Regards

esllou
12-07-2003, 02:41 PM
this is the amazon_product_feed.cgi which is very very VERY widespread on the net as a tool to implement Amazon Web Services.

It is apparently a tad too fast though and needs to be slowed down...

It is NOT the hack of the self same script which causes google to index hundreds of thousands of mod_rewrite versions of the standard apf pages, hence without the parameters that make the url's so SE unfriendly.

Terra
12-07-2003, 03:05 PM
what numbers are we talking about as far as making it "safer"...the script, that is.
Two posts to reference first:
SRC Nutshell
http://www.aota.net/forums/showthread.php?postid=93458#post93458

Rate Limiting Metrics:
http://www.aota.net/forums/showthread.php?postid=99212#post99212

General Goal for Rate Limiting:
No more than X 'search' instances per IP
No more than Y 'search' instances in operation total (all IPs inclusive)
Deny 'search' request if (1 min OR 5 min) server load is above 3.0 (groked via /proc/loadavg)
**Using either 1min or 5min, both have their pros and cons due to differences in rise and decay rates

if I can make it friendly only to humans, I will. Even it if means blocking the great Google in the sky.
Extremely difficult to do, I can assure you... Most of the problems now are caused by 'ill willed' spider operators that try to cloak themselves as 'humans'... This forces server/script operators to look at the heuristics of their behavior instead of relying on specific items or pattern recognition...

e.g.:
"If it walks like a duck and talks like a duck, then it must be a duck"
no longer applies... :(

The main problem with Googlebot now is that it comes in via spider (wolf) packs... Each spider alone is server & script friendly, but have about 10 Googlebots (nicely) spidering your site, and you have a resource intensive situation happening...

I am not faulting Google over this, as the vast quantity of information they have to spider and index is astronomical... Therefore they do have to scale the task by issuing multiple bots to keep up with the web... I for one do not see an easy solution for Google, as what they are doing for the Internet is a GoodThing™...

if you can give me some hard numbers, I can go to the script support forum and begin to get it sorted this evening.
X = 3
Y = 10
SL <= 3.0

The above values will offer your script a bit of latitude to operate, without thrashing the server...

'Y' is sufficiently high enough to cover bursting, however this is not to mean it can be a 'sustained' load with 10 of your scripts in memory all the time... The 'SL' should help to maintain that the 'Y' is for burst only.

One other thing....was it not possible to disable just THAT script last night?? Or is that an ignorant question?
Disabling a single script is extremely dangerous, therefore it is now our policy to completely disable any/all CGI activity...

The main problem, is that another one of your scripts may be dependent upon the script we just deactivated... If it cannot execute that deactivated script, then it may not handle that condition gracefully (lost $$$ Orders)... The potential of data corruption (or lost data) is at a much higher risk when individual scripts are deactivated... By freezing the CGI, we remove pretty much all risk (and liable) of a BadThing™ happening...

Think of a shopping cart where the main cart scripts are active, however we have just deactivated the 'credit card processor' script... This is just one simple scenario, of millions, where things can just plain go wrong when a single script has been deactivated...

In short, we play it safe for all parties involved...

--
Terra
--But your honor, FQ got their Peanut Butter in our Chocolate - we'd like 1 million please--
FutureQuest

<EDIT: libel != liable>

esllou
12-07-2003, 06:01 PM
is there anything external I can add to slow the script down and put a limit on it? Any general Unix command I can put in my .htaccess or at the top of the script itself????

dank
12-07-2003, 06:30 PM
If this is one of the typical Amazon XML scripts, I've got a few thoughts on the matter. First, even if the server allowed it, I wouldn't recommend running it as is for the simple reason that it can be incredibly tedious for the visitor. Sometimes it's snappy, but other times there are very lengthy delays that can keep an entire page from loading. Not good.

My solution was to set up a customized version of one of the scripts to run nightly via a cron job, grab the XML feed, copy the product images locally and update the database with product info (in case prices or anything changed), and serve up that product info when the pages are visisted. The odds of a title or price changing more than once in a day are pretty slim, so it ought to be as good as a live feed as far as the visitor is concerned.

Now, I'm only doing this for half a dozen books, so the nightly update is pretty minimal. If you're trying to pull the feed for a large directory, that could be ugly. Maybe pull the most oft-visited products nightly, and the less popular ones on demand? That would lessen the server load due to regular visitors, but it still wouldn't address spider control...

Dan

esllou
12-07-2003, 07:44 PM
well, I just got word back from the script author that he has "no idea" how to put rate control into it! Not too inspiring.

So, how can I limit this script from externally?

Any ideas?

Terra, I suspect the answer lies here:

groked via /proc/loadavg

care to enlighten me??

Terra
12-07-2003, 08:17 PM
So, how can I limit this script from externally?
You can't, unless you write a wrapper around it...

I'd suspect that if the author cannot add this to the script, that you may either:
a) do it yourself
b) hire someone to do it for you

In either case, I'd recommend resubmitting the work back to the author so that the whole world can benefit... ;)

I suspect the answer lies here:
Only part of the answer... It is only a component of the full solution...

--
Terra
--operating a 3rd party script carries a world of responsibility and usually no warranty--
FutureQuest

Terra
12-07-2003, 08:27 PM
Also, for a trip down memory lane:
http://www.aota.net/forums/showthread.php?postid=95523#post95523

These Amazon style scripts are really causing our servers various loading problems, mostly because so many of these types of scripts are just plain poorly written...

It was not too long ago that we had to literally outright ban one certain type of 'amazon.pl' that was causing a ton of heavy spiking...

Overall, we are very liberal with scripts that are operated on our servers until they prove themselves otherwise... Hopefully, when we do have to ban them, that the site owner will know that it is for a very good reason that is dealing with a serious and excessive condition...

--
Terra
--existence is defined from ones ability to recall prior events--
FutureQuest

Wassercrats
12-07-2003, 08:34 PM
General Goal for Rate Limiting:
No more than X 'search' instances per IP
No more than Y 'search' instances in operation total (all IPs inclusive)
Deny 'search' request if (1 min OR 5 min) server load is above 3.0 (groked via /proc/loadavg)
**Using either 1min or 5min, both have their pros and cons due to differences in rise and decay ratesThe code to do that would be pretty portable and easy to intergrate into a script, as long as the language is the same. Isn't there a stand-alone script somewhere that esllou could cut and paste into the Amazon script? There should be. The internet seems too young sometimes.

esllou
12-07-2003, 08:48 PM
yeah, that's what I hoped!!

anyone???

Wassercrats
12-07-2003, 08:59 PM
If you don't have alot of traffic to that script, maybe an easier alternative would be to lock it for a few seconds after it's executed, so only one instance of the script could run at a time. To lessen amount of memory used per second, some sleeps could probably be inserted somewhere. You still might need someone to do it for you, but those things might make it easier for them and cheaper for you.

esllou
12-08-2003, 12:14 PM
ok Terra,

this is what I have heard back from the script authors on the issue:

that the question we're trying to figure out. you can tell an instance of the script to sleep with the wait command but that doesn't stop it from sucking up memory or stop a bot from making another request and opening another instance.

once the script is open the damage is done. limiting the number of concurrent instances must be handled before the script is ran. a web server mod could do this. i just don't see how this could be handled from within the script itself.


I would really like two things to happen.

a) for this script to get back working
b) for it to get back working in a slow, easy way that is of no disturbance to anyone else on my server. I am not a selfish pig! :D

Any room for maneouvre?

By the way Terra, just so you know...as you probably already do. I have the same script running on my other FQ site but the whole site is a smaller operation. I know this wouldn't stop it being attacked like my esl-lounge.com script was the other night.

But one request - if you want me to pull it, can you give me some time first to sort out this current problem first.

One other thing. If I do have to pull my site off, there not being any revenue coming in now, can you tell me what the position is as far as unused credit is concerned. I am paid up to Sept. 2004!!! You can sticky mail me or e-mail me if you would prefer not to discuss individual account details here on the BB.

esllou
12-08-2003, 12:23 PM
another thought Terra.

how would it be if I renamed the script.

I suspect this script is pretty well known on the net now and while it certainly isn't as well known and vulnerable as wwwboard or form mail, renaming it would help with a lot of bots that trawl sites looking for apf in the cgi bin.

sorry....pulling my hair out here :\

Stephen
12-08-2003, 01:56 PM
for the kind of script in question to go from being a resource hog, to one that's quite manageable (if it works the way i think it does), all the author needs to do is add file caching to their program. if each remote URL request can be mapped to a unique local file path then the page can be retrieved, converted to HTML if it isn't already, and cached for local serving for the next hour (or however long the caching period is set to). No matter how many visitors you get in the cache period, another network retrieval is unnecessary.

so dank's approach is basically right, but the author should implement it in the script itself. i've used this approach for parsing remote XML feeds. the only complication that i can think of is converting the URL request into a unique filepath for caching the results. most of those requests have lots of (essentially) query string parameters, which have to be converted to a file path. the author would have to figure out a way to do that. it just takes a little thought. they could even hash the request parameter list and chop it into little pieces to construct a unique file path. that's not hard. i do that too for file uploads on one project.

rate limiting doesn't offer the prospect of severe reduction in server resources, while local caching of HTML pages certainly does. i recommend you point this out to the author. he won't be able to implement an overnight solution, but a week's worth of programming/testing of a caching system might make his script far more usable to everyone.

dank
12-08-2003, 02:15 PM
so dank's approach is basically right
Thanks, I was starting to wonder if I had actually posted it...

Dan

esllou
12-08-2003, 02:52 PM
dank, stephen...thanks for those suggestions. dank, I hadn't even seen your post. There must have been two posts I hadn't seen and when I clicked the time of the last post on the forum index, it took me straight past yours. Sorry for that!

I will get back to the author although I am not too hopeful. There is already some sort of caching device on the script but it clearly isn't powerful enough. I don't know how the caching works but when you consider the hundreds of thousands of items that amazon has, the chances of caching the same product as someone else wants a bit later is quite small.

Terra
12-08-2003, 03:22 PM
once the script is open the damage is done. limiting the number of concurrent instances must be handled before the script is ran. a web server mod could do this. i just don't see how this could be handled from within the script itself.
*sigh* - figures they would take that (easy-way-out) stance... :mad:

I am now pretty much concerned about the aptitude/skill level of the program's author - and if even such a program should even be publicly available to begin with... Seems that the author is following the stick-your-head-in-the-sand-avoidance-plan and pray-it-just-goes-away...

No - the damage has not already been done yet - only the cost of spawning the CGI script...

If there is rate limiting built into the script, it will execute these functions *first*... If the spider has exceeded the limits imposed, then kick back a '403' status code... If the spider gets caught in a loop, then redirect them to microsoft.com or something and let them spin their wheels over there...

One of my favorite quips is:
"No Bond, I expect you to die!"

In short, if your script has ~3 seconds (sometimes more) of runtime, then by running the rate limit code first, then the total runtime would be well under 1 second, not a guaranteed 3 seconds per each invocation... It is not so much the memory that is hurting, but rather the runtime is what is causing the log jam...

If the spider is really brain damaged, then when it reaches a point - kick out an email to yourself... Most likely by this time, it has already hit my radar screen and will most likely earn a slot in my firewall...

the chances of caching the same product as someone else wants a bit later is quite small.
Caching is not going to help in an appreciable way, since the behavior of spiders is to walk through a site consecutively... The only way caching would help here is if two spiders walked the same path back-to-back, where spider #2 would catch the items cached via spider #1 romp...

Overall though, it is a good idea to pull from a local cache, if Amazon provides a way to check to see if your cached item is current or stale... If stale, toss the item and re-retrieve from Amazon, else avoid the network interaction and provide the cached item...

--
Terra
--silicone - breakfast for the timid--
FutureQuest

dank
12-08-2003, 03:28 PM
if Amazon provides a way to check to see if your cached item is current or stale...
I'm not sure even that would be a big enough improvement. My experience with the Amazon XML feeds is there can be very lengthy delays communicating with the Amazon server from time to time. That would likely tie up the cache check, unless the script were written intelligently enough to give up after a set period of time (similar to what I've been trying to find a way to do in the PHP forum).

Dan

Stephen
12-08-2003, 03:50 PM
true enough, you can't cache the pages of millions of items. if you can't predict approximately (i.e. mostly) what the user is retrieving then caching loses its value. on the other hand, if you are specializing in a genre, such as "music of the sixties" the set of possible pages to cache goes down dramatically and the likelihood that two users are interested in the same item goes up.

personally, i think that sites that mirror most of what Amazon.com has to offer are somewhat pointless. a well-focused interest site that adds content that Amazon cannot, on top of the offered items makes much more sense, and is more likely to be useful/successful in the long run. these sites would probably focus on less than a few thousand items, where caching makes sense. and, of couse, you can't cache everything. the results of a "search" are usually entirely unpredictable, though detail pages for the items listed might well be cached.

esllou
12-10-2003, 10:53 AM
if anyone knows this script, amazon_products_feed.cgi and would like to earn a few bucks making it "safe" to avoid the problems discussed above, get in touch with me through my site, esl-lounge.com, or through this forum. This script is vital for the financial survival of the site.

Thanks

Wassercrats
12-10-2003, 04:45 PM
Here's (http://www.webmasterworld.com/forum88/119.htm) something in PHP. Maybe you could get someone to intergrate it with Amazon's script. There's a post in that thread that gives a tip on how to convert part of it to Perl. I haven't read most of the thread though.

You'll have to click a Google link to get it. http://www.google.com/search?sourceid=navclient&ie=UTF-8&oe=UTF-8&q=%22detects+rapid+multiple+accesses%22

Even better--takes you to the first page of the thread: http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&safe=off&q=%22webcrawlers+which+don%27t+honour%22

esllou
12-10-2003, 09:03 PM
thanks Wassercrats. That looks the sort of thing I am after.

Terra, is the script mentioned in these WebMasterWorld threads that Wassercrats linked to sufficient to safeguard the server????

esc
12-12-2003, 04:01 AM
You could perhaps embed the link to your script into a small Flash movie which will prevent bots from triggering the script while impose no problem to most (> 98%) human users. I doubt that bots are so sophisticated that they can parse SWF files.

You might even secure the cgi-bin with password and call your script with ‘user:password@mysite.com/cgi-bin/myscript’ so that it cannot be used directly.

An other more elaborated option would be to recode the whole thing totally in Flash ActionScript which is quite good in XML parsing. So the load would be shifted from the server to the client-side. Perhaps some people are already working on this, as you can find free Flash newsfeed readers that parse RSS which is a similar technology.

Erich

esllou
12-12-2003, 07:41 AM
thanks for that esc.

will be a looooooong weekend thinking about all of this.

have a good one yourself....

mromero
02-15-2004, 02:44 PM
Following up on this post, anyone using the Anaconda or the Cusimano scripts for Amazon? As these are commercial scripts as opposed to free, presumably they are more server friendly?

Regards

mromero
02-29-2004, 06:15 PM
Well in case anyone is interested I installed the Cusimano script and after a couple of days (I'm no Linux geek but if I plug away long enough I can usually get complex things done), I managed a simple setup that is very kind to Taz.

Basically I run the script three times a day via cron to retrieve 20 items. I then cache this html, images etc. locally. So visitors get a static page that loads fast. Eventually I will maybe retrieve like 100 items and reduce the updates from 3 times to once a day.

It was not as difficult as I had feared - what took me so long is an apparent bug with the old version of Red Hat that Fquest uses and Lynx which is used to retrieve the Amazon pages. Got around that by using GET.

Regards