PDA

View Full Version : FQuest Alert: TAZ


Terra
08-04-2000, 11:01 AM
At 9:55am EST, TAZ suffered from a NIC driver memory overrun which caused a cascade effect eventually locking up the server...[nbsp][nbsp]This requires a full power down and reboot to reset the network interface and it's driver...[nbsp][nbsp]Currently there is no known workaround or fix to this problem other than a full reset...

TAZ finally came back online around 10:52am after it's lengthy startup disk checks...

Our apologies for any inconvenience this may have caused...

--
Andrew Gillespie
Systems Administrator
FutureQuest, Inc.

Terra
08-04-2000, 02:46 PM
Looks like TAZ is getting slammed with multiple issues today ranging from a nic driver going awol, to users overrunning the server with abnormal high process counts...

I currently have the server under a microscope...[nbsp][nbsp]Both lockups today followed the same pattern for which I am actively looking for to prevent any further problems...

A special thanks goes out to Justin for being on top of both situations and handling the issues in an expedient manner...

TAZ is now back online and running at normal levels, but as before the tide could turn instantly...

If TAZ continues to have the same NIC problem, then the existing server hardware will be taken offline and replaced with a new server that is waiting in the wings...

--
Terra
sysAdmin
FutureQuest

Terra
08-04-2000, 03:17 PM
The offending domain on TAZ has been isolated and their CGI bin deactivated...

You will be receiving a TOS email shortly...

--
Terra
sysAdmin
FutureQuest

DB
08-06-2000, 12:07 AM
I always get a little chill run down my spine when I read things like that.[nbsp][nbsp]It's not that I don't fully support your actions, I do trust your judgment. But as a developer, particularly one who has been writing more and more code for increasingly complex sites, I always try to learn from other people's mistakes so as not to repeat them. Are there any lessons to be learned from this instance, or was it strictly a case of server abuse?

------------------
--Tom aka DiamondBack
[nbsp][nbsp]http://diamond-back.com
[nbsp][nbsp]http://smartasses.org

Terra
08-06-2000, 12:56 AM
It's more of a privacy issue (and respect) for the site owner...[nbsp][nbsp]There are very rare cases where it will be brought to the public forums and usually the person initiates the conversation there instead of visiting our support desk...

I just sent out an email concerning this, and it should be general purpose enough that I can repost my reply here...

-----
The problem in this instance was that 56 requests were made, but the script was waiting for an outside connection (dmoz.org) to pull further information...[nbsp][nbsp]At that time, dmoz.org was offline and the script did not test for this fact thereby continuing to wait for information...[nbsp][nbsp]This tied up the Apache process, and there are a maximum of 255 children that can be spawned at any time...[nbsp][nbsp]An Apache child should at most run for 20 seconds before being freed up for another connection, but sometimes poorly written CGI can circumvent that creating an innocent DOS (Denial of Service)...

The moral of the story is:
If you depend on content from another website, make sure you have a timeout in place for non-responsive sites and respond accordingly...[nbsp][nbsp]To the surfer, it just appears to be hung up and they continue on - while the child is still waiting for the defunct connection...[nbsp][nbsp]Apache does not have a mechanism to kill a script while it's in this state as the CGI script is not consuming CPU time, but rather consuming 1 connection slot of 255...[nbsp][nbsp]With an active Apache pool, this can add up rather quickly and cause problems for others...
-----

CGI deactivation is a last ditch reaction to stop problems in progress...

Steps I follow:
1) Deactivate the CGI bin
2) stop the Apache daemon
3) check for any hung/zombied CGI processes and clear them out
4) start the Apache daemon
*Time for 2 to 4 is anywhere from 5 to 30 seconds, but never more than 30 as I will start it anyways and work around the situation...
5) Reactivate the CGI bin
6) Monitor closely for any further bad behavior
7) If behavior continues, then terminate CGI bin

I feel that we are **extremely** lenient with CGI developers on our servers and go the long extra mile to accomodate them instead of running analistic reaper style programs...

Our actions are primarily self-preservation (Guardian) based as it's a dual edged sword...[nbsp][nbsp]If it was not your CGI bin, then we are scolded for allowing them to run or continue...[nbsp][nbsp]If it was your CGI bin - then we are scolded for being to restrictive...[nbsp][nbsp]*sigh*
**Not 'yours' per sey, just the motion of words**

In all fairness, I personally take the time to evaluate the situation and take appropriate action...[nbsp][nbsp]Occasionally we encounter site owners with chronic CGI problems and those must be handled differently...[nbsp][nbsp]They somehow belong to the 'Broken script of the month club from CGI resources'...

In conclusion, CGI is exactly that: 'Dynamic'...[nbsp][nbsp]One minute everything is peachy king, next moment it can all go to...[nbsp][nbsp]That is just the nature, and I've spent many of my waking hours designing servers that accomodate the diverse and demanding needs of CGI execution...[nbsp][nbsp]For the record, over 94% of all sites we host run CGI in one form or fashion, with about 25% of them being beached whales...

--
Terra
--Why this thread to write a novel in--
FutureQuest

frankc
08-08-2000, 10:29 AM
Terra, thanks for the education on how things work "behind the scenes".

Would a list of known "bad boy scripts" be useful--that is, an alpha list of stock scripts *and* the version number that are known to be boinked?[nbsp][nbsp]That'd be a handy resource to monitor before we "just do it".[nbsp][nbsp];)[nbsp][nbsp] You certainly don't want us to ask before using every single cgi script.

[nbsp][nbsp]Frank
==thanks for the thorough, well-thought-out behind-the-scenes work!==