View Full Version : 32 min Alert
Andrew is down in the server room now...
For some reason the guardian did not catch the load in time and we went down. Time to reboot and bring it all back up was about 32 minutes....
Andrew is re-writing the guardian script and seeing if he can find out why it didn't alert to the load....
All is back up and running as it should be at this point.
Our apologies for this one....
Deb
Justin
01-18-1999, 12:48 AM
Load - meaning like a huge rush of hits or whatever? Just curious.
Justin
I would have to wait for Terra to return to say for sure... he'll have a better idea of what happened.
From my end -- I was sitting here working on a script for personal use and watching the server work... everything was running fine and no 'red lights' indicated a problem for me... then it just went down, which surprised me because the guardian should have caught it, but it didn't http://www.aota.net/ubb/frown.gif
I know that the move of the log files helped quite a bit and as we watched it run the logs we were quite pleased with the results... I also know that the next server that is being built now is needed.. but our current load shouldn't have brought us down.... this is just the long way around saying "I really don't know what happened, but we needed to reboot".
What I am learning quickly is that if I were searching for a new host I wouldn't put too much stock in a 99% uptime guarantee really because when we go down it devistates me everytime (as it should!) and we still have held well above 99% uptime every month. I couldn't imagine what it would be like if we really were only up 99% of the time:/ (noting that i'm talking about the server itself and not outages outside our control)
At any rate -- there will be another server online soon and that will mean twice the power and twice the resources of what we have now... we are doing everything in our power to improve, improve, improve... hate these glitches/setbacks!
Deb
Terra
01-18-1999, 01:29 AM
Load is difficult to quantify...
What I monitor are several key components of the server operation...
Memory/CPU/Disk/zombies/sleeping/non-interruptible/run-queue, etc...etc...etc...
Many factors revolve around one another with the operation of a multi-user server...
I have written a monitoring sub-system that keeps an eye on everything and changes operating parameters in real-time... The FQguardian was recently completed and I'm getting all the fine-tuning done... Tonight was a surprise as it should have re-prioritized the CGI/PHP3 sub-system for the increased load, for which it did not...
Further investigation will need to go into this as I review the operating parameter logs and try to deduce what was not adjusted for...
All in all - I have to track over 94 different knobs and guages - and done in such a way that the FQguardian program does not become a problem in itself... http://www.aota.net/ubb/wink.gif
Oh the life of a sysAdmin... I do love it so...
--
Andrew Gillespie
Systems Administrator
FutureQuest.net
vBulletin® v3.6.8, Copyright ©2000-2008, Jelsoft Enterprises Ltd.