View Full Version : FQuest Alert: SIX
Terra
03-27-1999, 08:50 PM
SIX has locked up again for no reason and no warning... I will be working on the core tonight in an effort to track down this long standing instability problem...
I finally did get a small glimpse of the problem.. (Fork: Resource temporarily unavailable) I hope this leads me to the root of the error...
--
Andrew Gillespie
Systems Administrator
FutureQuest.net
Terra
03-27-1999, 10:51 PM
SIX is back online... It is with great dismay to report that I will be decommisioning the SIX server...New server will be ordered on Monday with all new hardware... We have had to much downtime on this server for it to continue...
I believe that something deep in the hardware is causing the random freezes, and instead of constantly chasing our tail on this - it would be wise to start with a clean slate...
--
Andrew Gillespie
Systems Adminsitrator
FutureQuest.net
hearts
03-27-1999, 11:03 PM
Terra, when ya decomission SIX, what is gonna happen to us that are there now? How long does it take to put together the new server, and will we all (that are on SIX) be transferred to this new server?
Lots of questions I am sure you don't wanna answer right now, so will stop here.
Thanks Terra
Thanks for getting it back up at warp speed... sorry to hear about the problems with it!
I'm gonna miss old "6"...
www.pumpkindriver.com/aTributeto6.mid (http://www.pumpkindriver.com/aTributeto6.mid)
http://www.aota.net/ubb/smile.gif
---------------------------
Paul
We're home now http://www.aota.net/ubb/wink.gif Lil'bit easier to post from here then it is from the servers themselves. (I had to drive TeRRa up there this time as he has an injured knee right now)
Sites on the SIX server have gone over the 99.5% promise this month. So far they have four days of credits coming to them. These credits will appear on your next billing cycle automatically.
when ya decomission SIX, what is gonna happen to us that are there now? How long does it take to put together the new server, and will we all (that are on SIX) be transferred to this new server?
What will happen once the new server is built is we will quickly move everyone that is currently on the SIX server to the new one. Estimated down time for the move would be about a 1/2 hour. All of your IP addys etc will remain the same.
We'll order the new parts Monday... from there it just depends how quickly we get them. Plan on the new server going on-line in about two weeks.
aTributeto6.mid
That about sums it up.
Deb
--
Was that a cutting edge server or a bleeding edge?
hearts
03-28-1999, 08:39 PM
i know poor ole six is dying on us, mail ain't working currently and neither is the cgi. http://www.aota.net/ubb/frown.gif
maybe it was a hiccup.. it is fine now..
[This message has been edited by hearts (edited 03-28-99).]
alexandra
03-28-1999, 08:42 PM
Glad it's not just me. I could get into my site, but not the UBB. AND I get a message I haven't seen before, that I've put at the end of this post.
(I've mailed Support, but thought I should post, too, in case someone else is similarly afflicted.) And we've been doing so well, too. Nearly a whole month without a problem (or at least, not one that either I or my posters noticed. )
Alexandra
Message you get when you try to get into Ballet Talk:
Internal Server Error
The server encountered an internal error or misconfiguration and was unable to complete your request.
Please contact the server administrator, webmaster@balletalert.com and inform them of the time the error occurred, and anything you might have done that may have caused the error.
More information about this error may be available in the server error log.
--------------------------------------------------------------------------------
Apache/1.3.4 Server at www.balletalert.com Port 80
Whatever it was that was just felt... is not happening now. I've tested both balletalert.com and Heartsweb and both are working at this point.
We do know that SIX is having 'strange' problems... this is why the server is being replaced. The problem that is occurring is unidentified at this point http://www.aota.net/ubb/frown.gif We will need to rip the system apart and test each and every piece of hardware in it... we can not do this as long as it is production use. First thing Monday morning we'll be placing the orders for the new parts... ergo doing everything we can as fast as we can to cure this issue.
We are watching this system VERY CAREFULLY right now -- it seems to be bogging with the complaint of resource usage.. yet the resources available are not being over used in any way that we can find at all http://www.aota.net/ubb/frown.gif It is an unknown error:/
Deb
alexandra
03-28-1999, 09:04 PM
Yes, Deb, I just checked and it was fine. Odd. (Before, when the server has been ill, I haven't been able to get in at all.)
My favorite part of the message it gave was the "tell them anything you may have done to cause the error." Talk about user friendly! http://www.aota.net/ubb/wink.gif I'm expecting signed confessions in my mail box any minute. My users are, for the most part, technologically innocent.
Alexandra
Jacob Stetser
03-28-1999, 09:11 PM
Hmm.
I was logged on during the seismic tremor of Six, and got a lot of bash: fork: resource not available problems.
SQL started giving me errors as well.
And then it seemed to clear up, the way the clouds might seem to miraculously blow away just as you reach the top of a mountain and are looking down at the.. oh, never mind.. I'm dreaming of being out in nature again..
durned computers. I haven't been hiking for over a year now.
Anyhow, all seems clear at the moment.
meikel
03-28-1999, 10:05 PM
A bit off topic, but...
"tell them anything you may have done to cause the error."
So what is the typical email one gets:
a) "I did nothing"
b) "I didn't change anything"
c) "I always did this, always worked"
d) "Everything is right but you"
e) etc...
I had a bug in FreeMem Pro V4.1 which was extremely hard to find. It was very tough to get some good information from someone to reproduce the problem on my own machines. It took around 4 days to replicate the problem. After this replication, the fix took 10 minutes.
Greetings from Bonn, Germany
[nbsp][nbsp]Meikel Weber
www.meikel.com (http://www.meikel.com)
Terra
03-29-1999, 12:06 AM
fork: resource not available problems
This is the error that I'm chasing down... It's a very **annoying** and extremely elusive bug... I have searched the Linux newsgroups and Internet at large... It appears that we are not the only one affected by it - and yet no one has been successful in isolating the culprit yet... In almost all instances, the server is not heavily loaded with all indicators showing all green lights... I do not know if this is causing the random freezes as well - I'm hoping that it is which means 2 bugs with one stomp... http://www.aota.net/ubb/wink.gif
I'm still installing traps and triggers into the server to try and snare this wombat... This is happening without rhyme or reason with no warning nor *any* type of indication as to which resource ****it thinks**** is getting low... I've got hundreds of resources in the haystack to choose from...
--
Terra
--Can someone loan me a needle?--
FutureQuest
Justin
03-29-1999, 12:20 AM
I sure hope this bug can be sqashed - I'd hate to see SIX decomissioned http://www.aota.net/ubb/frown.gif She was just so fast and responsive at the beginning there... I'm gonna miss her http://www.aota.net/ubb/frown.gif
So what's the new server's name gonna be?
------------------
Justin Nelson
FutureQuest Support
hearts
03-29-1999, 12:42 AM
hmmmm.. seems SQL is the prob?
Just me and my untechnical self thinking, but this all didn't seem to be happening until the SSL was made available. Prior to that, everything seemed to be going so perfectly smooth.
thanks for keeping us up to date on everything... as i am sure we are all watching this thread carefully, and my heart is hoping, you don't have to decomission SIX.
[This message has been edited by hearts (edited 03-28-99).]
[This message has been edited by hearts (edited 03-28-99).]
Terra
03-29-1999, 02:33 AM
So what's the new server's name gonna be?
The name 'SIX' will stay the same, just the hardware switched out... TAZ has gone through the same scenario, it has been through 3 seperate hardware configurations...
Many of my configurations are dependent upon the hostnames 'SIX' and 'TAZ' that to change it would make my life very difficult... It would require over 60+ configuration files to be updated, just for that one simple host renaming...
Hearts, I have looked into the SSL potential, and the problem is not there... Many things have had to be done to get the server running properly with the Linux 2.2.x kernel series as RedHat does not officially support it... This has forced me to rely on updating (manually) many of the core packages to utilize the new .h header files and structures contained within... The problems are definitely at the kernel level, as user space is much more controlled now... No longer will Linux overcommit memory (selectable feature), or allow rampant programs from taking the server down without some serious bashing and thrashing the swap memory... It's my job to make sure it doesn't get to that point...
The problem is being systematically researched and steps are being taken to isolate the cause of the problem... I am now monitoring a ton more of the servers realtime operational parameters and ripping into the Linux kernel source code to see if I can find the problem myself... It's a long road, as downgrading SIX to Linux 2.1.x is d*mn near impossible so I am having to make do with what is currently built and online in production...
SIX runs the same hardware as TAZ except for a difference in memory manufacturer and a different 6 gig UDMA secondary logging harddrive... I think that the memory is what is causing the random system freezes as the Tyan boards are ***very*** sensative to the type of memory installed... We were shipped the wrong memory, but due to time constraints and downtimes with TAZ we had to get the SIX server online or perish (new account freeze was in effect for over a week already)... The memory was Micron memory (HIGH Quality), but nevertheless the Tyan Tiger board may think otherwise...
I am also looking at the possibility of malicious acts being brought against the server... I have been researching all the DOS attacks that are relevant to Linux systems, especially the 2.2.x series... So far there has been one that affected 2.2.3 and earlier... That hole was closed when we upgraded to 2.2.4, yet the instability problem reared it's ugly head the very next day...
Sometimes I feel that SIX was a mistake, that I was perhaps pushing the envelope of technology/capability/power... The new configuration is breathtaking, and quite possibly my absolute best work to date, but it's only as strong as it's weakest link... Now that I've made my bed, I have to deal with the ramifications of a High-Powered problematic server... I can only hope that the problems are isolated and dealt with...
Well, I'm done rambling now...
--
Terra
--Speaker, I yield the rest of my time to the men in black--
FutureQuest
hearts
03-29-1999, 02:51 AM
malicious acts against our server??? Lemme at 'em so i can cuss 'em out. That is an unsettling feeling, however all part of being online.. and we know we got the best in you.. and deb gets the best of you! *evil grin*
And ya know Terra, I don't think SIX was a mistake in the slightest, you took a gamble for us. What more could we ask of ya. I applaud every single effort you have put forth for us. You were in a time crunch, you had to deal with so much at one time and you selected your priorities and handled it with logic, knowledge, and a hunch.
Sometimes I feel that SIX was a mistake, that I was perhaps pushing the envelope of technology/capability/power...
well to this I can only say, that you are ahead of your own time. You got the knowledge, SIX just didn't seem to have the strength.
Disappointments become an awesome learning tool, and yeah, sometimes disappointments come with a price, however, my heart says that the good far out weighs the bad.
You got us for moral support, and as long as you continue to be honest with us.. you will always have that.
Thanks FQ.........
hearts
Justin
03-29-1999, 05:24 AM
Well said, Hearts http://www.aota.net/ubb/smile.gif
hearts
03-29-1999, 12:10 PM
TERRA.. SIX is getting sick again.. no cgi no ssi and email is off/on. *just letting ya know*
teach1st
03-29-1999, 12:17 PM
No FTP...
fred
www.pb5th.com (http://www.pb5th.com)
Terra
03-29-1999, 12:52 PM
Eureka!!!!
Today is the *first* time I got to see the sequence of events causing this failure... This is my first true break...
More news to follow... http://www.aota.net/ubb/wink.gif
--
Terra
--Watching dominos fall--
FutureQuest
Melprophet
03-29-1999, 12:57 PM
Yep, SIX is definitely sick...My ubb is inacessable at the moment and that peculiar message pasted here the other day is popping up...
Funny though, I got an e-mail in reply to the error message from one of my posters I thought I'd share:
"Hey Mel, I don't know what I did wrong to cause this problem. But if I'm using your forum and somebody did something wrong it must be me. You know Murphy's Law? Well, I'm Murhpy!
Jacob Stetser
03-29-1999, 02:26 PM
Hm. Well, email checks out, PHP and cgi check out, web checks out...
but SQL isn't connecting http://www.aota.net/ubb/frown.gif
Jake
hearts
03-29-1999, 02:38 PM
well... since we know Terra is working on this, let us give him some time.. Give our tech a break ...
[This message has been edited by hearts (edited 03-29-99).]
Terra
03-29-1999, 02:42 PM
Woohooo!
Finally tracked this bug down...
Fork: Resource temporarily unavailable
Sequence of events:
1) x21 Apache: PHP3 script starting going bonkers running high load
2) slowed down the other x?? Apache daemons causing excessive children to be spawned
3) slammed head first into a kernel restricted 512 simultaneous processes...
***This is the glitch, when I compiled the kernel, I overrode this setting (1024) and increased this limit but it appears that deep in another (Virtual Memory) Makefile a piece of legacy remained and nullified my change forcing it back to the default 512...
4) Apache fighting to serve pages, QMail trying to deliver mail, etc... etc... yet no one could spawn children to serve the requests creating gridlock...
5) SysAdmin grimishly killing all non-essential children and subsequent parent spawn... http://www.aota.net/ubb/biggrin.gif
6) SysAdmin successful at getting server back under control and shutting down all frontend and backend engines... http://www.aota.net/ubb/wink.gif
This describes a classic ForkBomb attack, but was accidental and not malicious...
Questions unanswered:
1) Why did PHP3 not invoke it's protective measures in regards to CPU time... (Again, I keep fixing it - and each new revision breaks my fixes) http://www.aota.net/ubb/frown.gif
2) Why did this legacy issue with Max Processes remain in the mainstream kernel?
3) Why did the kernel not report responsible errors to the system logs?
4) Why is this setting hardwired and not capable of being dynamically set during runtime with all the proc enhancements?
5) Why is the max limit on x86 systems only 4092?
Steps of Resolution:
1) Patch the errant kernel Makefile
2) recompile the kernel and double check to make sure this does not happen again
3) reboot the server bringing the patched kernel back online
This may or may not solve the random freezes that SIX has been experiencing, time will tell on this issue as it is believed to be hardware/memory related...
I will hold out as long as possible, but may have to reboot the server at any time in case this dire situation happens again to bring the updated kernel online... Since it will be an orderly reboot, it would only be down for 6 - 8 minutes max...
--
Terra
--I found the needle by sitting on it, right place at right time--
FutureQuest
[This message has been edited by ccTech (edited 03-29-99).]
Jacob Stetser
03-29-1999, 02:43 PM
[ooer]
[This message has been edited by Jacob Stetser (edited 03-29-99).]
SneakyDave
03-29-1999, 02:58 PM
Oh yeah Andrew, I KNEW those were the problems, I was just going to tell you too! http://www.aota.net/ubb/smile.gif
Great job, and keep up the good work!
Sneaky
(slamming head first into wall)
Hi Terra.
Is SIX still sick?
FTP & POP are ok, But HTTP disconnects after receiving the GET request, resulting in a "network error occurred" in Netscape
- Stan
Terra
03-29-1999, 05:25 PM
We run pooled x?? Apache Daemons now, and I have been going through them one by one make some necessary resource updates...
x25 was just wrapped up, and is back online now it took quite a bit more work than the others as it has specific configurations to handle the really large domains...
x20 - x25 are our primary Apache pools...
Sorry about the inconvenience today, please believe that I'm working as fast as I can to sort this whole situation out... Everything should be stabalized with the Real Audio system being next on my list...
SIX is now being rebooted to bring the new kernel online... After bringing x25 back up it drove our limits close to the red again... This new kernel modifications should fix the resource issues...
--
Terra
sysAdmin
FutureQuest
Terra
03-30-1999, 01:07 AM
Woohoo!
After keeping a close watchful eye and constant vigile on SIX it is responding *much* better now...
It has twice exceeded the old brick wall and happily kept chugging right along during our peak hours...
Time will tell though, but after monitoring the new updates/patches it's looking very promising...
--
Terra
--There is just something about SIX that just feels right--
FutureQuest
muwah http://www.aota.net/ubb/wink.gif
--
You see things; and you say, "Why?"
But I dream things that never were;
and I say "Why not?"
-- George Bernard Shaw
Jacob Stetser
03-30-1999, 01:21 AM
He ALWAYS leaves us in suspense like that :P
It doesn't seem to have locked up, at least, just giving that strange bash:fork:Resource temporarily unavailable message a lot.
(I've been logged in throughout the crisis, furiously trying to avert dis- oh wait, I've just been sitting on, doing nothing important http://www.aota.net/ubb/smile.gif )
I hope the techman's epiphany gets him to the 'root' of the problem.
Yuk yuk, I kill me.
Jake
jenili
03-30-1999, 01:27 AM
You go, Terra! http://www.aota.net/ubb/biggrin.gif
Is there anything we should avoid or test in PHP coding, anything that seemed to be setting it off before?
jeni
-- fork: resource temporarily unavailable, eat with your fingers and don't whine with your mouth full --
hearts
03-30-1999, 01:53 AM
Terra, the suspense is killing me! I hope this means you can fix it and won't need to order a new server!!!! http://www.aota.net/ubb/smile.gif
vBulletin® v3.6.8, Copyright ©2000-2008, Jelsoft Enterprises Ltd.