PDA

View Full Version : [FQuest Alert] PT02, CASPER, ROCKO, SCOOTER, NIBBLER, Servers and Network Status


Bob
01-12-2009, 05:08 PM
The PT02 server is currently down and techncians are working on restoring services.

This was related to a power glitch in the FutureQuest Network which momentarily caused full interruptions.

Edit the following servers were all taken off line as well:
CASPER ROCKO SCOOTER NIBBLER

More will be posted shortly.

Our apologies,
Bob

hobbes
01-12-2009, 05:10 PM
I'm still unable to connect to several sites. Is the issue resolved, or still on-going?

And what does a power glitch on a network mean :-?

Bob
01-12-2009, 05:16 PM
I just Updated my Original post to reflect additional outages as I did not have full information when posting.

The glitch appears to have been a power source that became overloaded, but Kevin will address that once all services have been restored.

-Bob

teach1st
01-12-2009, 05:20 PM
MySQL servers too? A forum is down and I'm getting MySQL errors. Also From CNC, server info:

DBI connect(':MYSQL02.futurequest.net','FQmysql_version',...) failed: Can't connect to MySQL server on 'MYSQL02.futurequest.net' (113) at (eval 18) line 1974

SteveYoung
01-12-2009, 05:22 PM
Thank you for the Blog support, just as I was getting beat up for problems with our site, the RSS feed on Google came in with your status report.

Bob
01-12-2009, 05:22 PM
The CASPER and SCOOTER servers have been returned to service.
-Bob

Bob
01-12-2009, 05:23 PM
The ROCKO server has been restored to service.

-Bob

Bob
01-12-2009, 05:25 PM
Fred,

Yes some of the MySQL servers were taken down as well and they are being brought back one by one.

We had two technicians at the Data Center when this occurred and two more are working remotely.

-Bob

Bob
01-12-2009, 05:27 PM
The NIBBLER server has been returned to service.

-Bob

Andilinks
01-12-2009, 05:29 PM
Thank you for the Blog support...

Yes, and to FQ's credit that only four such events have been recorded in the past year. Thanks for the great service.

Terra
01-12-2009, 05:35 PM
Thanks Andi, much appreciated during these stressful scrambling moments of trying to restore a multi-machine power outage! :)

Bruce
01-12-2009, 05:54 PM
PT02 has now been returned to service.

Its boot was delayed by a forced file system check on boot. Due to the atypically large number of files and directories on PT02 (each mail message is an individual file -- almost 8,000,000 of them), this check takes a long time. This process corrected a few minor errors in some directories, but no major problems were uncovered.

DogAndPony
01-12-2009, 06:03 PM
Interesting... Sounds like some big old power supply box hiccuped or somebody took an axe to a cable bundle. :shocked:

Whahoppen?

Have fun fixin'! :ytphead:

Kevin
01-12-2009, 06:04 PM
What happened here was one of the managed power controllers was overloaded by a new server and the breaker on it tripped causing all connected servers to simply power off.

Unfortunately I forgot that the first of our power pole controllers was accidentally purchased as a 15A version instead of a 20A or 30A version back in ~2002. I didn't think twice about adding a new server when it read ~13.5A in use but that new server drove it up to 15.7A and tripped the breaker.

One of the systems connected to that power controller was a 32 port network switch that also caused other servers to be unavailable for the first few minutes.

The MYSQL03 server took a long time to get back up because apparently the video card in it died either during this crisis or sometime since the last time it was rebooted and it refused to boot without one. We had to yank it out and replace the video card with a spare in order to return it to service.

Our appologies for this major outage and any inconvenience this has caused.

DogAndPony
01-12-2009, 06:13 PM
Hey, Kevin!

Wow... That's crazy about MYSQL03. Glad you have spare video cards lying about!

And I have to say that as always, I tremendously appreciate the accountability.

If you lived and worked in LA, chances are you'd never take personal responsibility. You would have said it was an unknown failure, or somebody else's fault. Lack of accountability is a pandemic disease out here. :hrmm:

Thank you, thank you, thank you for your honesty and diligence!

Kevin
01-12-2009, 06:15 PM
That was actually the only spare video card. MYSQL01-05 are about the only systems we have with video cards instead of integrated chipsets. There will be another spare soon though just in case.

If anyone cares the bad video card is an old ATI Rage XL and the replacement is an older Matrox. Those servers only have PCI slots since they weren't made for high end video.

Randall
01-12-2009, 07:45 PM
I think I still have an ancient (Pentium II era) Matrox card for emergencies. The other spare cards are AGP, and we still have a bunch of machines that are PCI-only or have integrated graphics.

Sometimes I forget, :rolleyes: which is why I have a shiny new nVIDIA dual-head AGP card sitting around doing nothing. MYSQL01-05 are about the only systems we have with video cards instead of integrated chipsets. Beware — I had an integrated chipset/port die on me not too long ago. Unless you want to replace the motherboard, I'd keep more than one spare PCI card on hand.

Randall

Kevin
01-12-2009, 07:48 PM
Beware — I had an integrated chipset/port die on me not too long ago. Unless you want to replace the motherboard, I'd keep more than one spare PCI card on hand.
We have spare boxes on hand for that reason. I could have swapped the hard drives from MYSQL03 into a spare box however that would have taken longer and I was pretty sure the problem was the video card.

Matt
01-13-2009, 01:08 PM
Are there any lingering issues here, either with SCOOTER or MYSQL13? I have been seeing anomalous PHP script behavior correlating with the server/ power problems that are still occurring sporadically. -Matt

Kevin
01-13-2009, 01:40 PM
Matt, there should not be any remaining issues on SCOOTER and MYSQL13 wasn't affected at all.

Please send a message to the service desk with the details of your issue and we will look into it.

Matt
01-13-2009, 02:46 PM
Problem turned out to be user error. Sorry.