View Full Version : FQuest Alert: RASMUS emergency repairs
Terra
01-30-2001, 04:04 AM
RASMUS is currently offline for Emergency Repairs...
It is the *exact* same condition as what QBERT suffered on January 9th, 2001...[nbsp][nbsp]I cannot prevent the problem, but only deal with it when it happens... :(
You can read the full details at:
http://www.aota.net/ubb/Forum4/HTML/000446-1.html
--
Terra
sysAdmin
FutureQuest
[This message has been edited by ccTech (edited 01-30-01@03:24 am)]
Terra
01-30-2001, 05:40 AM
The first half of repairs are completed and RASMUS is now back online...
Due to this problem, STATS processing is suspended until tonight when I have better indication of stability...
All disk/file checks came out with promising results, so I am hoping that it will remain this way...
I will schedule the manual RAID rebuilding within the next 48 hours and searching for ways to prevent this in the future as this is the second time I've been bitten by this bug...
Ironic though on the following 2 points:
1) do_try_to_free_pages[nbsp][nbsp]--[nbsp][nbsp]Fixed in Linux 2.2.19
2) ReiserFS + RAID Resync -- Fixed in Linux 2.4.1
Neither one can I use yet, because one is not released yet and second needs time to mature before enterprise acceptance...
#1 was the first domino to tip over that caused this whole mess to begin with... :(
It was _supposed_ to be fixed in 2.2.18, but at the last minute it was not included until 2.2.19pre3...[nbsp][nbsp]I tried to merge that fix with .18 and ended up breaking some vital semantics much deeper in the kernel...
Our sincerest apologies for the inconvenience this downtime has caused...
--
Terra
--Lives at work--
FutureQuest
janderk
01-30-2001, 06:29 AM
Wow, got a NetWhistle message telling me that my site was down. At my previous provider I always received similar messages about once a week. Since my arrival at FutureQuest I almost forgot about my Netwhistle account.
Although I, just like everyone else, hate being down, I fully accept it happening from time to time, as long as the total uptime is acceptable. Your guys feedback is just amazing. Never seen or heard of a hosting company providing feedback on a open forum while working on the problem.
The problem sounds very complicated to me so I started looking up what a journaling file system is at whatis.com.
http://whatis.techtarget.com/WhatIs_Definition_Page/0,4152,284007,00.html
Sounds like a real cool feature every server should have. It's nice to see that FutureQuest is trying as hard as they can to find a work around.
Be carefull not to get bored by creating servers with a year+ average uptimes ;)
Jan Derk
[This message has been edited by janderk (edited 01-30-01@8:08 pm)]
Terra
01-30-2001, 06:49 AM
Thank You for the conveyance of your thoughts... :)
XFS is next on my list as it will most likely provide better resiliency and more depth of checking/recovery tools...[nbsp][nbsp]We are nearing the threshold of Enterprise Class architecture with mission critical availability...
My dream is to obtain the golden 99.9999999999999999% uptime, barring scheduled downtime for either upgrades or unforeseen events...
--
Terra
sysAdmin
FutureQuest
janderk
01-30-2001, 08:57 AM
3am repairing a server
4/5am answering newsgroup posts
Do you ever sleep?
Shhhhh.. don't say "server" so loud..[nbsp][nbsp]:P
Paul
Terra
01-30-2001, 09:51 AM
Huh, did someone say server???
:P
--
Terra
--Still milling around enhancing this'n'that--
FutureQuest
DestinyBWL
01-30-2001, 12:30 PM
Although I, just like everyone else, hate being down, I fully accept it happening from time to time, as long as the total uptime is acceptable. Your guys feedback is just amazing. Never seen or heard of a hosting company providing feedback on a open forum while working on the problem.
One of the things that clinched my signing up with futurequest.[nbsp][nbsp]After I narrowed my choice down to just a few hosts (futurequest being the most expensive), I found a couple of messages on the net that made me go with them.[nbsp][nbsp]one of the messages went something like... "Futurequest claims to be a good, honest host, and that is the beauty of them.[nbsp][nbsp]The ARE a *honest* host."
jimbo
01-30-2001, 12:33 PM
"Futurequest claims to be a good, honest host, and that is the beauty of them.[nbsp][nbsp]The ARE a *honest* host."
With the exception of that whole "gift" fiasco, they are :P.
;)[nbsp][nbsp]
-jim
[This message has been edited by jimbo (edited 01-30-01@11:33 am)]
YFS200
01-31-2001, 05:04 AM
Just wondering about the inner workings of FQ here.
I was wondering if the FQ server use a hardware based RAID-1 system, or software based.
I just upgrade a NT system, using what I have learned reading here.
It now has two 18gb 10k Quantum SCSI U160 HDD in a 19" StorCase rackmount rack. Along with an Adaptec RAID card dedicated just to the drives.[nbsp][nbsp]All setup in a RAID-1 config.[nbsp][nbsp]
Now as far as the OS can tell, it's a single drive. The SCSI BIOS does all the RAID stuff. Including rebuilding a drive on the fly.
I have played with the ReiserFS on my laptop. The little OmniBook tends to be hard of drives. (or was the recompiling the kernel when driving down a nasty freeway that did the ext2 in. :) ) Not much as I broke the PC card support in the last kernel rebuild.[nbsp][nbsp]Still working on it.[nbsp][nbsp]I have not tried ReiserFS on a RAID-1 system yet.
Guess the question is why would a hot rebuild of the RAID mess up the ReiserFS?
YFS200
Hmmm....
Terra
01-31-2001, 11:16 AM
YFS200:
We use the updated software RAID patches written by Mingo...[nbsp][nbsp]Going with a hardware based RAID system was cost prohibitive, and the gains realized where offset by my own tuning of the server for low overhead RAID operations plus some advanced read balancing techniques that maximize the use of RAID-1 mirroring...[nbsp][nbsp]Also, the increase of RAM to 1Gb on the servers gave a tremendous performance boost as I could now fine tune the dirty cache buffers and delay certain writes during slower periods on the server...[nbsp][nbsp]Believe it or not, our servers ratio of Read/Write is heavy on the write side...
For example, current disk stats for the SIX server (uptime: 27 days)
READ:[nbsp][nbsp]10,026,943
WRITE: 54,825,762
*cough*massive gigantic bulky log file writes*cough*
Instead of reinventing the explanation wheel, I will redirect you to a post made on the ReiserFS mailing list that explains the issue quite well...
http://marc.theaimsgroup.com/?l=reiserfs&m=94226961021515&w=2
The main issue we get nailed on are 'pinned buffers', and when that happens - there is some custom added logic to deny all writes to the hard drive...[nbsp][nbsp]This helps to prevent massive file corruption that could otherwise take place as the 'pinned' condition is seen and locks fall down into place...
There is still no solution yet for the Linux 2.2.x series kernels or the existing RAID-0.90 updated patches...
--
Terra
--Linux 2.4.x is sweet, shame that I learned my lesson with 2.2.5 on SIX--
FutureQuest
YFS200
02-04-2001, 05:53 AM
So it's a RAID software setup with a few "tweaks" of your own. I would have figured on a hardware RAID card.
I think I understand what's going on now. Cool how it's setup.[nbsp][nbsp]Thanks for the info.
YFS200
I wish laptops came with RAID-1 system. Seems like it would be a good idea. The way they get bounced around and the lack of backup on the road.[nbsp][nbsp]
vBulletin® v3.6.8, Copyright ©2000-2009, Jelsoft Enterprises Ltd.