PDA

View Full Version : [FQuest Alert] Network Outage


Deb
08-04-2003, 05:26 PM
Though I do not have all of the details (very few at this moment actually) I did want to send a note out to let you know that the technicians are working on the issue that has caused our network to be extremely sluggish, and for many of you unavailable.

We obviously hope to have the issue resolved shortly and of course will continue working until it is!

We thank you ahead of time for your anticipated understanding of such a frustrating situation.

Deb
FutureQuest.net

JoeLeBlanc
08-04-2003, 05:36 PM
We feel your frustration, and we do understand :P well, at lease I do.

Terra
08-04-2003, 06:03 PM
Though I don't have all the details yet, it appears that the Internap side of our network pipes suffered serious packet loss...

This in turn caused a domino effect with BGP4 getting into a deadlock with the Qwest BGP announcements... This led to packets being rapidly switched back and forth between the good network and bad network side... When I had damped the BGP4 oscillation, connectivity started to return, however the rapid toggling was felt as sluggishness...

The most difficult aspect was drilling and working through the sluggishness, as every step of the way took 10 times longer than what it should have due to non-responsiveness... :(

As it stands, I have severed and isolated the Internap uplinks and diverted all traffic over to the Qwest side which seems to have completely stabilized...

Our sincerest apologies for the frustration this has caused... Please be assured we are diligently sifting through the wreckage looking for viable clues to reassemble the particulars prior to the event...

--
Terra
sysAdmin
FutureQuest

BOF
08-05-2003, 05:20 AM
Thank you Terra and the FutureQuest team for getting on top of the problem and for keeping us advised.

hobbes
08-07-2003, 10:18 AM
Terra -

Were you able to get all the details yet? Is there a fix to keep this from happening again?

Thx.

Terra
08-07-2003, 10:45 AM
Yes and no...

There was a timing problem found with the NIC driver on the Internap side that will be in the next router software upgrade that should help prevent this...

Zebra is in need of a few patches as well, as it had got itself jammed up in a loop from the BGP4 flapping...

We are also going to be embedding more SFlow taps into the core routers and pump them through a dedicated channel to the IDS processing engines...

There is more work that needs to be done with the QoS system to help clamp down on excessive packet streams in the case of a BGP4 event... We may also be looking at going with a realtime hardware insertion solution within our core and chain links to help facilitate this...

Obviously the above will take some time to sort it all out and figure out the best way to interlock the pieces...

Consider the lessons learned from that event to pave the way for the next stage of our network evolution - though I'm sure there will be more as time goes on... The Internet is simply way too dynamic to pin down the millions of permutations for what can go wrong... Best we can do is endure, learn, and adapt - instead of driving ourselves into a monolithic corner...

--
Terra
--Darwin at work--
FutureQuest

hobbes
08-07-2003, 12:24 PM
Thanks for the update!