View Full Version : [FQuest Alert] Internap DS-3 Network Issues
Terra
03-03-2003, 02:12 PM
FutureQuest <==> Internap DS-3 failure.
Rerouted all traffic through Qwest.
--more to be posted shortly--
--
Terra
sysAdmin
FutureQuest
Dr Mirth
03-03-2003, 02:24 PM
The traffic was indeed being re-routed through Qwest but I still couldn't access my web site up until a few minutes ago.
Performing a trace route it would get all the way through Qwest's network and then stick at the IP: 63.144.0.106.
But no doubt we will get the full explanation soon. Many thanks for the quick response (as always!). :)
Terra
03-03-2003, 03:17 PM
After a more thorough investigation, the problem was found to be caused by a fast carrier transition on the DS-3 interface, which triggered a bug in the HDLC protocol driver... When the fast up-down-up state happened, the HDLC driver was busy working on something else and did not see the line transition... This in turn did not allow HDLC protocol to resynchronize to the DS-3 reinitiated sequencing...
In short, the DS-3 line circuit was up, however the protocol that moves packets between FutureQuest and Internap was out of step... This caused the procotol driver to keep trying to send the packets into an effective black hole...
This was enough to keep the existing BGP sessions in operation, with HUGGIN pushing out the packets to the NOVA (Internap) router... When I saw this happen, I jumped into the Qwest side (IKONOS) and shutdown the BGP session trying to prevent any potential flapping that might occur... Once access was gained to the trio of routers, I was able to manually reroute everything back through the Qwest link...
Unfortunately, this was one of those rare occasions that our router's routing tables needed manual intervention as NOVA was still announcing it was still there even though it's egress was a black hole...
I will be studying this scenario in more detail, and see if I can devise a fail-safe method to watch for this situation and punt all BGP sessions and drop back to full static routing (outbound) with proper AS route advertising (inbound)...
The router vendor (ImageStream) is aware of the situation and will be working on a maintenance release to fix this (struck-by-lightening) race in the HDLC protocol handler...
Our sincerest apologies to all those affected by this unfortunate and unforeseen situation...
--
Terra
sysAdmin
FutureQuest
tedloh
03-03-2003, 03:32 PM
I love Terra's answers... of which about 0.39567% of us can understand the technical explanation, and 99.60433% of us marvel at how detailed the explanation is, unlike "Internap had a brownout".
As always, thanks for all you do and the speed with which you do it :)
Strice
03-03-2003, 04:37 PM
People around my workplace are constantly asking my why things break or the network doesn't work, etc., when I know darn well the explanation is way over their heads anyway, and then they get upset with me if my answer doesn't make sense to them.
My co-worker asked why our website was down, and I showed him Terra's explanation. He just sort of got quiet and said, "Oh. I hate it when that happens."
That'll show him.
I love my job. :google:
Monty
03-03-2003, 05:05 PM
I love Terra's answers
Me too! The ones in email are even better. It's nice to have such talent taking care of us.
Mont
hobbes
03-03-2003, 05:59 PM
Terra -
Would a device like a packet shaper in front of the routers have caught this and provided a switchover? Or simply monitoring for packet traffic and failing over if there's nothing/minimal traffic going across a link?
- Hobbes
-- Perhaps just renaming the boxes might do it, after all, what did you expect a (super)NOVA to do, or for the Spanish alternative: no va --
Terra
03-03-2003, 06:26 PM
Would a device like a packet shaper in front of the routers
Well, in not so many words, the routers also handle packet/traffic shaping via Linux QoS abilities... The issue at hand was that BGP caused a blackhole, and one that had to be manually delatched...
One of the projects I'm working on for the routers is setting up VRRP, to handle downed interior interfaces... However, that would not have helped in this case because the interface was the exterior DS-3 line card, and the event did not get picked up by the protocol allowing it to look like the interface was functional and conveying packets... I guess what disturbs me most is that the NLRI facet did not pick up on it fast enough to cause the internal routes to deconverge in a timely fashion... This created several tarpits to my efforts in tunneling into the network and gaining full access to all routers... I am already investigating the timer loops in the NLRI handling, and see if they can be modified to intercept this type of event and force a shutdown of all BGP neighbor connections...
I have been assured by ImageStream that the HDLC issue will be fixed in the next maintenance release, somewhere within the next couple weeks... Until then, I have to be on guard to make sure we don't get hit by lightning twice...
Overall, today's event was just another reminder of no matter how well you think you have all the angles covered, there is always going to be a corner case somewhere - especially when dealing with dynamic routing protocols that are in a constant state of motion...
--
Terra
--the light at the end of the tunnel ended up being a train--
FutureQuest
Randall
03-03-2003, 10:17 PM
Originally posted by tedloh:
I love Terra's answers... of which about 0.39567% of us can understand the technical explanation, and 99.60433% of us marvel at how detailed the explanation is I'm firmly in the latter category, but it's more than just the level of detail -- engineers love detail but most of them can't write to save their lives. Terra crams enough action and suspense into these explanations that you just can't put them down, even if it's all G(r)eek.
The movie version will rock. ;)
Randall
# Two thumbs up!
Reading Terra's technical explanations is sort of a morbid fascination, really. You know you won't understand any of it and nothing good can come of the experience, yet you just can't look away... Sort of like staring at the sun too long. Yes Terra, you really do illuminate us mere earthlings! :)
Dan
I really enjoy Terra's posts. If you have to suffer some downtime, at least here you get a peak behind the scenes.
Other places that don't tell you anything regarding maintenance, problems, and outages and just expect you to live with it really frustrate me. The full reports given by FutureQuest are great!
hobbes
03-04-2003, 11:01 AM
Although he would never admit it, Terra is in fact using the Techno-Babble-Generator(tm). A new gizmo that spits out what appears to be English, but in fact is a random combination of words and phrases that will confound, amuse, and enlighten the non-digerati. :)
JoeRT
03-04-2003, 01:28 PM
As a tech-type myself, I really appreciate that Terra digs into these problems, finds the root causes and develops solutions. He could just say "I unplugged it, plugged it back in and it worked" or "it just needed to be rebooted." But it's that desire to investigate and solve even the most infrequent problem that marks a real technician. You will not find it everywhere, and we are very fortunate to have Terra and the FQ team. This is what I require of all the technicians that work for me, and the results are great.
-------------------------
Joe Torsitano
www.weatherforyou.com
vBulletin® v3.6.8, Copyright ©2000-2012, Jelsoft Enterprises Ltd.