PDA

View Full Version : [FQuest Alert] Network Instability


Terra
08-14-2006, 07:01 PM
We have shutdown the Level3 side of our network to determine if the problem is with Level3 or if there is a problem with one of the router's interface cards...

In the meantime, traffic will continue to flow over our Internap and TWTC uplinks...

Our apologies for anyone that got caught in a routing loop earlier...

--
Terra
sysAdmin
FutureQuest

cindik
08-14-2006, 07:11 PM
We have shutdown the Level3 side of our network to determine if the problem is with Level3 or if there is a problem with one of the router's interface cards...

In the meantime, traffic will continue to flow over our Internap and TWTC uplinks...

Our apologies for anyone that got caught in a routing loop earlier...

The weird part was when I was getting temporary GoDaddy pages for futurequest.net and aota.net. :eek: :ytyikes: :ytthud: :shocked: :eeww:

Terra
08-14-2006, 07:15 PM
Seeing GoDaddy pages wouldn't have anything to do with us... I would recommend sweeping your computer for spyware and various other ilk that would have caused the redirection...

--
Terra
sysAdmin
FutureQuest

Andilinks
08-14-2006, 07:20 PM
My hourly numbers look normal, if there was any loss here it is imperceptible. :)

Andi

cindik
08-14-2006, 07:26 PM
Seeing GoDaddy pages wouldn't have anything to do with us... I would recommend sweeping your computer for spyware and various other ilk that would have caused the redirection...

::shrug::

Some of my friends had the same problem too - everything cleared up about the same time.

Matt
08-15-2006, 02:55 AM
First I got a call Monday morning from a client affected by the SAMSON disk problem (any additional information as to the cause?). Then, later that afternoon, I was affected by this network problem, which lasted I believe for maybe 30 minutes (not sure as I finally gave up and called it a day). I would also be interested to see additional details as to what the problem was. Maybe FQ's just overwhelmed, but at one time there would have been a detailed, somewhat cryptic follow-up as to what the problem(s) was/ were.

In both cases I do not see any follow-up, explaining what the problem was and what might have been done to prevent it from re-occurring. I also wish that I didn't have to repeat this, but FQ needs an alternate channel of communication. Since ALL FQ communication goest through the FQ network, a network outage means ZERO communication. Lesser hosts have addressed this and FQ needs to swallow its pride and use another hosting company in a different location as a means of providing back-up communication.

Finally, it is my novice understanding that the purpose of having several different network providers is that if one goes down, the FQ network stays on-line, without any noticeable hickups. That is in theory. I know from my experience that the hardware I have used for this purpose never works quite as seamlessly as advertised. I have, however, attributed this to using routers costing hundreds, not thousands of dollars. Assuming that FQ has routers costing thousands of dollars, why is manual intervention required when a network provider is having problems?

-Matt

Andilinks
08-15-2006, 03:15 AM
I also wish that I didn't have to repeat this, but FQ needs an alternate channel of communication.Though I wasn't affected this time I'll have to concur. It wouldn't have to be anything more complicated than a single low-end web site hosted somewhere else. This tiny expenditure would provide a huge reassurance to all your clients. I would feel better knowing it is there even if I never used it.

Andi

hobbes
08-15-2006, 09:02 AM
Concur w/Matt. Perhaps something could be setup in the Colorado Data Center:-? Of course one of us could set up a forum somewhere (else) and let others know about it ahead of time, but it would be best for it to be an FQ sanctioned & maintained feature.

-- Convinced (http://www.aota.net/forums/showthread.php?t=20236&highlight=samson) Samsom (http://www.aota.net/forums/showthread.php?t=20396&highlight=samson) should (http://www.aota.net/forums/showthread.php?t=20429&highlight=samson) have (http://www.aota.net/forums/showthread.php?t=21150&highlight=samson) had (http://www.aota.net/forums/showthread.php?t=21169&highlight=samson) its (http://www.aota.net/forums/showthread.php?t=21175&highlight=samson) hair (http://www.aota.net/forums/showthread.php?t=21180&highlight=samson) trimmed (http://www.aota.net/forums/showthread.php?t=21142&highlight=samson) long (http://www.aota.net/forums/showthread.php?t=20294&highlight=samson) ago (http://www.aota.net/forums/showthread.php?t=21752&highlight=samson) --

Randall
08-15-2006, 10:47 AM
FQ needs an alternate channel of communication. Agreed. I just happened to catch the tail-end of it (which isn't a bad thing -- when these things happen and I'm not around I really feel left out), so it was already over before I could finish my troubleshooting. But when the whole network goes *poof* for more than a few minutes, you don't know if it's Level3 acting up again, or an ISP problem, or alligators in the data center...

Could be a simple announcement page at, say, fqalert.com: "Please stand by while we chase the alligators out of the data center." That sort of thing. :rasberry:

Randall

magr
08-15-2006, 11:36 AM
Is there still a problem? My site is on and off. Right now it is very sluggish.
Thank you.

Terra
08-15-2006, 11:44 AM
There is currently instability with our BGP4 sessions, and I am currently scrambling to patch things up between our Internap and TWTC uplinks... I have just finished rerouting all egress packets out our TWTC uplink, which should give me some breathing room to stabilize the Internap and Level3 uplinks...

I have discovered that the problem with Level3 is not Level3 themself, but rather the Quagga BGP4 daemon is feeding out bad updates to our other routing daemons in regards to where it believes our internal IP space is... This caused a domino effect that propagated to our other routing daemons, and I'm now trying to purge out the ill effect...

In short, our Quagga routing daemons have gone schizophrenic and they are currently being medicated, put to sleep (initiating manual routing control), and will be reawoken once the drugs have kicked in... :)

--
Terra
sysAdmin
FutureQuest

magr
08-15-2006, 11:48 AM
Thanks, Terra. The magic fingers will take care of it then.

Wassercrats
08-15-2006, 01:01 PM
FQ needs to swallow its pride and use another hosting company in a different location as a means of providing back-up communication.Do any major web hosts do that? If I were Futurequest, I'd do something internally to allow backup communication. If it's only a matter of minutes before emails or whatever get out, I think I'd settle for that.

Andilinks
08-15-2006, 01:13 PM
If I were Futurequest, I'd do something internally to allow backup communication.
That defeats the off-site advantage, unless it's done in Colorado which seems to be a natural thing.

But two hosts trading an emergency site also seems to be a natural symbiosis. With FQ personnel spread around the world, even if the datacenter in Orlando were totally cut off from the world the emergency site could be updated to reflect the situation. It would buy some peace of mind for all of FQ's clients. When I lose touch with my site it is very upsetting for me, and when I can't reach aota.com or my email it reaches panic proportions. I think it's the same for most of FQ's clients.

No matter how serious or trivial a problem may be, it is much worse if I'm in the dark about it.

Wassercrats
08-15-2006, 01:39 PM
If I owned McDonalds and someone wanted to start a relationship with Burger King for backup buns beacuse it would be faster or cheaper than the corner grocery, I'd fire them.

I don't think there's much of an off-site advantage, but maybe Futurequest needs a backup network and backup machines running different equiptment. I'm still not sure what the complaint is though. Was there too long a wait for email notification? How long?

Andilinks
08-15-2006, 01:53 PM
If I owned McDonalds and someone wanted to start a relationship with Burger King for backup buns beacuse it would be faster or cheaper than the corner grocery, I'd fire them.

I don't think there are any large Internet companies that don't have cooperative arrangements with many companies that are their rivals in other arenas.

frankc
08-15-2006, 06:17 PM
In short, our Quagga routing daemons have gone schizophrenic and they are currently being medicated, put to sleep (initiating manual routing control), and will be reawoken once the drugs have kicked in...
Dang, I love this place! :vday2:

Keep up the good work, Terra :yeah: !

Terra
08-16-2006, 02:17 PM
We believe the Level3 routing issue has been solved, and that uplink has been turned back up and is now operating properly...

--
Terra
sysAdmin
FutureQuest

hobbes
08-16-2006, 02:31 PM
Terra - can we get a comment on the potential for a notice board hosted outside the existing FQ network? Though not a big deal for very short outages, I recall at least 1-2 occasions where the lack of contact was frustrating. THx.

Jeff
08-16-2006, 03:44 PM
If I owned McDonalds and someone wanted to start a relationship with Burger King for backup buns beacuse it would be faster or cheaper than the corner grocery, I'd fire them.
They could simply get a server in a datacenter that doesn't directly offer virtual hosting since that is FutureQuest's focus. I don't think it would be a bad idea for McDonald's to have a backup bun supplier that only makes buns and not hamburgers; even if the corner grocery sells hamburger buns and hamburger patties and catsup it's not a threat to McDonalds. FutureQuest is defined by the combination of superior support and network and servers. It doesn't bother me if they have a non-staff hvac technician or generator technician come in, nor do I see a problem with having an emergency server or monitoring server or offsite-backup servers in buildings or racks they don't directly own as long as the main primaries are under their direct control. I don't think the need for such is critical, but it's not a bad idea either if there ever is a case where downtime is over 5 minutes.

Terra
08-16-2006, 03:54 PM
We will be looking at placing a brief status page at another URL in the case of a major network outage... There are no plans of hosting forums elsewhere at this time...

In the case of this network issue, we would have not placed up any status because by the time we were alerted by external monitoring that Level3 was looping, the problem was corrected within ~5 minutes by shutting down that side of our network...

--
Terra
sysAdmin
FutureQuest

Matt
08-16-2006, 05:16 PM
In the case of this network issue, we would have not placed up any status because by the time we were alerted by external monitoring that Level3 was looping, the problem was corrected within ~5 minutes by shutting down that side of our network... The downtime for ME was more like 30 minutes. It'd be great if whatever monitoring software you're using would send a simultaneous alert to the backup site, thus eliminating the need to manually intervene.

I get the feeling that post-Y2K, post-9/11, post-Katrina, the importance for having contingency plans still isn't fully appreciated here. Irrespective of natural or man-made disasters, it's also possible that the government could come knocking and confiscate enough of the infrastructure to take FQ off the net. I love FQ, but as good as it is, it still can't control nature or people. With that in mind, I think that having a remotely hosted site with forum (Emergency notices only) and real-time status report would be a reasonable goal. A point that may have been missed is that clients cannot communicate w/ FQ if its network is down; simply having a status page will not address this circumstance (communication is a two-way street).

P.S. Here's a backup site that I found randomly which encompasses these ideas:
http://ineosolutions.info/

-Matt

Kevin
08-16-2006, 05:22 PM
The problem with automatic updates is that if one site can talk to the other you can probably get to the main site so the backup wouldn't be needed. If we do a backup emergency notification site it would only be for updates on situations that make the main site unreachable.

Wassercrats
08-16-2006, 05:24 PM
clients cannot communicate w/ FQ if its network is downAs long as the status page would be updated enough with whatever information is available (and matters), that should be about as good as a forum or email. What would there be to discuss that's not obvious?

Maybe the current /Status page should be moved to the other domain and there should be a time and date for each server to indicate when a special notice was last posted about that server. Then you can click the date and read the notice.

Matt
08-16-2006, 05:41 PM
What would there be to discuss that's not obvious?Instances where FQ isn't aware that there is a problem (perhaps one which affects only a small subset of clients).

The problem with automatic updates is that if one site can talk to the other you can probably get to the main site so the backup wouldn't be needed. I assume that FQ has off-site monitoring capabilities. If the off-site monitoring software detects a problem communicating w/ FQ network, it updates the emergency site. If off-site monitoring isn't in place, that's another piece that needs to be put in place.

If we do a backup emergency notification site it would only be for updates on situations that make the main site unreachable.Sounds reasonable.

Jeff
08-16-2006, 06:05 PM
Instances where FQ isn't aware that there is a problem (perhaps one which affects only a small subset of clients).
They could put a simple contact form on the emergency site - then that server will relay to the futurequest server in rare cases where there is a network problem only between the user and futurequest and the user doesn't have another network connection to use.

Kevin
08-16-2006, 06:16 PM
If you can't get to FutureQuest and it is a problem that we aren't aware of because it isn't affecting a large enough portion of the internet then pick up the phone. 877-QUESTLINE

Jeff
08-16-2006, 06:19 PM
Well, sure there is a simple solution :P

(sorry, edited my reply to be clever despite being short on sleep before your reply Kevin.)

Kevin
08-16-2006, 06:22 PM
I am not sure if "phone support" is really the right term for it. :P

The number goes straight to an answering machine where you can leave a message. The messages go to the same ticketing system as the service desk emails and are listened to by whoever is working at the time (usually Bob). There is no way to call that number and get a human but it is designed for situations where we can't be contacted through the internet.

hobbes
08-16-2006, 07:37 PM
Kevin - there have been a couple of occasions where I have left a message on the QuestLine when I was unable to reach FQ.net. If memory serves, response times were less than desirable as I'm guessing many others were also trying to figure out what was going on, and of course the staff's efforts had to be spent on troubleshooting/fixing the problem. An off-network notice board would be a more expedient means of getting the word out. My $0.02.

Kevin
08-16-2006, 07:40 PM
Right, I was only saying that the phone number was good for cases where you can't reach us but most everyone else can.

Javier Mosqueda
08-17-2006, 05:59 PM
I entered through telnet 2 hours ago, hire in my machine and there with my client. While with my client the connection was closed by FQ, and since then I haven't been able to enter.

¿is this "Network Instability" the reasion for that?

Terra
08-17-2006, 08:23 PM
No, most likely your telnet session timed out from inactivity...

There has not been any further instability, and the patches made to the Level3 side have fully stabilized and appear to be working rather well...

--
Terra
sysAdmin
FutureQuest