FutureQuest, Inc. FutureQuest, Inc. FutureQuest, Inc.

FutureQuest, Inc.
Go Back   FutureQuest Community > FutureQuest Site Owners (All may read - Only Site Owners May Respond) > Notices & Alerts
User Name
Password  Lost PW

 
Thread Tools Search this Thread Display Modes
Old 02-12-2008, 10:57 AM   Postid: 165150
 Bob
Service Rep
 
Bob's Avatar
 
Join Date: Dec 1999
Location: Jacksonville, Fl
Posts: 4,887
[FQuest Alert] Mail Services

A short while ago the MX servers all became overloaded. Technicians are working on the issue at this time however it may result in some delayed messages or errors when attempting to send messages.

We will post more as soon as additional information is available.

Our apologies for any inconvenience,
Bob
Bob is offline  
Old 02-12-2008, 11:32 AM   Postid: 165154
 Bob
Service Rep
 
Bob's Avatar
 
Join Date: Dec 1999
Location: Jacksonville, Fl
Posts: 4,887
Re: [FQuest Alert] Mail Services

It appears that an automatic signature update for ClamAV caused issues with scanning on the MX servers.

Bruce has disabled scanning at this time to allow services to get back to more normal levels and also to investigate what the exact problem is.

We will post more as it becomes available.

Again our sincere apologies,
Bob
Bob is offline  
Old 02-12-2008, 12:42 PM   Postid: 165156
hobbes
Have you hugged a tiger today?
 
hobbes's Avatar

Forum Notability:
1238 pts: A True Crowd-pleaser!
[Post Feedback]
 
Join Date: Mar 2000
Location: Third Sol Planet Posts: Far too many. Oh ok -
Posts: 2,705
Re: [FQuest Alert] Mail Services

Are queues still backed up? Any other issues still going on?
hobbes is offline  
Old 02-12-2008, 01:26 PM   Postid: 165160
 Bob
Service Rep
 
Bob's Avatar
 
Join Date: Dec 1999
Location: Jacksonville, Fl
Posts: 4,887
Re: [FQuest Alert] Mail Services

Quote:
Originally Posted by hobbes View Post
Are queues still backed up? Any other issues still going on?
Yes the MX servers have large backlogs of queued mail as a result of the ClamAV issues and will take some time to clear out once everything has settled down.

At this time multiple technicians are working on the issues trying to get everything back to more normal service levels.

As soon as we know anything additional we will post again.

Our apologies and also our thanks for the understanding and cooperation everyone has shown during this event,
Bob
Bob is offline  
Old 02-12-2008, 01:14 PM   Postid: 165159
Bradley
Site Owner
 
Bradley's Avatar

Forum Notability:
75 pts: Helpful Contributor
[Post Feedback]
 
Join Date: Aug 1999
Location: Kingsport,TN
Posts: 794
Re: [FQuest Alert] Mail Services

Oddly enough I was doing an install of clamav on the colo last night. I had to stop for the night and may put it on hold until I can see what happened with your install.
__________________
Bradley
Nothing in this world that's worth having comes easy.
My blog
Bradley is offline  
Old 02-12-2008, 01:42 PM   Postid: 165161
 Bob
Service Rep
 
Bob's Avatar
 
Join Date: Dec 1999
Location: Jacksonville, Fl
Posts: 4,887
Re: [FQuest Alert] Mail Services

The MX servers are being rebooted one by one to bring them back into normal service without scanning enabled.

As they are being booted mail delays will occur and then once back in service they will begin delivering mail and start moving the queued messages to inboxes.

Once all are back in service we will post again.

Thanks,
Bob
Bob is offline  
Old 02-12-2008, 01:52 PM   Postid: 165162
kitchin
Site Owner

Forum Notability:
1163 pts: A True Crowd-pleaser!
[Post Feedback]
 
Join Date: Jan 2001
Location: Virginia
Posts: 2,992
Re: [FQuest Alert] Mail Services

Thanks for dealing with the horror that is email. Do you think DKIM is going to fix the internet, like that Slashdot article said it might?
kitchin is offline  
Old 02-12-2008, 02:36 PM   Postid: 165165
Randall
Fuzzier than thou
 
Randall's Avatar

Forum Notability:
1187 pts: A True Crowd-pleaser!
[Post Feedback]
 
Join Date: Nov 2002
Posts: 9,640
Re: [FQuest Alert] Mail Services

Huh. My outgoing mail was moving again as of 11:38 EST -- maybe they fixed mine first to shut me up.

Randall
__________________
Where's Randall?
Randall is offline  
Old 02-12-2008, 02:37 PM   Postid: 165166
 Bob
Service Rep
 
Bob's Avatar
 
Join Date: Dec 1999
Location: Jacksonville, Fl
Posts: 4,887
Re: [FQuest Alert] Mail Services

All the MX servers have been returned to service (one is in limited service due to a potential disk failure) at this time.

Most are showing a slow but definite drop in queued messages which indicates they are handling new messages as well as starting to clear out the queues. The remaining we expect to see a reduction in queued messages within the hour.

I would expect that it will take a number of hours to completely clear all the queued messages as they built up rather dramatically.

The MX server with the potential disk failure will be attended to this afternoon and Bruce will be by to provide additional information regarding the problems with ClamAV later as well...

Once again our sincere apologies and again our thanks for everyones patience during this event.

Kitchen, I really don't think DKIM will "Fix" the Internet BUT it should be a start towards possibly getting email back as a trusted and reliable communication solution. I hope...

Thanks everyone,
Bob
Bob is offline  
Old 02-12-2008, 07:39 PM   Postid: 165180
 Bruce
Developer
 
Bruce's Avatar
 
Join Date: Apr 2001
Location: Saskatoon, SK, Canada
Posts: 1,182
Re: [FQuest Alert] Mail Services

As far as I can determine, this is what happened.

Around 10:20 AM, our systems did an automated update on the ClamAV virus signature databases. After downloading the new signatures, the master system signals the actual scanning systems to reload their databases. Upon loading the incremental tables, data corruption was introduced into many of the running scanners. When they tried to scan messages, the corruption was discovered, and the scan was aborted.

However, instead of exiting, these scanners continued to accept more connections and then drop them. This effectively stopped all scanning on those servers. Once all the scanners with corrupted had effectively dropped out out service, there was no longer enough scanning engines online to handle the volume of incoming email. This prevented incoming mail from being accepted consistently, but some was still getting in.

Unfortunately, it got worse. Many of those scanning engines corrupted enough internal data that they stopped even trying to accept new connections. Instead, they just printed out a bogus error message ("ERROR: accept() failed: Loaded 362916 signatures.") repeatedly in a tight loop. Eventually, this bogus log data filled up all disk space on partition with the logs on the MXs. The MXs serve double duty, handling both (re)delivery (which uses very little CPU) and scanning services (which use very little disk resources). When those partitions filled, the mail services were unable to write out their logs, causing them to stall and leaving the mail that was already on the server undeliverable.

An update to ClamAV had been made available yesterday, so I figured that the broken data files would likely work in the newer version. However, even that proved to be a problem. After upgrading, the newer version crashed when it tried to read two leftover data files from the older version. This, combined with a typo I made when updating the config files loaded the CPUs and filled all RAM, causing half of the MXs to become inaccessible. This is when the previously mentioned reboots had to be executed.

After rebooting the inaccessible servers and fixing the broken config and data files, all the virus scanners were brought back online one by one. After they were back and fully operational, mail delivery returned to normal. However, there were thousands of messages queued up on each server, so it took some time for all previously held messages to be delivered.

The redelivery process completed over an hour ago, and all the servers appear stable at this point. Kevin is at the data center, and will be replacing a failing drive on one of the MXs shortly. This will not significantly impact service and no further interruptions are anticipated.
__________________
Bruce Guenter, FutureQuest http://www.FutureQuest.net/ http://untroubled.org/
Bruce is offline  


Currently Active Users Viewing This Thread: 1 (0 members and 1 visitors)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -4. The time now is 06:14 PM.


Running on vBulletin®
Copyright © 2000 - 2013, Jelsoft Enterprises Ltd.
Hosted & Administrated by FutureQuest, Inc.
Images & content copyright © 1998-2013 FutureQuest, Inc.
FutureQuest, Inc.