PDA

View Full Version : [FQuest Alert] Mail Services


Bob
02-12-2008, 10:57 AM
A short while ago the MX servers all became overloaded. Technicians are working on the issue at this time however it may result in some delayed messages or errors when attempting to send messages.

We will post more as soon as additional information is available.

Our apologies for any inconvenience,
Bob

Bob
02-12-2008, 11:32 AM
It appears that an automatic signature update for ClamAV caused issues with scanning on the MX servers.

Bruce has disabled scanning at this time to allow services to get back to more normal levels and also to investigate what the exact problem is.

We will post more as it becomes available.

Again our sincere apologies,
Bob

hobbes
02-12-2008, 12:42 PM
Are queues still backed up? Any other issues still going on?

Bradley
02-12-2008, 01:14 PM
Oddly enough I was doing an install of clamav on the colo last night. I had to stop for the night and may put it on hold until I can see what happened with your install.

Bob
02-12-2008, 01:26 PM
Are queues still backed up? Any other issues still going on?

Yes the MX servers have large backlogs of queued mail as a result of the ClamAV issues and will take some time to clear out once everything has settled down.

At this time multiple technicians are working on the issues trying to get everything back to more normal service levels.

As soon as we know anything additional we will post again.

Our apologies and also our thanks for the understanding and cooperation everyone has shown during this event,
Bob

Bob
02-12-2008, 01:42 PM
The MX servers are being rebooted one by one to bring them back into normal service without scanning enabled.

As they are being booted mail delays will occur and then once back in service they will begin delivering mail and start moving the queued messages to inboxes.

Once all are back in service we will post again.

Thanks,
Bob

kitchin
02-12-2008, 01:52 PM
Thanks for dealing with the horror that is email. Do you think DKIM is going to fix the internet, like that Slashdot article said it might?

Randall
02-12-2008, 02:36 PM
Huh. My outgoing mail was moving again as of 11:38 EST -- maybe they fixed mine first to shut me up. :rasberry:

Randall

Bob
02-12-2008, 02:37 PM
All the MX servers have been returned to service (one is in limited service due to a potential disk failure) at this time.

Most are showing a slow but definite drop in queued messages which indicates they are handling new messages as well as starting to clear out the queues. The remaining we expect to see a reduction in queued messages within the hour.

I would expect that it will take a number of hours to completely clear all the queued messages as they built up rather dramatically.

The MX server with the potential disk failure will be attended to this afternoon and Bruce will be by to provide additional information regarding the problems with ClamAV later as well...

Once again our sincere apologies and again our thanks for everyones patience during this event.

Kitchen, I really don't think DKIM will "Fix" the Internet ;) BUT it should be a start towards possibly getting email back as a trusted and reliable communication solution. I hope...

Thanks everyone,
Bob

Bruce
02-12-2008, 07:39 PM
As far as I can determine, this is what happened.

Around 10:20 AM, our systems did an automated update on the ClamAV virus signature databases. After downloading the new signatures, the master system signals the actual scanning systems to reload their databases. Upon loading the incremental tables, data corruption was introduced into many of the running scanners. When they tried to scan messages, the corruption was discovered, and the scan was aborted.

However, instead of exiting, these scanners continued to accept more connections and then drop them. This effectively stopped all scanning on those servers. Once all the scanners with corrupted had effectively dropped out out service, there was no longer enough scanning engines online to handle the volume of incoming email. This prevented incoming mail from being accepted consistently, but some was still getting in.

Unfortunately, it got worse. Many of those scanning engines corrupted enough internal data that they stopped even trying to accept new connections. Instead, they just printed out a bogus error message ("ERROR: accept() failed: Loaded 362916 signatures.") repeatedly in a tight loop. Eventually, this bogus log data filled up all disk space on partition with the logs on the MXs. The MXs serve double duty, handling both (re)delivery (which uses very little CPU) and scanning services (which use very little disk resources). When those partitions filled, the mail services were unable to write out their logs, causing them to stall and leaving the mail that was already on the server undeliverable.

An update to ClamAV had been made available yesterday, so I figured that the broken data files would likely work in the newer version. However, even that proved to be a problem. After upgrading, the newer version crashed when it tried to read two leftover data files from the older version. This, combined with a typo I made when updating the config files loaded the CPUs and filled all RAM, causing half of the MXs to become inaccessible. This is when the previously mentioned reboots had to be executed.

After rebooting the inaccessible servers and fixing the broken config and data files, all the virus scanners were brought back online one by one. After they were back and fully operational, mail delivery returned to normal. However, there were thousands of messages queued up on each server, so it took some time for all previously held messages to be delivered.

The redelivery process completed over an hour ago, and all the servers appear stable at this point. Kevin is at the data center, and will be replacing a failing drive on one of the MXs shortly. This will not significantly impact service and no further interruptions are anticipated.

hobbes
02-12-2008, 07:53 PM
Thanks Bruce. Any way of preventing these occurrences again in the future?

Terra
02-13-2008, 06:02 AM
The whole point of automated updates is to get new signatures online as fast as possible...

It would be nice if there was some sort of versioning to the update files, that would signal to the running ClamAV that says "Hey, you are not capable of handling this new signature format - so don't and just notify"...

Overall, this should be fixed by the upstream - and not us... We don't want to deviate from the core ClamAV with patches that will just get broken in the next release...