PDA

View Full Version : Extending spamassassin with a custom filter that scores for words in email body?


m23
05-14-2005, 01:04 PM
I would like to write a custom filter that can adjust the spam assassin score for an email, based on keywords in the email body. The spam I do get (that gets through) has certain keywords, like (I've shortened some to account for variations like "qualify" and qualified"):

rate
qualif
quote
approv
mortga
refinan
forkesparrked44anew
http://
folosko
credit
guarantee

From my understanding of how custom filters work, I can write a script to search for these words and then take action, like deleting the email. I can already do that, as suggested in this http://www.aota.net/forums/showthread.php?t=19362.

BUT what I'd really like to do is have the script work with spamassassin to adjust the score. Then if the score is high enough, then delete the email.

I don't want the script to take an all or nothing approach and just delete the email because it finds a single offending word. That could work for this word "forkesparrked44anew" but it could work for a word like "credit."

And while I could create a more sophisticated script that implements its own scoring system, I'd rather leverage spamassassin and just extend it by adding my own keywords and scores for email body text.

Randall
05-14-2005, 06:11 PM
BUT what I'd really like to do is have the script work with spamassassin to adjust the score. Then if the score is high enough, then delete the email. I'm no filter expert, so I could be talking through my hat here. :wink: But I imagine that a script (python, PHP, whatever you like best) could easily extract the SA score, parse the message for your custom hit list, adjust the score and then delete or pass the message as required.

Whether that would put a strain on the servers is something that only Bruce or Terra would know...

Randall

m23
05-14-2005, 06:33 PM
So spam assassin processes any email before a custom script runs? If so, that would work.

Thanks for the suggestion. I'll try it and let you know. I'm still trying to visualize the process of exactly in which order each type of filter is applied.

I'll give it a try. If it works, i'll post my script when done. It sees like an example like this would be useful to others. I saw another post (which I provided a link to above) for a similar need, but it appeared to me to not use any kind of scoring system.

Melissa
05-14-2005, 06:36 PM
So spam assassin processes any email before a custom script runs?The Order in Which Email Filters Run (http://service.FutureQuest.net/index.php?_a=knowledgebase&_j=questiondetails&_i=100) :)

Tom E.
05-15-2005, 12:59 PM
At the bottom of the above link The order in which email filters run (http://service.futurequest.net/index.php?_a=knowledgebase&_j=questiondetails&_i=100) it says
Note that the FutureQuest® mail system will not run SpamAssassin processing on the same email message more than once, even if it is forwarded to another mailbox in the FutureQuest® network that also has SpamAssassin enabled. The first SpamAssassin filter to score the message will insert the appropriate email headers and any subsequent SpamAssassin filters will operate on these original results.
However, in the current thread about the spam assassin upgrade (http://www.aota.net/forums/showthread.php?postid=133008#post133008) (and last time I tried it in the CNC a few months ago) you're not allowed to redirect email labelled as spam to an account on the same domain because of potential looping.

So (deep breath)........ is the knowledge base article about filter order wrong, or will the configuration of CNC and the new spam assassin version allow you redirect to an account on the same domain if global filtering is enabled? Or...... do we still have to disable global filtering and use the convenience checkboxes at the bottom of the filter settings page to apply the settings to all accounts except the one used as a spam trap?

Or (humility clause)...... do I just not understand things correctly?

Melissa
05-15-2005, 01:19 PM
Hi Tom,

...you're not allowed to redirect email labelled as spam to an account on the same domain because of potential looping.

The important distinction to make is Global...

When enabling SpamAssassin as a global filter, you can NOT redirect to an email address on the same account.

Note that the FutureQuest® mail system will not run SpamAssassin processing on the same email message more than once, even if it is forwarded to another mailbox in the FutureQuest® network that also has SpamAssassin enabled. The first SpamAssassin filter to score the message will insert the appropriate email headers and any subsequent SpamAssassin filters will operate on these original results.
In the case of redirecting tagged email, it will not be scored again...however, the action (redirect) would still occur (and occur and occur and occur)...;)
...do we still have to disable global filtering and use the convenience checkboxes at the bottom of the filter settings page to apply the settings to all accounts except the one used as a spam trap?
Correct. To redirect tagged email to a mailbox on that account (domain/IRs), you would need to set up SA on the individual email accounts.

m23
05-30-2005, 12:50 PM
Thanks, Melissa. I see now that everything is listed in the KnowledgeBase (http://service.futurequest.net/?_a=knowledgebase).

m23
05-30-2005, 01:04 PM
I have written a Perl script that is almost done. The problem is that it is behaving differently from when I test it at the command line vs. when I test it as a filter. It works on my local windows machine from the command line. And it works on my futurequest account from the command line. But it doesn't work when tested as an email filter.

The scoring is off. For example, when the script runs it adds to the spam assassin score if it finds a word listed in a file I name spam_words.txt. If the spam assassin score was 0.8 and the script finds two words in spam_words.txt then the score could be 0.8 + 0.2 and should equal 1.0. When I test the script everything works fine. When I set it up as a filter in my command and control, the scoring doesn't work. I get a bogus score, which is always 0.5.

I don't know why I always get a score of 0.5. At the end of my script, I used undef to undefine all my variables -- I though maybe the variables weren't getting reset for some reason. But that didn't help.

I think I will post this somewhere else for Perl to see if someone can help. If anyone has experience with email filters who can help here, that would be great.

The main thing I need help with is how to make sure that I am testing this thing correctly from the command line.

When using an email filter (a processor type), the email is STDIN. So, on the command line, I should use < email.txt to test, if I save one of my emails as email.txt? The name of the script is spam_cooker2.pl, so I test it like this, using Perl's debugger:

Perl -d spam_cooker2.pl < email.txt

By the way, here is the script. Note, I have commented out a section at the end so the script does not bounce any email. I will uncomment that part later after the script works.


#!/usr/bin/perl

my $file;
my $header;
my $body;
my %words;
my @matches;
my $subject;
my $score = 0;

# Read in the email and save to scalar
while(<STDIN>){$file .= $_};

# Read in spam words file, saving words and their scores to hash
open(F,"<spam_words.txt");
while(<F>)
{ chomp $_;
(my $score, my $word) = split("\t",$_);
$words{$word} = $score;
}
close(F);

# Separate email header and body to process each individually later
# Also, save original spam assassin score and email subject line
($header,$body) = $file =~ /(.*\n?)^$(.*\n?)/ms;
$score = $2 if $header =~ /X-Spam-Status: (Yes|No), score=(\d+\.\d+)/;
$subject = $1 if $header =~ /Subject: (.*)/;

# Look for each word in spam_words.txt in email body
# If found, increase spam assassin score and add to matches array
foreach(%words)
{ if($body =~ quotemeta $_)
{ $score += $words{$_};
push @matches, $_;
}
}

# If score is higher than 0.0, rewrite subject line in email header
if($score > 0.0)
{ $header =~ s/Subject.*/Subject: *{SPAM COOKER: $score}* $subject/;
}

# All done, now print out new email
print $header;
print "\n";
print "MATCHES: ";
foreach(@matches)
{ print $_ . ", ";
}
print "\n";
print $body;

# Undef all my variables, just for good measure
undef $file;
undef $header;
undef $body;
undef %words;
undef @matches;
undef $subject;
undef $score;

# Comment this out for now, so email does not get bounced during testing
#if($score > 1.0) # It's spam
#{ exit 100; # bounce it
#}
#else # It's not spam,
#{ exit 0; # deliver it
#}

# Exit code of 0 should deliver email and not bounce it when using email filter
0;

Bruce
05-30-2005, 06:17 PM
When using an email filter (a processor type), the email is STDIN. So, on the command line, I should use < email.txt to test, if I save one of my emails as email.txt?Correct.

I test it like this, using Perl's debugger:

Perl -d spam_cooker2.pl < email.txtOne thing you need to be careful of is paths. Filters are run in the xdom directory. That is, /big/dom/xDOMAIN You will need to make sure all paths in your script are either absolute paths (including the /big/dom part) or are properly relative to that directory. For best testing, change directory into your xdom directory before starting the test.

m23
05-30-2005, 07:08 PM
Thanks, Bruce. I'm one step closer now. I know that using < email.txt is the same for testing as when the script is used to read email from STDIN.

I think all my paths are correct, because a) everything is in the same directory (in my domain directory, as you put it /big/dom/xDomain) and b) the script is running, it's just not getting the correct score.

The issue is that the score I get is different when the script is used as an email filter vs when I run the script from a command line. For example, when run from the command line, I might get a score of 0.2, but when run as a filter I get a score of 0.5. In fact, everytime I run the script as a filter I get a score of 0.5.

While I can test the script from the command line using the Perl debugger, I can't debug the script running as a filter. Since I get different results when run from the command line and when run as a filter, I am not sure what to do.

Another possibility is that the file I am using to test with, email.txt, is different than an actual email. To create email.txt, I copied the email headers from an email in Questmail, then typed a blank line, then copied in the email body.

Any ideas?

kitchin
05-30-2005, 08:25 PM
Actual email is more likley to use \r\n\r\n to split the header and body, instead of \n\n, even though it is on a unix system.

Try
($header,$body) = split(/\r\n\r\n|\n\n|\r\r/, $file, 2);
or something like that or wait for someone more informed to correct my posting!!!

Also, you can write a log file to see what's going on.!

sheila
05-31-2005, 02:53 AM
This script isn't in your HOME directory, I hope? The mail system does not have permissions to get into that folder.

It needs to be either in the xdomain directory, or else if you create a new folder, it needs to have 775 permissions on it, since the mail system runs the scripts under your GroupID, and not under your UserID.

Other than that, definitely add lines into the script for debugging purposes that print debugging output to a log file so that you can see the values of the variables at different places in the script as it runs.

m23
06-01-2005, 12:08 AM
kitchin,

I added a bunch of logging code to see what's happening. The trouble is the regular expression line ($header,$body) = $file =~ /(.*)^$(.*)/ms;

The $header and $body variables are empty. For some reason, this works fine from the command line. Well, actually, ($header,$body) = $file =~ /(.*\n?)^$(.*\n?)/ms; is what worked from my account command line (unix), but didn't work when the script was used as a filter. I think you're on to something. I've read that Perl actually adjusts the line ending characters depending on the operating system, so if you use \n Perl would know what to actually use on Unix vs Windows vs Mac. I could be wrong on that or misapplying it somehow. But clearly my regular expression isn't working. I'll get back to you after looking into it more.

m23
06-01-2005, 12:11 AM
sheila,

Nope, I don't think so. It's in /big/dom/xmydomain. Is that correct? The script is running, just not giving the correct results. I end up getting the email processed by the script. It's just that the newly create spam score is always 0.5, which is wrong.

Actually, I found out why it's always 0.5. That is the original spam assassin score. It's always 0.5 because every test email is from an email address of mine that ends in 1969, eg, my-REMOVE THIS BIT HERE-name1969@hotmail.com. Spam assassin gives a 0.5 score to any email address that ends in numbers. So every test email gets a 0.5 from spam assassin.

kitchin
06-01-2005, 12:16 AM
If you don't want t use my "split" code, then this =~//ms code might work:

($header,$body) = $file =~ /(.*\n?)^\r?$(.*\n?)/ms;

Most email, if I recall, uses \r\n line endings, even in unix.
??

m23
06-01-2005, 12:45 AM
I used this and it works:

($header,$body) = $file =~ /(.*\n?)^$(.*\n?)/ms;

Yours works too, which was

($header,$body) = $file =~ /(.*\n?)^\r?$(.*\n?)/ms;

I'm not sure why \n is needed at all with the /m modifier, which is supposed to make the . character match new lines. But oh well, it's working. Maybe I'll find out why someday. I don't have time to think anymore about that now. I have to verify the next part that isn't working.

m23
06-01-2005, 01:23 AM
Actually, those regular expressions aren't working. They are storing the entire email, header and body, in the $header variable. The $body variable is empty.

sheila
06-01-2005, 01:45 AM
I really don't think you should need to match on \r\n

The line separator \r\n is used during the SMTP conversation between the sending and receiving SMTP servers. But, as far as I know (?) once the message is accepted by our system, the line endings are all stored as the usual Unix line endings.

I would recommend not using regex at all.

Doesn't Perl have some command like "find" or something for strings, that can find the first occurence of "\n\n" and split the message into the part before and the part after?

In Python, I would do this:

raw_email = sys.stdin.read()
headers, body = raw_email.split("\n\n", 1)

Yeah, here is a string function for Perl that does the same thing?
http://www.cs.cf.ac.uk/Dave/PERL/node56.html

split(PATTERN, STRING, LIMIT) -- Breaks up a string based on some delimiter. In an array context, it returns a list of the things that were found. In a scalar context, it returns the number of things found.

Looks like this would work for you?

Let the email be in the string
rawemail

where you read it in from stdinput

Then
headers, body = split("\n\n", rawemail, 1)

Well, I know it works in Python. I would think it would work in Perl, too?

kitchin
06-01-2005, 08:55 AM
Shelia is right! I think I've made this mistake before. You just need to look for \n\n. This will work, without regex (split uses a regex):

$i= 2 + index($file, "\n\n");
$head= substr($file, 0, $i);
$body= substr($file, $i);

kitchin
06-01-2005, 09:06 AM
Also, this line
foreach(%words)
should be
foreach(keys %words)

And the regexs matching on header lines would be better if they matched the line start, so
/Subject: (.*)/
would be
/^Subject: (.*)/m

and especially
$header =~ s/Subject.*/Subject: *{SPAM COOKER: $score}* $subject/;
would be
$header =~ s/^Subject: .*/Subject: *{SPAM COOKER: $score}* $subject/mi;

Probably!

m23
06-03-2005, 09:24 PM
Shiela,

Yes, I had used Perl's split function earlier, but then when I wanted to match multiple lines I switched to using regular expressions. I thought regular expressions would be easier to grab multiline chunks of text, but maybe I was wrong. Also, I use regular expressions a lot (but not in Perl).

Maybe I could do something like this instead using Perl's split function:

@sections = split(/$^/,$email);

Unfortunately I haven't had a lot of time to work on this script lately. And I'm using it to learn Perl better. So it's pretty slow going right now. I'm almost done though.

m23
06-03-2005, 09:27 PM
kitchin,

Thanks, I'll try those regex's. I fixed my syntax error you pointed out with iterating over the hash using foreach.

I wish I knew why the same regex (that I am currently using, posted before) that matches from the command line does not match when the script is used as a filter. That is really interesting, even if irritating.

m23
06-08-2005, 11:51 PM
kitchin,

Thanks for your help. The script is basically working. Little issues keep coming up, and I kind of want a break from them for the moment, for example, now the script appears to only run if the email is not identified as spam by spam assassin. When before, it seemed like the script did run whether the email had been tagged as spam or not. I think I need some time away from the email filter issues.

In any case, I'm going to rewrite the script in Python. After that, in Ruby. I'm using this script as a way to compare Perl, Python, and Ruby. I imagine the final script will be in Python, because it appears to have more support than Ruby. But, I'll see how it turns out.