View Full Version : The whole MSIE Crawler/403/stats thing, again...
sheila
05-17-2001, 09:40 AM
OK, as referenced here:
http://www.aota.net/ubb/Forum3/HTML/001618-1.html
I'm still dealing with this issue of someone using MSIE Crawler to access my site, and here is a snip from the first, of several thousands, of log entries from yesterday, with IP address 128.158.104.168 and user agent MSIECrawler:
128.158.104.168 - - [16/May/2001:15:42:39 -0400] "GET /robots.txt HTTP/1.1" 302 299 "-" "Mozilla/4.0 (compatible; MSIE 5.01; Windows 98; MSIECrawler)"
128.158.104.168 - - [16/May/2001:15:42:40 -0400] "GET /403.html HTTP/1.1" 302 299 "-" "Mozilla/4.0 (compatible; MSIE 5.01; Windows 98; MSIECrawler)"
It continues to be IP address 128.158.104.168, in all cases (even the previous ones from Feb, March and April). Even though I wrote to someone at nasa.gov, and got a response, and was told that they would see to it that it stopped, it still hasn't stopped.
Wierd thing is, you notice above that MSIE Crawler is not generating a code 403. It is getting a 302 code.
I've finally decided to write a script to redirect this guy to some other website. (Nasa.gov ? Microsoft.com ?) However, it would rely on his requests generating a 403 error. (And why isn't it generating a 403 error as shown from my logs above? Is this related to have IRMs? I think maybe so.)
So, how can I ensure that this IP address generates a 403 error? And why didn't it generate one yesterday?
tedloh
05-17-2001, 09:51 AM
Why not just forbid him through your .htaccess?[nbsp][nbsp]Too much trouble to redirect him...
------------------
Ted (Chief Do-It-All)
Got2Bet.com - The Net's Winner's Circle
http://www.got2bet.com
ted@tygresystems.com
sheila
05-17-2001, 11:34 AM
The problem is, because of my IRM, the user does not get a 403 error.
For example, I just tried doing this to myself, using my own IP address and a .htaccess file.
First I tried in my root directory. In my .htaccess file I added the following:
order allow,deny
allow from all
deny from 128.158.104.168
deny from xxx.xxx.xxx.xxx
(Where my IP addy is xxx.xxx.xxx.xxx)
Now when I try to go to my own website, or any page in my site <font color=#FF0000>excepting those below my IRM subdirectory</font>, I get a 403-Forbidden message.
But, when I go to my IRM, I get a webpage that says: "Found. The document has moved here" and the word "here" is a link to my 403 document for my IRM subdirectory.
Here is the source for the page that is returned:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>302 Found</TITLE>
</HEAD><BODY>
<H1>Found</H1>
The document has moved <A HREF="http://www.k12groups.org/403.html">here</A>.
<HR>
<ADDRESS>Apache/1.3.17 Server at www.k12groups.org (http://www.k12groups.org) Port 80</ADDRESS>
</BODY></HTML>
So, FutureQuest is causing a re-direct on my 403 document, by the way the IRMs are set up. This is causing a loop, since (I guess) the MSIECrawler agent then follows the "here" link, and requests my 403 document repeatedly. The MSIECrawler agent is not receiving a 403 error code. It is receiving a 302 "moved temporarily" code.
You know, if we could make the IRM generate a 403 error, as it should, then this problem would go away.
I wonder if I will need to delve into the mysteries of URL re-write/re-direction via Apache modules to solve this one myself. Really, I think that this is a result of the way the IRMs are set up. I'm sure that FutureQuest is using the apache modules to re-write my IRM subdirectory, so that it looks like its own domain. I'm guessing they're using the mod_alias (http://httpd.apache.org/docs/mod/mod_alias.html) appache module. Maybe I could work with some of the other modules to override this for the case of this one IP address. It is going to take some more studying on my part.
Anyone who has a clue, please chip in and tell me how to do it. I won't be insulted.
sheila
05-17-2001, 11:37 AM
By the way, this also happened, even when I put the order for allow,deny in the same subdirectory as my IRM. Even then, it gave the 302-Found page with a link to my 403.html document.
tedloh
05-17-2001, 05:11 PM
That's weird.
My IRM has a similar .htaccess for the deny... but it works perfectly when the .htaccess is placed in the same directory.
In theory, with the deny in place, I guess you are supposed to get NOTHING but the 403 - so if you are showing 302 the deny must not be working.
My .htaccess also specifies where the 403 is, though:
ErrorDocument 403 /cgi-bin/xxxxxx/error.cgi?403
I wonder if that could be the problem...
------------------
Ted (Chief Do-It-All)
Got2Bet.com - The Net's Winner's Circle
http://www.got2bet.com
ted@tygresystems.com
[This message has been edited by tedloh (edited 05-17-01@5:11 pm)]
sheila
05-17-2001, 05:22 PM
Ah, Ted, I've been playing with this further.
If you place the 403 document in the cgi-bin, it will work fine. That is because, from your IRM subdirectory, the path to the cgi-bin is the same, whether you view it as
http://www.maindomain.com/irmdomain/
or
http://www.irmdomain.com
So, you don't have this problem with the 302-Found redirect thingy.
HOWEVER, try putting a 403 document somewhere within your IRM subdirectory, and you will generate the problem I was having. I guarantee it!
So, it seems that the only way with IRMs on Futurequest, to handle error documents correctly, without having a re-direct problem, is to serve them from the cgi-bin ?
sheila
05-17-2001, 05:23 PM
Can one serve HTML documents from the cgi-bin?
[This message has been edited by sheila (edited 05-17-01@5:29 pm)]
sheila
05-17-2001, 05:47 PM
Okay, it seems that the answer to the last question is: No, you can't display an html page from the cgi-bin.
Therefore, in order to serve my plain, 403-html document correctly for my IRM, I have had to write a cgi script that basically opens an html file for me (the one I would have served), and prints it out to the requesting user agent.
Slightly inefficient, but at least I avoid the 302-redirect problem.
Terra
05-17-2001, 05:55 PM
I've said it a hundred times, MSIECrawler is a stupid, idiotic, brain dead, drag it out behind the barn - type of bot...
The conversation:
access /robots.txt (or any other page)
This is where the fun begins as the bot get's itself caught up in an infinite loop:
Verbose:
get /robots.txt - denied - custom 403 - get /403.html - denied
now I'll shorten it:
/robots.txt - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied - /403.html - denied
Well - you get the picture...
The IRM code does not know that the forbidden bot is requesting a custom ErrorDocument, so therefore the IRM just keeps telling the bot to go away...
If everyone named their ErrorDocument '403.html' then I can allow that to pass through - however this is a haphazard solution path at best...
I will evaluate this condition and see if I can create yet another workaround for Microsoft's way of doing things...[nbsp][nbsp]With other bots, it's not much of a problem as they usually were designed correctly with an error counter which expires and stops the transversal...
Stupid bot.
--
Terra
--Another prime example of Microsoft prying open the door and forcing us to do it their way--
FutureQuest
sheila
05-18-2001, 12:11 AM
OK, so I'm putting a .htaccess in my subdirectory
/www/k12groups/
which redirects to k12groups.org
So, I thought I could use the Redirect command from the[nbsp][nbsp]mod_alias Apache module (http://httpd.apache.org/docs-2.0/mod/mod_alias.html#redirectperm). It has the following syntax:
Redirect [status] url-path url
And at the bottom of that section, the docs state:
Other status codes can be returned by giving the numeric status code as the value of status. If the status is between 300 and 399, the url argument must be present, otherwise
it must be omitted. Note that the status must be known to the Apache code (see the function send_error_response in http_protocol.c).
Therefore, it seemed to me I could put the following line in my .htaccess file in the /www/k12groups/ subdirectory:
Redirect 403 /k12groups/403.html
But it doesn't work. I still get the same 302-Found status returned, with a link to my 403.html document. :( :(
[This message has been edited by sheila (edited 05-17-01@12:12 pm)]
tedloh
05-18-2001, 01:01 AM
Down bot![nbsp][nbsp]Heel!
Perhaps I'm just lucky things are working right, but I bet it's because Terra has things in order, except for the stupid bot.
Now I can ban every idiot who tries to use EmailSiphon.
------------------
Ted (Chief Do-It-All)
Got2Bet.com - The Net's Winner's Circle
http://www.got2bet.com
ted@tygresystems.com
DianeDuane
07-01-2001, 06:32 AM
Originally posted by tedloh:
Down bot![nbsp][nbsp]Heel!
Perhaps I'm just lucky things are working right, but I bet it's because Terra has things in order, except for the stupid bot.
Now I can ban every idiot who tries to use EmailSiphon.
Ted, could you possibly take a moment to explain, in terms suitable for an utterly clueless newbie, exactly what you did? The whole business of .htaccess files, etc, is so far still a mystery to me -- I do want to learn more -- but right now I've just noticed EmailSiphon sniffing around my site, and I want to stomp on the problem soonest. Help help, oh help.
Best -- Diane:waa:
tedloh
07-02-2001, 02:52 AM
Hi Diane - welcome!
Actually, I think there's a way to ban by checking the browser agent - but instead I just watch my logs, and every entry that contains EmailSiphon has the IP banned in my .htaccess.
A line like
Deny from 111.222.255.12
in your .htaccess file would ban that particular IP.
I suspect you and I are in the same boat :) I am still learning about restricting access... maybe one of the other experts can help?
DianeDuane
07-03-2001, 07:39 AM
Hi, Ted! Many thanks for the welcome.
Ideally, I would like to ban by browser agent, but it seems likely to involve a perl script or other exotica, which will only increase the cluelessness level around here. I didn't even know I had an .htaccess file until the other day... %)
I note your suggestion to ban by IP...I'll give that a try. Is that statement the only one that needs to appear in the file (for that IP), or will it need some other expression or statement to help it along?
Best! -- Diane
tedloh
07-07-2001, 01:14 PM
Yes, that's all it takes.
Make sure to leave a blank line at the end of the file - I understand that .htaccess may not function correctly without it.
vBulletin® v3.6.8, Copyright ©2000-2012, Jelsoft Enterprises Ltd.