View Full Version : Finding broken links: 404s
Stecyk
07-24-2006, 06:18 PM
Hi,
I used Google's Sitemap site, even though I don't have a sitemap, to review several statistics. One of the stats is "Not found".
I have one link /blog/weblog/ that is a broken link. I have no idea where to find this broken link to fix it. Does anyone have any suggestions to find this broken link quickly?
I also used Futurequests included stat package and verified that this broken link has shown up 5 times. Not a big deal, but I w'd like to correct it.
Any ideas, hints, or instructions are most appreciated.
Best regards,
Kevin
sheila
07-24-2006, 10:08 PM
Google offers a search option to find pages that contain a certain link. Perhaps you might try that type of search on Google?
Stecyk
07-25-2006, 12:23 AM
Hi Sheila,
I tried Google's link feature and came up empty, which is odd in a way because Google is the one that identified the error.
Oh well. :)
Regards,
Kevin
Stecyk
07-25-2006, 03:57 AM
Hi Sheila,
I tried Link Sleuth: http://home.snafu.de/tilman/xenulink.html
But I get "status: forbidden request".
It works on other websites, but not mine. Is FQ somehow blocking it?
Best regards,
Kevin
sheila
07-25-2006, 04:14 AM
I'm afraid Xenu Link Sleuth is being blocked...
See:
http://www.aota.net/forums/showthread.php?t=21146
Stecyk
07-25-2006, 11:46 AM
Okay, so I need to find some other way to find my 404 errors. Hmmmm....
Andilinks
07-25-2006, 03:49 PM
I've been avoiding this question because I don't understand it.
Where is this broken link and why is it so hard to find? Can't you just search the daily activity log for 404 referrers? If you search on "404" there will be a few that occur in other strings but most will be actual "file not found errors"
My stats package gives a summary but unless you get huge traffic searching the daily logs should be easy.
Andi
Stecyk
07-25-2006, 07:33 PM
Hi Andi,
I don't understand it either.
From my logs:
00.000.00.000 - - [18/Jul/2006:23:25:45 -0400] "GET /blog/weblog/ HTTP/1.1" 404 1468 "http://www.speciousargument.com/blog/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4"
I put in a fake IP address. Okay, so I think that "/blog/weblog/ ought to occur somewhere on my main index:
http://www.speciousargument.com/blog/index.php
Yet, when I view the code and search for "blog/weblog", I get nothing.
So that's where my confusion is.
Any ideas?
Andilinks
07-25-2006, 07:53 PM
Ok, here's a current list of 404 requests to my site just picked at random.
http://www.andifashion.com/404.gif
Most do not exist and never did exist, these are simply requests generated by bots or browser errors. Of those that did once exist but are now removed most have not existed for at least a year, yet they are still requested--presumably by bots that never forget.
Stecyk
07-25-2006, 10:46 PM
Andi,
I understood your random selection. However, Google sitemap site also indicates my same error, though it doesn't tell me where the error originated from. That is the reason why I am trying to track it down. I don't think Google made it up.
Kevin
sheila
07-25-2006, 10:55 PM
I just wanted to point out a similar line of discussion that was posted recently in the forums...related to Google indexing pages that didn't exist...
http://www.aota.net/forums/showthread.php?t=21650
Andilinks
07-25-2006, 11:15 PM
Did it come from this page?
http://www.andifashion.com/u1.jpg
As you can see there are no errors reported today, but just recently there has been a list of pages that never existed or haven't existed in ages. I don't know where these come from, some were deleted in 2002 and 2003...
Maybe just time will cure this.
Stecyk
07-25-2006, 11:21 PM
Hi,
Sheila, thank you for your message. Because this same error is reported in both my server logs and by Google sitemaps, I think there might be something to it. If it only incurred in and not the other, I wouldn't fuss it. But I think there might be something to it.
Andi, I too have 0 unreachable URLs. My site's problem is "Not Found" urls.
http://img225.imageshack.us/img225/9656/blahfm6.th.jpg (http://img225.imageshack.us/my.php?image=blahfm6.jpg)
Click on the image if you want to see a larger version.
Kevin
Andilinks
07-25-2006, 11:49 PM
http://www.andifashion.com/u2.gif
This is a very interesting list, well for me it is...
I'd let time do it's work, though like I said some of these pages were deleted in 2002. Robots are especially peculiar.
edit: reload if the image doesn't appear, I had to make a change
Stecyk
07-25-2006, 11:57 PM
Andi,
My interpretation, right or wrong, is that on or about mid July, Google found various links on your site that terminated with a 404. If the page with the offending link was deleted back in 2002, I don't think Google would be chasing that link down now because that page no longer exists.
But I could be wrong.
Best regards,
Kevin
Andilinks
07-26-2006, 12:07 AM
Google found various links on your site that terminated with a 404. Nahhh... I did a global search for these URLs with UltraEdit on all the pages, these links aren't on my site. I keep a local version of the entire site and upload it whenever I make a global change, which is often... There are no htm, html, or shtml files in my www folder older than May.
Maybe on somebodies' site, but not on mine. Could be someone's old referrer logs online that get indexed.
Stecyk
07-26-2006, 12:51 AM
Andi,
In any event, all this conversation didn't help answer something that was supposedly basic. I showed my server log with the referrer being an internal page. And I showed Google. Two independent sites/methods saying the same thing.
Kevin
Andilinks
07-26-2006, 01:09 AM
So, download your entire site to a single directory, then do a global search of the entire local version and if the link isn't there then the two independent sources agree with each other but not with reality...
Andilinks
07-26-2006, 01:19 AM
So anyway if that works out, check out my apparel stocks for mobile page.
http://www.andilinks.com/apparel-stock-mobile.htm
If you don't have a PDA open it with Opera > View > Small Screen...
Stecyk
07-26-2006, 01:21 AM
So, download your entire site to a single directory, then do a global search of the entire local version and if the link isn't there then the two independent sources agree with each other but not with reality...
Tough to do with dynamic site. Stuff gets created on the fly. But I have checked the supposed offending page with no success. And that is why I came here to ask others for their advice.
sheila
07-26-2006, 01:23 AM
I was thinking, since the log entry you showed was dated the 18th and today is the 25th, that maybe that "bad" link was no longer on your index page. Maybe you changed the site since then?
Or, maybe it used to be on the index page, but has since scrolled off, as time has passed. So I looked at your entire archives for July, but I still do not find a link /blog/weblog
nor indeed, any link with "weblog" in it.
Andilinks
07-26-2006, 01:32 AM
...dynamic site.Can't you search the database that feeds the site? Even created on the fly the URL still had to come from a record in that database.
sheila
07-26-2006, 02:38 AM
I just ran a link checker on your site.
From a command line window on my desktop:
$ python webchecker.py -x http://www.speciousargument.com/blog/ > linkresults.txt
The linkresults.txt file had all the 404 results in it...there was no string "weblog" to be found anywhere in the results. If you want the linkresults.txt file I can provide it to you one way or another, I'm sure...
The link checker program I used is a Python script that can be downloaded from the Python.org web site. It is in the source code for the Python package. If you are interested and need more details to find it, let me know...
sheila
07-26-2006, 02:49 AM
actually, on reviewing the linkresults.txt file, I'm not sure that it is especially useful. Most of the "404s" that are reported have the zgi_url parameter in them, which appears to have something to do with your flicker links.
Maybe this webchecker script is too out-of-date to handle modern code on dynamic web pages? Dunno...I had used it on some sites a while back and had pretty good success with it...but it seems to be stumbling over your page...not sure why.
Andilinks
07-26-2006, 04:30 AM
...modern code on dynamic web pages?Some technologies do get ahead of themselves. I don't think a single bad link is a serious problem, especially since it may not even exist.
But the fact that this can't be easily tracked down is a very serious flaw and would cause me to avoid the software in question.
Stecyk
07-26-2006, 11:40 AM
Hi Sheila,
Yes, I would very much like to see the file you created. If you're unable to send it through internal forum messaging, please use my gmail account. The stuff in front of the "at" symbol is simply my last name, which I also use as my identity in this forum. I am being slightly cryptic for spam bot purposes. Thank you!
I'll keep watching for that error to see if it pops up again. I managed to find and correct most of the other errors that Googled indicated. But perhaps it was only there for a short period of time, or it was never there.
Again, thank you.
Best regards,
Kevin
Stecyk
07-26-2006, 11:42 AM
Hi Andi,
Can't you search the database that feeds the site? Even created on the fly the URL still had to come from a record in that database.
Yes, I did check my feeds and got nothing. But the templates themselves, which I don't have a good handle on, also embed links in them too. So I wasn't sure if perhaps a template was causing the grief. And yes, I searched the templates for the pattern as well and came up empty.
Thank you for your help.
Best regards,
Kevin
sheila
07-28-2006, 12:21 AM
Yes, I would very much like to see the file you created.
Just sent it to your gmail addy.
Stecyk
07-28-2006, 01:36 AM
Hi Sheila,
Your file was immensely helpful. I found a bugs in a few posts and comments. And I discovered that three of my templates had incorrect links too. So I think with your file I managed to eliminate a whole slew of 404s.
Thank you very much for your help!!
Best regards,
Kevin
sheila
07-28-2006, 01:38 AM
Cool. Glad it turned out to be helpful. :yeah:
Stecyk
07-28-2006, 02:09 AM
Hi Sheila,
If you could please provide more detail on the program, that would be helpful.
Where can it be obtained, how/where is it installed, and how do you run it?
Thank you!!
Best regards,
Kevin
sheila
07-28-2006, 02:17 AM
Well, the file itself that I used is called webchecker.py
It is a Python script and I got it from the Python distribution source code.
You can download a tarball of the source code from the Python.org download page:
http://www.python.org/download/
When I unzipped the .tar.bz2 archive that I downloaded, there is a directory within the archive called Tools and inside that directory is a directory called webchecker. And that's where the script is located.
You need to have Python installed in order to run the script.
The last time I installed Python on Windows (it's been a few years, but...) it seems to me that the Tools directory was automatically installed for me with the webchecker tools inside of it. The same was not true on my Mac. I had to go out and look for the source code for the web checker script, as it wasn't installed on my Mac.
Anyhow, it is a handy little script.
To run it, you just open a command window in the same directory where the script is located and do like I did above (example in my first post mentioning the script).
Stecyk
07-28-2006, 12:33 PM
Hi Sheila,
Thank you for your help!
Best regards,
Kevin
vBulletin® v3.6.8, Copyright ©2000-2009, Jelsoft Enterprises Ltd.