View Full Version : Can I unblock Xenu?
axleramon
08-15-2007, 05:33 PM
Hi, I can't seem to use the Xenu link checker on my futurequest hosted domain. It says the domain is blocked. I understand this is a security feature of futurequest, but I would like to unblock it (even temporarily) so that I can check my site for broken links. Any ideas?
Hello,
As I replied to your service ticket submitted with the same question...
"As noted in the below Forums post link, If the actual Agent is Xenu Link Sleuth it is blocked globally and that cannot be modified.
http://www.aota.net/forums/showthread.php?postid=146625#post146625 "
-Bob
axleramon
08-15-2007, 06:44 PM
Thanks Bob... my apologies for the duplicate post, I didn't understand how your support system works, my bad. Many thanks for the prompt response!
Hi again Alex,
No apology required and we appreciate your cooperation in support requests.
-Bob
johnfl68
08-15-2007, 07:04 PM
You may want to try the W3C Link Checker - I have used is a few times on FutureQuest sites with out problems.
http://validator.w3.org/checklink
John
axleramon
08-15-2007, 07:13 PM
Thanks John,
I'm trying the W3C checker right now. Are you aware of any way to prevent the W3C checker from following certain URL's. Obviously I wouldn't want it to check urls like google ads, etc...
Your help is appreciated, thanks again!
johnfl68
08-15-2007, 07:15 PM
Sure - in the robots.txt file on your site - add the following:
User-Agent: W3C-checklink
Disallow: /(name of folder or file to you do not want scanned)
This should do the trick.
John
axleramon
08-15-2007, 07:26 PM
Hmmm... I added this code to my .htaccess
User-Agent: W3C-checklink
Disallow: http://pagead2.googlesyndication.com/
My site vanishes, I get a "500 Internal Server Error" on all pages. Did I miss something?
Further, can I add multiple URLs, like this:
User-Agent: W3C-checklink
Disallow: http://www.google.com/
Disallow: http://pagead2.googlesyndication.com/
Or do they have to have the W3C-checklink every time like this:
User-Agent: W3C-checklink
Disallow: http://www.google.com/
User-Agent: W3C-checklink
Disallow: http://pagead2.googlesyndication.com/
Really appreciate the help, thank you!
Kevin
08-15-2007, 07:29 PM
You can only put paths on your site into the .htaccess file. It is up to the bot to know not to go out onto other sites.
Kevin
08-15-2007, 07:29 PM
Oh, same goes for the robots.txt file which is the one you should be editing. Putting that text into .htaccess makes it invalid which is why your site isn't working.
axleramon
08-15-2007, 07:46 PM
Oops, I had that mixed up. I tried this in robots.txt:
User-Agent: W3C-checklink
Disallow: http://www.google.com/
Disallow: http://pagead2.googlesyndication.com/
The W3C checker is still following these links. So I understand there is nothing I can do to prevent this, right? Sorry for my ignorance.
You may want to try the W3C Link Checker - I have used is a few times on FutureQuest sites with out problems.
http://validator.w3.org/checklink
John
Is there a way to check a whole site or is it one page at a time with the W3 validator.
johnfl68
08-15-2007, 09:11 PM
You can set how deep it looks by checking "Check linked documents recursively" and setting the "recursion depth" depending on how deep you need it to go for your site, I think I typically use 5 in this box.
John
Thanks John - don't know how I didn't see that there. Many thanks!
Andilinks
08-15-2007, 10:30 PM
You can download your site and check it with Xenu locally, I've done it dozens, maybe hundreds of times. Though I no longer use Xenu, I capture images with a program called HTML2JPG which allows me to catch parked domains and similar items that Xenu will miss.
Jarrod
08-16-2007, 02:29 AM
The W3C validator does obey robots.txt, and whilst it starts to follow google ad links it doesn't complete because the robots.txt file on the google domain stops it.
Here's some sample output from one of my page checks.
http://pagead2.googlesyndication.com/pagead/show_ads.js
What to do: The link was not checked due to robots exclusion rules. Check the link manually.
Response status code: (N/A)
Response message: Forbidden by robots.txt
Lines: 46, 66
http://www.google-analytics.com/urchin.js
What to do: The link was not checked due to robots exclusion rules. Check the link manually.
Response status code: (N/A)
Response message: Forbidden by robots.txt
Line: 104
vBulletin® v3.6.8, Copyright ©2000-2009, Jelsoft Enterprises Ltd.