View Full Version : robots.txt?
Patrick
08-23-1999, 04:09 PM
What does the "robots.txt" file do?
------------------
Patrick
www.foreverkate.com (http://www.foreverkate.com)
jokesplus
08-23-1999, 04:53 PM
robots.txt can provide some control of which pages search engine spiders will come and visit.[nbsp][nbsp]My web-site has the following:
User-agent: *
Disallow: /stats/
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /jokes/jscript/
which prevents well-behaved spiders from visiting each of the directories listed.[nbsp][nbsp]You can get more info. from
http://info.webcrawler.com/mak/projects/robots/exclusion.html
btw - Nice web-site.
HTH
Jarrod
------------------
For humor on-line check out Jokesplus
http://www.jokesplus.com
[This message has been edited by jokesplus (edited 08-23-99)]
Armand
08-23-1999, 06:03 PM
While we are on the subject of the robots.txt thing, I have a question.[nbsp][nbsp]I know you can set it to disallow directories and have mine set to do so.[nbsp][nbsp]But I am wondering if you can specify individual pages to not be spider in a directory that otherwise is set to allow indexing?[nbsp][nbsp]yeah picky, picky huh?[nbsp]
Justin
08-23-1999, 07:49 PM
Be careful listing directories that you don't want other people to know about... because anyone can view your robots.txt file. It's a common way for people to find out what you may have on your site that you don't want people to see.
Also keep in mind that a search engine can't index pages that aren't linked to - they have no more or less permissions than any standard browser would have. They can't index (for example) your /stats directory unless you link to it somewhere. They also can't get into a password protected area.
HTH
------------------
Justin Nelson
FutureQuest Support
elite
08-23-1999, 08:54 PM
couldnt you use a simple program that checks referrers for your robots.txt file, and list all of the search engines on it. So when a search engine came to it it would give them the file, anyone else would recieve nothing let me try to see what I can come up with..
elite
08-23-1999, 09:06 PM
I have made a file that will only show your robots.txt files to search engines, and not people that typoe in that url to see what you dont want seem.
Ok in theory this should work but no guarantees! you might want to check with more knowledgable people about it b4 you try it..
Step 1:
Save this to a file called "robotsref.cgi" Then upload it in ascii, and chmod it to 755..
Step 2:
Go to your robots.txt file, and add the following line: <!--#INCLUDE virtual="/robotsref.cgi"-->
Step 3:
create a file named robotstxt.txt and add all of your files you do not want search engines to spider, IE
User-agent: *
Disallow: /stats/
Step 4:
Add all of the spiders to the code below. I added a few, but you might want to add some more!
#!/usr/bin/perl
# Get referrer info and make it a var
$TYPE = $ENV{'HTTP_REFERER'};
# Messages for Search Engines
open (MESSAGE, "robotstxt.txt");
@message = <MESSAGE>;
close (MESSAGE);
$infoseek = "@message";
# Default message if referrer info doesn't match one from above
$message="";
# Parse $type and match to valid referrer and respective messages
$message=$infoseek if $TYPE =~ /infoseek/;
$message=$infoseek if $TYPE =~ /go\.com/;
$message=$infoseek if $TYPE =~ /cnet/;
$message=$infoseek if $TYPE =~ /altavista\.com/;
$message=$infoseek if $TYPE =~ /zip2\.com/;
$message=$infoseek if $TYPE =~ /scooter/;
# Output message to page
print "Content-type: text/html\n\n";
print <<EOM;
$message
EOM
[This message has been edited by elite (edited 08-23-99)]
elite
08-23-1999, 10:11 PM
Come to think of it you would have to parse your txt file for ssi by adding a line to an .htaccess file. And another thing is that HTTP_REFERER I believe is turned off so that wouldnt work now on futurequest...
I think there is another way to do the same thing so I will look into that tommorow.,
vBulletin® v3.6.8, Copyright ©2000-2012, Jelsoft Enterprises Ltd.