PDA

View Full Version : Search engine indexing dynamic (PHP/Perl) sites?


stan
06-01-1999, 09:31 PM
I understood that most search engines will not follow links on dynamic pages such as are generated by Perl or PHP.

The main reason for a spider/crawlers to stop at dynamic links is that, well, they're dynamic, so
(1) the content could change every time it looks at the page, and
(2) the spider could get lost on a very long trip such as in the following PHP script "lost.php":


<HTML>
<HEAD><TITLE>Lost in space</TITLE></HEAD>
<BODY>
<?
[nbsp][nbsp]$n++;
[nbsp][nbsp]echo &quot;That's $n. Now click <A HREF=\&quot;lost.php?n=$n\&quot;>here</A>&quot;;
?>
</BODY>
</HTML>


Often, however, a PHP/MySQL site is not as dynamic as the code fragment above. Only changes in the MySQL database will result in a change of content.

Questions
(1) Is it true that spiders of search engines will skip PHP/MySQL pages or not follow links on it?
(2) Is there a way to tell these spiders to follow it anyway?
(3) Is it possible to tell Apache to interpret http:/www.domain.com/path/script.php/some/argument/list as
http:/www.domain.com/path/script.php?some/argument/list as this would be an answer to (2)


- Stan (eager not to lose potential visitors)

Jacob Stetser
06-01-1999, 09:44 PM
It sure is...

$url_array=explode(&quot;/&quot;,$PATH_INFO);[nbsp][nbsp]//BREAK UP THE URL PATH
[nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp] //[nbsp][nbsp][nbsp][nbsp]USING '/' as delimiter

if($url_array[1]) {
[nbsp][nbsp][nbsp][nbsp][nbsp]$variable=urldecode($url_array[1]);
}

if($url_array[2]) {
[nbsp][nbsp][nbsp][nbsp][nbsp]$variable2=urldecode($url_array[2]);
}

... etc.

I use it in quite a few places not only to make sites search engine accessible but to give a nicer URL.. I don't like lots of ?s, &amp;s, and =s! :)

Jake
------------------
icongarden.com/?fq (http://icongarden.com/?fq)
icongarden: making good ideas grow.

Terra
06-01-1999, 10:52 PM
Hmmm, I never quite thought about that...

Of course having a spider nail all your dynamic pages does tend to load up the server, I have a few nasty SkyCache engines (Caching Proxies) hit our servers and they are not nice at all...[nbsp][nbsp]One day I will have to put a break on them if they don't mellow out their pull rate...

Question:
1) What URL's will search engines index (.htm,.html)?
2) What URL's are on the NO-index list (.pl, .cgi, .php)?

I'm wondering if there is a way around this in the server core, e.g. naming your files like 'filenameP.html' and have the server recognize the 'P.html' part as PHP3...[nbsp][nbsp]I know for fact it will not do this now, but maybe in the handler code it could parse the identifier.extension combination...

Don't mind me - just thinking out loud... :)[nbsp][nbsp]It could possibly break other people that just so happen to snag this without trying...[nbsp][nbsp]I'm not even sure how difficult it would be to implement as the handlers are pretty solid on extension parsing only...

If there is a rarely used 'static' extension for Question: #1, then I might map PHP3 to it, instead of making very ugly hacks to the Apache handlers (**worst case scenario**)...

--
Terra
--Inspiration, often leads to perspiration, which leads to fireworks from your keyboard...
FutureQuest

stan
06-01-1999, 11:01 PM
Maybe it is possible to tell Apache that

http://host/php/file/argument.html

should be read is run 'file' as a php script?

But maybe the suggestion of Jacob is enough: just use

http://host/path/file.php/arguments.html

- Stan
<!-- NO_AUTO_LINK -->

Jacob Stetser
06-01-1999, 11:41 PM
Well, there's more- you can do something like what I've got below in your .htaccess file, to force the server to recognize a certain file (for our example, we'll rename jake.php to jake.html as PHP...


<Files jake.html>
ForceType application/x-httpd-php3
</Files>


That tells the server that despite the .html extension, it's really a PHP script and should be run as such. I use this extensively as well. From the outside, you wouldn't know one of my sites was script-run at all, and I haven't had to go the distance of asking apache to parse _every_ file for PHP.. I just pick and choose the PHP files, especially those I use with the Path_Info trick, that don't have any extension at all and in a URL, they seem like part of the folder heirarchy :)

I love PHP! (Did I mention PHP will automagically create arrays for you from form input if you end a variable with brackets? (e.g. File[] will become $file[] in your script and can be accessed like an array. Woohoo!)

Jake
------------------
icongarden.com/?fq (http://icongarden.com/?fq)
icongarden: making good ideas grow.

Ron
06-02-1999, 12:43 AM
Question:
[nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp]1) What URL's will search engines index (.htm,.html)?
[nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp]2) What URL's are on the NO-index list (.pl, .cgi, .php)?

In the &quot;old&quot; days, all of the SE's would follow all of these links (and many others), whenever and where ever they found them. But even AV no longer follows all, but rather only samples a web site (250-500 pages seem to be the max). Excite, on the other end of the spectrum, rarely samples more than 25 pages -- and everything except root URL's are purged every month or two.

However - insofar as the SE's follow any links, every one with which I'm familiar will still follow all of the above. But none of the SE's will follow any link with a &quot;?&quot; mark embedded within it...


------------------
Ron Carnell
Passions in Poetry
http://netpoets.com/

stan
06-05-1999, 08:01 PM
Thanx Jacob,

This <Files> directive *really* works, also for files
without extentions, making it look even more like a folder.

Where did you use &quot;File[] will become $file[] in PHP&quot; for? For check boxes in forms?

- Stan

Justin
06-05-1999, 09:50 PM
I find the $file[] handy for a lot of things - say you have a list of files (for example). That list can be any number. Now if each one has a checkbox, how does the script that is called know how many checkboxes there were on the calling page? And how does the origional page know what to name these checkboxes?

So you could have something like this:

for ($i = 0; $i < $numberOfFiles; $i++) {
[nbsp][nbsp] print &quot;<input type=checkbox name=file[$i] value=1>\n&quot;;
}

So you get file[1], file[2], and so on :)

------------------
Justin Nelson
FutureQuest Support