View Full Version : .mod_rewrite to hide query strings
TDarlington
08-30-2001, 10:53 PM
This issue was addressed in part some time ago but my search of the forums didn't quite find a resolution. So here goes.
I have a little online shop as part of a larger site. To view an individual product, you use a URL something like this:
product.php?sku=1234
The problem with that is that if I want product pages to appear in site searches, I'll need to set my search engine to follow query strings. Which wouldn't be so bad, but since I'm using PHP sessions, that means the search engine robot will start getting URLs with PHPSESSIDs tacked on the end of the URL:
product.php?sku=1234&PHPSESSID=abcdefghi...
Worse, the spider will sometimes hit the same product with 3 different PHPSESSIDs and record it as 3 different pages in the search index.
It seems like the solution is to use Apache's mod_rewrite to create URLs that don't contain query strings so that I can turn off the "follow query strings" option in the search engine.
That means that I'd set up a rewrite rule to accept URLs something like this:
product/1234
and pass them along to the server as
product.php?sku=1234
This creates another problem, though. From what I understand, other search engines will see some rewritten URLs and spider through them at high speed, not realizing that they aren't static pages, which would clobber the server.
So I'm kind of in a bind. Is there a way I can hide query strings without creating a spider hazard for the FQ server? Or does someone see another solution entirely for solving my problem?
An alternative would be, if you have the products in a database (i.e. MySQL, which I would guess is the case since you are accessing products by ID in a PHP page), to incorporate a database search into the existing site search so that it doesn't even need to search product pages. I believe that would make the query string a moot point.
Dan
Unfortunately, mod_rewrite is an over-abused method used to overcome poor application design.
The solution you're seeking should be resolved by modifying the search algorithms.
That means that I'd set up a rewrite rule to accept URLs something like this:
product/1234
and pass them along to the server as
product.php?sku=1234
Where would the 'product/1234' url's come from? If you re-write all your links so they are in this form, then where does the sessionid go? If the sessionid is not needed in this form, then it shouldn't be needed in the traditional form, either?
Rich
TDarlington
08-31-2001, 11:55 AM
Although I may in fact be a poor application designer, I don't think you understand what I'm trying to achieve here, Rich.
The solution you're seeking should be resolved by modifying the search algorithms.
I think it's unrealistic to expect me to write my own site search engine! Remember that this needs to be a consolidated full-text search plus a search of the product pages. Clearly the way to do this is to have a single index that treats all pages the same way, so that if you do a search for an artist's name, you'll get the pages about that artist's exhibition along with the online store page for his exhibition catalog.
Where would the 'product/1234' url's come from? If you re-write all your links so they are in this form, then where does the sessionid go? If the sessionid is not needed in this form, then it shouldn't be needed in the traditional form, either?
The whole point is that the session ID can remain part of the query string -- in fact it WILL, automatically, because that's where PHP puts it when unable to store a cookie. (I am using and referring to PHP's built-in section functions.)
So if I set up a link to an individual product page like this...
<a href="product/1234">productname</a>
...and PHP decides to stick a session ID onto it, PHP will automatically rewrite the link to this:
<a href="product/1234?PHPSESSID=abcdefhijk...">productname</a>
Which would be fine. A rewrite application could easily extract both the product number and the session ID from a REQUEST_URI like that, so it would work for users. And the search engine could safely be directed to ignore query strings, so the site would be indexed cleanly.
But again, the problem with external spiders hitting the site too hard.
PaulKroll
08-31-2001, 02:34 PM
What spider is doing this? PHPSESSID is the default for PHP IIRC, and this means a tremendous waste of that search engine's resources if they're not ignoring it as part of URLs.
You might want to drop them a note to tell them this, but this is ultimately something that search engines have to deal with.
PaulKroll
08-31-2001, 02:50 PM
And I thought I'd already had enough caffeine today... OK, so this is a search engine YOU'RE running, something you've bought or some free thing you've installed.
Modding it to tear out the PHPSESSID should not be all that hard... assuming you have source and license to alter it as you please. If it's free, you can probably just mention on the relevant forum how lame it is that the search engine doesn't already do this, and watch someone trip over themselves adding that feature for bragging rights.
Otherwise, you'll probably have to pay a developer to do it.
It's easier, trust me, to do that mod to a Perl or PHP search engine than to make mod_rewrite work properly and efficiently. :)
sheila
08-31-2001, 03:04 PM
Here's a thought:
I've currently got the FDSE (Fluid Dynamics Search Engine) (http://www.xav.com/scripts/search/) installed on my site. It has a feature, where you put a Meta-Tag in the header of each page, as to how you want it indexed.
So, for example, my homepage could be indexed as either:
http://www.thinkspot.net/sheila/
or
http://www.thinkspot.net/sheila/index.html
For a number of reasons, I would prefer it be indexed as the former. So, in that page I have the following Meta-Tag:
<META NAME="fdse-index-as" CONTENT="http://www.thinkspot.net/sheila/">
Therefore, any time the FDSE site search engine runs across this page, regardless of the link it followed to get there, it will index it only as the first of the two options I listed.
That means it also won't be listed as:
http://thinkspot.net/sheila
or
http://thinkspot.net/sheila/index.html
Even it you don't want to use that particular search engine, maybe this at least gives you an idea how you could modify your site search engine and/or your pages so that it doesn't index them with the session ID embedded in the URL.
HTH,
daledude
09-03-2001, 12:50 AM
Maybe this article would be helpful.
http://spider-food.net/dynamic-page-optimization.html
Goodluck,
Dale
TDarlington
09-03-2001, 01:08 AM
Actually, I've solved the problem, and the solution is terribly obvious in retrospect. I just have my template look at HTTP_USER_AGENT, and if it sees the search engine spider, it disables PHP sessions for that pageview. Poof, no more PHPSESSID, end of problem!
(FYI, Sheila, I'm using FDSE too. I found that the fdse-index-as meta wasn't reliable as a means of dealing with query strings, unfortunately.)
TD
Originally posted by TDarlington:
Actually, I've solved the problem, and the solution is terribly obvious in retrospect. I just have my template look at HTTP_USER_AGENT, and if it sees the search engine spider, it disables PHP sessions for that pageview. Poof, no more PHPSESSID, end of problem!
TD
An excellent (and elequent) solution, I might add! :)
Rich
TDarlington
09-03-2001, 12:20 PM
Originally posted by Rich:
An excellent (and elequent) solution, I might add! :)
Thanks! I had to smack my forehead when I realized that a simple
if (!strpos($HTTP_USER_AGENT,"FDSE")) { ... } around my session functions would solve the whole problem.
I'm like Canada Post: I might be really slow, but eventually I get there.
TD
vBulletin® v3.6.8, Copyright ©2000-2009, Jelsoft Enterprises Ltd.