FutureQuest, Inc. FutureQuest, Inc. FutureQuest, Inc.

FutureQuest, Inc.
Go Back   FutureQuest Community > General Site Owner Support (All may read/respond) > PHP, Perl, Python and/or MySQL
User Name
Password  Lost PW

Reply
 
Thread Tools Search this Thread Display Modes
Old 02-23-2009, 11:15 AM   Postid: 172957
donjuevo
Registered User

Forum Notability:
0 pts: Even-handed
[Post Feedback]
 
Join Date: Feb 2009
Posts: 9
Is Regex the Answer?

I promise to make this as brief as possible...

I'm building a site that requires members to place a little of my code on their websites and that that code be visible. I built a crawler that uses preg_match while scanning member sites to check that this code is present. It works well for the most part. The problem is that the crawler tells me the code is there even if its been commented out. I found a great introduction to regex here: http://www.phpro.org/tutorials/Intro...PHP-Regex.html and it leads me to believe regex could be the solution. I just can't seem to see my way through it.

Is there a way to build a regex that will...
1. Only look between the opening and closing body tags
2. Not look inside comments

Or am I barking up the wrong PHP function?

Any help or advice would be greatly appreciated!

Thanks,
Don
donjuevo is offline   Reply With Quote
Old 02-23-2009, 11:26 AM   Postid: 172958
 Kevin
Systems Administrator
 
Kevin's Avatar
 
Join Date: Aug 2001
Location: Orlando, FL
Posts: 2,481
Re: Is Regex the Answer?

Regular Expressions are very powerful and certainly worth learning. However, without seeing the code it would be impossible to tell if they will help you.

Also, whatever you use I suspect it will be difficult to determine if the code is encapsulated within a comment. You may end up having to use an interpreter/parser for the language in question to determine if the code is there and valid.
__________________
Kevin
Kevin is online now   Reply With Quote
Old 02-23-2009, 11:50 AM   Postid: 172961
donjuevo
Registered User

Forum Notability:
0 pts: Even-handed
[Post Feedback]
 
Join Date: Feb 2009
Posts: 9
Re: Is Regex the Answer?

Kevin,

Thanks for the quick response!

I'm using file_get_contents to grab the contents of their index page then scanning for my bit within it...

preg_match('/www\.mydomain\.com\/mypage\.php\?cid/i', $this->markup)

Because I'm such a firm believer in the phrase 'Work Smarter Not Harder', I'd much rather automate this instead of regularly visiting these sites.

Can this be done in stages? ie:
Stage 1: return only what's between body tags, then
Stage 2: remove everything between comment tags then
Stage 3: scan what's left for my bit?

Thanks again,

Don
donjuevo is offline   Reply With Quote
Old 02-24-2009, 02:14 PM   Postid: 172990
donjuevo
Registered User

Forum Notability:
0 pts: Even-handed
[Post Feedback]
 
Join Date: Feb 2009
Posts: 9
Re: Is Regex the Answer?

This is where my two-stage filtering stands now...

preg_match("/<body(.|\s)*?>(.|\s)*?<\/body>/i", $the_markup, $filter1);
$filter2 = preg_replace("/<!--(.|\s)*?-->/","",$filter1[0]);

Something in the first line is causing an internal server error. Anyone have any thoughts???

Again, any help at all would be appreciated!!!

Thanks,
Don
donjuevo is offline   Reply With Quote
Old 02-24-2009, 07:15 PM   Postid: 172993
Tom E.
Site Owner
 
Tom E.'s Avatar

Forum Notability:
1175 pts: A True Crowd-pleaser!
[Post Feedback]
 
Join Date: Feb 2005
Location: Connecticut
Posts: 717
Re: Is Regex the Answer?

Hi Don,

The strip_tags() function will do it all for you.

This will strip all comments and tags from the first parameter, except for the tags in the second parameter.

PHP Code:
$whats_left strip_tags($the_markup'<a>'
Then check $whats_left for the URL you're looking for.

-- Tom
Tom E. is offline   Reply With Quote
Old 02-24-2009, 08:12 PM   Postid: 172994
donjuevo
Registered User

Forum Notability:
0 pts: Even-handed
[Post Feedback]
 
Join Date: Feb 2009
Posts: 9
Re: Is Regex the Answer?

Tom,

Thanks for your reply. I've just read up on strip_tags(). It is a nice side effect that all comments are automatically removed, but beyond that, I don't believe it's a viable solution. My two-stage regex solution in function form as it looks now:

Function codePresent($the_url){
"~www\.mydomain\.com/mypage\.php\?cid~i";
$the_markup = file_get_contents($the_url);
preg_match("~<body(.|\s)*?>((.|\s)*?)</body>~i", $the_markup, $body_content);
$visible_content = preg_replace("~<!--(.|\s)*?-->~","",$body_content[2]);
$myCodeCheck = preg_match($myCode, $visible_content);
return $myCodeCheck;
}
$body_content[2] holds everything between body tags.
$visible_content is the body content w/ comments removed. These work like a charm.
A query is run returning the url's of members and in a do/while loop, each url is sent to the function above.

Here's the problem now: If I run this as is, I get an HTTP 500 - Internal server error. If I only submit a single url, it behaves as it should. If I eliminate the step with the body tags and pass the markup straight to preg_replace line, it iterates through all the urls as it should.

So, I'm sorry to say I no longer have a regex problem. Now I have an unexplainable PHP problem. If I had an extra computer, I'd have thrown this one hours ago.

Thanks for listening,

Don
donjuevo is offline   Reply With Quote
Old 02-24-2009, 08:34 PM   Postid: 172995
Tom E.
Site Owner
 
Tom E.'s Avatar

Forum Notability:
1175 pts: A True Crowd-pleaser!
[Post Feedback]
 
Join Date: Feb 2005
Location: Connecticut
Posts: 717
Re: Is Regex the Answer?

I had assumed that you were checking to see if your URL was inside an anchor tag that has not been commented out.

Have you checked the PHP error log in CNC?

You can also turn on error reporting in the .htaccess file to display errors in the browser while you're debugging.

Tom
Tom E. is offline   Reply With Quote
Old 02-24-2009, 09:49 PM   Postid: 172996
donjuevo
Registered User

Forum Notability:
0 pts: Even-handed
[Post Feedback]
 
Join Date: Feb 2009
Posts: 9
Re: Is Regex the Answer?

Found the error in the log...

Premature end of script headers: php5.cgi

I had no luck getting additional error information w/ .htaccess.

Don
donjuevo is offline   Reply With Quote
Old 02-25-2009, 11:58 AM   Postid: 173004
donjuevo
Registered User

Forum Notability:
0 pts: Even-handed
[Post Feedback]
 
Join Date: Feb 2009
Posts: 9
Re: Is Regex the Answer?

Additional discoveries...

When the error occurs, a 6.27Mb core dump file shows up in the same directory. Also, the first call to the function is successful as a subsequent insert query within the do/while loop posts the crawler results to a table. My best guess is the problem lies with the 2nd visit to the function and the $body_content array created by preg_match. I've tried unsetting this variable prior to the function closing but it makes no difference. I'm thinking it may be something on Dreamhost's end but I can't be certain.

Don
donjuevo is offline   Reply With Quote
Old 02-25-2009, 03:52 PM   Postid: 173005
 Terra
CTO FutureQuest, Inc.
 
Terra's Avatar
 
Join Date: Jun 1998
Location: Z'ha'dum
Posts: 7,672
Re: Is Regex the Answer?

Quote:
When the error occurs, a 6.27Mb core dump file shows up in the same directory.


-AND-

(at first, I thought this was on our servers, and it sent chills down my spine until I read)
Quote:
I'm thinking it may be something on Dreamhost's end


Wow, they are allowing Apache/PHP core dump files to be dropped into a clients web space?

Scary!

If you are sharing that Apache/PHP engine with others and it cores, then run it through 'strings' to see all sorts of interesting stuff from your neighbors like passwords, credit cards, and such...



I put in a lot of effort to modify our servers and Linux kernel to prevent that from happening... I think that's one of the reasons that PHPPete likes us so much...

Hi PHPPete...
__________________
--
Terra
sysAdmin
FutureQuest, Inc.
http://www.FutureQuest.net
Terra is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 visitors)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -4. The time now is 01:49 AM.


Running on vBulletin®
Copyright © 2000 - 2013, Jelsoft Enterprises Ltd.
Hosted & Administrated by FutureQuest, Inc.
Images & content copyright © 1998-2013 FutureQuest, Inc.
FutureQuest, Inc.