|
|
|
02-23-2009, 11:15 AM
|
Postid: 172957
|
|
Registered User
Join Date: Feb 2009
Posts: 9
|
Is Regex the Answer?
I promise to make this as brief as possible...
I'm building a site that requires members to place a little of my code on their websites and that that code be visible. I built a crawler that uses preg_match while scanning member sites to check that this code is present. It works well for the most part. The problem is that the crawler tells me the code is there even if its been commented out. I found a great introduction to regex here: http://www.phpro.org/tutorials/Intro...PHP-Regex.html and it leads me to believe regex could be the solution. I just can't seem to see my way through it.
Is there a way to build a regex that will...
1. Only look between the opening and closing body tags
2. Not look inside comments
Or am I barking up the wrong PHP function?
Any help or advice would be greatly appreciated!
Thanks,
Don
|
|
|
02-23-2009, 11:26 AM
|
Postid: 172958
|
|
Systems Administrator
Join Date: Aug 2001
Location: Orlando, FL
Posts: 2,481
|
Re: Is Regex the Answer?
Regular Expressions are very powerful and certainly worth learning. However, without seeing the code it would be impossible to tell if they will help you.
Also, whatever you use I suspect it will be difficult to determine if the code is encapsulated within a comment. You may end up having to use an interpreter/parser for the language in question to determine if the code is there and valid.
__________________
Kevin
|
|
|
02-23-2009, 11:50 AM
|
Postid: 172961
|
|
Registered User
Join Date: Feb 2009
Posts: 9
|
Re: Is Regex the Answer?
Kevin,
Thanks for the quick response!
I'm using file_get_contents to grab the contents of their index page then scanning for my bit within it...
preg_match('/www\.mydomain\.com\/mypage\.php\?cid/i', $this->markup)
Because I'm such a firm believer in the phrase 'Work Smarter Not Harder', I'd much rather automate this instead of regularly visiting these sites.
Can this be done in stages? ie:
Stage 1: return only what's between body tags, then
Stage 2: remove everything between comment tags then
Stage 3: scan what's left for my bit?
Thanks again,
Don
|
|
|
02-24-2009, 02:14 PM
|
Postid: 172990
|
|
Registered User
Join Date: Feb 2009
Posts: 9
|
Re: Is Regex the Answer?
This is where my two-stage filtering stands now...
preg_match("/<body(.|\s)*?>(.|\s)*?<\/body>/i", $the_markup, $filter1);
$filter2 = preg_replace("/<!--(.|\s)*?-->/","",$filter1[0]);
Something in the first line is causing an internal server error. Anyone have any thoughts???
Again, any help at all would be appreciated!!!
Thanks,
Don
|
|
|
02-24-2009, 07:15 PM
|
Postid: 172993
|
|
Site Owner
Forum Notability:
1175 pts: A True Crowd-pleaser!
[ Post Feedback]
Join Date: Feb 2005
Location: Connecticut
Posts: 717
|
Re: Is Regex the Answer?
Hi Don,
The strip_tags() function will do it all for you.
This will strip all comments and tags from the first parameter, except for the tags in the second parameter.
PHP Code:
$whats_left = strip_tags($the_markup, '<a>')
Then check $whats_left for the URL you're looking for.
-- Tom
|
|
|
02-24-2009, 08:12 PM
|
Postid: 172994
|
|
Registered User
Join Date: Feb 2009
Posts: 9
|
Re: Is Regex the Answer?
Tom,
Thanks for your reply. I've just read up on strip_tags(). It is a nice side effect that all comments are automatically removed, but beyond that, I don't believe it's a viable solution. My two-stage regex solution in function form as it looks now:
Function codePresent($the_url){ "~www\.mydomain\.com/mypage\.php\?cid~i";
$the_markup = file_get_contents($the_url);
preg_match("~<body(.|\s)*?>((.|\s)*?)</body>~i", $the_markup, $body_content);
$visible_content = preg_replace("~<!--(.|\s)*?-->~","",$body_content[2]);
$myCodeCheck = preg_match($myCode, $visible_content);
return $myCodeCheck; }
$body_content[2] holds everything between body tags.
$visible_content is the body content w/ comments removed. These work like a charm.
A query is run returning the url's of members and in a do/while loop, each url is sent to the function above.
Here's the problem now: If I run this as is, I get an HTTP 500 - Internal server error. If I only submit a single url, it behaves as it should. If I eliminate the step with the body tags and pass the markup straight to preg_replace line, it iterates through all the urls as it should.
So, I'm sorry to say I no longer have a regex problem. Now I have an unexplainable PHP problem. If I had an extra computer, I'd have thrown this one hours ago.
Thanks for listening,
Don
|
|
|
02-24-2009, 08:34 PM
|
Postid: 172995
|
|
Site Owner
Forum Notability:
1175 pts: A True Crowd-pleaser!
[ Post Feedback]
Join Date: Feb 2005
Location: Connecticut
Posts: 717
|
Re: Is Regex the Answer?
I had assumed that you were checking to see if your URL was inside an anchor tag that has not been commented out.
Have you checked the PHP error log in CNC?
You can also turn on error reporting in the .htaccess file to display errors in the browser while you're debugging.
Tom
|
|
|
02-24-2009, 09:49 PM
|
Postid: 172996
|
|
Registered User
Join Date: Feb 2009
Posts: 9
|
Re: Is Regex the Answer?
Found the error in the log...
Premature end of script headers: php5.cgi
I had no luck getting additional error information w/ .htaccess.
Don
|
|
|
02-25-2009, 11:58 AM
|
Postid: 173004
|
|
Registered User
Join Date: Feb 2009
Posts: 9
|
Re: Is Regex the Answer?
Additional discoveries...
When the error occurs, a 6.27Mb core dump file shows up in the same directory. Also, the first call to the function is successful as a subsequent insert query within the do/while loop posts the crawler results to a table. My best guess is the problem lies with the 2nd visit to the function and the $body_content array created by preg_match. I've tried unsetting this variable prior to the function closing but it makes no difference. I'm thinking it may be something on Dreamhost's end but I can't be certain.
Don
|
|
|
02-25-2009, 03:52 PM
|
Postid: 173005
|
|
CTO FutureQuest, Inc.
Join Date: Jun 1998
Location: Z'ha'dum
Posts: 7,678
|
Re: Is Regex the Answer?
Quote:
|
When the error occurs, a 6.27Mb core dump file shows up in the same directory.
|
-AND-
(at first, I thought this was on our servers, and it sent chills down my spine until I read)
Quote:
|
I'm thinking it may be something on Dreamhost's end
|
Wow, they are allowing Apache/PHP core dump files to be dropped into a clients web space?
Scary!
If you are sharing that Apache/PHP engine with others and it cores, then run it through 'strings' to see all sorts of interesting stuff from your neighbors like passwords, credit cards, and such...
I put in a lot of effort to modify our servers and Linux kernel to prevent that from happening... I think that's one of the reasons that PHPPete likes us so much...
Hi PHPPete... 
|
|
|
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 visitors)
|
|
|
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Linear Mode
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT -4. The time now is 01:32 PM.
|
| |
|
|
|