PDA

View Full Version : Is Regex the Answer?


donjuevo
02-23-2009, 11:15 AM
I promise to make this as brief as possible...

I'm building a site that requires members to place a little of my code on their websites and that that code be visible. I built a crawler that uses preg_match while scanning member sites to check that this code is present. It works well for the most part. The problem is that the crawler tells me the code is there even if its been commented out. I found a great introduction to regex here: http://www.phpro.org/tutorials/Introduction-to-PHP-Regex.html and it leads me to believe regex could be the solution. I just can't seem to see my way through it.

Is there a way to build a regex that will...
1. Only look between the opening and closing body tags
2. Not look inside comments

Or am I barking up the wrong PHP function?

Any help or advice would be greatly appreciated!

Thanks,
Don

Kevin
02-23-2009, 11:26 AM
Regular Expressions are very powerful and certainly worth learning. However, without seeing the code it would be impossible to tell if they will help you.

Also, whatever you use I suspect it will be difficult to determine if the code is encapsulated within a comment. You may end up having to use an interpreter/parser for the language in question to determine if the code is there and valid.

donjuevo
02-23-2009, 11:50 AM
Kevin,

Thanks for the quick response!

I'm using file_get_contents to grab the contents of their index page then scanning for my bit within it...

preg_match('/www\.mydomain\.com\/mypage\.php\?cid/i', $this->markup)

Because I'm such a firm believer in the phrase 'Work Smarter Not Harder', I'd much rather automate this instead of regularly visiting these sites.

Can this be done in stages? ie:
Stage 1: return only what's between body tags, then
Stage 2: remove everything between comment tags then
Stage 3: scan what's left for my bit?

Thanks again,

Don

donjuevo
02-24-2009, 02:14 PM
This is where my two-stage filtering stands now...

preg_match("/<body(.|\s)*?>(.|\s)*?<\/body>/i", $the_markup, $filter1);
$filter2 = preg_replace("/<!--(.|\s)*?-->/","",$filter1[0]);

Something in the first line is causing an internal server error. Anyone have any thoughts???

Again, any help at all would be appreciated!!!

Thanks,
Don

Tom E.
02-24-2009, 07:15 PM
Hi Don,

The strip_tags() (http://us2.php.net/manual/en/function.strip-tags.php) function will do it all for you.

This will strip all comments and tags from the first parameter, except for the tags in the second parameter.

$whats_left = strip_tags($the_markup, '<a>')
Then check $whats_left for the URL you're looking for.

-- Tom

donjuevo
02-24-2009, 08:12 PM
Tom,

Thanks for your reply. I've just read up on strip_tags(). It is a nice side effect that all comments are automatically removed, but beyond that, I don't believe it's a viable solution. My two-stage regex solution in function form as it looks now:

Function codePresent($the_url){
"~www\.mydomain\.com/mypage\.php\?cid~i";
$the_markup = file_get_contents($the_url);
preg_match("~<body(.|\s)*?>((.|\s)*?)</body>~i", $the_markup, $body_content);
$visible_content = preg_replace("~<!--(.|\s)*?-->~","",$body_content[2]);
$myCodeCheck = preg_match($myCode, $visible_content);
return $myCodeCheck;
}
$body_content[2] holds everything between body tags.
$visible_content is the body content w/ comments removed. These work like a charm.
A query is run returning the url's of members and in a do/while loop, each url is sent to the function above.

Here's the problem now: If I run this as is, I get an HTTP 500 - Internal server error. If I only submit a single url, it behaves as it should. If I eliminate the step with the body tags and pass the markup straight to preg_replace line, it iterates through all the urls as it should.

So, I'm sorry to say I no longer have a regex problem. Now I have an unexplainable PHP problem. If I had an extra computer, I'd have thrown this one hours ago.

Thanks for listening,

Don

Tom E.
02-24-2009, 08:34 PM
I had assumed that you were checking to see if your URL was inside an anchor tag that has not been commented out.

Have you checked the PHP error log in CNC?

You can also turn on error reporting in the .htaccess file to display errors in the browser while you're debugging.

Tom

donjuevo
02-24-2009, 09:49 PM
Found the error in the log...

Premature end of script headers: php5.cgi

I had no luck getting additional error information w/ .htaccess.

Don

donjuevo
02-25-2009, 11:58 AM
Additional discoveries...

When the error occurs, a 6.27Mb core dump file shows up in the same directory. Also, the first call to the function is successful as a subsequent insert query within the do/while loop posts the crawler results to a table. My best guess is the problem lies with the 2nd visit to the function and the $body_content array created by preg_match. I've tried unsetting this variable prior to the function closing but it makes no difference. I'm thinking it may be something on Dreamhost's end but I can't be certain.

Don

Terra
02-25-2009, 03:52 PM
When the error occurs, a 6.27Mb core dump file shows up in the same directory.

-AND-

(at first, I thought this was on our servers, and it sent chills down my spine until I read)
I'm thinking it may be something on Dreamhost's end

Wow, they are allowing Apache/PHP core dump files to be dropped into a clients web space?

Scary!

If you are sharing that Apache/PHP engine with others and it cores, then run it through 'strings' to see all sorts of interesting stuff from your neighbors like passwords, credit cards, and such...

:shocked:

I put in a lot of effort to modify our servers and Linux kernel to prevent that from happening... I think that's one of the reasons that PHPPete likes us so much... ;)

Hi PHPPete... :ythiya:

donjuevo
02-26-2009, 05:56 PM
The Final Chapter...

As the array in the preg_match statement seemed to be the cause of all my explosions, I opted for a different route...

preg_replace to remove comments
preg_replace to remove the top of the document down to the opening body tag
preg_replace to remove the bottom of the document up to the closing body tag

No arrays no problems.

I still don't know why it kept exploding, but after 4 days of anguish, I can't seem to care anymore.

Thank you all for your advice along the way and apologies for not realizing this forum belonged to a hosting company.

Thanks again,

Don

Tom E.
02-26-2009, 07:23 PM
Hi Don,

I took a closer look at your code and noticed a few things:preg_match("~<body(.|\s)*?>((.|\s)*?)</body>~i", $the_markup, $body_content);
$visible_content = preg_replace("~<!--(.|\s)*?-->~","",$body_content[2]);

Both your regular expressions have "*?" together, which doesn't make sense.
The return value of preg_match() isn't checked, so preg_replace() gets called even if the match fails, so $body_content[2] is undefined.


Also, I'm curious where the URL you're searching for could be that would make strip_tags() inappropriate to use. :dunno:

-- Tom

donjuevo
02-26-2009, 07:37 PM
Tom,

You make a good point about not checking that preg_match isn't checked to see if it returns anything. This wasn't the cause of the error unfortunately though. The test url's I was passing to the function are all mine and all have content.

I understood the question mark to make the grouping just prior to it ungreedy. My concern, at least for comments, was that if there were multiple comments throughout the page a greedy search may start at the opening tag in the first comment and end with the closing tag of the last comment.

As for strip_tags()... my bit of code sits in an iframe and I hadn't decided at that point whether or not I needed to include the opening iframe tag in my search or not. That and this is my first foray into regex and, just like a kid with a new toy, I really wanted to make it work.

Just out of curiosity, do you know whether or not strip_tags() also removes javascript comments?

Thanks again,

Don

Tom E.
02-26-2009, 09:00 PM
I stand corrected :shocked:

I was about to reply that you use the 'U' modifier to make the pattern ungreedy, but I looked it up first (because I can never remember if it's an upper or lowercase U) and found this at http://us3.php.net/manual/en/reference.pcre.pattern.modifiers.php
U (PCRE_UNGREEDY)
This modifier inverts the "greediness" of the quantifiers so that they are not greedy by default, but become greedy if followed by "?". It is not compatible with Perl. It can also be set by a (?U) modifier setting within the pattern or by a question mark behind a quantifier (e.g. .*?).
I never realized you could do it on a case by case basis with "*?".

Just out of curiosity, do you know whether or not strip_tags() also removes javascript comments?
It doesn't. But the entire script might be within HTML comments:<script language="JavaScript" type="text/JavaScript">
<!--
// javascript code
-->
</script>
Then all javascript code would be removed when you strip the HTML comments.

Tom

donjuevo
02-26-2009, 09:18 PM
After my last post, I started thinking more about javascript coments and what could be hidden there maliciously ie body tags, etc. So this has become my final solution to arriving at the code that's visible on any given web page...

$the_markup = file_get_contents($the_url);
$the_markup = preg_replace("~<script(.|\s)*?</script>~i","",$the_markup); # remove scripts
$the_markup = preg_replace("~<!--(.|\s)*?-->~","",$the_markup); # remove comments
$the_markup = preg_replace("~^(.|\s)*?<body(.|\s)*?>~i","",$the_markup); # remove top
$the_markup = preg_replace("~</body(.|\s)*?>(.|\s)*?\z~i","",$the_markup); # remove bottom

Don't know if its bulletproof or not. I try and write clean html and that's all it's been tested against so far. I'm just still totally tickled that my core dump explosions are behind me now.

Thanks again for everything,

Don