![]() |
Regular expression to remove HTML-style comments
I would like to use PHP's output buffering capability to remove HTML-style comments from web pages as they are served to the client. This would be useful for removing comments inserted by HTML editing software such as Dreamweaver, for example. I have output buffering working, but the regular expression that I'm using is causing me some headache:
return (ereg_replace("<!--.*-->", "", $buffer)); This works really well... a little TOO well. If I have a string like this: <!--Comment #1-->Text<!--Comment#2--> The whole string is replaced, including Text. What I would like to happen, is that Text is not replaced, just the comments on either side. I'd appreciate any assistance with accomplishing this. Thanks, Matt |
This is because the PHP ereg family of routines are 'greedy'... :(
There is no easy way to get around it short of using a negated character class, like so: (ereg_replace("<!--[^\-]+-->", "", $buffer)) Your best bet is to use the much more civilized PCRE library which does support non-greedy qualifiers... /<!--.*?-->/ -- Terra --ereg and friends need therapy-- FutureQuest |
One of the problems with PHP is that it's too large, so it's easy to miss built-in-functions, like, say, strip_tags() which is specifically meant to strip HTML tags. Haven't tried it on something with comment tags, but I'd certainly hope they'd be wiped out with all the rest, and there's a good chance this function is faster than a general regex. (It's also impossible to strip some valid HTML tags using a regex, but usually it's not a big problem).
|
Thanks for the quick feedback :) Terra, I'll take a look at the PCRE library. Paul, that is an interesting idea, but in this case, I want to keep most of the HTML code, just not the comment tags. It looks like with strip_tags() I would have to define all the tags that I didn't want stripped (which would be everything except comments). If this is correct, then I'm not sure that strip_tags() would be faster due to the large array of valid tags that would have to be evaluated/ passed over. Were you thinking of something a little different? -Matt
|
Something else to watch out for, if you don't have control of the HTML doc, are nested comment tags. For example:
<!-- Begin Cmt <!-- Nested Cmt --> Hi There --> Using a typical regex looking for the open tag and close tag will result in: Hi There --> |
Quote:
/<!--(.|\s)*?-->/ instead, as in: $x= preg_replace('/<!--(.|\s)*?-->/', '', $x); Because <!-- comments can be several lines long -->. |
Quote:
|
Quote:
Look for PCRE_DOTALL in http://www.zend.com/manual/pcre.pattern.modifiers.php |
Thanks everyone, it now works perfectly (I think). For anyone interested in stripping comments from HTML pages, here is the code I'm using:
Code:
<?php |
Quote:
|
| All times are GMT -4. The time now is 02:52 AM. |
Powered by vBulletin®
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.