FutureQuest Community

FutureQuest Community (http://www.aota.net/forums/index.php)
-   PHP, Perl, Python and/or MySQL (http://www.aota.net/forums/forumdisplay.php?f=15)
-   -   Regular expression to remove HTML-style comments (http://www.aota.net/forums/showthread.php?t=12016)

Matt 08-17-2002 02:51 AM

Regular expression to remove HTML-style comments
 
I would like to use PHP's output buffering capability to remove HTML-style comments from web pages as they are served to the client. This would be useful for removing comments inserted by HTML editing software such as Dreamweaver, for example. I have output buffering working, but the regular expression that I'm using is causing me some headache:
return (ereg_replace("<!--.*-->", "", $buffer));

This works really well... a little TOO well. If I have a string like this:
<!--Comment #1-->Text<!--Comment#2-->
The whole string is replaced, including Text. What I would like to happen, is that Text is not replaced, just the comments on either side.

I'd appreciate any assistance with accomplishing this.
Thanks, Matt

Terra 08-17-2002 03:50 AM

This is because the PHP ereg family of routines are 'greedy'... :(

There is no easy way to get around it short of using a negated character class, like so:
(ereg_replace("<!--[^\-]+-->", "", $buffer))

Your best bet is to use the much more civilized PCRE library which does support non-greedy qualifiers...
/<!--.*?-->/

--
Terra
--ereg and friends need therapy--
FutureQuest

PaulKroll 08-17-2002 04:52 AM

One of the problems with PHP is that it's too large, so it's easy to miss built-in-functions, like, say, strip_tags() which is specifically meant to strip HTML tags. Haven't tried it on something with comment tags, but I'd certainly hope they'd be wiped out with all the rest, and there's a good chance this function is faster than a general regex. (It's also impossible to strip some valid HTML tags using a regex, but usually it's not a big problem).

Matt 08-17-2002 05:59 AM

Thanks for the quick feedback :) Terra, I'll take a look at the PCRE library. Paul, that is an interesting idea, but in this case, I want to keep most of the HTML code, just not the comment tags. It looks like with strip_tags() I would have to define all the tags that I didn't want stripped (which would be everything except comments). If this is correct, then I'm not sure that strip_tags() would be faster due to the large array of valid tags that would have to be evaluated/ passed over. Were you thinking of something a little different? -Matt

hobbes 08-17-2002 07:43 AM

Something else to watch out for, if you don't have control of the HTML doc, are nested comment tags. For example:

<!-- Begin Cmt <!-- Nested Cmt --> Hi There -->

Using a typical regex looking for the open tag and close tag will result in:

Hi There -->

kitchin 08-17-2002 07:45 AM

Quote:

Originally posted by Terra:

/<!--.*?-->/

Strangely enough, you will have to use a bit more than that. Since "." does not match line breaks, use:

/<!--(.|\s)*?-->/

instead, as in:

$x= preg_replace('/<!--(.|\s)*?-->/', '', $x);

Because <!-- comments
can be several lines
long -->.

kitchin 08-17-2002 07:54 AM

Quote:

Originally posted by hobbes:
Something else to watch out for, if you don't have control of the HTML doc, are nested comment tags. For example:

<!-- Begin Cmt <!-- Nested Cmt --> Hi There -->

Using a typical regex looking for the open tag and close tag will result in:

Hi There -->

Netscape and IE parse nested comments that way anyway. So it would look yucky even before you applied the regex to it. I thought of the same thing and tested it. But there probably is some oddball case we're missing.

hobbes 08-17-2002 12:04 PM

Quote:

Strangely enough, you will have to use a bit more than that. Since "." does not match line breaks, use:
That's why you should use the /s attribute to treat the "string" as a single line; newlines will then match the period '.'

Look for PCRE_DOTALL in http://www.zend.com/manual/pcre.pattern.modifiers.php

Matt 08-17-2002 02:34 PM

Thanks everyone, it now works perfectly (I think). For anyone interested in stripping comments from HTML pages, here is the code I'm using:
Code:

<?php
  function callback($buffer)
  {
        return preg_replace('/<!--(.|\s)*?-->/', '', $buffer);
  }

  ob_start("callback");
?>
... HTML source goes here ...
<?php ob_end_flush(); ?>


PaulKroll 08-17-2002 04:46 PM

Quote:

I want to keep most of the HTML code, just not the comment tags
One of the problems with being up at 4:00 in the morning is that phrases like "HTML-style comments" are easily misread... :ididthat:


All times are GMT -4. The time now is 02:52 AM.

Powered by vBulletin®
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.