Christian Neukirchen wrote:
> "Andreas S." <f / andreas-s.net> writes:
>
> > Daniel Baird wrote:
> >> On 7/23/06, Stefan Scholl <stesch / no-spoon.de> wrote:
> >>>
> >>> Max Benjamin <moore.joseph / gmail.com> wrote:
> >>> > Is there an easy way to strip html tags from strings?
> >>>
> >>> A regex isn't always the _best_ way to deal with markup
> >>> languages, but for an _easy_ way it's good enough.
> >>
> >>
> >> the problem is, it's not always the _correct_ way.
> >>
> >> <div id="weird>id"></div>
> >
> > This is no correct HTML, < and > have to be encoded as entities.
>
> It's valid XHTML:
>
> $ echo '<bar quux="foo>bar" />' | xmllint -
> <?xml version="1.0"?>
> <bar quux="foo&gt;bar"/>
>
> However, '<' needs to be escaped:
>
> $ echo '<bar quux="foo<bar" />' | xmllint -
> -:1: parser error : Unescaped '<' not allowed in attributes values
> <bar quux="foo<bar" />
>
> --
> Christian Neukirchen  <chneukirchen / gmail.com>  http://chneukirchen.org

re = %r{
    <
      (?:
        # Any characters but > or " .
        [^>"] +
        |
        # Characters within quotes.
        # Allow escaped quotes.
        "
          (?:
              # Accept any escaped character.
              \\.
              |
              [^"\\] +
          ) *
        "
      ) *
    >
}xm

print DATA.read.gsub( re, '' )

__END__
Some<><"">
<bar quux="\"foo>bar" /> text
to <?xml version="1.0"?>
<bar quux="foo&gt;bar"/> save
for <bar quux="foo<bar" />
<bar quux="\"foo><bar>" />reading.