Christian Neukirchen wrote: > "Andreas S." <f / andreas-s.net> writes: > > > Daniel Baird wrote: > >> On 7/23/06, Stefan Scholl <stesch / no-spoon.de> wrote: > >>> > >>> Max Benjamin <moore.joseph / gmail.com> wrote: > >>> > Is there an easy way to strip html tags from strings? > >>> > >>> A regex isn't always the _best_ way to deal with markup > >>> languages, but for an _easy_ way it's good enough. > >> > >> > >> the problem is, it's not always the _correct_ way. > >> > >> <div id="weird>id"></div> > > > > This is no correct HTML, < and > have to be encoded as entities. > > It's valid XHTML: > > $ echo '<bar quux="foo>bar" />' | xmllint - > <?xml version="1.0"?> > <bar quux="foo>bar"/> > > However, '<' needs to be escaped: > > $ echo '<bar quux="foo<bar" />' | xmllint - > -:1: parser error : Unescaped '<' not allowed in attributes values > <bar quux="foo<bar" /> > > -- > Christian Neukirchen <chneukirchen / gmail.com> http://chneukirchen.org re = %r{ < (?: # Any characters but > or " . [^>"] + | # Characters within quotes. # Allow escaped quotes. " (?: # Accept any escaped character. \\. | [^"\\] + ) * " ) * > }xm print DATA.read.gsub( re, '' ) __END__ Some<><""> <bar quux="\"foo>bar" /> text to <?xml version="1.0"?> <bar quux="foo>bar"/> save for <bar quux="foo<bar" /> <bar quux="\"foo><bar>" />reading.