sutch wrote: > William James wrote: > > dsutch / gmail.com wrote: > > > I'm using HTML Tools 1.09 to parse HTML that contains tags that are to > > > be processed by the web server. For example, here's an image tag: > > > > > > <img src="<$DCGallery$>Separators/gtabseps.gif" alt="this is a > > > seperator"> > > > > Is this valid html? > > Thank you for this information. I did a bit more research and now > believe that this is not valid HTML. Read on... > > > From another thread: > > > > > $ echo '<bar quux="foo>bar" />' | xmllint - > > > <?xml version="1.0"?> > > > <bar quux="foo>bar"/> > > > > > > However, '<' needs to be escaped: > > > > > > $ echo '<bar quux="foo<bar" />' | xmllint - > > > -:1: parser error : Unescaped '<' not allowed in attributes values > > > <bar quux="foo<bar" /> > > Unfortunately, escaping is not an option since the HTML files that are > being parsed are being output from another closed system. > > The question is: can HTML Tools be told to ignore "<" and ">" inside of > attribute values? Or is there another HTML parser for Ruby that would > handle this? > > Alternatively, is there some method for finding these characters within > attribute values and escaping them before parsing by Ruby and then > un-escaping them after parsing (so that the server can perform the > required processing of these PHP-like tags). Perhaps this will work. str = <<HERE <html> <!-- A comment can contain <, I think. --> <img src="<$DCGallery$>Separators/gtabseps.gif" alt="this is a separator"> </html> HERE # We will split the html string into an array of strings. # Each member of the array will be an html comment, an # html tag, or plain text. re = %r{ ( <!--.*?--> | < (?: [^<>"] + | " (?: \\. | [^\\"]+ ) * " ) * > ) }xm str.split( re ).each { |x| if "<" == x[0,1] && "<!" != x[0,2] # Since > is o.k., change only <. x[1..-2] = x[1..-2].gsub( /</, "<" ) end print x }