sutch wrote:
> William James wrote:
> > dsutch / gmail.com wrote:
> > > I'm using HTML Tools 1.09 to parse HTML that contains tags that are to
> > > be processed by the web server.  For example, here's an image tag:
> > >
> > > <img src="<$DCGallery$>Separators/gtabseps.gif" alt="this is a
> > > seperator">
> >
> > Is this valid html?
>
> Thank you for this information.  I did a bit more research and now
> believe that this is not valid HTML.  Read on...
>
> >  From another thread:
> >
> > > $ echo '<bar quux="foo>bar" />' | xmllint -
> > > <?xml version="1.0"?>
> > > <bar quux="foo&gt;bar"/>
> > >
> > > However, '<' needs to be escaped:
> > >
> > > $ echo '<bar quux="foo<bar" />' | xmllint -
> > > -:1: parser error : Unescaped '<' not allowed in attributes values
> > > <bar quux="foo<bar" />
>
> Unfortunately, escaping is not an option since the HTML files that are
> being parsed are being output from another closed system.
>
> The question is: can HTML Tools be told to ignore "<" and ">" inside of
> attribute values?  Or is there another HTML parser for Ruby that would
> handle this?
>
> Alternatively, is there some method for finding these characters within
> attribute values and escaping them before parsing by Ruby and then
> un-escaping them after parsing (so that the server can perform the
> required processing of these PHP-like tags).

Perhaps this will work.

str = <<HERE
<html>
<!--
  A comment can contain <,
  I think.
-->
<img src="<$DCGallery$>Separators/gtabseps.gif"
alt="this is a separator">
</html>
HERE

# We will split the html string into an array of strings.
# Each member of the array will be an html comment, an
# html tag, or plain text.

re = %r{ ( <!--.*?--> |
           <  (?:
              [^<>"] +
              |
              "  (?: \\.  |  [^\\"]+  ) *  "
              ) *
           >
         )  }xm


str.split( re ).each { |x|
  if "<" == x[0,1]  &&  "<!" != x[0,2]
    # Since > is o.k., change only <.
    x[1..-2] = x[1..-2].gsub( /</, "&lt;" )
  end

  print x
}