Bill Kelly wrote:

> From: "Paul Lutus" <nospam / nosite.zzz>
>>
>> def parse_html(data,tag)
>>   return data.scan(%r{<#{tag}\s*.*?>(.*?)</#{tag}>}im).flatten
>> end
> 
> Are `<' and`>' characters legal inside quoted attribute values?

I don't think so. I think they have to be escaped, like most things in HTML
syntax.

> 
> E.g.  <img alt="a>b" src="inequality.gif">
> 
> Also, is the closing tag allowed to have whitespace between the
> tag name and the ending bracket?
> 
> E.g.  </body >

Not syntactically correct, but the question might be "will it happen?" In
which case the answer is "probably".


> The latter would be trivial to accomodate with a \s* obviously;

Yep.

> but the former would be a shade trickier (though certainly still
> possible with a regexp.)

I don't think that one needs to be addressed. It isn't syntactically correct
as well as being strange. I know when I create relatively free-form
attributes like the content of "title," I always escape the HTML tag
delimiters. I am reasonably sure it is a requirement.

If we allowed bare "<" and ">" between quotes in attributes, we would have
to scan the tags character by character to be sure to have a valid parse.
In nearly all cases involving delimiters like quotes and any relaxed,
permissive syntax, you end up scanning with a state machine.

> There's a lot of foul, cruel, and bad-tempered HTML out there
> in the wild.

Yeah, and I wrote some of it personally, or it was written with my editor
Arachnophilia.

> Depending on the needs of the Original Poster, 
> death could await a simplistic HTML lexer with nasty big pointy
> teeth.

Yes, as I have said. :)

> TIM: I warned you! But did you listen to me? Oh, no, you knew it
> all, didn't you? Oh, it's just a harmless little markup language,
> isn't it? Well, it's always the same, I always--
> ARTHUR: Oh, shut up!
> TIM: --But do they listen to me?--
> ARTHUR: Right!
> TIM: -Oh, no--
> KNIGHTS: Charge!

Not at all fair to a helpless attack-rabbit. :)

-- 
Paul Lutus
http://www.arachnoid.com