Phrogz wrote: > On May 15, 10:18 pm, Jöòg W Mittag <Joerg.Mit... / Web.De> wrote: >> For example, the >> following snippet is a perfectly well-formed and valid HTML document, >> but none of the regexps posted in this thread so far are able to >> correctly parse it: >> >> <HTML/ >> <HEAD/ >> <TITLE/>/ >> <P/> > Wow. I was all fired up to call you out on this, and ask you what > insane cocaine you were smoking when you main this claim. Well, keep in mind that this is a very contrived, extreme, exaggerated example that you will never find in the wild, simply because not only the regexps in this thread but also the browsers cannot parse it -- although I heard rumors that Emacs/w3 actually supports some of the features used by that snippet. I just wanted to demonstrate that there are a lot of weird things in HTML that are much better left to the people that write HTML parsers rather than writing the same incomplete HTML regexps over and over and over and over again. > I was a web developer for many many years and standards were very, > very important to me. I thought I knew the specs. The above example mainly draws upon one simple fact: the HTML designers decided to make HTML an application of SGML without actually having a *beep*ing clue about SGML, thus creating some "interesting" interactions with SGML's parsing rules. And who can blame them? The reason they created HTML in the first place, was that SGML is so mind-bogglingly complex that *nobody* has a *beep*ing clue! So, you can read all the W3C specs you want, but what makes HTML so weird isn't actually in there; it's buried somewhere in the thousands of pages of ISO SGML specs. > And then I ran that by validator.w3.org along with an HTML 4.01 strict > DTD, and - to my utter shock and surprise and horror - it turns out > you were correct. Well, let's see what actually happens. We start out with this: <html> <head> <title>></title> </head> <body> <p>></p> </body> </html> First, SGML is case-insensitive and HTML inherits that property. This already fools about 99% of all HTML regexps that you can find on the web: <HTML> <HEAD> <TITLE>></TITLE> </HEAD> <BODY> <P>></P> </BODY> </HTML> We don't need to escape closing/right angle brackets (>), only opening/left ones (<): <HTML> <HEAD> <TITLE>></TITLE> </HEAD> <BODY> <P>></P> </BODY> </HTML> Next, we use a feature that HTML inherited from SGML (without anybody noticing), called Null End Tags (NET), which allows you, basically, to DRY out (in Rails speak) the end tags. If you close the start tag with a slash instead of an angle bracket, you can replace the end tag with another slash, so <tag>some content</tag> becomes <tag/some content/ That looks like this: <HTML/ <HEAD/ <TITLE/>/ / <BODY/ <P/>/ / / Quite weird, huh? But we are not done yet! End tags are optional if they can be inferred from the context (and if the DTD specifically allows this). So, for example, since BODY cannot occur inside of HEAD, the opening BODY tag implies a closing HEAD tag: <HTML/ <HEAD/ <TITLE/>/ <BODY/ <P/> And one last step: actually, not only are end tags optional, you can even lose the tags entirely if they can be inferred. P can only occur inside a BODY, so the BODY can be inferred from P and we can get rid of it: <HTML/ <HEAD/ <TITLE/>/ <P/> > Thanks for sharing. My pleasure. BTW: this is not so useless as it might first seem. It's actually quite important to know that the W3C Validator uses an SGML parser to validate your documents, because that means it's worthless for a) XHTML, because XHTML is an application of XML, not SGML and b) HTML, too, because browsers don't parse HTML as SGML, they parse it as Tag Soup. (To be more precise: if the validator tells you your HTML is invalid, then you know it's broken; however, if it tells you it's valid, that doesn't necessarily mean it'll actually work in a browser.) XHTML is much better validated with an XML Schema Validator such as Christoph Schneegans' Schema Validator at <http://Schneegans.de/sv/> or the Validome validator at <http://Validome.org/>. It's crucial to remember that the W3C Validator and the browser parse HTML quite differently and that neither of those has necessarily anything to do with how *you* might actually parse it (-; I once found a cute little snippet on a website that I unfortunately can no longer locate, that demonstrated this quite nicely. That snippet had a little typo in it that fooled the human reader, the W3C Validator and the browser into reading that exact same snippet in three radically different ways, although what was *really* meant was actually a *fourth* thing. Just one quick example: HTML allows you to leave out the quotation marks around attribute contents. So, <A HREF=search.html>Search</A> is perfectly fine, however <A HREF=http://google.com/>Search</A> isn't, because as we now know, the double slash actually gets interpreted as a Null End Tag, so the above snippet would actually be parsed as something like the following: google.com/>Search</A> And the validator will complain about an extra closing </A> tag, while the browser will quietly fix that up to mean Search which is obviously what was intended. However, if you don't know about Null End Tags you can stare at the Validator's Error Message: Line X, Column Y: end tag for element "A" which is not open for hours and still not realize that your problem has nothing to do with an extra end tag, Line X or Column Y but that you are actually missing some quotation marks somewhere else in your document. BTW: the W3C gave up on SGML long ago and developed XML as a much simpler subset of SGML and XHTML as an application of XML. Now, the WHAT-WG followed by basically giving up any pretenses that HTML5 was actually an application of SGML; rather it is a language in its own right, totally seperate from both XML and SGML. And now we know why! One last goodie: you can actually specify an alternate root element in the DOCTYPE declaration: <!DOCTYPE p PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <P/ Although I have no friggin' clue how a browser were actually supposed to display this. Anyway, that concludes today's off-topic SGML rant, let's now get back to our regularly scheduled Smalltalk and Lisp threads, please (-; jwm