David Vallner wrote:
> James Britt wrote:
>> (Offhand, I don't see how static or explicit typing would help track 
>> these sorts of issues.  Unit tests might.)
> 
> Hrm. Mechanize or htmltools optionally passing HTML input through tidy 
> perhaps? I've no idea what the scope of htmltools markup error recovery 
> capabilities is, that just might help.
Minimal, in my experience.  There are some very, very broken pages out 
there.

My current method for doing this sort of thing involves sniffing the 
character set, normalising to utf-8, chucking the output through tidy to 
get xml, ripping off the xml processing instruction and passing what's 
left through REXML.  You have to take the processing instruction off 
because if the page actually includes text in more than one character 
set (you'd be surprised how often this happens), the normalising won't 
be complete and tidy will get it wrong half the time, which barfs REXML. 
  I can show code if you want.

In any case, this is tangential - the fundamental issue is that static 
and explicit typing can't catch semantic errors.  The original paper on 
Hungarian notation (which Joel Spolsky goes on about at 
http://www.joelonsoftware.com/articles/Wrong.html) nails this problem, 
but I've never seen a language whose interpreter/compiler enforces 
matched variable and function naming conventions.

-- 
Alex