Ok lots of good responses, thanks!  A few comments:

Felix:  while URI.parse() is behaving according to the two cited RFCs, I
think it is missing an important use case.  In "http://3beers-wrk",
"3beers-wrk" isn't a domain name, is it?  It is an unqualified host name
(I assume we'd pick the host name up from context.  Now, the RFC also
suggests that host name must follow these rules (starting with a letter,
etc.), and furthermore, all components of a domain name just follow this
convention, which suggests that the regexp is common.rb is also
incorrect. :)

Also, the solution of "rename the host" is a non-solution when dealing
with customers, who are using an otherwise perfectly acceptable hostname
(I haven't found a tool yet that will balk at a hostname beginning with
a number)

Now, I'm not sure if the RFCs have been replaced by newer versions -
that would take some digging.

So, John, I'd say that this is a bug in URI.parse, since it follows
neither the published RFCs nor the practical implementation of them
today (as Coey points out).  And if it follows neither, it's really not
a very good general purpose function in the Ruby library and so should
be fixed.

Andrew


-----Original Message-----
From: RubyTalk / gmail.com [mailto:rubytalk / gmail.com] 
Sent: Wednesday, August 29, 2007 1:17 PM
To: ruby-talk ML
Subject: Re: Bug in URI.parse?

If it is a bug change toplabel in common.rb to this

TOPLABEL = "(?:[#{ALNUM}](?:[-#{ALNUM}]*[#{ALNUM}])?)"

Thanks to my friendly MySQL admin .

Stephen Becker IV

On 8/29/07, John Joyce <dangerwillrobinsondanger / gmail.com> wrote:
> I wouldn't call it a bug exactly, it does do what it is written to do.
> Instead, let's just say that URI.parse isn't very robust.
> It doesn't handle lots of real-world situations in ways you would
> expect.
> You would expect some sort of message saying the TLD (top level
> domain) is missing or bad, but also you would not expect this to end
> your program abruptly.
>
> A good URI parser will also accept IP addresses, since those are also
> valid, at least in the sense that they are real and do exist and are
> likely to be used or entered by users.
> Another problem is the way it handles URLs missing the www or http://
> or https://
> While strictly speaking this should be required, it clearly is not
> the reality of URLs in the world or the reality of how humans use
> them. People have become accustomed to using what are officially
> partial or bad URLs.
>
> Most web browsers will accept a simple string and attempt to find it,
> even if it means adding a TLD.
>
> ARPANET is pretty pointless now.
>
> I've begun my own script to check if a URL is correct, but only if it
> is the human readable variety.
> One of the biggest problems becomes the transitory nature of URLs.
> They can change or disappear without notice.
> Another problem is the path after a TLD. The path can be nearly
> anything and can only be determined to be the first single / after
> the apparent TLD.
>
>