Felix Windt writes:
 > > -----Original Message-----
 > > From: Daniel Berger [mailto:djberg96 / gmail.com] 
 > > Sent: Wednesday, August 29, 2007 10:24 AM
 > > To: ruby-talk ML
 > > Subject: Re: Bug in URI.parse?
 > > 
 > > It looks like URI.parse doesn't like the leading number:
 > > 
 > > irb(main):001:0> require 'uri'
 > > irb(main):003:0> URI.parse("http://xshare")
 > > => #<URI::HTTP:0x16fd906 URL:http://xshare>
 > > irb(main):004:0> URI.parse("http://xshare-foo")
 > > => #<URI::HTTP:0x16fc498 URL:http://xshare-foo>
 > > irb(main):006:0> URI.parse("http://3qshare")
 > > URI::InvalidURIError: the scheme http does not accept registry part:
 > > 3qshare (or bad hostname?)
 > >         from C:/ruby/lib/ruby/1.8/uri/generic.rb:195:in `initialize'
 > >         from C:/ruby/lib/ruby/1.8/uri/http.rb:78:in `initialize'
 > >         from C:/ruby/lib/ruby/1.8/uri/common.rb:488:in `new'
 > >         from C:/ruby/lib/ruby/1.8/uri/common.rb:488:in `parse'
 > >         from (irb):6
 > > 
 > > I couldn't tell you what the proper behavior is.
 > > 
 > > Regards,
 > > 
 > > Dan
 > > 
 > > 
 > 
 > That is true, and due to the following regular expressions from
 > uri/common.rb:
 > 
 > # domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
 > DOMLABEL = "(?:[#{ALNUM}](?:[-#{ALNUM}]*[#{ALNUM}])?)"
 > # toplabel      = alpha | alpha *( alphanum | "-" ) alphanum
 > TOPLABEL = "(?:[#{ALPHA}](?:[-#{ALNUM}]*[#{ALNUM}])?)"
 > # hostname      = *( domainlabel "." ) toplabel [ "." ]
 > HOSTNAME = "(?:#{DOMLABEL}\\.)*#{TOPLABEL}\\.?"
 > 
 > So a valid hostname will consist of optional DOMLABELs in front of a
 > TOPLABEL. The TOPLABEL must start with a letter, end in a letter or digit,
 > with letters, digits and hyphens inbetween the two.
 > 
 > That is consistent with RFC 1035 (DOMAIN NAMES - IMPLEMENTATION AND
 > SPECIFICATION) [http://www.ietf.org/rfc/rfc1035.txt]:
 > The labels must follow the rules for ARPANET host names.  They must
 > start with a letter, end with a letter or digit, and have as interior
 > characters only letters, digits, and hyphen.  There are also some
 > restrictions on the length.  Labels must be 63 characters or less.
 > 
 > The error thrown by URI.parse is a little odd in this context, but explained
 > as follows:
 > 
 > In the URI.parse chain, the URI is checked against a longer regular
 > expression that only partly matches the hostname, but also other URI parts
 > (such as userinfo, the scheme etc.). The hostname part doesn't match here
 > because it's dealing with an invalid hostname. The URI registry part _does_
 > match your invalid hostname, so this information is passed on in the array
 > of matched URI parts for the registry.
 > This array is then checked in Generic.new. That constructor finds the string
 > passed for the registry, but the class is hard coded to not use registries:
 > 
 > USE_REGISTRY = false
 > #
 > # DOC: FIXME!
 > #
 > def self.use_registry
 >   self::USE_REGISTRY
 > end
 > 
 > And in the constructor:
 > 
 > if @registry && !self.class.use_registry
 >   raise InvalidURIError,
 >   "the scheme #{@scheme} does not accept registry part: #{@registry} (or bad
 > hostname?)"
 > end
 > 
 > 
 > 
 > To sum up: a hostname of 3beers-wrk is invalid as an ARPANET host according
 > to the RFC, so the correct solution would be to rename the host.
 > 
 > 
 > Hope that helps,
 > 
 > Felix

While I believe that Felix's analysis is valid, the problem is that
there are valid, real domains that start with numbers, and URI should
parse those, and in fact, it generally does.

irb(main):002:0> require 'uri'
=> true
irb(main):003:0> URI.parse('http://slashdot.org')
=> #<URI::HTTP:0x2fee3c URL:http://slashdot.org>
irb(main):004:0> URI.parse('http://401k.com')
=> #<URI::HTTP:0x2fca24 URL:http://401k.com>
irb(main):006:0> URI.parse('http://www.3com.com')
=> #<URI::HTTP:0x2f7b64 URL:http://www.3com.com>
irb(main):007:0> URI.parse('https://401k.fidelity.com')
=> #<URI::HTTPS:0x2f5364 URL:https://401k.fidelity.com>

All of these are real domains for real websites, and thus, the
suggestion of "rename the host" would not work very well.  

The problem is probably better illustrated by this example:

irb(main):005:0> URI.parse('http://www.example.4bad')
URI::InvalidURIError: the scheme http does not accept registry part: www.example.4bad (or bad hostname?)
        from /usr/local/lib/ruby/1.8/uri/generic.rb:195:in `initialize'
        from /usr/local/lib/ruby/1.8/uri/http.rb:78:in `initialize'
        from /usr/local/lib/ruby/1.8/uri/common.rb:488:in `new'
        from /usr/local/lib/ruby/1.8/uri/common.rb:488:in `parse'
        from (irb):5

Here, the top-level domain starts with a digit, and _that_ is not
allowed.  And we will most likely never see such a beast out in the
world.  So the work-around for Dan's original problem would be to
specify the domain name with the hostname: 3qshare.<your-domain>

But, I would contend that this _is_ a bug in URI.  My suggestion would
be that the regex for HOSTNAME be:
  HOSTNAME = "#{DOMLABEL}(?:(?:\\.#{DOMLABEL})*\\.#{TOPLEVEL}\\.?)"
(I'll admit I'm not that familiar with this regex notation, so I'm
winging it; apologies for any mistakes.)  The point is that the
hostname may not be specified with a domain, and if so, must still be
parsed.  If the hostname is either a fully qualified hostname or just
a domain name, then the format of a top-level domain must be checked
and enforced, with optional (sub)domains in between.

Of course, I'm working from what Felix gave above; I haven't gone
through uri/common.rb to any significant extent, so there may be other
things that this suggestion would cause to break.

Coey


-- 
Coey Minear
Senior Test Engineer
(651) 628-2831
coey_minear / securecomputing.com

Secure Computing(R)
Your trusted source for enterprise security(TM)
http://www.securecomputing.com
NASDAQ: SCUR


*** The information contained in this email message may be privileged,
confidential and protected from disclosure. If you are not the intended
recipient, any review, dissemination, distribution or copying is
strictly prohibited. If you have received this email message in error,
please notify the sender by reply email and delete the message and any
attachments. ***