Hi Matz,

On 5/14/02 2:41 AM, "Yukihiro Matsumoto" <matz / ruby-lang.org> wrote:

> Hi,
> 
> In message "UTF8 and Regexp"
>   on 02/05/14, Bob Hutchison <hutch / recursive.ca> writes:
> 

[snip]

> |
> |Is it possible to specify character codes > 0x7F in the patterns of Ruby's
> |regular expressions? Any suggestions are more than welcome.
> 
> Just embed them directly in the pattern, or
> 
> r = Regexp.compile("ab\304\243cd", 0, "UTF-8")
> 
> or even
> 
> r = Regexp.compile("ab#{[0x123].pack('U')}cd", 0, "UTF-8")

OK, thanks, things are becoming clearer. I was doing the equivalent of this,
except with the arguments nil and 'u'. However, I seem to have been trying
to figure two things out at once and didn't know it. This helps. Thanks.

However, have a look at the following little program:

  def check(s, pattern)
    spot = s.index(pattern)

    if nil != spot then
      printf("found in '%s' at %d\n", s, spot)
    else
      printf("not found in '%s'\n", s)
    end

    return spot
  end

    ##############
    # A pattern that will match anything not it ?a, ?b, ?c, or ?d.
    p = Regexp.compile("[^abcd]", 0, "UTF-8")

    check("abcd", p)
    check("abcde", p)
    check("ab\x02de", p)

    ##############
    # add 0x123 to the list
    p = Regexp.compile("[^ab#{[0x123].pack('U')}cd]", 0, "UTF-8")

    check("abcd", p)
    check("abcde", p)
    check("ab\x02de", p)

    ##############
    # add 0x1234 to the list
    p = Regexp.compile("[^ab#{[0x1234].pack('U')}cd]", 0, "UTF-8")

    check("abcd", p)
    check("abcde", p)
    check("ab\x02de", p)

## END

With the results...

not found in 'abcd'
found in 'abcde' at 4
found in 'abde' at 2
not found in 'abcd'
not found in 'abcde'
not found in 'abde'
not found in 'abcd'
found in 'abcde' at 4
found in 'abde' at 2


I'm not quite following why this is. What's special about 0x123? I just
picked this out of the air.

So anyway, I wrote this little program:

for i in 0..0x200 do
  if nil == "abcde".index(Regexp.compile("[^ab#{[i].pack('U')}cd]",
                                         0, "UTF-8")) then
    printf("%d %x\n", i, i)
  end
end

There seem to be a bunch of characters like 0x123. 0x180 is a particularly
nasty one since it appears to put the program into an infinite loop.

Preceeding the utf-8 character with a \\ in the pattern doesn't help either
(Ruby doesn't seem to like: /[^ab\Ccd]/, and it doesn't do what I expected
anyway)

Now, I admit I'm using a version of Ruby 1.7.2 from Dec 27 2001 so maybe an
upgrade is in order. I'm also using Mac OS X. If this is the reason I
apologise.

> 
> Sorry for inconvenience.  It will be far better in the M17N
> enhancement process.  Expression like \x{123} in the regular
> expression will be allowed.

I'm looking forward to the M17N (can you imagine someone looking forward to
something like that !? :-)

> 
> matz.