Hi Matz, On 5/14/02 2:41 AM, "Yukihiro Matsumoto" <matz / ruby-lang.org> wrote: > Hi, > > In message "UTF8 and Regexp" > on 02/05/14, Bob Hutchison <hutch / recursive.ca> writes: > [snip] > | > |Is it possible to specify character codes > 0x7F in the patterns of Ruby's > |regular expressions? Any suggestions are more than welcome. > > Just embed them directly in the pattern, or > > r = Regexp.compile("ab\304\243cd", 0, "UTF-8") > > or even > > r = Regexp.compile("ab#{[0x123].pack('U')}cd", 0, "UTF-8") OK, thanks, things are becoming clearer. I was doing the equivalent of this, except with the arguments nil and 'u'. However, I seem to have been trying to figure two things out at once and didn't know it. This helps. Thanks. However, have a look at the following little program: def check(s, pattern) spot = s.index(pattern) if nil != spot then printf("found in '%s' at %d\n", s, spot) else printf("not found in '%s'\n", s) end return spot end ############## # A pattern that will match anything not it ?a, ?b, ?c, or ?d. p = Regexp.compile("[^abcd]", 0, "UTF-8") check("abcd", p) check("abcde", p) check("ab\x02de", p) ############## # add 0x123 to the list p = Regexp.compile("[^ab#{[0x123].pack('U')}cd]", 0, "UTF-8") check("abcd", p) check("abcde", p) check("ab\x02de", p) ############## # add 0x1234 to the list p = Regexp.compile("[^ab#{[0x1234].pack('U')}cd]", 0, "UTF-8") check("abcd", p) check("abcde", p) check("ab\x02de", p) ## END With the results... not found in 'abcd' found in 'abcde' at 4 found in 'abde' at 2 not found in 'abcd' not found in 'abcde' not found in 'abde' not found in 'abcd' found in 'abcde' at 4 found in 'abde' at 2 I'm not quite following why this is. What's special about 0x123? I just picked this out of the air. So anyway, I wrote this little program: for i in 0..0x200 do if nil == "abcde".index(Regexp.compile("[^ab#{[i].pack('U')}cd]", 0, "UTF-8")) then printf("%d %x\n", i, i) end end There seem to be a bunch of characters like 0x123. 0x180 is a particularly nasty one since it appears to put the program into an infinite loop. Preceeding the utf-8 character with a \\ in the pattern doesn't help either (Ruby doesn't seem to like: /[^ab\Ccd]/, and it doesn't do what I expected anyway) Now, I admit I'm using a version of Ruby 1.7.2 from Dec 27 2001 so maybe an upgrade is in order. I'm also using Mac OS X. If this is the reason I apologise. > > Sorry for inconvenience. It will be far better in the M17N > enhancement process. Expression like \x{123} in the regular > expression will be allowed. I'm looking forward to the M17N (can you imagine someone looking forward to something like that !? :-) > > matz.