On 15.02.2007 16:19, Ian Macdonald wrote: > On Thu 15 Feb 2007 at 12:39:21 +0900, Rob Biedenharn wrote: > >> Yes, the LANG is affecting the result in irb, but not ruby. >> >> $ irb -v >> irb 0.9.5(05/04/13) >> >> Whether the irb behavior is "correct" or anomalous is probably a >> question for the maintainers to debate. The man page for ctype(3) >> (on my Mac OS X 10.4.8) indicates that the macros are supposed to be >> based on the locale and my copy of the pickaxe (p.71) says that the >> character classes are based on the ctype macros of the same name. >> However, a quick C program shows effectively the same behavior as >> ruby (i.e., only the [0-9A-Za-z] satisfy isalnum() even for nl_NL). >> I'm now more curious as to how irb is finding the character classes. > > It turns out that the poster who mentioned possible interference from > the readline(3) library was right. That was me. :-) > Look at this: > > $ irb > irb(main):001:0> foo = "prñÇñÓñÆs" > => "pr\351f\351r\351es" > irb(main):002:0> foo =~ /[^[:alnum:]]/ > => nil > > $ irb --noreadline > irb(main):001:0> foo = "prñÇñÓñÆs" > => "pr\351f\351r\351es" > irb(main):002:0> foo =~ /[^[:alnum:]]/ > => 2 > > This is _very_ unexpected and undesirable behaviour and, as such, > probably qualifies as a bug. Yeah, seems so. Unless it's documented behavior. :-) > Interestingly, adding "require 'readline'" to the stand-alone script > does _not_ introduce this behaviour, so it must be something to do with > the initialisation that irb does. It's really strange as both print the same output. How about doing this - just to be sure that both strings contain the same sequence of bytes: require 'enumerator' foo.to_enum(:each_byte).to_a.join(", ") Kind regards robert