Daniel DeLorme wrote:
> Usually the complaint about the support lack of unicode support is that 
> something like "日本語".length returns 9 instead of 3, or that "日本語 
> ".index("語") returns 6 instead of 2. It's nice that people want to 
> completely redefine the API to return character positions and all that, 
> but please don't complain that it's broken just because you happen to be 
> using it incorrectly. Use the right tool for the job. SQL for database 
> queries, non-home-brewed crypto libraries for security, regular 
> expressions for string manipulation.
> 
> I'm terribly sorry for the rant but I had to get it off my chest.

Regular expressions for all character work would be a *terribly* slow 
way to get things done. If you want to get the nth character, should you 
do a match for n-1 characters and a group to grab the nth? Or would it 
be better if you could just index into the string and have it do the 
right thing? How about if you want to iterate over all characters in a 
string? Should the iterating code have to know about the encoding? 
Should you use a regex to peel off one character at a time? Absurd.

Regex for string access goes a long way, but's just about the heaviest 
way to do it. Strings should be aware of their encoding and should be 
able to provide you access to characters as easily as bytes. That's what 
1.9 (and upcoming changes in JRuby) fixes.

- Charlie