On Thu, 20 Jan 2005, Yukihiro Matsumoto wrote:

| String explode (name might be changed) returns an array of fixnums,
| which means s.explode.length == s.length (String#size now returns the
| byte length of the string under the current M17N prototype, but I
| consider it's a wrong decision, and will be fixed in the 1.9).

Will there be a way of getting the byte length?

Also, if s.explode.length == s.length, same character in different encodings:

1. "\x{30b9}".explode (encoding = utf16) => [ 0x30b9 ]?
2. "\x{b930}".explode (encoding = utf16le) => [ 0x30b9 ]?
3. "\x{e382b9}".explode (encoding = utf8) => [ 0xe3, 0x82, 0xb9 ] or [ 0xe382b9 ]?

For 3., s.explode.length == s.length implies [ 0xe382b9 ]

IIRC a utf8 character is not considered a single number, but a stream of
bytes (which can then be converted to give you the unicode character,
i.e. the unicode character set character rather than the unicode
encoding character).  Other encodings may be similiar.

Also, if 2. is true, is there a way of getting the truly raw bytes?

| I think it will be
| 
|   Integer#chr(encoding=script's_default)
| 
| to get a string corresponding a codepoint.  The is the place I haven't
| made design decision.  But you will have something like this.

This implies

"\x{e382b9}".explode (encoding = utf8) => [ 0xe382b9 ]

since you need a single value (Integer) for chr, but 0xe382b9.chr("utf8")
doesn't feel right to me.

Wes