On 26.11.2010 01:42, David Masover wrote:
> On Wednesday, November 24, 2010 08:40:22 pm J=F6rg W Mittag wrote:
>> David Masover wrote:
>>> Java at least did this sanely -- UTF16 is at least a fixed width. If
>>> you're going to force a single encoding, why wouldn't you use
>>> fixed-width strings?
>>
>> Actually, it's not.
>
> Whoops, my mistake. I guess now I'm confused as to why they went with U=
TF-16
> -- I always assumed it simply truncated things which can't be represent=
ed in
> 16 bits.

The JLS is a bit difficult to read IMHO.  Characters are 16 bit and a=20
single character covers the range of code points 0000 to FFFF.

http://java.sun.com/docs/books/jls/third_edition/html/typesValues.html#4.=
2.1

Characters with code points greater than FFFF are called "supplementary=20
characters" and while UTF-16 provides encodings for them as well, these=20
need two code units (four bytes).  They write "The Java programming=20
language represents text in sequences of 16-bit code units, using the=20
UTF-16 encoding.":

http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#95413

IMHO this is not very precise: all calculations based on char can not=20
directly represent the supplementary characters.  These use just a=20
subset of UTF-16.  If you want to work with supplementary characters=20
things get really awful.  Then you need methods like this one

http://download.oracle.com/javase/6/docs/api/java/lang/Character.html#toC=
hars(int)

And if you stuff this sequence into a String all of a sudden=20
String.length() does no longer return the length in characters what is=20
in line with what the JavaDocs states

http://download.oracle.com/javase/6/docs/api/java/lang/String.html#length=
()

Unfortunately the majority of programs I have seen never takes this into =

account and uses String.length() as "length in characters".  This awful=20
mixture becomes apparent in the JavaDoc of class Character, which=20
explicitly states that there are two ways to deal with characters:

1. type char (no supplementary supported)
2. type int (with supplementary)

http://download.oracle.com/javase/6/docs/api/java/lang/Character.html#uni=
code

>> You can produce corrupt strings and slice into a half-character in
>> Java just as you can in Ruby 1.8.
>
> Wait, how?

You can convert a code point above FFFF via Character.toChars() (which=20
returns a char[] of length 2) and truncate it to 1.  But: the resulting=20
sequence isn't actually invalid since all values in the range 0000 to=20
FFFF are valid characters.  This isn't really robust.  Even though the=20
docs say that the longest matching sequence is to be considered during=20
decoding there is no reliably way to determine whether d80d dd53=20
represents a single character (code point 013553) or two separate=20
characters (code points d80d and dd53).

If you like you can play around a bit with this:
https://gist.github.com/719100

> I mean, yes, you can deliberately build strings out of corrupt data, bu=
t if
> you actually work with complete strings and string concatenation, and y=
ou
> aren't doing crazy JNI stuff, and you aren't digging into the actual bi=
ts of
> the string, I don't see how you can create a truncated string.

Well, you can (see above) but unfortunately it is still valid.  It just=20
happens to represent a different sequence.

Kind regards

	robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/