On 3/14/06, Bill Kelly <billk / cts.com> wrote:> From: "Austin Ziegler" <halostatue / gmail.com>> >> > On 3/13/06, Anthony DeRobertis <aderobertis / metrics.net> wrote:> >>> >>         UTF-8 can take more than one octet to represent a> >>         character; UTF-16 can take more than two; UTF-32> >>         more than four; etc.> >> > No. UTF-32 does not have surrogates. Unicode is perfectly> > representable in either 20 or 21 bits. A single character is *always*> > representable in a uint32_t sized space with UTF-32.>> Hi, I have zero background in non-ASCII character representations,> but the following post has been echoing in my head as a data point> for... can't believe it's been three-and-a-half years:>> http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/46284>> Does that have any relation to your current context?  Curt seems to> be talking not of surrogates, but saying "combining characters"> mean variable-length issues still exist with UTF-32 ?>well, in some languages you get characters like "LATIN CAPITAL LETTERA WITH ACUTE".In a string you can either get the above or "LATIN CAPITAL LETTER A"followed by "COMBINING ACUTE" or somesuch. This is decomposed.
And there are libraries for normalizing/composing/decomposing unicode strings.
Thanks
Michal