On 29/12/2007, Gonzalo Garramuño <ggarra / advancedsl.com.ar> wrote:
> Austin Ziegler wrote:
> >
> > If you treat it as UTF-16, IConv can handle it. The problem will be
> > when it's *not* actually UTF-16.
> >
> > -austin
>
> Thanks, Austin and Tanaka.
>
> I have no experience in multi-byte character handling, so bear with me.
>
> Could you give me a simple example of a wchar_t not being utf-16 (or
> some function returning a wchar_t that's not utf16) or the composite
> breaking?  Can this be detected so that if the decoding fails, the
> string is treated just as an array of bytes?

wchar_t is typically utf-16 or utf-32. However, as far as I know it is
defined to be "implementation specific wide character representation"
(that is, the encoding is undefined).
If you happen to be on a system where it is utf-16 you cannot get any
counterexample of a character that is not utf-16. But I can write a
libc implementation where data in wchar_t are stored as bit complement
of utf-32 encoded character rotated three bits to the right (or
whatever else strikes my fancy) and it will still be conforming. But
you won't be able to iconv the wchar_t strings anymore, and there is
probably no way to find out.

I suspect there is even no requirement to provide conversion functions
to get anything sane from the wide character strings or the other way
around.

There are those mbs <=> ws conversion routines but these depend on
locale. So you need a) a known locale with a known sane encoding
b) a separate process in which you set this locale to convert between
the known sane encoding and wide strings (so that you do not break the
locale settings of your application)

Thanks

Michal