Issue #7742 has been updated by duerst (Martin Dürst).


thegcat (Felix Schäfer) wrote:
> =begin
> We (((<Planio|URL:https://plan.io>))) are also in need of Windows-1258 to UTF-8 conversion, is there anything we can do to help?
> =end

As explained above, the problem is with normalization. If you are okay with a version that just does one-to-one conversion, then that can be produced rather quickly (maybe even over the weekend). But most Vietnamese content, e.g. on the Web, is normalized (NFC), and I guess you'd want to have that, too. But then you also have to be careful with respect to round-tripping, because windows-1258->UTF-8 will be .encode('UTF-8', 'windows-1258').to_nfc or so, but backwards conversion would need special code because neither NFC nor NFD can directly be converted to windows-1258.

A slightly more elaborate version would do one-to-one conversion from windows-1258 to UTF-8, but would convert that kind of data as well as data in NFC back to windows-1258 (but not arbitrarily non-normalized data) back to windows-1258. Such a converter might be relatively easy to produce, or it might be more difficult; I can't say which off the top of my head.

So if you use a normalization library after conversion, that might work out, but it would be somewhat of a special case. Also, when we later introduce a different (more normalizing) converter, that may be seen as a non-backwards-compatible change.

One solution to backwards-compatibility would be to use different encoding labels to differentiate versions of conversion. But this has the problem that in the current state of affairs, it introduces additional "encodings" that are not really different, but just variants produced by different conversions. That's the problem e.g. with the current UTF8-MAC, and I don't want to create more of these.

A more long-term solution would be to introduce a difference between encodings and conversions, where e.g. one could use .encode('windows-1258--non-normalized', 'utf-8') or so to indicate a non-normalized version of conversion. But that would need some more general discussion among the Ruby experts in this field.

So Felix, if you tell me what you need, and we can make sure that it doesn't affect later backwards-compatibility, I might be able to work on something.
----------------------------------------
Bug #7742: System encoding (Windows-1258) is not recognized by Ruby to  convert back to UTF-8 
https://bugs.ruby-lang.org/issues/7742#change-44183

Author: Mars (Hong Ha Dang )
Status: Open
Priority: Normal
Assignee: duerst (Martin Dürst)
Category: 
Target version: next minor
ruby -v: 1.9.3
Backport: 


I installed Railsinstaller in win8. After intall complete the screen set to 
> configuration Railsinstaller on cmd (step 2). I give user name: DHH Mars and 
> email: dhhma... / gmail.com. It ran and have following massage: 
> 
> C:/RailsInstaller/scripts/config_check.rb:64:in 'exist?': code converter not 
> found <Windows-1258 to UTF-8> <Encoding::ConverterNotFoundError> from 
> C:/RailsInstaller/scripts/config_check.rb:64:in 'main' 
> 
> C:\Sites> 


-- 
http://bugs.ruby-lang.org/