At 19:44 08/01/12, Vincent Isambart wrote:
>
>On Jan 12, 2008, at 10:53 AM, Martin Duerst wrote:
>
>> This might slightly change once we introduce a third argument to
>> String#encode. This third argument, as I currently plan it, should
>> be able to express things such as "convert non-convertibles to
>> a replacement character" or "simply drop non-convertible data"
>> or so.
>
>
>Instead (or in addition to) this third argument, what about having the  
>String#encode function taking a block?

That's also planned. But a third argument can be much faster for
the simple cases, that's why I don't want to exclude it.

>I haven't thought at it a lot, but something like this could be useful:
>- to replace unknown characters with '?'
>str.encode('UTF-8') { '?' }
>- to strip the unknown characters
>str.encode('UTF-8') { '' }
>- to trancode from a mix of UTF-8 and ISO-8859-1 to ISO-8859-1 (yes  
>that may sound strange but I've seen cases when it may appear with  
>badly managed data)
>str.encode('ISO-8859-1', 'UTF-8') { |s| s }
>
>(to may everything simpler I did not take into account the encoding of  
>the string returned by the bloc, checking it may be a good thing, I do  
>not know)
>
>You may even want to be able to control if the block is called with  
>either each unknown byte or each sequences of unknown bytes. Giving  
>the position in the start string to the lock may be also a good idea.
>
>I'm not sure this idea could have any useful use except in the case of  
>data in mixed encodings (and I'm note even sure if this is common or  
>not), and it probably needs some more thought, but it was just an idea  
>that crossed my mind and seemed more 'Ruby-like' than just an  
>additional parameter. The encode function may of course be made to  
>support both the additional parameter and the bloc.

It's definitely Ruby-like, and there are quite a few use cases.
The one I'm thinking about most is converting non-convertible
characters to escapes of various kinds. I have thought about
quite a few of the cases you mention above, but I have to
think through your 'convert from mixed encoding' case a bit
more.

Regards,   Martin.



#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst / it.aoyama.ac.jp