James Gray wrote:
> I'm pretty sure you are in the minority with this opinion.

Quite possibly :-)

> You really like this?
> 
> $ ruby -e 'p "R´┐?sum´┐?"[0..1]'
> "R\303"
> 
> How often is that going to be the desired result?

Well, if I were extracting the first two bytes from a JPEG header, that 
would be exactly what I'd expect. I've very rarely wanted to extract the 
first two *characters* from a string. I can think of one example: a 
string truncation helper in a web page.

    def trunc(string, maxlen=50)
      if string.length > maxlen
        string = string[0,maxlen-3] + "..."
      end
      string
    end

I'll certainly agree that's something you'd want to do, and /.{,50}/u is 
an ugly way of doing it. In any case, I'm not saying there shouldn't be 
any m17n support, or even that tagging strings with encodings is in 
itself wrong, as long as the semantic implications are made clear.

The number one bugbear I have is that (unless you take a number of 
specific steps to avoid it), program behaviour is inconsistent. You can 
run the *same* program with exactly the *same* input data on two 
different machines, and they will process it differently, possibly even 
crashing in one case. If someone has a problem running your app, it's 
now insufficient just to ask what O/S and ruby version they are running 
in order to be able to replicate the problem.

Consider an app which is bundled with HTML templates, which the app 
reads using File.read(). The templates happen to be written using, say, 
UTF-8. It all works fine on my machine, and passes all tests. However it 
barfs when run on someone else's machine, because their environment 
variables are different.

I think that LC_ALL is a very poor predictor of what encoding a specific 
file is in. Ruby doesn't trust it for source files (it uses #encoding 
tags instead), so why trust it for data?

Now, if the default external encoding were fixed as (say) UTF-8, that 
would be more sane. The default behaviour would then be the same on any 
machine where ruby is installed:

- File#gets returns a string with encoding='UTF-8'
- File#read returns a string with encoding='BINARY'

unless explicitly overridden, e.g. when the file is opened. So if these 
hypothetical HTML templates are written in ISO-8859-15, you would be 
forced to declare this in your program.

In any case, I'm used to having my data treated as binary unless I 
explicitly ask otherwise. e.g.

$ echo "??????" | wc
      1       1       7
$ echo "??????" | wc -m
4

[Ubuntu Hardy, default setup with LANG=en_GB.UTF-8]

> Can you list what's not yet covered in my blog series?

I've posted a bunch of lists before. Every time I try out some feature, 
because it's undocumented, the test turns up more questions than it 
answers. Maybe I really should go ahead and document it all, but that 
would be a very large project.

Trying out in irb used to be a good way to test ruby, but that's no good 
in ruby 1.9 because it's not consistent with script behaviour. For 
example:

$ irb19
irb(main):001:0> "foo".encoding
=> #<Encoding:US-ASCII>
irb(main):002:0> /foo/.encoding
=> #<Encoding:US-ASCII>
irb(main):003:0> "foo??".encoding
=> #<Encoding:UTF-8>
irb(main):004:0> /foo??/.encoding
=> #<Encoding:UTF-8>

Now try running this program:

p "foo".encoding
p /foo/.encoding
p "foo??".encoding
p /foo??/.encoding

It barfs on the multi-byte chars. That's reasonable in the absence of 
knowledge about the source file, so now add an #encoding line:

#encoding: UTF-8
p "foo".encoding
p /foo/.encoding
p "foo??".encoding
p /foo??/.encoding

and you still get a different answer to IRB. The first string gets an 
encoding as UTF-8 instead of US-ASCII; and yet the /foo/ regexp gets an 
encoding of US-ASCII in both cases.

This is compounded by the hidden state which remembers whether a 
particular string is all 7-bit characters or not. That is, although 
"foo" and "foo??" are both marked as having identical encoding UTF-8, 
they are actually treated *differently* by the encoding rules. You have 
to test using the #ascii_only? method. And yet a regexp literal 
apparently follows a different rule. Except when you are in IRB.

> It means that I think your comments are doing harm to the 1.9
> migration and I can't find the good you are doing to balance that.

I don't think what I'm saying would stop any library author from 
modifying their library to work with 1.9 if they so wish. They have to 
make up their own minds.

I believe the worst long-term problems are likely to be C extensions. I 
have seen no hints at all for C extension writers on how to handle 
strings properly (especially the hidden ascii_only? state) so I believe 
these are likely to have obscure bugs for some time.

Regards,

Brian.
-- 
Posted via http://www.ruby-forum.com/.