> Sorry, but "reading" CGI params is a red herring. You may get it as one > thing and then convert it to something else. Exactly. > > Likewise, when you read from a file/socket/whatever you might not be > > getting a real string, you might be getting a byte array. They are > > fundamentally different things, a byte array may happen to contain > > text at some point, but some time later it may be just a stream of > > data. Conversely a String _always_ contains human-readble text in > > whatever encoding you want. > > Okay. What class should I get here? > > data = File.open("file.txt", "rb") { |f| f.read } A byte vector. Unknown input, so you just get a stream of bytes. > Under the people who want separate ByteVector and String class, I'll > need *two* APIs: > > st = File.open("file.txt", "rb") { |f| f.read_string } > bv = File.open("file.txt", "rb") { |f| f.read_bytes } Why? This looks needlessly complex. string = File.open('file.txt', 'r') {f.read.to_s(:utf-8)} Or possibly string = File.open('file.txt', 'r') {f.read(:utf8)} bytes = File.open('file.txt', 'r') {f.read(:bytearray)} with no argument assuming it's a default encoding. But with this approach the same class could be used for both, which takes us full circle ;) > > As someone who has to work with Unicode in PHP, I'd say it's important > > to separate the types. If you want to display something to a user you > > have to know what it is, but when you're reading a file you don't > > care, unless you know what's in it. > > The problem here is not unification. The problem here is that PHP is > stupid. It is generally recognised that Ruby's API decisions are much > smarter than most other languages, and this is a good example of where > this would happen. Hence why I'm using Ruby, but I'm paid for PHP. Ruby is by far the nicer language. The best approach to my untrained eye would be for some sort of global setting for all libraries to operate on, and the developer has to ensure that all data are read in that encoding. Hopefully it will make dealing with legacy data will be easier. The ideal situation would be for everything to be in one encoding, but that just doesn't happen. -- Phillip Hutchings http://www.sitharus.com/