Sorry I haven't responded earlier but it seems I'm not notified by
email.
The thing is I do not control the input, it is the browscap file found
here and is in ISO-8859-1
http://browsers.garykeith.com/downloads.asp
\xdf (decimal 223) is a valid ISO-8859-1 code point
(http://en.wikipedia.org/wiki/ISO/IEC_8859-1)
it appears as '?' because my terminal is UTF-8 but the bytes are there:
$ cat test.rb
a = "Der gro\xdfe BilderSauger"
a.each_byte { |b| puts b }
$ ruby test.rb
68
101
114
32
103
114
111
223 <- Here I am
101
32
66
105
108
100
101
114
83
97
117
103
101
114
You can also see that the length is 22, not 25.
Also if I
puts a.encode('UTF-8', 'ISO-8859-1')
I see the proper character in my terminal
But when read from a file:
$ cat test.rb
File.open('test.in', 'r:ISO-8859-1').each_line do |l|
puts l
puts '***'
puts l.length
puts '***'
l.each_byte {|b| puts b}
end
$ ruby test.rb
Der gro\xdfe BilderSauger
***
25
***
68
101
114
32
103
114
111
92 <- Here
120 <- we
100 <- are
102 <- as 4 ASCII chars '\xdf'
101
32
66
105
108
100
101
114
83
97
117
103
101
114
I also tried to put UTF-8 codepoints and read as UTF-8 without luck. It
seems there is no escape sequence when reading from a stream, which I
can understand.
What I can't figure out is how to interpret these escape sequences when
reading them from a file.
--Gilles
--
Posted via http://www.ruby-forum.com/.