Andrew S. Townley wrote: > Thanks for the replies. Actually, as I was doing something else, > another option occurred to me which seems both to a) work properly and > b) be safe(-ish): > > irb(main):001:0> $KCODE = 'u' > => "u" > irb(main):002:0> s = "ãâ¥ã" > => "ãâ¥ã" > irb(main):003:0> x = s.dump > => "\"\\342\\202\\254\"" > irb(main):004:0> t = "" > => "" > irb(main):005:0> t.instance_eval x > => "ãâ¥ã" irb(main):001:0> t = "" => "" irb(main):002:0> t.instance_eval "`ls`" => "tmp.txt\ntmp.rb\n" > Since all I ever want is to have the data back in the string, and string > doesn't have any methods that are likely to cause problems, this might > be a reasonable short-to-medium term solution. It still makes me a bit > uncomfortable though, because I don't really want anything other than > the encoded characters handled. Be sure your string do not include something like `rm -rf` ... > I can't use Marshal, because I need to have the data available as plain > text (hence the quoted strings) which isn't necessarily guaranteed to be > always processed by Ruby. I chose String#dump because it seemed like it > would always generate a "safe" string that would be parsed using normal > quote literal recognition. I hadn't tested it until recently with lots > of Unicode data, because I simply hadn't gotten there yet. I was just > lucky... How about Array#pack. It has an ability to escape strings as MIME quoted-printable: irb(main):001:0> s = "abcdãâ¥ãfghi" => "abcdãâ¥ãfghi" irb(main):002:0> t = [s].pack("M") => "abcd=E2=82=ACfghi=\n" irb(main):003:0> t.unpack("M")[0].force_encoding("UTF-8") => "abcdãâ¥ãfghi" # that force_encoding thing is required for ruby 1.9. > Even the Unicode handling is straightforward enough, and since I posed > the question, I found this blog: > http://dilettantes.code4lib.org/2009/04/parsing-escaped-unicode-in-ruby/ > which talks about modifying the JSON parser approach. I might be able > to do that, or, I might need to end up writing my own > serializer/deserializer, since at this stage (over a year), I've a lot > of legacy data lying around that was created with this approach. JSON is a ruby's stdlib these days (1.9 and above). Using it might be easier than you might think at first. irb(main):001:0> require 'json' => true irb(main):002:0> "ãâ¥ã".to_json => "\"\\u20ac\"" > I guess, I could write a one-off clean-up utility for the data that I > have now and then use the JSON library just to encode/decode the > strings, but that seems like overkill. > > My goals here are interoperability, reuse, ease of adapting to my > existing code (in that order). Until I ran across the site, I hadn't > thought about the JSON approach, but it might make the most sense for > interoperable data. Mind you, I only care about safe string > serialization/deserialization, and I've no use in the application for > the rest of the JSON spec. Generally speaking you cannot be safe with eval and eval-type methods used. So You have to either (1) write your own deserializer without evals, or (2) use existing one like JSON. I guess using existing libraries is not a bad idea for interpoerabilities. So JSON might not be that overkill. Quoted-printable is defined in RFC so might also be a good alternative. > Changing the question a little: does anyone know of the best way to > serialize and parse strings containing Unicode and other non-printing > characters? Ideally, I'd like to have something that works like > String#dump except that it used escaped Unicode code point references, > e.g. \uxxxx and \Uxxxxxxxx, and handles all of the "usual suspects" like > \", \\, etc. If you want \uxxxx-style escape, JSON library is a best bet I think. Another choice is to use YAML stdlib, but it generates backslashed escapes so you need to convert them anyway. > Doing some more googling, I also came across this, but I'm not sure what > the status of it is, and I'm not sure that it addresses my issue either. > It seems to be more about processing Unicode rather than serialization > of Unicode to ASCII. (http://snippets.dzone.com/posts/show/4527). > > [much time passes...including lunch] > > After arsing around for a long time with various stupid stuff, I finally > came up with this. I don't really like it, but it seems to do the job. > Comments welcome: > > irb(main):026:0> euro = "ãâ¥ã" > => "ãâ¥ã" > irb(main):027:0> x = euro.dump > => "\"\\342\\202\\254\"" > irb(main):028:0> x.gsub(/\\(\d\d\d)/) { [ $1.oct ].pack("c") }[1..-2] > => "ãâ¥ã" > > However, this doesn't get me in/out of the "standard" Unicode escapes. > > Thanks in advance for any ideas or suggestions. > > Cheers, > > ast