Andrew S. Townley wrote:

> Thanks for the replies.  Actually, as I was doing something else,
> another option occurred to me which seems both to a) work properly and
> b) be safe(-ish):
> 
> irb(main):001:0> $KCODE = 'u'
> => "u"
> irb(main):002:0> s = "ãâ¥ã"
> => "ãâ¥ã"
> irb(main):003:0> x = s.dump
> => "\"\\342\\202\\254\""
> irb(main):004:0> t = ""
> => ""
> irb(main):005:0> t.instance_eval x
> => "ãâ¥ã"

irb(main):001:0> t = ""
=> ""
irb(main):002:0> t.instance_eval "`ls`"
=> "tmp.txt\ntmp.rb\n"

> Since all I ever want is to have the data back in the string, and string
> doesn't have any methods that are likely to cause problems, this might
> be a reasonable short-to-medium term solution.  It still makes me a bit
> uncomfortable though, because I don't really want anything other than
> the encoded characters handled.

Be sure your string do not include something like `rm -rf` ...

> I can't use Marshal, because I need to have the data available as plain
> text (hence the quoted strings) which isn't necessarily guaranteed to be
> always processed by Ruby.  I chose String#dump because it seemed like it
> would always generate a "safe" string that would be parsed using normal
> quote literal recognition.  I hadn't tested it until recently with lots
> of Unicode data, because I simply hadn't gotten there yet.  I was just
> lucky...

How about Array#pack.  It has an ability to escape strings as MIME
quoted-printable:

irb(main):001:0> s = "abcdãâ¥ãfghi"
=> "abcdãâ¥ãfghi"
irb(main):002:0> t = [s].pack("M")
=> "abcd=E2=82=ACfghi=\n"
irb(main):003:0> t.unpack("M")[0].force_encoding("UTF-8")
=> "abcdãâ¥ãfghi"

# that force_encoding thing is required for ruby 1.9.


> Even the Unicode handling is straightforward enough, and since I posed
> the question, I found this blog:
> http://dilettantes.code4lib.org/2009/04/parsing-escaped-unicode-in-ruby/
> which talks about modifying the JSON parser approach.  I might be able
> to do that, or, I might need to end up writing my own
> serializer/deserializer, since at this stage (over a year), I've a lot
> of legacy data lying around that was created with this approach.

JSON is a ruby's stdlib these days (1.9 and above).  Using it might be easier
than you might think at first.

irb(main):001:0> require 'json'
=> true
irb(main):002:0> "ãâ¥ã".to_json
=> "\"\\u20ac\""


> I guess, I could write a one-off clean-up utility for the data that I
> have now and then use the JSON library just to encode/decode the
> strings, but that seems like overkill.
> 
> My goals here are interoperability, reuse, ease of adapting to my
> existing code (in that order).  Until I ran across the site, I hadn't
> thought about the JSON approach, but it might make the most sense for
> interoperable data.  Mind you, I only care about safe string
> serialization/deserialization, and I've no use in the application for
> the rest of the JSON spec.

Generally speaking you cannot be safe with eval and eval-type methods used.  So
You have to either (1) write your own deserializer without evals, or (2) use
existing one like JSON.  I guess using existing libraries is not a bad idea for
interpoerabilities.  So JSON might not be that overkill.  Quoted-printable is
defined in RFC so might also be a good alternative.

> Changing the question a little:  does anyone know of the best way to
> serialize and parse strings containing Unicode and other non-printing
> characters?  Ideally, I'd like to have something that works like
> String#dump except that it used escaped Unicode code point references,
> e.g. \uxxxx and \Uxxxxxxxx, and handles all of the "usual suspects" like
> \", \\, etc.

If you want \uxxxx-style escape, JSON library is a best bet I think.  Another
choice is to use YAML stdlib, but it generates backslashed escapes so you need
to convert them anyway.

> Doing some more googling, I also came across this, but I'm not sure what
> the status of it is, and I'm not sure that it addresses my issue either.
> It seems to be more about processing Unicode rather than serialization
> of Unicode to ASCII. (http://snippets.dzone.com/posts/show/4527).
> 
> [much time passes...including lunch]
> 
> After arsing around for a long time with various stupid stuff, I finally
> came up with this.  I don't really like it, but it seems to do the job.
> Comments welcome:
> 
> irb(main):026:0> euro = "ãâ¥ã"
> => "ãâ¥ã"
> irb(main):027:0> x = euro.dump
> => "\"\\342\\202\\254\""
> irb(main):028:0> x.gsub(/\\(\d\d\d)/) { [ $1.oct ].pack("c") }[1..-2]
> => "ãâ¥ã"
> 
> However, this doesn't get me in/out of the "standard" Unicode escapes.
> 
> Thanks in advance for any ideas or suggestions.
> 
> Cheers,
> 
> ast