Parsing String#dump data?

Andrew_SSTownley · April 23, 2009, 12:04pm

Hi,

I was wondering if there was a better/safer way to parse string data
that has been dumped using the String#dump method. At the moment, I’ve
been using regular expressions to do it, but that doesn’t seem to work
with unicode characters, since they get dumped as follows:

irb(main):007:0> s = “â‚¬”
=> “â‚¬”
irb(main):008:0> s.dump
=> “”\342\202\254""

On a lark, I figured that Ruby would be able to get them back, so I
tried this:

irb(main):009:0> x = eval s.dump
=> “â‚¬”

Which, of course, works. However, I’m a bit leery of doing this from a
safety perspective, because I really don’t have any control over these
strings, and I’d prefer not to allow the execution of arbitrary Ruby
code every time I’m trying to restore strings (I need them serialized as
appropriately escaped quoted literals).

Has anyone else ever needed to do this, and, if so, how did you solve
the problem. I guess I could do another pass on the string looking for
‘\[\n]+’ values and try and combine them some way, but I’m not really
sure how to do that either.

Any ideas?

Cheers,

ast

Andrew_SSTownley · April 23, 2009, 1:11pm

On Thu, Apr 23, 2009 at 12:03 PM, Andrew S. Townley [email protected]
wrote:

=> “"\342\202\254"”

On a lark, I figured that Ruby would be able to get them back, so I
tried this:

irb(main):009:0> x = eval s.dump
=> “€”
Maybe
eval s.dump if %r{\A"."\Z} === s.dump && ! %r{#[{@]} === s.dump # not
tested
is save, but I am not 100% sure
Cheers
Robert
sure how to do that either.

–
Si tu veux construire un bateau …
Ne rassemble pas des hommes pour aller chercher du bois, préparer des
outils, répartir les tâches, alléger le travail… mais enseigne aux
gens la nostalgie de l’infini de la mer.

If you want to build a ship, don’t herd people together to collect
wood and don’t assign them tasks and work, but rather teach them to
long for the endless immensity of the sea.

Andrew_SSTownley · April 23, 2009, 1:43pm

Andrew S. Townley wrote:

Which, of course, works. However, I’m a bit leery of doing this from a
safety perspective, because I really don’t have any control over these
strings, and I’d prefer not to allow the execution of arbitrary Ruby
code every time I’m trying to restore strings (I need them serialized as
appropriately escaped quoted literals).

IMHO it is a bad idea to use String#dump when you cannot control those
strings.

My recommendation is to use Marshal.dump, which also generates a string.
Adding quotes to those marshal-generated strings should be easier than
safely
evaluate dumped string.

Andrew_SSTownley · April 23, 2009, 9:19pm

Andrew S. Townley wrote:

irb(main):004:0> t = “”
=> “”
irb(main):005:0> t.instance_eval x
=> “â‚¬”

irb(main):001:0> t = “”
=> “”
irb(main):002:0> t.instance_eval “ls”
=> “tmp.txt\ntmp.rb\n”

Since all I ever want is to have the data back in the string, and string
doesn’t have any methods that are likely to cause problems, this might
be a reasonable short-to-medium term solution. It still makes me a bit
uncomfortable though, because I don’t really want anything other than
the encoded characters handled.

Be sure your string do not include something like rm -rf …

I can’t use Marshal, because I need to have the data available as plain
text (hence the quoted strings) which isn’t necessarily guaranteed to be
always processed by Ruby. I chose String#dump because it seemed like it
would always generate a “safe” string that would be parsed using normal
quote literal recognition. I hadn’t tested it until recently with lots
of Unicode data, because I simply hadn’t gotten there yet. I was just
lucky…

How about Array#pack. It has an ability to escape strings as MIME
quoted-printable:

irb(main):001:0> s = “abcdâ‚¬fghi”
=> “abcdâ‚¬fghi”
irb(main):002:0> t = [s].pack(“M”)
=> “abcd=E2=82=ACfghi=\n”
irb(main):003:0> t.unpack(“M”)[0].force_encoding(“UTF-8”)
=> “abcdâ‚¬fghi”

that force_encoding thing is required for ruby 1.9.

Even the Unicode handling is straightforward enough, and since I posed
the question, I found this blog:
http://dilettantes.code4lib.org/2009/04/parsing-escaped-unicode-in-ruby/
which talks about modifying the JSON parser approach. I might be able
to do that, or, I might need to end up writing my own
serializer/deserializer, since at this stage (over a year), I’ve a lot
of legacy data lying around that was created with this approach.

JSON is a ruby’s stdlib these days (1.9 and above). Using it might be
easier
than you might think at first.

irb(main):001:0> require ‘json’
=> true
irb(main):002:0> “â‚¬”.to_json
=> “"\u20ac"”

I guess, I could write a one-off clean-up utility for the data that I
have now and then use the JSON library just to encode/decode the
strings, but that seems like overkill.

My goals here are interoperability, reuse, ease of adapting to my
existing code (in that order). Until I ran across the site, I hadn’t
thought about the JSON approach, but it might make the most sense for
interoperable data. Mind you, I only care about safe string
serialization/deserialization, and I’ve no use in the application for
the rest of the JSON spec.

Generally speaking you cannot be safe with eval and eval-type methods
used. So
You have to either (1) write your own deserializer without evals, or (2)
use
existing one like JSON. I guess using existing libraries is not a bad
idea for
interpoerabilities. So JSON might not be that overkill.
Quoted-printable is
defined in RFC so might also be a good alternative.

Changing the question a little: does anyone know of the best way to
serialize and parse strings containing Unicode and other non-printing
characters? Ideally, I’d like to have something that works like
String#dump except that it used escaped Unicode code point references,
e.g. \uxxxx and \Uxxxxxxxx, and handles all of the “usual suspects” like
", \, etc.

If you want \uxxxx-style escape, JSON library is a best bet I think.
Another
choice is to use YAML stdlib, but it generates backslashed escapes so
you need
to convert them anyway.

Andrew_SSTownley · April 23, 2009, 5:13pm

On Thu, 2009-04-23 at 20:41 +0900, Urabe S. wrote:

Adding quotes to those marshal-generated strings should be easier than safely
evaluate dumped string.

Thanks for the replies. Actually, as I was doing something else,
another option occurred to me which seems both to a) work properly and
b) be safe(-ish):

irb(main):001:0> $KCODE = ‘u’
=> “u”
irb(main):002:0> s = “â‚¬”
=> “â‚¬”
irb(main):003:0> x = s.dump
=> “"\342\202\254"”
irb(main):004:0> t = “”
=> “”
irb(main):005:0> t.instance_eval x
=> “â‚¬”

Since all I ever want is to have the data back in the string, and string
doesn’t have any methods that are likely to cause problems, this might
be a reasonable short-to-medium term solution. It still makes me a bit
uncomfortable though, because I don’t really want anything other than
the encoded characters handled.

I can’t use Marshal, because I need to have the data available as plain
text (hence the quoted strings) which isn’t necessarily guaranteed to be
always processed by Ruby. I chose String#dump because it seemed like it
would always generate a “safe” string that would be parsed using normal
quote literal recognition. I hadn’t tested it until recently with lots
of Unicode data, because I simply hadn’t gotten there yet. I was just
lucky…

Even the Unicode handling is straightforward enough, and since I posed
the question, I found this blog:
http://dilettantes.code4lib.org/2009/04/parsing-escaped-unicode-in-ruby/
which talks about modifying the JSON parser approach. I might be able
to do that, or, I might need to end up writing my own
serializer/deserializer, since at this stage (over a year), I’ve a lot
of legacy data lying around that was created with this approach.

I guess, I could write a one-off clean-up utility for the data that I
have now and then use the JSON library just to encode/decode the
strings, but that seems like overkill.

My goals here are interoperability, reuse, ease of adapting to my
existing code (in that order). Until I ran across the site, I hadn’t
thought about the JSON approach, but it might make the most sense for
interoperable data. Mind you, I only care about safe string
serialization/deserialization, and I’ve no use in the application for
the rest of the JSON spec.

Changing the question a little: does anyone know of the best way to
serialize and parse strings containing Unicode and other non-printing
characters? Ideally, I’d like to have something that works like
String#dump except that it used escaped Unicode code point references,
e.g. \uxxxx and \Uxxxxxxxx, and handles all of the “usual suspects” like
", \, etc.

Doing some more googling, I also came across this, but I’m not sure what
the status of it is, and I’m not sure that it addresses my issue either.
It seems to be more about processing Unicode rather than serialization
of Unicode to ASCII. (http://snippets.dzone.com/posts/show/4527).

[much time passes…including lunch]

After arsing around for a long time with various stupid stuff, I finally
came up with this. I don’t really like it, but it seems to do the job.
Comments welcome:

irb(main):026:0> euro = “â‚¬”
=> “â‚¬”
irb(main):027:0> x = euro.dump
=> “"\342\202\254"”
irb(main):028:0> x.gsub(/\(\d\d\d)/) { [ $1.oct ].pack(“c”) }[1…-2]
=> “â‚¬”

However, this doesn’t get me in/out of the “standard” Unicode escapes.

Thanks in advance for any ideas or suggestions.

Cheers,

ast