Suggestion for string parsing

melmoth · September 18, 2008, 10:55am

Hi all,
I would like to know if there’s a better way to parse a string and
assing values to variables;

Ex:

Client=MPEG-4,390000,700000,24000

I can do

line =~ /(\w*)=([0-9A-Za-z -.:]),([0-9]),([0-9]),([0-9])/

and

var1 = $1
var2 = $2
var3 = $3
var4 = $4
var4 = $5

But I’m sure there’s a better way, even considering that the number of
parameters can increase and I don’t want to write a long regular
expression rule, that is hard to read.

Thanks a lot for any tips

melmoth · September 18, 2008, 11:15am

From: Me Me [mailto:[email protected]]

var1 = $1

var2 = $2

var3 = $3

var4 = $4

var5 = $5

hint: array

eg,

line
=> “Client=MPEG-4,390000,700000,24000”

re
=> /(\w*?)=([0-9A-Za-z -.:]?),(\d?),(\d*?),(\d*)/

line.match(re).captures
=> [“Client”, “MPEG-4”, “390000”, “700000”, “24000”]

also,

x,y,z=[1,2,3]
=> [1, 2, 3]

x
=> 1

z
=> 3

melmoth · September 18, 2008, 11:27am

But I’m sure there’s a better way, even considering that the number of
parameters can increase and I don’t want to write a long regular
expression rule, that is hard to read.

Are the parameters always delimited by commas ? In which case you could
modify the regular expression

line =~/(\w*)=(.*)/

Then

$2 #=> “MPEG-4,390000,700000,24000”
$2.split(",") #=> [“MPEG-4”, “390000”, “700000”, “24000”]

Returns you the values after the ‘=’ sign in line as an array. For more
power you could pass this sub-string to a CSV parsing library such as
FasterCSV.

Chris

melmoth · September 18, 2008, 11:35am

Thans for answering,
I was thinking if there some kind of c sscanf,
so that I could parse and assing to variable at the same time

so if I have

line=“Client=MPEG-4,390000,700000,24000”

something like:
sscanf(line, %s=%s %s %d %d %d, val1, val2, val3, val4, val5, val6)

I don’t know if there’s a similar string function for this in Ruby

thanks

melmoth · September 18, 2008, 11:52am

From: Me Me [mailto:[email protected]]

I was thinking if there some kind of c sscanf,

so that I could parse and assing to variable at the same time

so if I have

line=“Client=MPEG-4,390000,700000,24000”

something like:

sscanf(line, %s=%s %s %d %d %d, val1, val2, val3, val4, val5, val6)

I don’t know if there’s a similar string function for this in Ruby

you are right on scanf.
there is one in ruby, and it’s a lot simpler than you think

you’ll have to require it though before using,

eg,

require ‘scanf’
=> false

line.scanf(“%6s=%6s,%d,%d,%d,%d”)
=> [“Client”, “MPEG-4”, 390000, 700000, 24000]

melmoth · September 18, 2008, 2:53pm

is there a way to use the scanf to parse a string not knowing how many
chars?
thanks

melmoth · September 18, 2008, 3:02pm

is there a way to use the scanf to parse a string not knowing how many
chars?

I’d still use Regexp.

line=“Client=MPEG-4,390000,700000,24000”
val1,val2,val3,val4,val5 =
/^(\w*)=([^,]),(\d),(\d*),(\d*)/.match(line).captures

Another way:

def handle_line(v1,v2,v3,v4,v5)
puts “I got it! #{v1} etc”
end
…
if /^(\w*)=([^,]),(\d),(\d*),(\d*)/ =~ line
handle_line(*$~.captures)
end

melmoth · September 18, 2008, 12:13pm

line.scanf("%6s=%6s,%d,%d,%d,%d")
=> [“Client”, “MPEG-4”, 390000, 700000, 24000]

Thanks
the problem I have now is that the size of the string is not fixed to 6
chars.
And if I try to parse like:
line.scanf("%s=%s,%d,%d,%d,%d")
It doesn’t parse the string.

Is there a way to parse any string?
thanks again

melmoth · September 18, 2008, 3:10pm

Brian C. wrote:

is there a way to use the scanf to parse a string not knowing how many
chars?

I’d still use Regexp.

line=“Client=MPEG-4,390000,700000,24000”
val1,val2,val3,val4,val5 =
/^(\w*)=([^,]),(\d),(\d*),(\d*)/.match(line).captures

Another way:

def handle_line(v1,v2,v3,v4,v5)
puts “I got it! #{v1} etc”
end
…
if /^(\w*)=([^,]),(\d),(\d*),(\d*)/ =~ line
handle_line(*$~.captures)
end

thanks,
but what I would like to avoid regexp, it seems strange to me that
there’s no way to parse a string providing the structure.
scanf would be great but if I put %s it doesn’t get the string, unless I
put the number of chars.

melmoth · September 18, 2008, 3:16pm

thanks,
but what I would like to avoid regexp, it seems strange to me that
there’s no way to parse a string providing the structure.
scanf would be great but if I put %s it doesn’t get the string, unless I
put the number of chars.

%s is terminated by whitespace. You have no way of telling scanf that
you want to treat “=” (after the first field) and “,” (after the second
field) as separators, rather than characters to be consumed by %s.

Well, as long as your data doesn’t contain spaces, you could do

line=“Client=MPEG-4,390000,700000,24000”
line.gsub(/[=,]/,’ ').scanf("%s %s %d %d %d")

melmoth · September 18, 2008, 3:20pm

Me Me wrote:

Brian C. wrote:

is there a way to use the scanf to parse a string not knowing how many
chars?

I’d still use Regexp.

thanks,
but what I would like to avoid regexp, it seems strange to me that
there’s no way to parse a string providing the structure.

Well, you can always write a BreakApart() algorithm but I must agree
with Brian that RegEx is the way to go. After all, that is what RegEx
does. I was tempted to add BreakApart() code here but I am neither sure
that it is what you really want nor that it is the best solution for the
problem at hand.

What is the actual problem? If it is what you said (“I would like to
know if there’s a better way to parse a string and assing values to
variables;”) then RegEx is a fine solution. If you reject a good
solution and seek something else, then it can only be that you are
actually seeking a solution to a different problem. So, what are you
really looking for?

melmoth · September 18, 2008, 3:54pm

What is the actual problem? If it is what you said (“I would like to
know if there’s a better way to parse a string and assing values to
variables;”) then RegEx is a fine solution. If you reject a good
solution and seek something else, then it can only be that you are
actually seeking a solution to a different problem. So, what are you
really looking for?

I’m quite new to Ruby and I can understand that athere are better way to
do things, what I would like to avoid is to write something like this
(that works)

line =~ /(\w*)=([0-9A-Za-z -.:]),([0-9A-Za-z -.:]),([0-9A-Za-z
-.:]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]*)/

I wanted to this in one line but in a more mainable way, otherwise I
could always pars the string char by char.

melmoth · September 18, 2008, 4:12pm

On Sep 18, 2008, at 8:46 AM, Me Me wrote:

do things, what I would like to avoid is to write something like this
(that works)

line =~ /(\w*)=([0-9A-Za-z -.:]),([0-9A-Za-z -.:]),([0-9A-Za-z
-.:]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]),
([0-9]),([0-9]),([0-9]),([0-9]*)/

I believe there’s a bug in your regex. I assume you don’t really mean
all characters between space and period in the second character class,
especially since that includes a comman.

I wanted to this in one line but in a more mainable way, otherwise I
could always pars the string char by char.

I would probably do it in two steps. Match the bit before and after
the equal sign in one, then split() the after bit on commas:

#!/usr/bin/env ruby -wKU

if “Client=MPEG-4,390000,700000,24000” =~ /\A([^=]+)=([^=]+)\z/
p [$1, *$2.split(",")]
end

END

Here’s another idea using StringScanner:

#!/usr/bin/env ruby -wKU

require “strscan”

class SimpleParser
def initialize(data)
s = StringScanner.new(data)
@values = [ ]

 @values << s.matched        if    s.scan(/\w+/)
 @values << s.matched[1..-1] if    s.scan(/=[0-9A-Za-z \-.:]+/)
 @values << s.matched[1..-1] while s.scan(/,[0-9]+/)

end

attr_reader :values
end

p SimpleParser.new(“Client=MPEG-4,390000,700000,24000”).values

END

Hope that gives you some fresh ideas.

James Edward G. II

melmoth · September 18, 2008, 8:23pm

Me Me wrote:

What is the actual problem? If it is what you said (“I would like to
know if there’s a better way to parse a string and assing values to
variables;”) then RegEx is a fine solution. If you reject a good
solution and seek something else, then it can only be that you are
actually seeking a solution to a different problem. So, what are you
really looking for?

I’m quite new to Ruby and I can understand that athere are better way to
do things, what I would like to avoid is to write something like this
(that works)

line =~ /(\w*)=([0-9A-Za-z -.:]),([0-9A-Za-z -.:]),([0-9A-Za-z
-.:]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]*)/

I wanted to this in one line but in a more mainable way, otherwise I
could always pars the string char by char.

AHA! I understand, or I at least flatter myself that I do. how about
this:

require ‘scanf’

s = “Client=MPEG-4,390000,700000,24000,9452349,234583475,2452345”
val = s.scanf("%6s=%s")
vals = val[1].split(",")
p vals

=> [“MPEG-4”, “390000”, “700000”, “24000”, “9452349”, “234583475”,
“2452345”]

melmoth · September 18, 2008, 9:56pm

On Sep 18, 3:48 am, Me Me [email protected] wrote:

line =~ /(\w*)=([0-9A-Za-z -.:]),([0-9]),([0-9]),([0-9])/
parameters can increase and I don’t want to write a long regular
expression rule, that is hard to read.

s = “Client=MPEG-4,390000,700000,24000”
==>“Client=MPEG-4,390000,700000,24000”
if s =~ /^\w+=\S+(,\d+)+$/
vars = s.split( /[=,]/ )
end
==>[“Client”, “MPEG-4”, “390000”, “700000”, “24000”]

melmoth · September 18, 2008, 4:19pm

what I would like to avoid is to write something like this
(that works)

line =~ /(\w*)=([0-9A-Za-z -.:]),([0-9A-Za-z -.:]),([0-9A-Za-z
-.:]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]),([0-9]*)/

I wanted to this in one line but in a more mainable way, otherwise I
could always pars the string char by char.

If you don’t actually need to match the data against a pattern, then
just use

line.split(’,’)

If you only want to proceed if the line is “valid”, then write a
suitable regexp pattern to validate it. There are plenty of shortcuts.
For example, \d is the same as [0-9]. {n} means repeat the preceeding
element exactly n times. So:

case line
when /^(\w*)=([^,]*),(\d+(,\d+){9})$/
key1 = $1
key2 = $2
numbers = $3.split(/,/).collect { |n| n.to_i }
# or: numbers = $3.scanf("%d %d %d %d %d %d %d %d %d %d") if you
prefer
else
puts “Invalid line!”
end

That matches word=string,n,n,n,n,n,n,n,n,n,n

Furthermore you can substitute patterns you use repeatedly:

WORD = “[0-9A-Za-z -.:]*”
…
when /^(#{WORD})=(#{WORD}),(#{WORD}),(\d+(,\d+){9})$/o

(//o means that the regexp is built only once, the substitutions aren’t
done every time round)

You can also use extended syntax to make the RE more maintainable:

VALID_LINE = %r{ ^
(\w*) = # key ($1)
(#{WORD}), # format ($2)
(\d+), # size ($3)
(\d+) # sample rate ($4)
$ }x

if VALID_LINE =~ line
…
end

You can also do groupings which don’t capture data using (?: … )

Compact enough?

melmoth · September 19, 2008, 3:21am

From: Me Me [mailto:[email protected]]

>> line.scanf(“%6s=%6s,%d,%d,%d,%d”)

> => [“Client”, “MPEG-4”, 390000, 700000, 24000]

the problem I have now is that the size of the string is not

fixed to 6 chars.

And if I try to parse like:

line.scanf(“%s=%s,%d,%d,%d,%d”)

It doesn’t parse the string.

Is there a way to parse any string?

thanks again

oops, sorry, i thought it was good enough.

in that case, you’ll have to use char classes,

line.scanf(“%[A-Za-z]=%[A-Z1-9-],%d,%d,%d,%d”)
=> [“Client”, “MPEG-4”, 390000, 700000, 24000]

is that ok?
kind regards -botp

melmoth · September 19, 2008, 2:57pm

William J. wrote:

s = “Client=MPEG-4,390000,700000,24000”
==>“Client=MPEG-4,390000,700000,24000”
if s =~ /^\w+=\S+(,\d+)+$/
vars = s.split( /[=,]/ )
end
==>[“Client”, “MPEG-4”, “390000”, “700000”, “24000”]

You are right, William. That is cleaner. nice!