Help with a regexp

dasch · July 12, 2006, 4:12pm

I’m trying to write a regular expression that matches bencoded strings,
i.e. strings on the form x:y, where x is the numeric length of y.

This is valid:

6:foobar

while this is not:

4:foo

I’ve tried using #{$1} inside the regexp, but it seems $1 is still nil
at that point. Here’s what I’ve got so far:

/^([1-9]+\d*):(\w){#{$1}}$/

(it ain’t working)

I’m matching with `string =~ regexp’; would it be better if I did it
another way?

Cheers,
Daniel

dasch · July 12, 2006, 6:39pm

Daniel S. schrieb:

I’ve tried using #{$1} inside the regexp, but it seems $1 is still nil
at that point. Here’s what I’ve got so far:

/^([1-9]+\d*):(\w){#{$1}}$/

(it ain’t working)

I’m matching with `string =~ regexp’; would it be better if I did it
another way?

Daniel, I’m pretty sure you can’t do it with a single regexp test. Why
not split the test in two parts, as in

str =~ /^([1-9]\d*):(\w+)$/ && $2.size == $1.to_i

Regards,
Pit

dasch · July 12, 2006, 7:05pm

Daniel,
When you grab the data it will be in a string format, so you need
to convert it to a number (most likely integer). Then you can compare
it with the size of the second value you grabbed. I would write it
like this (modify as needed):

print “enter data”
a = gets
valid = /^(\d*):(\w*)/

check = valid.match(a)

if(check[1].to_i == check[2].size)

end

I hope this helps.

_Steve

dasch · July 12, 2006, 7:24pm

On Jul 12, 2006, at 10:10 AM, Daniel S. wrote:

4:foo

Cheers,
Daniel

I believe that this is not in the set of languages a Regexp can
match. As others have already suggested you’ll have to do it in 2 steps.

dasch · July 12, 2006, 8:51pm

[email protected] wrote:

check = valid.match(a)

if(check[1].to_i == check[2].size)

end

Thank you for your reply.

My problem is that I have a string like this: “3:foo6:monkey5:sheep”,
which I need to separate into [“foo”, “monkey”, “sheep”]. The values can
contain numeric values, so splitting at \d won’t work. This is what
makes it difficult:

“3:ab23:cat5:sheep” => [“ab2”, “cat”, “sheep”]

I need to grab the number, then read that many characters, then read the
next number, etc.

Cheers,
Daniel

dasch · July 12, 2006, 8:58pm

Hi –

On Thu, 13 Jul 2006, Daniel S. wrote:

My problem is that I have a string like this: “3:foo6:monkey5:sheep”, which I
need to separate into [“foo”, “monkey”, “sheep”]. The values can contain
numeric values, so splitting at \d won’t work. This is what makes it
difficult:

“3:ab23:cat5:sheep” => [“ab2”, “cat”, “sheep”]

I need to grab the number, then read that many characters, then read the next
number, etc.

How do you know, in the case of:

2:ab23:cat

which of the two is invalid?

David

dasch · July 12, 2006, 9:16pm

On Jul 12, 2006, at 2:54 PM, [email protected] wrote:

David

I don’t believe you do. OTOH I’m not sure if you should care. If
one’s invalid, the whole thing is invalid.

Anyway, here’s my solution:

% cat n_colon_s.rb
require ‘strscan’
def parse_n_colon_s(s)
scanner = StringScanner.new(s)
results = []
until scanner.empty?
if scanner.scan(/(\d+):/)
n = scanner[1].to_i
raise ‘Malformed String’ unless (s = scanner.scan(/\w{#{n}}/))
results << s
else
raise ‘Malformed string’
end
end
results
end

p parse_n_colon_s(“2:ab3:cat”)
p parse_n_colon_s(“2:ab23:cat”)

% ruby n_colon_s.rb
[“ab”, “cat”]
-:18:in `parse_n_colon_s’: Malformed String (RuntimeError)
from -:28

dasch · July 12, 2006, 9:23pm

Daniel S. wrote:

My problem is that I have a string like this: “3:foo6:monkey5:sheep”,
Cheers,
Daniel

require ‘strscan’

array = []
ss = StringScanner.new(‘3:ab23:cat5:sheep’)

while !ss.eos?
len = ss.scan(/\d+:/).chop.to_i
array << ss.peek(len)
ss.pos += len
end

p array

=> [“ab2”, “cat”, “sheep”]

Tom

dasch · July 12, 2006, 9:29pm

Hi –

On Thu, 13 Jul 2006, Logan C. wrote:

it difficult:
which of the two is invalid?

David

I don’t believe you do. OTOH I’m not sure if you should care. If one’s
invalid, the whole thing is invalid.

OK, but how do you know whether or not this is valid?

3:ab23:cat

David

dasch · July 12, 2006, 9:12pm

Daniel S. wrote:

My problem is that I have a string like this: “3:foo6:monkey5:sheep”,
which I need to separate into [“foo”, “monkey”, “sheep”]. The values can
contain numeric values, so splitting at \d won’t work. This is what
makes it difficult:

“3:ab23:cat5:sheep” => [“ab2”, “cat”, “sheep”]

I need to grab the number, then read that many characters, then read the
next number, etc.

“3:foo6:monkey5:sheep”.scan(/(\d+):([^\d]+)/){|(num,str)|
if num.to_i == str.length
# correct
else
# not correct
end
}

lopex

dasch · July 12, 2006, 9:32pm

require ‘strscan’
s = StringScanner.new( “3:ab23:cat5:sheep” )
words = []
until s.eos?
if digits = s.scan( /\d+/ )
digits = digits.to_i
s.pos += 1
words << s.peek( digits )
s.pos += digits
else
p words
abort “Couldn’t find digits for: #{s.rest}”
end
end
p words
#=> [“ab2”, “cat”, “sheep”]

dasch · July 12, 2006, 9:36pm

On Thu, 13 Jul 2006 [email protected] wrote:

OK, but how do you know whether or not this is valid?

3:ab23:cat

it is valid.

harp:~ > cat a.rb
require ‘strscan’

class List < ::Array
def initialize s
parse s
end
class ParseError < ::StandardError; end
def parse s
clear
ss = StringScanner.new s.to_s
until ss.eos?
n = ss.scan %r/^\d+:/
raise ParseError, “syntax error beginning @ pos <#{ ss.pos }>”
unless n
n = n.to_i
raise RangeError, “bad length <#{ n }>” if n < 0
word = ss.scan(%r/.{0,#{ n }}/).to_s
raise ParseError, “syntax error beginning @ pos <#{ ss.pos }>”
unless
word.size == n
self << word
end
end
end

list = List.new “3:ab23:cat5:sheep”
p list

list = List.new “3:ab23:cat”
p list

harp:~ > ruby a.rb
[“ab2”, “cat”, “sheep”]
[“ab2”, “cat”]

regards.

-a

dasch · July 12, 2006, 9:46pm

Hi –

On Thu, 13 Jul 2006, [email protected] wrote:

the second. it’s invalid here

2:ab23:cat
^
^

virtual position 10.

How do you know? The first one could be wrong and the second one
right.

David

dasch · July 12, 2006, 9:33pm

On Thu, 13 Jul 2006 [email protected] wrote:

I need to grab the number, then read that many characters, then read the
next number, etc.

How do you know, in the case of:

2:ab23:cat

which of the two is invalid?

the second. it’s invalid here

2:ab23:cat
^
^

virtual position 10.

 harp:~ > cat a.rb
 require 'strscan'

 class List < ::Array
   def initialize s
     parse s
   end
   class ParseError < ::StandardError; end
   def parse s
     clear
     ss = StringScanner.new s.to_s
     until ss.eos?
       n = ss.scan %r/^\d+:/
       raise ParseError, "syntax error beginning @ pos <#{ ss.pos

}>" unless n

       n = n.to_i
       raise RangeError, "bad length <#{ n }>" if n < 0

       word = ss.scan(%r/.{0,#{ n }}/).to_s

       raise ParseError, "syntax error beginning @ pos <#{ ss.pos

}>" unless
word.size == n

       self << word
     end
   end
 end

 list = List.new "3:ab23:cat5:sheep"
 p list

 list = List.new "2:ab23:cat"
 p list

 harp:~ > ruby a.rb
 ["ab2", "cat", "sheep"]
 a.rb:20:in `parse': syntax error beginning @ pos <10>

(List::ParseError)
from a.rb:5:in `initialize’
from a.rb:31

regards.

-a

dasch · July 12, 2006, 9:46pm

Here’s what I consider a slightly cleaner, more robust version:

require ‘strscan’
inp = “3:ab23:cat5:sheep”
s = StringScanner.new( inp )
words = []
until s.eos?
begin
unless digits = s.scan( /\d+(?=:)/ )
raise “I can’t find an integer followed by a colon”
end
s.pos += 1 # skip the colon we know is there
digits = digits.to_i
unless s.rest_size >= digits
raise “I ran out of characters; looking for #{digits} characters,
#{s.rest_size} left”
end
words << s.peek( digits )
s.pos += digits
rescue RuntimeError => e
warn “Looking at #{s.rest.inspect},”
warn e.message
abort “Words found so far: #{words.inspect}”
end
end
puts "Words found: ", words

dasch · July 12, 2006, 9:59pm

On Thu, 13 Jul 2006 [email protected] wrote:

2:ab23:cat
^
^

virtual position 10.

How do you know? The first one could be wrong and the second one
right.

because that’s the specification:

I need to grab the number, then read that many characters, then read the
next number, etc.

so, in this case, the parse follows these steps:

read ‘2:’
read 2 chars
read ‘23:’
read 23 chars -> ERROR

i think you are confusing the syntax of the structure for the the logic
of it.
syntax wise ‘2:ab’ is totally valid and so that token pair is consumed.
of
course it’s the case that the user meant

3:ab23:cat

it’s not up to parsers to detect logic errors, only syntax errors. in
this
case the left to right quality (no backtracking specified) makes it
unambiguous as to which token contains the error - it’s the second.

yaml has similar issues, for instance this yaml document

k : v

0
2
3

loads as

{“k”=>[0, 2, 3]}

which is obviously not what is meant. but is it

k : v
0
2
3

or

k :

0
2
3

both are one char errors. nevertheless it’s too much to expect yaml to
determine which was meant or suspect that an error was made.

regards.

-a

dasch · July 12, 2006, 10:02pm

On Thu, 13 Jul 2006 [email protected] wrote:

it is valid.

I guess what I mean is: how do you know whether it’s valid by
coincidence but actually wrong?

the same way ruby knows that this

a = false and must_call_this_function_but_do_not!

is valid but actually wrong – it doesn’t! that’s why logic errors are
the
worst kind to debug…

fortunately for us too - otherwise we programmers would be out of work!

-a

dasch · July 12, 2006, 10:09pm

Hi –

On Thu, 13 Jul 2006, [email protected] wrote:

3:ab23:cat
is valid but actually wrong – it doesn’t! that’s why logic errors are the
worst kind to debug…

fortunately for us too - otherwise we programmers would be out of work!

I think what I’m groping toward (or not) is a case where backtracking
is necessary before the conclusion is drawn that it’s invalid. I know
backtracking wasn’t specified… but I wonder whether there are cases
that can’t be resolved otherwise. (Just thinking out loud, so to
speak.)

David

dasch · July 12, 2006, 9:53pm

Hi –

On Thu, 13 Jul 2006, [email protected] wrote:

On Thu, 13 Jul 2006 [email protected] wrote:

OK, but how do you know whether or not this is valid?

3:ab23:cat

it is valid.

I guess what I mean is: how do you know whether it’s valid by
coincidence but actually wrong?

David

dasch · July 12, 2006, 10:43pm

Phrogz wrote:

end
warn e.message
abort "Words found so far: #{words.inspect}"
end
end
puts "Words found: ", words

Thanks for the help!

I ended up doing it slightly different (after a lot of pain, and a sneek
peek at a Python script.):

This is an early library that handles bencoding. Note that the error
handling is pretty much non-existent at this point, and it’ll break if
you fart on it.

class String
def bencode
“#{length}:#{self}”
end
end

class Numeric
def bencode
“i#{floor}e”
end
end

class Array
def bencode
“l#{map{|obj| obj.bencode}}e”
end
end

class Hash
def bencode
“d#{map{|key, val| [key.bencode, val.bencode]}}e”
end
end

module Bencode
@filter = {:integer => /(0|-?[1-9]+\d*)e/,
:string => /(0|[1-9][0-9]*):/}

 class << self
   attr_reader :map

   def dump(obj)
     obj.bencode
   end

   def load(str)
     decode(str, 0).first
   end

   def decode_integer(data, i)
     match = @filter[:integer].match(data[i..-1])
     raise ArgumentError if match.nil?
     return Integer(match[1]), match.end(0) + i
   end

   def decode_string(data, i)
     match = @filter[:string].match(data[i..-1])
     raise ArgumentError if match.nil?
     length = Integer(match[1])
     offset = match.end(0) + length + i
     return data[(match.end(0) + i)...offset], offset
   end

   def decode_list(data, i)
     ary = []
     while data[i..i] != "e"
       val, i = decode(data, i)
       ary << val
     end
     return ary, i + 1
   end

   def decode_dict(data, i)
     hsh = {}
     while data[i..i] != "e"
       key, i = decode_string(data, i)
       val, i = decode(data, i)
       hsh[key] = val
     end
     return hsh, i + 1
   end

   def decode(data, i)
     case data[i..i]
     when "i"
       decode_integer(data, i + 1)
     when "l"
       decode_list(data, i + 1)
     when "d"
       decode_dict(data, i + 1)
     else
       decode_string(data, i)
     end
   end
 end

end

Eventually I’ll upload it to RubyForge. If you’ve got comments, bring
'em on, but remember that I only just started this today.

Cheers,
Daniel