Help with a regexp

dasch · July 12, 2006, 11:05pm

Daniel S. wrote:

If you’ve got comments, bring
'em on, but remember that I only just started this today.

One comment: you seem to have intermixed the places for decoding
(module methods) and encoding (instance methods of classes). It would
seem cleaner, to me, to either add class methods to the classes
(Array.from_bencode instead of Bencode.decode_list) or only use the
module (Bencode.from_array instead of Array#bencode).

dasch · July 13, 2006, 7:26am

Phrogz wrote:

Daniel S. wrote:

If you’ve got comments, bring
'em on, but remember that I only just started this today.

One comment: you seem to have intermixed the places for decoding
(module methods) and encoding (instance methods of classes). It would
seem cleaner, to me, to either add class methods to the classes
(Array.from_bencode instead of Bencode.decode_list) or only use the
module (Bencode.from_array instead of Array#bencode).

Actually, I’m imitating the behavior of YAML. I think it’s very
intuitive that an object creates a bencoded copy of itself, while the
parser methods are gathered at one place. Maybe make the /decode(_.+)?/
methods private?

Cheers,
Daniel

dasch · July 13, 2006, 3:20pm

Hi Daniel,

Daniel S. wrote:

I’ve tried using #{$1} inside the regexp, but it seems $1 is still nil
at that point.

I think you can do what you want there, but if you’re using captures
within the regex they are captured in, you denote them as \1, \2, etc.

%r{<(foo)></\1>} # should match a pair of empty “foo” tags

Peas,

Seth Thomas R.
http://sethrasmussen.com

dasch · July 13, 2006, 3:30pm

Hi –

On Thu, 13 Jul 2006, Seth Thomas R. wrote:

while this is not:

4:foo

I’ve tried using #{$1} inside the regexp, but it seems $1 is still nil
at that point.

I think you can do what you want there, but if you’re using captures
within the regex they are captured in, you denote them as \1, \2, etc.

%r{<(foo)></\1>} # should match a pair of empty “foo” tags

It does, but the issue would be getting it to interpolate and be
pre-processed as a quantifier:

/(\d):\w{#{\1}}/ or something

which doesn’t seem to be possible, at least as far as I can tell.

David

dasch · July 13, 2006, 4:22pm

Daniel S. [email protected] writes:

4:foo
I don’t think that what you want to do is possible with a mere regular
expression.

It might be possible using perl’s special
evaluate-code-while-in-regexp (??{ code }) feature, but not with any
language that doesn’t allow regular expression evaluations to escape
back into the host language.

The problem is that you want to leave crucial portions of the regexp
uncompiled until the moment that half of the regular expression has
matched, and this is not possible.

But matching bencoded data isn’t that hard; here’s something I just
whipped up that should handle bencoded data:

require ‘strscan’

class BencodeScanner
def BencodeScanner.scan(str)
scan = StringScanner.new(str)
r = BencodeScanner.doscan_internal(scan,false)
raise “Malformed Bencoded String” unless scan.eos?
r
end

private

@@string_regexps = Hash.new {|h,k| h[k] = /:.{#{k}}/m}

def BencodeScanner.doscan_internal(scanner, allow_e=true)
tok = scanner.scan(/\d+|[ilde]/)
case tok
when nil
raise “Malformed Bencoded String”
when ‘e’
raise “Malformed Bencoded String” unless allow_e
return nil
when ‘l’
retval = []
while arritem = BencodeScanner.doscan_internal(scanner)
retval << arritem
end
return retval
when ‘d’
retval = {}
while key = BencodeScanner.doscan_internal(scanner)
val = BencodeScanner.doscan_internal(scanner,false)
retval[key] = val
end
return retval
when ‘i’
raise “Malformed Bencoded String” unless scanner.scan(/-?\d+e/)
return scanner.matched[0,scanner.matched.length-1].to_i
else
raise “Malformed Bencoded String” unless
scanner.scan(@@string_regexps[tok])
return scanner.matched[1,tok.to_i]
end
end
end

dasch · July 13, 2006, 5:16pm

“Phrogz” [email protected] wrote in news:1152733325.895320.247460
@b28g2000cwb.googlegroups.com:

end
warn e.message
abort "Words found so far: #{words.inspect}"
end
end
puts "Words found: ", words

I’ve been experimenting with Ruby since Tuesday, and I’d like to thank
you all for sharing code with us here - it really speeds us forwards in
picking up the spirit of Ruby coding.

I believe I’ve simplified your code by using the incredibly full set of
built-in methods in the string object, rather than depending on
“require ‘strscan’”:
-----------------------8<----------------------------
s = “3:a:23:cat5:sheep”
words = []
until s.empty?
begin
unless digits = s.slice!(/\d+(?=:)/)
raise “I can’t find an integer followed by a colon”
end
words << s.slice!(0…digits.to_i)
unless words.last.size >= digits.to_i
raise “I ran out of characters; looking for #{digits} characters,
#{s.size} left”
end
rescue RuntimeError => e
warn “Looking at #{s.inspect},”
warn e.message
abort “Words found so far: #{words.inspect}”
end
end
puts "Words found: ", words
-----------------------8<----------------------------

dasch · July 14, 2006, 8:11pm

Daniel M. wrote:

while this is not:

def BencodeScanner.scan(str)
def BencodeScanner.doscan_internal(scanner, allow_e=true)
retval << arritem
raise “Malformed Bencoded String” unless scanner.scan(/-?\d+e/)
return scanner.matched[0,scanner.matched.length-1].to_i
else
raise “Malformed Bencoded String” unless scanner.scan(@@string_regexps[tok])
return scanner.matched[1,tok.to_i]
end
end
end

Thank you all for your responses!

I’ve been away for the last two days, so I’ve only just got an
opportunity to reply.

Daniel, I’ve further developed your solution:

module Bencode
class BencodingError < StandardError; end

 class << self
   def dump(obj)
     obj.bencode
   end

   def parse(benc)
     require 'strscan'

     scanner = StringScanner.new(benc)
     obj = scan(scanner)
     raise BencodingError unless scanner.eos?
     return obj
   end

   alias_method :load, :parse

   private

   def scan(scanner)
     case token = scanner.scan(/[ild]|\d+:/)
     when nil
       raise BencodingError
     when "i"
       number = scanner.scan(/0|(-?[1-9][0-9]*)/)
       raise BencodingError unless number
       raise BencodingError unless scanner.scan(/e/)
       return number
     when "l"
       ary = []
       until scanner.peek(1) == "e"
         ary.push(scan(scanner))
       end
       scanner.pos += 1
       return ary
     when "d"
       hsh = {}
       until scanner.peek(1) == "e"
         hsh.store(scan(scanner), scan(scanner))
       end
       scanner.pos += 1
       return hsh
     when /\d+:/
       length = token.chop.to_i
       str = scanner.peek(length)
       scanner.pos += length
       return str
     else
       raise BencodingError
     end
   end
 end

end

Cheers, and thank you all for helping me out!
Daniel S.

dasch · July 13, 2006, 5:47pm

I’ve been experimenting with Ruby … Tuesday, …

I never even noticed! talk about showing your age

sorry, I caught the colon at the start of each word - should have been :

s = “3:a:23:cat5:sheep”
words = []
until s.empty?
begin
unless digits = s.slice!(/\d+(?=:)/)
raise “I can’t find an integer followed by a colon”
end
s.slice!(0)
words << s.slice!(0…digits.to_i-1)
unless words.last.size >= digits.to_i
raise “I ran out of characters; looking for #{digits} characters,
#{s.size} left”
end
rescue RuntimeError => e
warn “Looking at #{s.inspect},”
warn e.message
abort “Words found so far: #{words.inspect}”
end
end
puts "Words found: ", words

Goodbye,

dasch · July 14, 2006, 8:21pm

Hi –

On Thu, 13 Jul 2006, Daniel M. wrote:

while this is not:

4:foo

I don’t think that what you want to do is possible with a mere regular
expression.

It might be possible using perl’s special
evaluate-code-while-in-regexp (??{ code }) feature, but not with any
language that doesn’t allow regular expression evaluations to escape
back into the host language.

Is ??{ code } in Perl different from #{…} in Ruby? (Not that I was
able to solve Daniel’s problem with #{…}, but I’m just curious about
the comparison.)

David

dasch · July 14, 2006, 9:00pm

On Jul 14, 2006, at 2:16 PM, [email protected] wrote:

This is valid:

It might be possible using perl’s special
evaluate-code-while-in-regexp (??{ code }) feature, but not with any
language that doesn’t allow regular expression evaluations to escape
back into the host language.

Is ??{ code } in Perl different from #{…} in Ruby? (Not that I was
able to solve Daniel’s problem with #{…}, but I’m just curious about
the comparison.)

According to Programming Perl yes indeedy it is. ??{ } is “Match
Time Pattern Interpolation”, and it lets you do all sorts of evil
(like matching nested parens with a regexp).

So in perl his code would be something like:

% cat mtpi.pl
$s1 = “3:abc”;
$s2 = “24:abc”;

print “Good\n” if ( $s1 =~ /(\d+):(??{’\w{’ . $1 . ‘}’})/);
print “Bad\n” if ( $s2 =~ /(\d+):(??{’\w{’ . $1 . ‘}’})/);

% perl mtpi.pl
Good

I apologize to any perlers if this isn’t idiomatic (or clean) perl, I
never had to use this kind of magic in my perl days and I had
difficulty getting it to work when I stored the regexp in a variable.
But the point is, is that it does work. Which is kind of scary.

dasch · July 14, 2006, 8:18pm

Daniel S. wrote:

    when "i"
      number = scanner.scan(/0|(-?[1-9][0-9]*)/)
      raise BencodingError unless number
      raise BencodingError unless scanner.scan(/e/)
      return number

That last line should of course read

return number.to_i

Daniel

dasch · July 15, 2006, 2:46pm

Logan C. [email protected] writes:

On Jul 14, 2006, at 2:16 PM, [email protected] wrote:

Is ??{ code } in Perl different from #{…} in Ruby? (Not that I was
able to solve Daniel’s problem with #{…}, but I’m just curious about
the comparison.)

I apologize to any perlers if this isn’t idiomatic (or clean) perl, I
never had to use this kind of magic in my perl days and I had
difficulty getting it to work when I stored the regexp in a variable.
But the point is, is that it does work. Which is kind of scary.

I think you probably had trouble with the \ when you tried storing it
in a variable because of quoting issues. So use qr, the perl
equivalent of ruby’s %r:

$s1 = “3:abc”;
$s2 = “24:abc”;

$regexp = qr/(\d+):(??{‘\w{’ . $1 . ‘}’})/;

print “Good\n” if ( $s1 =~ $regexp);
print “Bad\n” if ( $s2 =~ $regexp);

Although, since bencoded strings can contain any character, and not
just word characters, what you really want is:

$regexp = qr/(\d+):(??{“.{$1}”})/;

Perl allows bunches of special constructs in regular expressions that
sane languages, which like to keep the matching of regular expressions
away from being able to jump back into the host language. (Note that
perl combines this feature with extra language-level security support,
since most programmers would assume that a user-supplied regexp
couldn’t execute arbitrary code)

For more examples, google “perlre”.

Incidentally, I’ve just been able to reproduce as much of bencoding as
I implemented in ruby earlier in a pair of nasty perl regular
expressions.

I won’t post it, since this is ruby-talk and not
perl-regex-nastiness-talk, but people who really want to see it can
look at Paste number 22637: perl nasty regexness

It doesn’t technically decode every possible bencoded string, because
limitations in perl’s regexp engine don’t let me say .{n} where “n” is
larger than about 32000 while a bencoded string can in theory have a
length up to 65535. But other than that, it should implement the
entire bencode spec.

dasch · July 15, 2006, 6:07pm

Hi –

On Sat, 15 Jul 2006, Daniel M. wrote:

I apologize to any perlers if this isn’t idiomatic (or clean) perl, I

Perl allows bunches of special constructs in regular expressions that
sane languages, which like to keep the matching of regular expressions
away from being able to jump back into the host language. (Note that
perl combines this feature with extra language-level security support,
since most programmers would assume that a user-supplied regexp
couldn’t execute arbitrary code)

Ruby does let you jump back, though, with #{…}. But it looks like
Perl does an extra level of compilation. (Unless I’ve got that
backwards.)

David

dasch · July 15, 2006, 11:25pm

[email protected] writes:

backwards.)
Ruby lets you do string interpolation with an easy syntax, and lets
you use that syntax even when writing a string that is being compiled
into a regular expression because it’s surrounded by //. Perl has
that, too. This however is something different - the execution of
code in the host language at regular expression long after the
expression has been compiled, in the middle of doing a match.

As I said, most languages don’t allow this. The usual pattern is:
1 make string in language-specific way
2 hand string to regexp engine and get back some compiled structure
3 store handle to compiled structure in host language
4 get string to match
5 hand string and compiled structure to regexp engine
6 regexp engine walks the string and compiled structure to determine
if there’s a match.

Now, for speed, step 6 is generally done in C. Sometimes, so is step
2. (I haven’t looked at the ruby code, but python does steps 1-5 in
python, and only 6 in C. Java’s regexp engine of course does all
those steps in java)

Perl, however, lets you interrupt step 6 and evaluate some perl code
in the midst of the C-based matching code.

dasch · July 16, 2006, 1:00am

Hi –

On Sun, 16 Jul 2006, Daniel M. wrote:

Ruby does let you jump back, though, with #{…}. But it looks like
Perl does an extra level of compilation. (Unless I’ve got that
backwards.)

Ruby lets you do string interpolation with an easy syntax, and lets
you use that syntax even when writing a string that is being compiled
into a regular expression because it’s surrounded by //. Perl has
that, too. This however is something different - the execution of
code in the host language at regular expression long after the
expression has been compiled, in the middle of doing a match.

Right, I see what you mean. I think with “extra level of compilation”
that’s what I was groping toward – really a post-compilation
evaluation in a later context.

Thanks –

David