Need for speed -> a C extension?

Hello all,

The below code is too slow for practical use. I need it to run at least
20 times faster. Perhaps that is possible with some C code? I have no
experience with writing Ruby extensions. What are the pitfalls? Which
part of the code should be ported? Any pointers to get me started?

Cheers,

Martin

#!/usr/bin/env ruby

IUPAC nucleotide pair ambiguity equivalents are saved in an

array of bit fields.

BIT_A = 1 << 0
BIT_T = 1 << 1
BIT_C = 1 << 2
BIT_G = 1 << 3

EQUAL = Array.new(256, 0)
EQUAL[‘A’.ord] = BIT_A
EQUAL[‘T’.ord] = BIT_T
EQUAL[‘U’.ord] = BIT_T
EQUAL[‘C’.ord] = BIT_C
EQUAL[‘G’.ord] = BIT_G
EQUAL[‘M’.ord] = (BIT_A|BIT_C)
EQUAL[‘R’.ord] = (BIT_A|BIT_G)
EQUAL[‘W’.ord] = (BIT_A|BIT_T)
EQUAL[‘S’.ord] = (BIT_C|BIT_G)
EQUAL[‘Y’.ord] = (BIT_C|BIT_T)
EQUAL[‘K’.ord] = (BIT_G|BIT_T)
EQUAL[‘B’.ord] = (BIT_C|BIT_G|BIT_T)
EQUAL[‘D’.ord] = (BIT_A|BIT_G|BIT_T)
EQUAL[‘H’.ord] = (BIT_A|BIT_C|BIT_T)
EQUAL[‘V’.ord] = (BIT_A|BIT_C|BIT_G)
EQUAL[‘N’.ord] = (BIT_A|BIT_C|BIT_G|BIT_T)

Module containing code to locate nucleotide patterns in sequences

allowing for

ambiguity codes and a given maximum edit distance.

Insertions are nucleotides found in the pattern but not in the

sequence.

Deletions are nucleotides found in the sequence but not in the

pattern.

Inspired by the paper by Bruno Woltzenlogel Paleo (page 197):

http://www.logic.at/people/bruno/Papers/2007-GATE-ESSLLI.pdf

module PatternMatcher


str.match(pattern[, pos[, max_edit_distance]])

→ Match or nil


Method to locate the next pattern match starting from a given

position. A match

is allowed to contain a given maximum edit distance. If a match is

located a

Match object will be returned otherwise nil.

def match(pattern, pos = 0, max_edit_distance = 0)
@pattern = pattern
@pos = pos
@max_edit_distance = max_edit_distance
@vector = vector_init

while @pos < @seq.length
  vector_update

  return match_new if match_found?

  @pos += 1
end

end


str.scan(pattern[, pos[, max_edit_distance]])

→ Array

str.scan(pattern[, pos[, max_edit_distance]]) { |match|

block

}

→ Match


Method to iterate through a sequence to locate pattern matches

starting

from a given position and allowing for a maximum edit distance.

Matches found in block context return the Match object. Otherwise

matches are

returned in an Array.

def scan(pattern, pos = 0, max_edit_distance = 0)
matches = []

while match = match(pattern, pos, max_edit_distance)
  if block_given?
    yield match
  else
    matches << match
  end

  pos = match.pos + 1
end

return matches unless block_given?

end

private

Method to initailize the score vector and return this.

def vector_init
vector = []

(0 ... @pattern.length + 1).each do |i|
  vector[i] = Score.new(matches = 0, mismatches = 0, insertions = i)
end

vector

end

Method to update the score vector.

def vector_update
new_vector = @vector.dup

(0 ... @pattern.length).each do |i|
  if match?(@seq[@pos], @pattern[i])
    new_vector[i + 1] = @vector[i].dup
    new_vector[i + 1].matches += 1
  else
    mismatch  = @vector[i].dup
    insertion = new_vector[i].dup
    deletion  = @vector[i + 1].dup

    if deletion?(mismatch, insertion, deletion)
      deletion.deletions += 1
      new_vector[i + 1] = deletion
    elsif mismatch?(mismatch, insertion, deletion)
      mismatch.mismatches += 1
      new_vector[i + 1] = mismatch
    elsif insertion?(mismatch, insertion, deletion)
      insertion.insertions += 1
      new_vector[i + 1] = insertion
    end
  end
end

@vector = new_vector

end

Method to determine if a match occurred.

def match?(char1, char2)
(EQUAL[char1.upcase.ord] & EQUAL[char2.upcase.ord]) != 0
end

Method to determine if a mismatch occured.

def mismatch?(mismatch, insertion, deletion)
if mismatch.edit_distance <= insertion.edit_distance and
mismatch.edit_distance <= deletion.edit_distance
true
end
end

Method to determine if an insertion occured.

def insertion?(mismatch, insertion, deletion)
if insertion.edit_distance <= mismatch.edit_distance and
insertion.edit_distance <= deletion.edit_distance
true
end
end

Method to determine if a deletion occured.

def deletion?(mismatch, insertion, deletion)
if deletion.edit_distance <= mismatch.edit_distance and
deletion.edit_distance <= insertion.edit_distance
true
end
end

Method to print the score vector.

def vector_print
@vector.each do |s|
puts s
end

puts

end

Method that returns a Match object initialized with

information from the score vector.

def match_new
matches = @vector.last.matches
mismatches = @vector.last.mismatches
insertions = @vector.last.insertions
deletions = @vector.last.deletions
length = @pattern.length - insertions + deletions
pos = @pos - length + 1
match = @seq[pos … pos + length]

Match.new(pos, match, matches, mismatches, insertions, deletions,

length)
end

Method that determines if a match was found by analyzing the score

vector.
def match_found?
if @vector.last.edit_distance <= @max_edit_distance
true
end
end

Class to instantiate Score objects that holds score information.

class Score
attr_accessor :matches, :mismatches, :insertions, :deletions

def initialize(matches = 0, mismatches = 0, insertions = 0,

deletions = 0)
@matches = matches
@mismatches = mismatches
@insertions = insertions
@deletions = deletions
end

# Method to calculate and return the edit distance.
def edit_distance
  self.mismatches + self.insertions + self.deletions
end

private

def to_s
  "(#{[self.matches, self.mismatches, self.insertions,

self.deletions].join(‘,’)})"
end
end

Class for creating Match objects which contain the description of a

match between a nucleotide sequence and a pattern.

class Match
attr_reader :pos, :match, :matches, :mismatches, :insertions,
:deletions, :edit_distance, :length

def initialize(pos, match, matches, mismatches, insertions,

deletions, length)
@pos = pos
@match = match
@matches = matches
@mismatches = mismatches
@insertions = insertions
@deletions = deletions
@edit_distance = mismatches + insertions + deletions
@length = length
end
end
end

On Apr 18, 2011, at 10:15 AM, Martin H. wrote:

Hello all,

The below code is too slow for practical use. I need it to run at least
20 times faster. Perhaps that is possible with some C code? I have no
experience with writing Ruby extensions. What are the pitfalls? Which
part of the code should be ported? Any pointers to get me started?

I pasted the code into a local file and ran it. It completed instantly.

It would be super helpful if you provided all of the code to make this
run and do some useful work. By doing so, people who want to help you
can run it through a profiler and identify hot spots.

cr

On Mon, Apr 18, 2011 at 5:15 PM, Martin H. [email protected] wrote:

The below code is too slow for practical use. I need it to run at least
20 times faster. Perhaps that is possible with some C code? I have no
experience with writing Ruby extensions. What are the pitfalls? Which
part of the code should be ported? Any pointers to get me started?

Can you be a bit more specific about the slow part or even extract a
simple use case which demonstrates the issue you are having? Also,
copying it from your email and then fixing all the line breaks is
tedious. Please use a more appropriate means to distribute the code
in such cases (e.g. pastie, gist).

Kind regards

robert

Martin H. wrote:

The below code is too slow for practical use. I need it to run at least
20 times faster. Perhaps that is possible with some C code? I have no
experience with writing Ruby extensions. What are the pitfalls? Which
part of the code should be ported? Any pointers to get me started?

Please give a clear description of the algorithm, and then
give some sample input and output.

WJ wrote in post #993576:

Martin H. wrote:

The below code is too slow for practical use. I need it to run at least
20 times faster. Perhaps that is possible with some C code? I have no
experience with writing Ruby extensions. What are the pitfalls? Which
part of the code should be ported? Any pointers to get me started?

Please give a clear description of the algorithm, and then
give some sample input and output.

Here is a working version of the code that can be profiled (though it
will take forever with 20M iterations):

http://pastie.org/1808127

The slow part according to profiler is:

% cumulative self self total
time seconds seconds calls ms/call ms/call name
29.39 2.66 2.66 1521 1.75 11.78 Range#each
15.80 4.09 1.43 33000 0.04 0.06 Seq#match?
10.72 5.06 0.97 78940 0.01 0.03 Kernel.dup
9.28 5.90 0.84 78940 0.01 0.01
Kernel.initialize_dup
6.63 6.50 0.60 142380 0.00 0.00
Seq::Score#edit_distance
5.30 6.98 0.48 22220 0.02 0.03 Seq#deletion?
3.54 7.30 0.32 66016 0.00 0.00 String#ord
3.43 7.61 0.31 14680 0.02 0.04 Seq#mismatch?
3.31 7.91 0.30 8300 0.04 0.05 Seq#insertion?

The input is DNA sequences. Basically strings of ATCG and Ns of length
50-100. These comes in files with 20M-30M sequences per file. I’ve got
~50 of these files and more incoming. The output will be truncated
sequences based on the match position located with this bit of code.

The algorithm is of the dynamic programming flavor and was inspired by
the paper by Bruno Woltzenlogel Paleo (page 197):

http://www.logic.at/people/bruno/Papers/2007-GATE-ESSLLI.pdf

Locating variable length matches is tricky!

Cheers,

Martin

def match?(char1, char2)
(EQUAL[char1.upcase.ord] & EQUAL[char2.upcase.ord]) != 0
end

That would be really easy to write tests and a benchmark against and
then rewrite in C using RubyInline… without tests and a benchmark tho,
you won’t know that you’ve done it correctly and provides a measurable
benefit.

They bottlenecks are the iteration over the sequence (while loop at line
68) and the vector_update (line 120). I am a bit surprised that match?
is that slow - I expect it to be almost instantaneous in C. I would like
to test that with a benchmark and Inline C.

And I just discovered RubyInline! Inline code is so much easier to deal
with that an external library IMHO. I have been messing a bit around
with “inline”, but that turns out to be some old crappy gem!!! Its not
the first time I have missed the right gem :o( …

However, I get a peculiar error when testing RubyInline with the first
of the examples:

http://pastie.org/1811057

I am on Ruby 1.9.2.

Cheers,

Martin

On Apr 18, 2011, at 11:30 , Martin H. wrote:

Here is a working version of the code that can be profiled (though it
will take forever with 20M iterations):

http://pastie.org/1808127

The slow part according to profiler is:

Profiling is great, but you also need to write a benchmark in order to
objectively measure the work you’re doing. You might also want to write
microbenchmarks against the bottlenecks you identify in your profiling
work. Eg:

def match?(char1, char2)
(EQUAL[char1.upcase.ord] & EQUAL[char2.upcase.ord]) != 0
end

That would be really easy to write tests and a benchmark against and
then rewrite in C using RubyInline… without tests and a benchmark tho,
you won’t know that you’ve done it correctly and provides a measurable
benefit.

On Tue, Apr 19, 2011 at 12:30 PM, Martin H. [email protected] wrote:

  1. and the vector_update (line 120). I am a bit surprised that match?
    is that slow - I expect it to be almost instantaneous in C. I would like
    to test that with a benchmark and Inline C.

Frankly, I find your code has a design issue: it seems you mix data
and iteration in a single class. This is visible from how #match
works

def match(pattern, pos = 0, max_edit_distance = 0)
@pattern = pattern
@pos = pos
@max_edit_distance = max_edit_distance
@vector = vector_init

IMHO it would be better to separate representation of the sequence and
the matching process. The matcher then would only carry a reference
to the sequence and all the data it needs to do matching.

Also #vector_update creates a lot of objects and does so for each
position in the sequence. That’s likely where you can improve things.

I am not sure what the matching algorithm is exactly. Can you summarize
it?

Ah, and one thing: if you add lowercase entries

EQUAL[‘a’.ord] = BIT_A
EQUAL[‘t’.ord] = BIT_T
EQUAL[‘u’.ord] = BIT_T
EQUAL[‘c’.ord] = BIT_C
EQUAL[‘g’.ord] = BIT_G
EQUAL[‘m’.ord] = (BIT_A|BIT_C)
EQUAL[‘r’.ord] = (BIT_A|BIT_G)
EQUAL[‘w’.ord] = (BIT_A|BIT_T)
EQUAL[‘s’.ord] = (BIT_C|BIT_G)
EQUAL[‘y’.ord] = (BIT_C|BIT_T)
EQUAL[‘k’.ord] = (BIT_G|BIT_T)
EQUAL[‘b’.ord] = (BIT_C|BIT_G|BIT_T)
EQUAL[‘d’.ord] = (BIT_A|BIT_G|BIT_T)
EQUAL[‘h’.ord] = (BIT_A|BIT_C|BIT_T)
EQUAL[‘v’.ord] = (BIT_A|BIT_C|BIT_G)
EQUAL[‘n’.ord] = (BIT_A|BIT_C|BIT_G|BIT_T)

you can make matching simpler

def match?(char1, char2)
(EQUAL[char1.ord] & EQUAL[char2.ord]) != 0
end

You might as well consider changing the Array into a Hash. Then you
can even get rid of the #ord call.

A different approach would just use regular expressions, e.g.

MAP = {
‘a’ => 1,
‘t’ => 2,
‘w’ => ‘[12]’,

}

seq = Array.new(20) { %w{1 2 4 8}.sample }
pat = “acgw”
rx = Regexp.new(pat.chars.map {|c| MAP[c.downcase]}.join)

seq.scan pat do |m|
p m
end

Cheers

robert

IMHO it would be better to separate representation of the sequence and
the matching process. The matcher then would only carry a reference
to the sequence and all the data it needs to do matching.

I am not sure if I understand this. I have tried to copy the behavior of
String#match.

Also #vector_update creates a lot of objects and does so for each
position in the sequence. That’s likely where you can improve things.

Yes, that is quite possible. I might be able to skip .dup on line 128
and 130. That will require some thinking and testing on my side.

I am not sure what the matching algorithm is exactly. Can you summarize
it?

Well, it is a dynamic programming algorithm to do fuzzy searches of
patterns in strings - allowing for custom matching rules (A==N, etc) and
a maximum edit distance. Inspired by the paper by Bruno Woltzenlogel
Paleo (page 197):

http://www.logic.at/people/bruno/Papers/2007-GATE-ESSLLI.pdf

A short example:

http://pastie.org/1811496

you can make matching simpler

def match?(char1, char2)
(EQUAL[char1.ord] & EQUAL[char2.ord]) != 0
end

Yes, but that should not give any significant speed increase?

You might as well consider changing the Array into a Hash. Then you
can even get rid of the #ord call.

Actually, I started with a hash for this - and it was slightly faster.
However, I think this bit field is very elegant - and since I was
preparing for porting to C - I think this is the way to go!

Cheers

Martin

unknown wrote in post #993757:

On Tue, Apr 19, 2011 at 6:30 AM, Martin H. [email protected] wrote:

I am on Ruby 1.9.2.
What exact version? what platform? how was it installed?

$ uname -a
Darwin mel.local 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29
15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386 i386

$ ruby -v j.rb
ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-darwin10.6.0]
internal:lib/rubygems/custom_require:29:in require': undefined class/module Digest::Base (ArgumentError) from <internal:lib/rubygems/custom_require>:29:in require’
from
/Users/maasha/maasha_install/lib/ruby/gems/1.9.1/gems/RubyInline-3.9.0/lib/inline.rb:51:in
<top (required)>' from <internal:lib/rubygems/custom_require>:33:in require’
from internal:lib/rubygems/custom_require:33:in rescue in require' from <internal:lib/rubygems/custom_require>:29:in require’
from j.rb:1:in `’

Ruby I compiled and installed myself. Inline was installed: gem install
RubyInline

:o(

Martin

On Tue, Apr 19, 2011 at 3:13 PM, Martin H. [email protected] wrote:

IMHO it would be better to separate representation of the sequence and
the matching process. The matcher then would only carry a reference
to the sequence and all the data it needs to do matching.

I am not sure if I understand this. I have tried to copy the behavior of
String#match.

Maybe on the interface, but you create side effects on the String (Seq

patterns in strings - allowing for custom matching rules (A==N, etc) and

you can make matching simpler
Actually, I started with a hash for this - and it was slightly faster.
However, I think this bit field is very elegant - and since I was
preparing for porting to C - I think this is the way to go!

I did not say you should get rid of the bit field! Plus, if you are
in need for speed and if it was slow then you should get rid of it
regardless of elegance.

Cheers

robert

Sorry for the half message, thick fingers hit the wrong key
accidentally.

On Tue, Apr 19, 2011 at 3:13 PM, Martin H. [email protected] wrote:

IMHO it would be better to separate representation of the sequence and
the matching process. The matcher then would only carry a reference
to the sequence and all the data it needs to do matching.

I am not sure if I understand this. I have tried to copy the behavior of
String#match.

Maybe on the interface, but you create side effects on the String (Seq
in your case). This is neither a clean separation (makes classes
bigger and harder to understand) nor is it thread safe (e.g. if you
want to try to concurrently match several patterns against the same
sequence).

Btw, in your case I would probably implement #scan and derive
implement match like

def match(*args)
scan(*args) do |match_data|
return match_data
end

nil # no match
end

That’s pretty easy and you can still replace this implementation with
something else if it should be necessary.

Also #vector_update creates a lot of objects and does so for each
position in the sequence. That’s likely where you can improve things.

Yes, that is quite possible. I might be able to skip .dup on line 128
and 130. That will require some thinking and testing on my side.

You could at least switch to only two arrays which you use alternating
with relatively moderate effort. You might even get away with a
single array.

A short example:

http://pastie.org/1811496

Ah, interesting. Thanks!

you can make matching simpler

def match?(char1, char2)
(EQUAL[char1.ord] & EQUAL[char2.ord]) != 0
end

Yes, but that should not give any significant speed increase?

Why not? You get at least rid of a Ruby method call. Did you measure
differences?

You might as well consider changing the Array into a Hash. Then you
can even get rid of the #ord call.

Actually, I started with a hash for this - and it was slightly faster.
However, I think this bit field is very elegant - and since I was
preparing for porting to C - I think this is the way to go!

I did not say you should get rid of the bit field! Plus, if you are
in need for speed and if it was slow then you should get rid of it
regardless of elegance.

Kind regards

robert

On Tue, Apr 19, 2011 at 6:30 AM, Martin H. [email protected] wrote:

I am on Ruby 1.9.2.
What exact version? what platform? how was it installed?

$ uname -a
Linux linux116.ctc.com 2.6.32-71.24.1.el6.x86_64 #1 SMP Fri Apr 8
01:07:23 CDT 2011 x86_64 x86_64 x86_64 GNU/Linux

$ ruby -v j.rb
ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-linux]
120

$ cat j.rb
require “inline”

class MyTest
inline do |builder|
builder.c "
long factorial(int max) {
int i=max, result=1;
while (i >= 2) { result *= i–; }
return result;
}"
end
end

t = MyTest.new()

puts factorial_5 = t.factorial(5)

On Tue, Apr 19, 2011 at 10:12 AM, Martin H. [email protected] wrote:

ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-darwin10.6.0]
internal:lib/rubygems/custom_require:29:in `require’: undefined
class/module Digest::Base (ArgumentError)

$ uname -a
Darwin Mac-mini.local 10.6.0 Darwin Kernel Version 10.6.0: Wed Nov 10
18:13:17 PST 2010; root:xnu-1504.9.26~3/RELEASE_I386 i386

$ ruby -v j.rb
ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-darwin10.6.0]
120

ruby installed via homebrew

Ruby I compiled and installed myself.

Your Ruby was compiled under 10.7 or 10.6?

On Tue, Apr 19, 2011 at 10:12 AM, Martin H. [email protected] wrote:

ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-darwin10.6.0]
from j.rb:1:in `’

Ruby I compiled and installed myself. Inline was installed: gem install
RubyInline

Thanks; I’ll try to give it a try to replicate this from home later.

Martin H. [email protected] wrote:

Your Ruby was compiled under 10.7 or 10.6?

I keep a log of the ruby compile. It looks like it was build under
10.6.0. The Mac OS here is 10.6.7 (there is no 10.7 out yet).

Oh, okay. I wanted to check since your uname above said “10.7.0”.

Is there a better place for this “bug report” than this thread?

Ruby Issue Tracking System: http://redmine.ruby-lang.org/

Can you run your test script with “ruby -d”?

Ruby Issue Tracking System: http://redmine.ruby-lang.org/
I’m not sure about RubyInline

Can you run your test script with “ruby -d”?

Also, can you just try “require ‘digest/md5’” in irb?

Your Ruby was compiled under 10.7 or 10.6?

I keep a log of the ruby compile. It looks like it was build under
10.6.0. The Mac OS here is 10.6.7 (there is no 10.7 out yet).

Is there a better place for this “bug report” than this thread?

Cheers

Martin

unknown wrote in post #994006:

Ruby Issue Tracking System: http://redmine.ruby-lang.org/
I’m not sure about RubyInline

Can you run your test script with “ruby -d”?

$ ruby -d j.rb →

http://pastie.org/1815275

Also, can you just try “require ‘digest/md5’” in irb?

irb(main):001:0> require ‘digest/md5’
ArgumentError: undefined class/module Digest::Base
from internal:lib/rubygems/custom_require:29:in require' from <internal:lib/rubygems/custom_require>:29:in require’
from (irb):1
from /Users/maasha/maasha_install/bin/irb:12:in `’

M

Okay, so the issue is in your ruby not RubyInline.

I updated my machine to 10.6.7 (uname: 10.7.0 :slight_smile: and my previously
compiled 1.9.2 still works correctly. Can you try to install a fresh
build in another prefix (by hand, RVM, or homebrew) and see if the
same behavior is still present?

Thanks.

I reverted to ruby 1.9.1p129

gem install RubyInline
ERROR: While executing gem … (ArgumentError)
undefined class/module Digest::Base

This happens on both Mac and Linux

Darwin mel.local 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29
15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386 i386

Linux bixeonws 2.6.32-30-server #59-Ubuntu SMP Tue Mar 1 22:46:09 UTC
2011 x86_64 GNU/Linux

What is this Digest::Base anyway? Can we have it shot and killed?

Martin