Speed issues iterating over chars

himdel · August 30, 2010, 9:58am

Hello all,

This snip of code needs more speed. Ideas? What is the state of Ruby and
Inline C?

Cheers,

Martin

class Hash
BASE_SOLEXA = 64

Soft masks sequence residues where the corresponding quality score

is below a given cutoff.

def mask!(cutoff)
if self.has_key? :SEQ and self.has_key? :SCORES
seq = self[:SEQ]
scores = self[:SCORES]
i = 0

  scores.each_char do |score|
    seq[i] = seq[i].downcase if score.ord - BASE_SOLEXA <= cutoff
    i += 1
  end

  self[:SEQ] = seq
end

self

end
end

himdel · August 30, 2010, 11:11am

2010/8/30 Martin H. [email protected]:

 scores = self[:SCORES]

self
end
end

What about using String#gsub!?

untested

seq.gsub! /(?:#{(0…(BASE_SOLEXA+cutoff)).map{|ch|“\u00%02x” %
ch}.join(“|”)})+/ do |m|
m.downcase!
end

You could speed up calculation of the regexp by placing this in the
default block of a Hash, e.g.

RXS = Hash.new {|h,cutoff| h[cutoff] =
/(?:#{(0…(BASE_SOLEXA+cutoff)).map{|ch|“\u00%02x” %
ch}.join(“|”)})+/

and then

seq.gsub! RXS[cutoff] do |m|
m.downcase!
end

Kind regards

robert

himdel · August 30, 2010, 3:27pm

You could speed up calculation of the regexp by placing this in the
default block of a Hash, e.g.

RXS = Hash.new {|h,cutoff| h[cutoff] =
/(?:#{(0…(BASE_SOLEXA+cutoff)).map{|ch|"\u00%02x" %
ch}.join("|")})+/

and then

seq.gsub! RXS[cutoff] do |m|
m.downcase!
end

This could be a good idea, but I am not entirely sure if it will work -
at least the current suggestion gives an empty RXS hash. And I am unsure
about what this do? map{|ch|"\u00%02x" % ch}

Thinking about this problem I wonder if it would be an idea to create a
mask with transliterate and then use a bitwise operation to downcase.
IIRC you change the case with a bitwise | operation with ’ '. It doesn’t
look like Ruby have any methods bitwise operations on strings.

Cheers,

Martin

himdel · August 30, 2010, 3:46pm

2010/8/30 Martin H. [email protected]:

m.downcase!
end

This could be a good idea, but I am not entirely sure if it will work -
at least the current suggestion gives an empty RXS hash. And I am unsure
about what this do? map{|ch|“\u00%02x” % ch}

Just try it out in IRB. This creates a regexp with unicode names.
You could however use Fixnum#chr instead, e.g.

#{ (0…(BASE_SOLEXA + cutoff)).map {|ch|
Regexp.escape(ch.chr)}.join(“|”) }

This gives shorter and more readable strings.

Thinking about this problem I wonder if it would be an idea to create a
mask with transliterate and then use a bitwise operation to downcase.
IIRC you change the case with a bitwise | operation with ’ '. It doesn’t
look like Ruby have any methods bitwise operations on strings.

AFAIK there are none. But you can do

irb(main):011:0> s = “aaa”
=> “aaa”
irb(main):012:0> s[0] = (s[0].ord | 0x0F).chr
=> “o”
irb(main):013:0> s
=> “oaa”

Kind regards

robert

himdel · August 30, 2010, 4:29pm

2010/8/30 Martin H. [email protected]:

?
It doesn’t. Somehow I must have missed an important detail…

You could do this though

seq.gsub! /./ do |m|
scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
end

Not too nice though. You could however do some preparation, e.g.
store scores as an Array of Fixnum instead of using #ord.

AFAIK there are none.

That is actually strange. In Perl you can do this:

perl -le ‘print “ABCDEFG” | " \0\0 "’
=> abCDefg

Which I believe is very efficient. And the mask with " " and “\0” can be
constructed with transliterate efficiently too.

Kind regards

robert

himdel · August 30, 2010, 4:50pm

You could do this though

seq.gsub! /./ do |m|
scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
end

Not too nice though. You could however do some preparation, e.g.
store scores as an Array of Fixnum instead of using #ord.

Yes, converting scores to arrays is bad since the scores are parsed from
files as strings (millions of them). And I am unsure if substr
substitutions are very efficient …

We need to trick this into the regex engine somehow.

How about transforming scores to a mask string like this: 000111 where 1
indicates that the corresponding sequence char should be lowercased
(that can be done with tr). Then we plug this onto the sequence string:

seq = “ATCGAT000111”

And then we construct a regex with a forward looking identifier that
reads the mask and manipulates the ATCG chars?

Martin

himdel · August 30, 2010, 4:04pm

#{ (0…(BASE_SOLEXA + cutoff)).map {|ch|
Regexp.escape(ch.chr)}.join("|") }

This gives shorter and more readable strings.

OK, so for the right range that is equal to:

regex = Regexp.union((-5 … cutoff).collect { |n| (n + BASE_SOLEXA).chr
} )

Which returns:

(?-mix:;|<|=|>|?|@|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T)

But I dont get how the substitution will scan one string (scores) an
make changes to the other (seq):

seq = ‘TTGGTCGCTCGCTCCGCGACCTCAGATCAGACGTGGGCGAT’
scores = ‘@ABCDEFGHIJK?MNOPQRSTUVWhgfedcba`_^][ZYX’

?

AFAIK there are none.

That is actually strange. In Perl you can do this:

perl -le ‘print “ABCDEFG” | " \0\0 "’
=> abCDefg

Which I believe is very efficient. And the mask with " " and “\0” can be
constructed with transliterate efficiently too.

Cheers,

Martin

himdel · September 1, 2010, 10:26am

2010/8/30 Martin H. [email protected]:

files as strings (millions of them). And I am unsure if substr
And then we construct a regex with a forward looking identifier that
reads the mask and manipulates the ATCG chars?

Here’s what I’d probably do.

Create a custom class (and not use a Hash) for this, e.g.

Score = Struct.new :seq, :score

Create another structure for caching scores and a bit representation
for downcase dependent on cutoff valiue, e.g.

ScoreCache = Struct.new :score do
def mask(cutoff)
cache[cutoff]
end

def mask_sequence(cutoff, seq)
mask(cutoff).each_bit do |idx|
seq[idx] = seq[idx].downcase!
end

seq

end

private
def cache
@cache ||= Hash.new do |h,cutoff|
c = 0

  self.score.each_with_index |ch,idx|
    c |= (1 << idx) if ch.ord - BASE_SOLEXA < cutoff
  end

  h[cutoff] = c
end

end
end

Store score string → ScoreCache

global_score_cache = Hash.new do |h,score|
h[score] = ScoreCache.new score
end

class Integer
def each_bit
raise “Currently only positive implemented” if self < 0

if block_given?
  idx = 0
  x = self

  while x != 0
    yield idx if x[0] == 1
    idx += 1
    x >>= 1
  end

  self
else
  Enumerator.new self, :each_bit
end

end
end

And then use it and profile.

Kind regards

robert

himdel · September 1, 2010, 1:35pm

Thanks Robert,

The gsub solution seems to be reasonably efficient.

seq.gsub! /./ do |m|
scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
end

But my original proposed naive loop is twice as fast:

scores.each_char do |score|
seq[i] = seq[i].downcase if score.ord - BASE_SOLEXA <= cutoff
i += 1
end

I dont really know how gsub and tr compares to the Perl equivalents
speed wise - in Perl tr is precompiling a lookup table that is evil fast
and the regex engine is also primed at compile time and runs extremely
fast.

I suspect that you need some C extension to go faster than this, but I
don’t really want to spend the time on that. I was exploring Inline C
but that appears very fragile - I cannot even get the example from the
cookbook up and running under Ruby 191/192.

Looking at your last proposal I see three iterations where two are
running narrow if loops. I have not testet it, but it looks suspicious.

It is a shame that Ruby does not have build in bitwise operators for the
String class allowing C speed masking. I have tried to get on the core
mailing list to post this as a feature suggestion, but the ML form is
stuffed.

Cheers,

Martin

himdel · September 1, 2010, 4:27pm

On Wed, Sep 1, 2010 at 1:35 PM, Martin H. [email protected] wrote:

scores.each_char do |score|
seq[i] = seq[i].downcase if score.ord - BASE_SOLEXA <= cutoff
i += 1
end

I dont really know how gsub and tr compares to the Perl equivalents
speed wise - in Perl tr is precompiling a lookup table that is evil fast
and the regex engine is also primed at compile time and runs extremely
fast.

Regexp is precompiled but I suspect that tr works at runtime only.
For definitive answer you’ll have to look at the source.

I suspect that you need some C extension to go faster than this, but I
don’t really want to spend the time on that. I was exploring Inline C
but that appears very fragile - I cannot even get the example from the
cookbook up and running under Ruby 191/192.

Looking at your last proposal I see three iterations where two are
running narrow if loops. I have not testet it, but it looks suspicious.

Well, but the caching should avoid that too many loops are executed.
I do not know however, how often you reuse values. If you need this
in several processes you could save the current state in a file via
Marshal which is quite fast.

Kind regards

robert