Forum: Ruby Speed issues iterating over chars

Posted by Martin Hansen (maasha)
on 2010-08-30 09:58
Hello all,


This snip of code needs more speed. Ideas? What is the state of Ruby and
Inline C?


Cheers,

Martin



class Hash
  BASE_SOLEXA = 64

  # Soft masks sequence residues where the corresponding quality score
  # is below a given cutoff.
  def mask!(cutoff)
    if self.has_key? :SEQ and self.has_key? :SCORES
      seq    = self[:SEQ]
      scores = self[:SCORES]
      i      = 0

      scores.each_char do |score|
        seq[i] = seq[i].downcase if score.ord - BASE_SOLEXA <= cutoff
        i += 1
      end

      self[:SEQ] = seq
    end

    self
  end
end
Posted by Robert Klemme (Guest)
on 2010-08-30 11:11
(Received via mailing list)
2010/8/30 Martin Hansen <mail@maasha.dk>:
>
>      scores = self[:SCORES]
>    self
>  end
> end

What about using String#gsub!?

# untested
seq.gsub! /(?:#{(0..(BASE_SOLEXA+cutoff)).map{|ch|"\\u00%02x" %
ch}.join("|")})+/ do |m|
  m.downcase!
end

You could speed up calculation of the regexp by placing this in the
default block of a Hash, e.g.

RXS = Hash.new {|h,cutoff| h[cutoff] =
/(?:#{(0..(BASE_SOLEXA+cutoff)).map{|ch|"\\u00%02x" %
ch}.join("|")})+/

and then

seq.gsub! RXS[cutoff] do |m|
  m.downcase!
end

Kind regards

robert
Posted by Martin Hansen (maasha)
on 2010-08-30 15:27
> You could speed up calculation of the regexp by placing this in the
> default block of a Hash, e.g.
> 
> RXS = Hash.new {|h,cutoff| h[cutoff] =
> /(?:#{(0..(BASE_SOLEXA+cutoff)).map{|ch|"\\u00%02x" %
> ch}.join("|")})+/
> 
> and then
> 
> seq.gsub! RXS[cutoff] do |m|
>   m.downcase!
> end


This could be a good idea, but I am not entirely sure if it will work - 
at least the current suggestion gives an empty RXS hash. And I am unsure 
about what this do? map{|ch|"\\u00%02x" % ch}

Thinking about this problem I wonder if it would be an idea to create a 
mask with transliterate and then use a bitwise operation to downcase. 
IIRC you change the case with a bitwise | operation with ' '. It doesn't 
look like Ruby have any methods bitwise operations on strings.


Cheers,


Martin
Posted by Robert Klemme (Guest)
on 2010-08-30 15:46
(Received via mailing list)
2010/8/30 Martin Hansen <mail@maasha.dk>:
>>   m.downcase!
>> end
>
>
> This could be a good idea, but I am not entirely sure if it will work -
> at least the current suggestion gives an empty RXS hash. And I am unsure
> about what this do? map{|ch|"\\u00%02x" % ch}

Just try it out in IRB.  This creates a regexp with unicode names.
You could however use Fixnum#chr instead, e.g.

#{ (0..(BASE_SOLEXA + cutoff)).map {|ch| 
Regexp.escape(ch.chr)}.join("|") }

This gives shorter and more readable strings.

> Thinking about this problem I wonder if it would be an idea to create a
> mask with transliterate and then use a bitwise operation to downcase.
> IIRC you change the case with a bitwise | operation with ' '. It doesn't
> look like Ruby have any methods bitwise operations on strings.

AFAIK there are none.  But you can do

irb(main):011:0> s = "aaa"
=> "aaa"
irb(main):012:0> s[0] = (s[0].ord | 0x0F).chr
=> "o"
irb(main):013:0> s
=> "oaa"

Kind regards

robert
Posted by Martin Hansen (maasha)
on 2010-08-30 16:04
> #{ (0..(BASE_SOLEXA + cutoff)).map {|ch| 
> Regexp.escape(ch.chr)}.join("|") }
> 
> This gives shorter and more readable strings.

OK, so for the right range that is equal to:

regex = Regexp.union((-5 .. cutoff).collect { |n| (n + BASE_SOLEXA).chr 
} )

Which returns:

(?-mix:;|<|=|>|\?|@|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T)

But I dont get how the substitution will scan one string (scores) an 
make changes to the other (seq):

seq    = 'TTGGTCGCTCGCTCCGCGACCTCAGATCAGACGTGGGCGAT'
scores = '@ABCDEFGHIJK?MNOPQRSTUVWhgfedcba`_^]\[ZYX'

?


> AFAIK there are none.  

That is actually strange. In Perl you can do this:

perl -le 'print "ABCDEFG" | "  \0\0    "'
=> abCDefg

Which I believe is very efficient. And the mask with " " and "\0" can be 
constructed with transliterate efficiently too.

Cheers,


Martin


Posted by Robert Klemme (Guest)
on 2010-08-30 16:29
(Received via mailing list)
2010/8/30 Martin Hansen <mail@maasha.dk>:
>
> ?
It doesn't. Somehow I must have missed an important detail... :-)

You could do this though

seq.gsub! /./ do |m|
  scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
end

Not too nice though.  You could however do some preparation, e.g.
store scores as an Array of Fixnum instead of using #ord.

>> AFAIK there are none.
>
> That is actually strange. In Perl you can do this:
>
> perl -le 'print "ABCDEFG" | "  \0\0    "'
> => abCDefg
>
> Which I believe is very efficient. And the mask with " " and "\0" can be
> constructed with transliterate efficiently too.

Kind regards

robert
Posted by Martin Hansen (maasha)
on 2010-08-30 16:50
> You could do this though
> 
> seq.gsub! /./ do |m|
>   scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
> end
> 
> Not too nice though.  You could however do some preparation, e.g.
> store scores as an Array of Fixnum instead of using #ord.

Yes, converting scores to arrays is bad since the scores are parsed from 
files as strings (millions of them). And I am unsure if substr 
substitutions are very efficient ...

We need to trick this into the regex engine somehow.

How about transforming scores to a mask string like this: 000111 where 1 
indicates that the corresponding sequence char should be lowercased 
(that can be done with tr). Then we plug this onto the sequence string:

seq = "ATCGAT000111"

And then we construct a regex with a forward looking identifier that 
reads the mask and manipulates the ATCG chars?


Martin

Posted by Robert Klemme (Guest)
on 2010-09-01 10:26
(Received via mailing list)
2010/8/30 Martin Hansen <mail@maasha.dk>:
> files as strings (millions of them). And I am unsure if substr
> And then we construct a regex with a forward looking identifier that
> reads the mask and manipulates the ATCG chars?

Here's what I'd probably do.

Create a custom class (and not use a Hash) for this, e.g.

Score = Struct.new :seq, :score

Create another structure for caching scores and a bit representation
for downcase dependent on cutoff valiue, e.g.

ScoreCache = Struct.new :score do
  def mask(cutoff)
    cache[cutoff]
  end

  def mask_sequence(cutoff, seq)
    mask(cutoff).each_bit do |idx|
      seq[idx] = seq[idx].downcase!
    end

    seq
  end

private
  def cache
    @cache ||= Hash.new do |h,cutoff|
      c = 0

      self.score.each_with_index |ch,idx|
        c |= (1 << idx) if ch.ord - BASE_SOLEXA < cutoff
      end

      h[cutoff] = c
    end
  end
end

# Store score string -> ScoreCache
global_score_cache = Hash.new do |h,score|
  h[score] = ScoreCache.new score
end

class Integer
  def each_bit
    raise "Currently only positive implemented" if self < 0

    if block_given?
      idx = 0
      x = self

      while x != 0
        yield idx if x[0] == 1
        idx += 1
        x >>= 1
      end

      self
    else
      Enumerator.new self, :each_bit
    end
  end
end


And then use it and profile.

Kind regards

robert
Posted by Martin Hansen (maasha)
on 2010-09-01 13:35
Thanks Robert,

The gsub solution seems to be reasonably efficient.

>  seq.gsub! /./ do |m|
>   scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
> end

But my original proposed naive loop is twice as fast:

> scores.each_char do |score|
>   seq[i] = seq[i].downcase if score.ord - BASE_SOLEXA <= cutoff
>   i += 1
> end

I dont really know how gsub and tr compares to the Perl equivalents 
speed wise - in Perl tr is precompiling a lookup table that is evil fast 
and the regex engine is also primed at compile time and runs extremely 
fast.

I suspect that you need some C extension to go faster than this, but I 
don't really want to spend the time on that. I was exploring Inline C 
but that appears very fragile - I cannot even get the example from the 
cookbook up and running under Ruby 191/192.

Looking at your last proposal I see three iterations where two are 
running narrow if loops. I have not testet it, but it looks suspicious.

It is a shame that Ruby does not have build in bitwise operators for the 
String class allowing C speed masking. I have tried to get on the core 
mailing list to post this as a feature suggestion, but the ML form is 
stuffed.


Cheers,


Martin
Posted by Robert Klemme (Guest)
on 2010-09-01 16:27
(Received via mailing list)
On Wed, Sep 1, 2010 at 1:35 PM, Martin Hansen <mail@maasha.dk> wrote:
>> scores.each_char do |score|
>>   seq[i] = seq[i].downcase if score.ord - BASE_SOLEXA <= cutoff
>>   i += 1
>> end
>
> I dont really know how gsub and tr compares to the Perl equivalents
> speed wise - in Perl tr is precompiling a lookup table that is evil fast
> and the regex engine is also primed at compile time and runs extremely
> fast.

Regexp is precompiled but I suspect that tr works at runtime only.
For definitive answer you'll have to look at the source.

> I suspect that you need some C extension to go faster than this, but I
> don't really want to spend the time on that. I was exploring Inline C
> but that appears very fragile - I cannot even get the example from the
> cookbook up and running under Ruby 191/192.
>
> Looking at your last proposal I see three iterations where two are
> running narrow if loops. I have not testet it, but it looks suspicious.

Well, but the caching should avoid that too many loops are executed.
I do not know however, how often you reuse values.  If you need this
in several processes you could save the current state in a file via
Marshal which is quite fast.

Kind regards

robert
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.