Hello all,
This snip of code needs more speed. Ideas? What is the state of Ruby and
Inline C?
Cheers,
Martin
class Hash
BASE_SOLEXA = 64
# Soft masks sequence residues where the corresponding quality score
# is below a given cutoff.
def mask!(cutoff)
if self.has_key? :SEQ and self.has_key? :SCORES
seq = self[:SEQ]
scores = self[:SCORES]
i = 0
scores.each_char do |score|
seq[i] = seq[i].downcase if score.ord - BASE_SOLEXA <= cutoff
i += 1
end
self[:SEQ] = seq
end
self
end
end
on 2010-08-30 09:58
on 2010-08-30 11:11
2010/8/30 Martin Hansen <mail@maasha.dk>: > > scores = self[:SCORES] > self > end > end What about using String#gsub!? # untested seq.gsub! /(?:#{(0..(BASE_SOLEXA+cutoff)).map{|ch|"\\u00%02x" % ch}.join("|")})+/ do |m| m.downcase! end You could speed up calculation of the regexp by placing this in the default block of a Hash, e.g. RXS = Hash.new {|h,cutoff| h[cutoff] = /(?:#{(0..(BASE_SOLEXA+cutoff)).map{|ch|"\\u00%02x" % ch}.join("|")})+/ and then seq.gsub! RXS[cutoff] do |m| m.downcase! end Kind regards robert
on 2010-08-30 15:27
> You could speed up calculation of the regexp by placing this in the > default block of a Hash, e.g. > > RXS = Hash.new {|h,cutoff| h[cutoff] = > /(?:#{(0..(BASE_SOLEXA+cutoff)).map{|ch|"\\u00%02x" % > ch}.join("|")})+/ > > and then > > seq.gsub! RXS[cutoff] do |m| > m.downcase! > end This could be a good idea, but I am not entirely sure if it will work - at least the current suggestion gives an empty RXS hash. And I am unsure about what this do? map{|ch|"\\u00%02x" % ch} Thinking about this problem I wonder if it would be an idea to create a mask with transliterate and then use a bitwise operation to downcase. IIRC you change the case with a bitwise | operation with ' '. It doesn't look like Ruby have any methods bitwise operations on strings. Cheers, Martin
on 2010-08-30 15:46
2010/8/30 Martin Hansen <mail@maasha.dk>: >> m.downcase! >> end > > > This could be a good idea, but I am not entirely sure if it will work - > at least the current suggestion gives an empty RXS hash. And I am unsure > about what this do? map{|ch|"\\u00%02x" % ch} Just try it out in IRB. This creates a regexp with unicode names. You could however use Fixnum#chr instead, e.g. #{ (0..(BASE_SOLEXA + cutoff)).map {|ch| Regexp.escape(ch.chr)}.join("|") } This gives shorter and more readable strings. > Thinking about this problem I wonder if it would be an idea to create a > mask with transliterate and then use a bitwise operation to downcase. > IIRC you change the case with a bitwise | operation with ' '. It doesn't > look like Ruby have any methods bitwise operations on strings. AFAIK there are none. But you can do irb(main):011:0> s = "aaa" => "aaa" irb(main):012:0> s[0] = (s[0].ord | 0x0F).chr => "o" irb(main):013:0> s => "oaa" Kind regards robert
on 2010-08-30 16:04
> #{ (0..(BASE_SOLEXA + cutoff)).map {|ch| > Regexp.escape(ch.chr)}.join("|") } > > This gives shorter and more readable strings. OK, so for the right range that is equal to: regex = Regexp.union((-5 .. cutoff).collect { |n| (n + BASE_SOLEXA).chr } ) Which returns: (?-mix:;|<|=|>|\?|@|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T) But I dont get how the substitution will scan one string (scores) an make changes to the other (seq): seq = 'TTGGTCGCTCGCTCCGCGACCTCAGATCAGACGTGGGCGAT' scores = '@ABCDEFGHIJK?MNOPQRSTUVWhgfedcba`_^]\[ZYX' ? > AFAIK there are none. That is actually strange. In Perl you can do this: perl -le 'print "ABCDEFG" | " \0\0 "' => abCDefg Which I believe is very efficient. And the mask with " " and "\0" can be constructed with transliterate efficiently too. Cheers, Martin
on 2010-08-30 16:29
2010/8/30 Martin Hansen <mail@maasha.dk>: > > ? It doesn't. Somehow I must have missed an important detail... :-) You could do this though seq.gsub! /./ do |m| scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m end Not too nice though. You could however do some preparation, e.g. store scores as an Array of Fixnum instead of using #ord. >> AFAIK there are none. > > That is actually strange. In Perl you can do this: > > perl -le 'print "ABCDEFG" | " \0\0 "' > => abCDefg > > Which I believe is very efficient. And the mask with " " and "\0" can be > constructed with transliterate efficiently too. Kind regards robert
on 2010-08-30 16:50
> You could do this though > > seq.gsub! /./ do |m| > scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m > end > > Not too nice though. You could however do some preparation, e.g. > store scores as an Array of Fixnum instead of using #ord. Yes, converting scores to arrays is bad since the scores are parsed from files as strings (millions of them). And I am unsure if substr substitutions are very efficient ... We need to trick this into the regex engine somehow. How about transforming scores to a mask string like this: 000111 where 1 indicates that the corresponding sequence char should be lowercased (that can be done with tr). Then we plug this onto the sequence string: seq = "ATCGAT000111" And then we construct a regex with a forward looking identifier that reads the mask and manipulates the ATCG chars? Martin
on 2010-09-01 10:26
2010/8/30 Martin Hansen <mail@maasha.dk>: > files as strings (millions of them). And I am unsure if substr > And then we construct a regex with a forward looking identifier that > reads the mask and manipulates the ATCG chars? Here's what I'd probably do. Create a custom class (and not use a Hash) for this, e.g. Score = Struct.new :seq, :score Create another structure for caching scores and a bit representation for downcase dependent on cutoff valiue, e.g. ScoreCache = Struct.new :score do def mask(cutoff) cache[cutoff] end def mask_sequence(cutoff, seq) mask(cutoff).each_bit do |idx| seq[idx] = seq[idx].downcase! end seq end private def cache @cache ||= Hash.new do |h,cutoff| c = 0 self.score.each_with_index |ch,idx| c |= (1 << idx) if ch.ord - BASE_SOLEXA < cutoff end h[cutoff] = c end end end # Store score string -> ScoreCache global_score_cache = Hash.new do |h,score| h[score] = ScoreCache.new score end class Integer def each_bit raise "Currently only positive implemented" if self < 0 if block_given? idx = 0 x = self while x != 0 yield idx if x[0] == 1 idx += 1 x >>= 1 end self else Enumerator.new self, :each_bit end end end And then use it and profile. Kind regards robert
on 2010-09-01 13:35
Thanks Robert, The gsub solution seems to be reasonably efficient. > seq.gsub! /./ do |m| > scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m > end But my original proposed naive loop is twice as fast: > scores.each_char do |score| > seq[i] = seq[i].downcase if score.ord - BASE_SOLEXA <= cutoff > i += 1 > end I dont really know how gsub and tr compares to the Perl equivalents speed wise - in Perl tr is precompiling a lookup table that is evil fast and the regex engine is also primed at compile time and runs extremely fast. I suspect that you need some C extension to go faster than this, but I don't really want to spend the time on that. I was exploring Inline C but that appears very fragile - I cannot even get the example from the cookbook up and running under Ruby 191/192. Looking at your last proposal I see three iterations where two are running narrow if loops. I have not testet it, but it looks suspicious. It is a shame that Ruby does not have build in bitwise operators for the String class allowing C speed masking. I have tried to get on the core mailing list to post this as a feature suggestion, but the ML form is stuffed. Cheers, Martin
on 2010-09-01 16:27
On Wed, Sep 1, 2010 at 1:35 PM, Martin Hansen <mail@maasha.dk> wrote: >> scores.each_char do |score| >> seq[i] = seq[i].downcase if score.ord - BASE_SOLEXA <= cutoff >> i += 1 >> end > > I dont really know how gsub and tr compares to the Perl equivalents > speed wise - in Perl tr is precompiling a lookup table that is evil fast > and the regex engine is also primed at compile time and runs extremely > fast. Regexp is precompiled but I suspect that tr works at runtime only. For definitive answer you'll have to look at the source. > I suspect that you need some C extension to go faster than this, but I > don't really want to spend the time on that. I was exploring Inline C > but that appears very fragile - I cannot even get the example from the > cookbook up and running under Ruby 191/192. > > Looking at your last proposal I see three iterations where two are > running narrow if loops. I have not testet it, but it looks suspicious. Well, but the caching should avoid that too many loops are executed. I do not know however, how often you reuse values. If you need this in several processes you could save the current state in a file via Marshal which is quite fast. Kind regards robert
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.