This could be a good idea, but I am not entirely sure if it will work -
at least the current suggestion gives an empty RXS hash. And I am unsure
about what this do? map{|ch|"\u00%02x" % ch}
Thinking about this problem I wonder if it would be an idea to create a
mask with transliterate and then use a bitwise operation to downcase.
IIRC you change the case with a bitwise | operation with ’ '. It doesn’t
look like Ruby have any methods bitwise operations on strings.
This could be a good idea, but I am not entirely sure if it will work -
at least the current suggestion gives an empty RXS hash. And I am unsure
about what this do? map{|ch|“\u00%02x” % ch}
Just try it out in IRB. This creates a regexp with unicode names.
You could however use Fixnum#chr instead, e.g.
Thinking about this problem I wonder if it would be an idea to create a
mask with transliterate and then use a bitwise operation to downcase.
IIRC you change the case with a bitwise | operation with ’ '. It doesn’t
look like Ruby have any methods bitwise operations on strings.
AFAIK there are none. But you can do
irb(main):011:0> s = “aaa”
=> “aaa”
irb(main):012:0> s[0] = (s[0].ord | 0x0F).chr
=> “o”
irb(main):013:0> s
=> “oaa”
seq.gsub! /./ do |m|
scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
end
Not too nice though. You could however do some preparation, e.g.
store scores as an Array of Fixnum instead of using #ord.
Yes, converting scores to arrays is bad since the scores are parsed from
files as strings (millions of them). And I am unsure if substr
substitutions are very efficient …
We need to trick this into the regex engine somehow.
How about transforming scores to a mask string like this: 000111 where 1
indicates that the corresponding sequence char should be lowercased
(that can be done with tr). Then we plug this onto the sequence string:
seq = “ATCGAT000111”
And then we construct a regex with a forward looking identifier that
reads the mask and manipulates the ATCG chars?
files as strings (millions of them). And I am unsure if substr
And then we construct a regex with a forward looking identifier that
reads the mask and manipulates the ATCG chars?
Here’s what I’d probably do.
Create a custom class (and not use a Hash) for this, e.g.
Score = Struct.new :seq, :score
Create another structure for caching scores and a bit representation
for downcase dependent on cutoff valiue, e.g.
ScoreCache = Struct.new :score do
def mask(cutoff)
cache[cutoff]
end
def mask_sequence(cutoff, seq)
mask(cutoff).each_bit do |idx|
seq[idx] = seq[idx].downcase!
end
seq
end
private
def cache @cache ||= Hash.new do |h,cutoff|
c = 0
self.score.each_with_index |ch,idx|
c |= (1 << idx) if ch.ord - BASE_SOLEXA < cutoff
end
h[cutoff] = c
end
end
end
Store score string → ScoreCache
global_score_cache = Hash.new do |h,score|
h[score] = ScoreCache.new score
end
class Integer
def each_bit
raise “Currently only positive implemented” if self < 0
if block_given?
idx = 0
x = self
while x != 0
yield idx if x[0] == 1
idx += 1
x >>= 1
end
self
else
Enumerator.new self, :each_bit
end
The gsub solution seems to be reasonably efficient.
seq.gsub! /./ do |m|
scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
end
But my original proposed naive loop is twice as fast:
scores.each_char do |score|
seq[i] = seq[i].downcase if score.ord - BASE_SOLEXA <= cutoff
i += 1
end
I dont really know how gsub and tr compares to the Perl equivalents
speed wise - in Perl tr is precompiling a lookup table that is evil fast
and the regex engine is also primed at compile time and runs extremely
fast.
I suspect that you need some C extension to go faster than this, but I
don’t really want to spend the time on that. I was exploring Inline C
but that appears very fragile - I cannot even get the example from the
cookbook up and running under Ruby 191/192.
Looking at your last proposal I see three iterations where two are
running narrow if loops. I have not testet it, but it looks suspicious.
It is a shame that Ruby does not have build in bitwise operators for the
String class allowing C speed masking. I have tried to get on the core
mailing list to post this as a feature suggestion, but the ML form is
stuffed.
scores.each_char do |score|
seq[i] = seq[i].downcase if score.ord - BASE_SOLEXA <= cutoff
i += 1
end
I dont really know how gsub and tr compares to the Perl equivalents
speed wise - in Perl tr is precompiling a lookup table that is evil fast
and the regex engine is also primed at compile time and runs extremely
fast.
Regexp is precompiled but I suspect that tr works at runtime only.
For definitive answer you’ll have to look at the source.
I suspect that you need some C extension to go faster than this, but I
don’t really want to spend the time on that. I was exploring Inline C
but that appears very fragile - I cannot even get the example from the
cookbook up and running under Ruby 191/192.
Looking at your last proposal I see three iterations where two are
running narrow if loops. I have not testet it, but it looks suspicious.
Well, but the caching should avoid that too many loops are executed.
I do not know however, how often you reuse values. If you need this
in several processes you could save the current state in a file via
Marshal which is quite fast.
Kind regards
robert
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.