Implementing a simple and efficient index system

Hello everyone,

I’m pretty new to Ruby and programming in general. Here’s my problem:

I’m writing a program that will automatically download protein sequences
from a server and write them into the corresponding file. Every single
sequence has a unique id and I have to eliminate duplicates. However, as
the number of sequences might exceed 50 000, I can’t simply save all
sequences in a hash (with their id as key) and then write them to hd
after downloading has finished. So my idea is to write every sequence to
the corresponding file immediately, but first I have to check if it has
been processed already.

I could save all processed id’s in an array and then check if the array
includes my current id:

sequences = []
some kind of loop magic
if sequences.include?(id)
process file
sequences << id
end
end

But I suspect that sequences.include?(id) would iterate over the whole
array until it finds a match. As this array might have up 50 000
positions and I will have to do this check for every sequence, this
would probably be very inefficient.

I could also save all processed id’s as keys of a hash, however I don’t
have any use for a value:

sequences = {}
some kind of loop magic
if sequences[id]
process file
sequences[id] = true
end
end

Would this method be more efficient? Is there a more elegant way? Also,
can Ruby handle arrays/hashes of this size?

Thanks in advance!

Janus B. wrote:

I’m pretty new to Ruby and programming in general. Here’s my problem:

I’m writing a program that will automatically download protein sequences
from a server and write them into the corresponding file. Every single
sequence has a unique id and I have to eliminate duplicates. However, as
the number of sequences might exceed 50 000, I can’t simply save all
sequences in a hash (with their id as key)

How do you know that? Did you try it, as an experiment?

Janus B. wrote:

Would this method be more efficient? Is there a more elegant way? Also,
can Ruby handle arrays/hashes of this size?

It’s not so bad to use true as a hash value. But if it bothers you,
there is the Set class, which is really a hash underneath, but the
interface is set-membership rather than associative lookup:

require ‘set’

s = Set.new

s << 123
s << 456

p s.include?(456) # ==> true
p s.include?(789) # ==> false

Janus B. wrote:

the corresponding file immediately, but first I have to check if it has
been processed already.

Can you simply use the id as a filename, and check for file existence
before writing? If you file system doesn’t handle huge dirs well, then
split the id into several terms. But I’d try the hash or set approach
first, to avoid all the system calls.

phlip wrote:

Janus B. wrote:

I’m pretty new to Ruby and programming in general. Here’s my problem:

I’m writing a program that will automatically download protein sequences
from a server and write them into the corresponding file. Every single
sequence has a unique id and I have to eliminate duplicates. However, as
the number of sequences might exceed 50 000, I can’t simply save all
sequences in a hash (with their id as key)

How do you know that? Did you try it, as an experiment?

No, I didn’t try it and it might actually work: Every sequence has a
size of ~1kb, so 50 000 sequences would probably be around 50mb. But
getting all this data will take hours, so I need to implement a system
that will not lose all data if the program is terminated abnormally.

Joel VanderWerf wrote:

Janus B. wrote:

Would this method be more efficient? Is there a more elegant way? Also,
can Ruby handle arrays/hashes of this size?

It’s not so bad to use true as a hash value. But if it bothers you,
there is the Set class, which is really a hash underneath, but the
interface is set-membership rather than associative lookup:

require ‘set’

s = Set.new

s << 123
s << 456

p s.include?(456) # ==> true
p s.include?(789) # ==> false

Thanks, that’s exactly what I was looking for! I didn’t know set
basically works like a hash without a key…

No, I didn’t try it and it might actually work: Every sequence has a
size of ~1kb, so 50 000 sequences would probably be around 50mb. But
getting all this data will take hours, so I need to implement a system
that will not lose all data if the program is terminated abnormally.

Try it with random data first. That way, you know the behavior under
load without paying the acquisition time.

  • Robert

Janus B. wrote:

No, I didn’t try it and it might actually work: Every sequence has a
size of ~1kb, so 50 000 sequences would probably be around 50mb. But
getting all this data will take hours, so I need to implement a system
that will not lose all data if the program is terminated abnormally.

Here are some simple alternatives for persisting and retrieving your
data in the order I’d recommend them based on what you’ve described so
far:

  1. PStore standard library: Put your objects into a magical hash, that’s
    automatically persisted to a file. Probably the quickest and easiest
    solution. See
    http://www.ruby-doc.org/stdlib/libdoc/pstore/rdoc/classes/PStore.html

  2. Lightweight SQL database: Maybe store sequences in SQLite as BLOBs.
    Probably the best long-term solution, but will require you to work
    harder to transform data to and from storage. See
    http://sqlite-ruby.rubyforge.org/

  3. Marshall core class: Dump objects to and from strings, and then
    files. Useful if you need something more than PStore, but still want to
    persist objects directly. See
    module Marshal - RDoc Documentation

Best of luck.

-igal

Joel VanderWerf wrote:

Igal K. wrote:

  1. PStore standard library: Put your objects into a magical hash,
    that’s automatically persisted to a file. Probably the quickest and
    easiest solution. See
    http://www.ruby-doc.org/stdlib/libdoc/pstore/rdoc/classes/PStore.html

PStore writes the whole file at once, not incrementally. Not really
what OP is looking for, IMO.
It takes ~2s for my machine to read or write the 50MB PStore file. This
isn’t a big deal if the original poster (OP) doesn’t mind keeping the
program running to process multiple sequences at once.

  1. Lightweight SQL database: Maybe store sequences in SQLite as
    BLOBs. Probably the best long-term solution, but will require you to
    work harder to transform data to and from storage. See
    http://sqlite-ruby.rubyforge.org/

Not clear that would be better than files. Maybe so, if the individual
strings are short. Would be interesting to get some benchmarks on this
question.
Files would probably be faster, but with such a small dataset, we’re
probably talking about less than a second of difference for processing
the full dataset. I like using SQLite for stuff like this because it
provides a standard, out-of-the-box solution for working with
persistence, incremental processing, structured data, queries, and the
ability to easily add more fields to a record.

  1. Marshal core class: Dump objects to and from strings, and then
    files. Useful if you need something more than PStore, but still want
    to persist objects directly. See
    module Marshal - RDoc Documentation

PStore uses Marshal, so it’s odd to say that Marshal is more than PStore.
Working directly with Marshall allows greater flexiblity than using the
PStore wrapper, for example, if they decided to write a filesystem
database class. :slight_smile:

If you’re looking for a way to manage marshalled (or string or
yaml…) data in multiple files, using file paths as db keys, look no
further than: http://raa.ruby-lang.org/project/fsdb/
Cool project, thanks for writing it. Sounds useful.

-igal

Igal K. wrote:

isn’t a big deal if the original poster (OP) doesn’t mind keeping the
program running to process multiple sequences at once.

I got the impression that Mr. O. P. was trying to avoid waiting until
the end of the download to write the file (maybe in case the network
went down halfway through).

  1. Marshal core class: Dump objects to and from strings, and then
    files. Useful if you need something more than PStore, but still want
    to persist objects directly. See
    module Marshal - RDoc Documentation

PStore uses Marshal, so it’s odd to say that Marshal is more than PStore.
Working directly with Marshall allows greater flexiblity than using the
PStore wrapper, for example, if they decided to write a filesystem
database class. :slight_smile:

Less is more :wink:

Igal K. wrote:

automatically persisted to a file. Probably the quickest and easiest
solution. See
http://www.ruby-doc.org/stdlib/libdoc/pstore/rdoc/classes/PStore.html

PStore writes the whole file at once, not incrementally. Not really what
OP is looking for, IMO.

  1. Lightweight SQL database: Maybe store sequences in SQLite as BLOBs.
    Probably the best long-term solution, but will require you to work
    harder to transform data to and from storage. See
    http://sqlite-ruby.rubyforge.org/

Not clear that would be better than files. Maybe so, if the individual
strings are short. Would be interesting to get some benchmarks on this
question.

  1. Marshall core class: Dump objects to and from strings, and then
    files. Useful if you need something more than PStore, but still want to
    persist objects directly. See module Marshal - RDoc Documentation

PStore uses Marshall, so it’s odd to say that Marshall is more than
PStore.

If you’re looking for a way to manage marshalled (or string or yaml…)
data in multiple files, using file paths as db keys, look no further
than:

http://raa.ruby-lang.org/project/fsdb/

I think the Set/Hash + many files option is best here, though.

Thanks, that’s exactly what I was looking for! I didn’t know set
basically works like a hash without a key…

make that “without a value”.

Robert


http://ruby-smalltalk.blogspot.com/


AALST (n.) One who changes his name to be further to the front
D.Adams; The Meaning of LIFF

Robert D. wrote:

Thanks, that’s exactly what I was looking for! I didn’t know set
basically works like a hash without a key…

make that “without a value”.

For sets in Perl I’ve used hashes with an arbitrary value of 1, or
undef. In Ruby I guess that would be values of true or nil. Any better
suggestions, apart from using Set of course?

On Jul 6, 2008, at 12:22 PM, Janus B. wrote:

sequences in a hash (with their id as key) and then write them to hd
sequences = []
would probably be very inefficient.
end
end

Would this method be more efficient? Is there a more elegant way?
Also,
can Ruby handle arrays/hashes of this size?

Thanks in advance!

Posted via http://www.ruby-forum.com/.

the simplest and most robust method is probably going to be to use
sqlite to store the id of each sequence. this will help you in the
case if a program crash and as you develop. for example:

cfp:~ > ruby a.rb

cfp:~ > sqlite3 .proteins.db ‘select * from proteins’
42|ABC123

cfp:~ > ruby a.rb
a.rb:27:in []=': 42 (IndexError) from /opt/local/lib/ruby/gems/1.8/gems/amalgalite-0.2.1/lib/ amalgalite/database.rb:477:in transaction’
from a.rb:24:in `[]=’
from a.rb:6

cfp:~ > sqlite3 .proteins.db ‘select * from proteins’
42|ABC123

cfp:~ > cat a.rb

db = ProteinDatabase.new

id, sequence = 42, ‘ABC123’

db[id] = sequence

BEGIN {

require ‘rubygems’
require ‘amalgalite’

class ProteinDatabase
SCHEMA = <<-SQL
create table proteins(
id integer primary key,
sequence blob
);
SQL

 def []= id, sequence
   @db.transaction {
     query = 'select id from proteins where id=$id'
     rows = @db.execute(query, '$id' => id)
     raise IndexError, id.to_s if rows and rows[0] and rows[0][0]
     blob = blob_for( sequence )
     insert = 'insert into proteins values ($id, $sequence)'
     @db.execute(insert, '$id' => id, '$sequence' => blob)
   }
 end

private
def initialize path = default_path
@path = path
setup!
end

 def setup!
   @db = Amalgalite::Database.new @path
   unless @db.schema.tables['proteins']
     @db.execute SCHEMA
     @db = Amalgalite::Database.new @path
   end
   @sequence_column =

@db.schema.tables[‘proteins’].columns[‘sequence’]
end

 def blob_for string
   Amalgalite::Blob.new(
     :string => string,
     :column => @sequence_column
   )
 end

 def default_path
   File.join( home, '.proteins.db' )
 end

 def home
   home =
     catch :home do
       ["HOME", "USERPROFILE"].each do |key|
         throw(:home, ENV[key]) if ENV[key]
       end

       if ENV["HOMEDRIVE"] and ENV["HOMEPATH"]
         throw(:home, "#{ ENV['HOMEDRIVE'] }:#{ ENV['HOMEPATH'] }")
       end

       File.expand_path("~") rescue(File::ALT_SEPARATOR ? "C:/" :

“/”)
end

   File.expand_path home
 end

end

}

a @ http://codeforpeople.com/

Robert D. wrote:

{}[42] --> nil

R.

Tsk. Don’t you know that “true or nil” evaluates to “true”? :stuck_out_tongue:

(Srsly, I think he meant true for membership and nil otherwise.)

On Mon, Jul 7, 2008 at 5:55 PM, Dave B. [email protected] wrote:

Robert D. wrote:

Thanks, that’s exactly what I was looking for! I didn’t know set
basically works like a hash without a key…

make that “without a value”.

For sets in Perl I’ve used hashes with an arbitrary value of 1, or
undef. In Ruby I guess that would be values of true or nil. Any better
suggestions, apart from using Set of course?

true might be a better choice than nil :wink:

{}[42] → nil

R.

On Mon, Jul 7, 2008 at 3:12 AM, Robert D. [email protected]
wrote:

Thanks, that’s exactly what I was looking for! I didn’t know set
basically works like a hash without a key…

make that “without a value”.

Which makes an interesting contrast between Ruby and Smalltalk.

In Smalltalk-80 Set is the more “fundamental” class, the implementation
uses
hashing to ensure that duplicates are eliminated and to speed up the
test of
whether or not a Set contains a given element.

Smalltalks equivalent to Hash, the Dictionary class, is implemented (via
inheritance) as a Set of association objects, where an association
represents a key value pair, and where two associations are equal if the
keys are equal, and the hash of the association is the hash of the key.

Ruby on the other hand implements Set as a Hash where the values are
unimportant, and does this via delegating to a hash rather than via
inheritance.


Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/

On 6 Lug, 20:22, Janus B. [email protected] wrote:

the corresponding file immediately, but first I have to check if it has
been processed already.

You can use BioRuby+BioSQL, fetching data from a remote server and
storing into the db.

On Jul 6, 8:22 pm, Janus B. [email protected] wrote:

the corresponding file immediately, but first I have to check if it has
end
sequences = {}
Thanks in advance!

Posted viahttp://www.ruby-forum.com/.

BioRuby+BioSQL ?
You can fetch a sequence from servers and dump it directly into the
database. You can choose MySQL, PostgreSQL, SqLite

ok it’s not well coded but works:
server = Bio::Fetch.new(‘http://www.ebi.ac.uk/cgi-bin/dbfetch’)
ARGV.flags.accession.split.each do |accession|
puts accession
if Bio::SQL.exists_accession(accession)
puts “Entry #{accession} already exists!”
else
entry_str = server.fetch(‘embl’, accession, ‘raw’, ‘embl’)

  if entry_str=="No entries found\. \n"
    $stderr.puts "Error: no entry #{accession} found.

#{entry_str}"
else
puts “Downloaded!”
puts “Loading…”
puts “Converting EMBL obj…”
entry = Bio::EMBL.new(entry_str)
puts “Converting Biosequence obj…”
biosequence = entry.to_biosequence
puts “Saving Biosequence into Bio::SQL::Sequence database”
result =
Bio::SQL::Sequence.new(:biosequence=>biosequence,:biodatabase_id=>db.id)
unless Bio::SQL.exists_accession(biosequence.primary_accession)
puts entry.entry_id
if result.nil?
pp “The sequence is already present into the biosql
database”
else
pp “Stored.”
end
end#notfound on web
end#bioentry exists
end #list accession

PS: I need to write docs about BioSQL and Ruby, sorry my fault.