Building ruby for speed: wise or otherwise?

On Tue, 29 Nov 2005, Robert K. wrote:

By eye! :slight_smile: The code doesn’t access the database at all until the
last part, and it doesn’t get there till about 45 mins. But to be
honest, this is so slow it isn’t worth benchmarking to get the
milliseconds.

Wow! In that case it certainly seems to make sense to optimize that. Did
you keep an eye on memory consumption and disk IO? Could well be that the

Not really. The machine is 5.5 years old. Rough Moore’s law calc
shows it would take about 5 mins on a modern machine.

sheer amount of data (and thus memory) slows your script down.

 555    1676   17179 /home/hgs/csestore_meta/populate_tables2.rb

I could post the script if you like. I’ve not profiled it to find
out where the slow bits are because it would take about 5 hours
going by previous slowdowns when profiling.

Unfortunately we’re close to release and I don’t really have much time to
look into this deeper. If anyone else volunteers…

I’m profiling it now, while I wait for GCC-4.0.2 to cook. Already
been snookered by needing gmp and mpfr for the fortran, then I found
I need autogen to build without the fortran, which needs guile, so
I’ve got those last two. [cf “The Gas Man Cometh”]

MySQL. Part of the problem is that this script is also for
updating, based on new data. If the db is empty it just inserts,
else it updates. Easy enought in ActiveRecord.

Ok, bad for bulk loading.

I think so.

Kind regards

robert
    Thank you,
    Hugh

Hugh S. wrote:

IO and this might have several reasons, from sub optimal execution plans
to slow disks / controllers.)

At the moment my script to populate the tables is taking about an
hour. Anyway it’s mostly ruby I think, because it spends most of
the time setting up the arrays before it populates the db with them.

I had similar problems with ActiveRecord and large datasets. Its slow. I
wrote a active record
extension (haven’t released yet, I am trying to figure out how-to best
release…as plugin, or as
patch to rubyonrails dev team). It makes large dataset entry 10 times
faster. If you like I can
email you privately the code and see if it helps you.

Zach

On Tue, 29 Nov 2005, Robert K. wrote:

bottleneck. (Often it’s IO and this might have several reasons,
from sub optimal execution plans to slow disks / controllers.)

At the moment my script to populate the tables is taking about an
hour. Anyway it’s mostly ruby I think, because it spends most of
the time setting up the arrays before it populates the db with them.

How did you measure that?

By eye! :slight_smile: The code doesn’t access the database at all until the last
part, and it doesn’t get there till about 45 mins. But to be
honest, this is so slow it isn’t worth benchmarking to get the
milliseconds.
555 1676 17179 /home/hgs/csestore_meta/populate_tables2.rb
I could post the script if you like. I’ve not profiled it to find
out where the slow bits are because it would take about 5 hours
going by previous slowdowns when profiling.

Besides that, I’m fairly new to database work, so I’m trying to
optimize what I know about before I start fiddling with the db.

Um, although I can understand your waryness with regard to the unknown -
you may completely waste your time. IMHO you should first determine the
cause of the slowness and then find a solution. If you optimize something
that just takes 10% of the whole running time you’ll never seen an
improvement of more than 10%…

this is true. This script isn’t run very often though, so if it
takes an hour I can live with it. Taking 5 hours to profile it once
would
be too long.

Another option to get masses of data into a database is to use some form
of bulk insert / bulk load. Depending on your database there are probably
several options.

MySQL. Part of the problem is that this script is also for
updating, based on new data. If the db is empty it just inserts,
else it updates. Easy enought in ActiveRecord.

Slow disks/controllers (+ lots of users) could be a factor, the
machine is 5.5 years old.

Could be. If possible spend it at least more mem.

Thanks. I’m trying to get something done about that.

Kind regards

robert
    Thank you,
    Hugh

On Wed, 30 Nov 2005, zdennis wrote:

privately the code and see if it helps you.
How big is it? I wondering how much there is to learn since I’m
still getting to grips with all this Rails stuff :-). I am
interested though. Thank you.

Zach

    Hugh

vjoel wrote:

I’m embarrassed to say it was 6.0. I have 8.0 (express) but can’t get
past the “MSVCR80.DLL missing” problem, at least with the
mkmf.rb-generated Makefile. (For regular projects in MSVC 8.0, you can
get around this problem by deleting the foobar.exe.embed.manifest.res
file from the Debug dir of project foobar.) Anyone have any ideas?

I’m using the single-click installer ruby, which IIRC is compiled with
7.1. Maybe it’s not a fair comparison with gcc-built ruby, since that
will take advatage of i686 vs. i386. So, not a very scientific
comparison at all–it would best to use the latest MS compiler, build
ruby from scratch, and make sure to use the same arch settings as for
gcc.

I’m just glad to see that gcc is so much better than it was.

I’d be very interested in seeing it compiled with the just-released VC
using LTCG (link-time code generation). It can make inter-module
optimizations and adjust calling conventions on a case-by-case basis.

Eric C. wrote:

will take advatage of i686 vs. i386. So, not a very scientific

I’d love to try it. Any idea how to hack the Makefiles generated by
mkmf.rb to fix the “MSVCR80.DLL missing” problem? Can you compile any
ruby extension successfully with 8.0?

Do you know that LTCG is included in the Express version? ISTR that some
optimization features (maybe profile guided optimization) were not in
the express edition. However, the Property Page for Linker/Optimization
seems to allow setting LTCG and PGO.

On Tue, 29 Nov 2005, Robert K. wrote:

Hugh S. wrote:

 555    1676   17179 /home/hgs/csestore_meta/populate_tables2.rb

I could post the script if you like. I’ve not profiled it to find
out where the slow bits are because it would take about 5 hours
going by previous slowdowns when profiling.

Well that was an interesting estimate. So far it has been 29 hours
to profile it…

    Hugh

vjoel wrote:

I’d love to try it. Any idea how to hack the Makefiles generated by
mkmf.rb to fix the “MSVCR80.DLL missing” problem? Can you compile any
ruby extension successfully with 8.0?

Do you know that LTCG is included in the Express version? ISTR that some
optimization features (maybe profile guided optimization) were not in
the express edition. However, the Property Page for Linker/Optimization
seems to allow setting LTCG and PGO.

I’m very new to Ruby, so I haven’t really tried anything yet. But I
just got a copy of VS 2005 Pro, so I’m itchin’ to try. Now if I could
just find some spare time…

Do you know if there is a benchmark that would give an idea of the
relative speed of the interpreter over a representative workload, or
could perhaps be used as the scenario for tthe profile-guided
optimization?

One more question: is there a 64-bit Ruby?

(In response to news:[email protected] by Robert K.)

Unfortunately we’re close to release and I don’t really have much time to
look into this deeper. If anyone else volunteers…

Please Hugh, post the script or at least parts of it. I have been doing
lots of database filling recently and might be able to give you a few
pointers. Here’s the general ones:

Profiling is often no use, Benchmarking might just be enough. Roughly
knowing where time goes to is a big help in guiding optimization.

Be sure to try running it with a small data set, to speed up the test-
change-test cycle (or whatever the cycle is called in languages that you
don’t compile). Maybe even profile that way.

Do you have a generation stage followed by a fill stage ? Or is the
computation intermingled with database accesses ?

hope to be of help
k

Hello Hugh,

I’d propose modifying your main logic as like follows:
require ‘benchmark’
include Benchmark

puts measure { new_table = TableMaker.new("hugh.csv",

“update_tables.sql”) }
puts measure { new_table.update_database() }
puts measure { old_table = TableMaker.new(“hugh.csv.old”) }

puts measure { new_table.make_cards("cards.out") }
puts measure { new_table.make_cards("new_cards.out",

new_table.diff_students(old_table)) }

and then running it with a reduced test set. That should give you a hint
as to where time is spent. I have read the code you posted, but cannot
find a performance hog in it. Perhaps you meant to say ‘huge.csv’
instead
of ‘hugh.csv’ ? How many students are there ? How many courses ? How
many
average courses per student ?

Also, I assume you know that fetching the image files from http can
potentially be very slow. To speed that up, you could parallelize the
process by using a queue, a few workers and a stub image that you can
return.

Or you can of course just wait for the machine :wink: … Too bad Moores law
doesn’t say that you actually get a new machine every 18 months, only
that it is available.

best greetings,
kaspar

Eric C. wrote:

I’m very new to Ruby, so I haven’t really tried anything yet. But I
just got a copy of VS 2005 Pro, so I’m itchin’ to try. Now if I could
just find some spare time…

I finally got miniruby built with VC 2005 & /GL: it seems to run about
10% faster than the 1.8.3 Windows drop.

Eric C. wrote:

Eric C. wrote:

I’m very new to Ruby, so I haven’t really tried anything yet. But I
just got a copy of VS 2005 Pro, so I’m itchin’ to try. Now if I could
just find some spare time…

I finally got miniruby built with VC 2005 & /GL: it seems to run about
10% faster than the 1.8.3 Windows drop.

What did you do to get miniruby to build?

What’s preventing a full ruby build?

On Fri, 2 Dec 2005, Kaspar S. wrote:

(In response to news:[email protected] by Robert K.)

Unfortunately we’re close to release and I don’t really have much time to
look into this deeper. If anyone else volunteers…

Please Hugh, post the script or at least parts of it. I have been doing
lots of database filling recently and might be able to give you a few
pointers. Here’s the general ones:

I don’t think it is particularly pretty. TableMaker used to
generate SQL directly, now it uses AR instead, so the output file is
unused. I’ve tried to clear out the other unused stuff that is no
use to you but I may still need.

    Thank you
    Hugh

#!/usr/local/bin/ruby -w

$: << ‘/home/hgs/aeg_intranet/csestore/app/models’

require ‘csv’

require ‘set’
require ‘open-uri’
require ‘net/http’
require ‘date’
require ‘md5’

require ‘hashattr’

require ‘fasthashequals’

require “rubygems”
require_gem “activerecord” # for the ORM.

Makes no sense to include these before active_record

These are just (almost empty) models from rails. There are some

relationship definitions (has_a, etc) but that’s about it.

require ‘student’
require ‘cse_module’
require ‘device’

$debug = false

Class for creating the database tables from the supplied input

class TableMaker
attr_accessor :students, :cse_modules
INPUT = “hugh.csv”
OUTPUT = “populate_tables.sql”

ACCEPTED_MODULES =
/^"TECH(100[1-7]|200\d|201[01]|300\d|301[0-2])|MUST100[28]/

STRFTIME_FORMAT = “%a, %d %b %Y %H:%M:%S GMT”

PATH_TO_IMAGES = ‘Z:\new\jpegs\’

Read in the database and populate the tables.

def initialize(input=INPUT, output=OUTPUT)
begin
puts “TableMaker.initialize (input=#{input.inspect},
output=#{output.inspect}”
# check these agree
# Struct.new( “Student”, :forename, :surname, :birth_dt,
# :picture, :coll_status)
# Struct.new(“Ident”, :student, :pnumber)

  # Struct.new("CourseModule", :aos_code, :dept_code,
  #            :aos_type, :full_desc)

  # Struct.new("StudentModule", :student_id, :course_module)

  @students = Set.new()
  @cse_modules = Set.new()
  @student_modules = Hash.new{Set.new()}
  # Most images will be written in bulk so cache them
  @web_timestamps = Hash.new()
  # Initialize variables
  forename, surname, birth_dt, pnumber, aos_code,
    acad_period, stage_ind, dept_code, stage_code, aos_type,
    picture, coll_status, full_desc = [nil] * 13

  student, cse_module, ident = nil, nil, nil
  record = nil

  last_pnumber, last_aos_code = nil, nil
  last_student, last_cse_module = nil, nil

  open(input, 'r') do |infp|
    while record = infp.gets
      # record.strip!
      puts "record is #{record}" if $debug
      # Don't split off the rest till we need it.
      # Hopefully splitting on strings is faster.
      forename, surname, birth_dt,
        pnumber, aos_code, the_rest =  record.split(/\s*\|\s*/,6)

      next unless aos_code =~ ACCEPTED_MODULES

      forename, surname, birth_dt, pnumber, aos_code,
        acad_period, stage_ind, dept_code, stage_code, aos_type,
        picture, coll_status, full_desc = record.split(/\s*\|\s*/)


      puts "from record, picture is [#{picture.inspect}]." if $debug

      if pnumber == last_pnumber
        student = last_student
        puts "pnumber set to last_pnumber" if $debug
      else
        # Structures for student
        student = Student.new(
                              :forename => forename,
                              :surname => surname,
                              :birth_dt => birth_dt,
                              :pnumber => pnumber,
                              :picture => picture,
                              :coll_status => coll_status
                             )

                             # Avoid duplicates
                             # unless @students.include? student
                             @students.add student
                             # else
                             # puts "Already seen #{student}" if 

$debug
# end
last_pnumber = pnumber
last_student = student

      end


      # Structures for module data.
      if aos_code == last_aos_code
        this_cse_module = last_cse_module
      else
        this_cse_module = CseModule.new(
                                        :aos_code => aos_code,
                                        :dept_code => dept_code,
                                        :aos_type => aos_type,
                                        :full_desc => full_desc
                                       )
      end

      # Avoid duplicates
      @cse_modules.add this_cse_module
      last_cse_module = this_cse_module
      @student_modules[student].add this_cse_module

      puts "cse_module is #{this_cse_module}" if $debug

    end
  end
rescue
  puts "\n"
  puts $!
  puts $!.backtrace.join("\n")
end

end

def has_student?(given_student)
result = @students.member?(given_student)
puts “has_student?: @students.size is #{@students.size}, result is
#{result}”
return result
end

def diff_students(other_table)
diff_students = @students - other_table.students
return Set.new(diff_students)
end

The pnumber is a barcode that uniquely identifies a student.

def has_pnumber?(apnumber)
return @students.any? do |pn|
pn == apnumber
end
end

def new_pnumber(old_table)
new_pnumbers = @pnumbers.reject do |pn|
old_table.has_pnumber?(pn)
end
return Set.new(new_pnumbers)
end

Convert the picture to a URI and get it, if necessary.

moved out of make_cards to shorten that function.

def get_picture(pic_name)
pic = “#{pic_name}”
pic.gsub!(/"/,’’)
pic.gsub!(/ /, “%20”)
url = pic.dup
puts “pic is #{pic.inspect}\nurl is #{url.inspect}” # if $debug
pic.sub!(/^.//,’’)
puts “pic is now #{pic.inspect}” # if $debug
if pic.empty?
puts “No such picture " if $debug
elsif pic =~ /^Z:\/i
puts “Already got this " if $debug
else
Dir.chdir(”./images”) do
begin
grab = true
url =~ /^http://([^:/]+):?([^/]
?)(.*)/
host, port, path = $1, $2, $3
port = 80 if port.nil? or port.empty?
puts "pic #{pic}:- host #{host} port #{port} path #{path} "
#if $debug
Net::HTTP.start(host, port) do |http|
header = http.head(path)
lastmod = header[‘last-modified’]
# timestamp = DateTime.strptime(lastmod, STRFTIME_FORMAT)
# timestamp = Time.new(DateTime.strptime(lastmod,
STRFTIME_FORMAT))
lastmod ||=Time.now.to_s
timestamp = (@web_timestamps[lastmod] ||=
Time.parse(lastmod))

        if File.exist?(pic)
          mtime = File.mtime(pic)
          puts "mtime #{mtime} timestamp #{timestamp}" if $debug
          if mtime > timestamp
            puts "file is newer, skip." if $debug
            grab = false
          end
        end
        if grab
          open(pic, "wb") do |image|
            image.print http.get(path).body
          end
        end
      end
    rescue => e
      puts e.inspect
      puts "\n"
      puts "#{$!}, #{e}"
      puts $!.backtrace().join("\n")
    end
  end
end
return  PATH_TO_IMAGES + pic + "\r\n"

end

Output all the data necessary to create the id cards.

def make_cards(output,the_students = @students)
personal_fields = [:forename, :surname, :birth_dt, :pnumber]
open(output, “w”) do |outf|
the_students.each do |student|
puts “student:- #{student} :” if $debug
outstring = personal_fields.collect do |message|
# Remove unwanted quotation marks
"#{student.send(message)}, “.gsub(/”/,’’)
end.join(’’)
# We need to iterate in case a student has two ids
# Not any more – we know they will look like two students.
# It doesn’t matter.

    outstring += get_picture(student.picture)
    outf.print outstring
  end
end

end

Cannot update the database til the comparison is complete, so

this code must be moved into here

def update_database
@students.each do |student|
puts “update_database(): pnumber is #{student.pnumber}”
begin
orig_student = Student.find(:first, :conditions => [“pnumber =
?”,student.pnumber])
puts “update_database(): orig_student.pnumber is
#{orig_student.pnumber}”
rescue Exception => e
puts “update_database(): exception is #{e}”
puts “\n”
puts $!
puts $!.backtrace.join("\n")
puts “\n”
orig_student = nil
end
if orig_student.nil? # i.e. nothing found
student.save!
else
orig_student.update_attributes(
:surname => student.surname,
:birth_dt => student.birth_dt,
:picture => student.picture,
:coll_status =>
student.coll_status
)
end
end
@cse_modules.each do |cse_module|
orig_cse_module = CseModule.find(:first, :conditions => [‘aos_code
= ?’, cse_module.aos_code]) rescue nil
if orig_cse_module.nil?
cse_module.save!
else
orig_cse_module.update_attributes(
:dept_code =>
cse_module.dept_code,
:aos_type =>
cse_module.aos_type,
:full_desc =>
cse_module.full_desc
)
end
end
# This next line should sort out the join table.
@student_modules.each do |student, modules|
the_student = Student.find(:first, :conditions => [‘pnumber = ?’,
student.pnumber])
modules.each do |cse_module|
the_cse_module = CseModule.find(:first, :conditons => [‘aos_code
= ?’, cse_module.aos_code])
puts “update_database(): updating #{the_cse_module} with
#{the_student}”
the_cse_module.students << the_student
end
end
end
end

class KitTableMaker

def initialize(input)
# create outside the block for speed.
name, serialno, barcode = [nil]*3
@kit = Set.new()
barcodes = Set.new()

open(input, 'r') do |infp|
  while record = infp.gets
    name, serialno, barcode = record.split(/\s*,\s*/,3)
    if barcodes.member?(barcode)
      puts "Duplicate barcode #{barcode}"
    else
      device = Device.new(:description => name,
                          :serialno => serialno,
                          :barcode => barcode)
      barcodes.add(barcode)
      @kit.add device
    end
  end
end

end

def update_database
@kit.each do |device|
begin
orig_kit = Device.find(:first, :conditions => [“barcode = ?”,
device.barcode])
rescue Exception => e
puts “Device::update_database: exception is #{e}”
puts “\n”, $!, $!.backtrace.join("\n"), “\n”
end
if orig_kit.nil?
device.save!
else
begin
orig_kit.update_attributes(:description => device.name,
:serialno => device.serialno,
:barcode => device.barcode)
rescue Exception => e
puts “Device::update_database: exception is #{e}”
puts “\n”, $!, $!.backtrace.join("\n"), “\n”
end
end
end
end
end

if FILE == $0
begin
ActiveRecord::Base.establish_connection(
:adapter => ‘mysql’,
:host => ‘localhost’,
:port => 3608,
:database =>
‘csestore_development’,
:username => ‘hgs’,
:password =>
‘post-it-to-ruby-talk?’
)
new_table =
TableMaker.new(“hugh.csv”, “update_tables.sql”)
new_table.update_database()
old_table =
TableMaker.new(“hugh.csv.old”)

                                       new_table.make_cards("cards.out")
                                       new_table.make_cards("new_cards.out", 

new_table.diff_students(old_table))
rescue Exception => e
puts “\n”
puts “#{$!}, #{e}”
puts $!.backtrace().join("\n")
end

device_table = KitTableMaker.new(“stock1.csv”)
device_table.update_database()
end