A Real World example for Ruby to "compiled" version discussion

People,

The “how can we make a ruby compiler” thread has been very interesting -
I really like hearing from serious and competent programmers about the
theoretical problems involved with this issue.

William R. asked for an expansion on the details of my C/C++
population genetics simulation program as a specific example of how one
might proceed depending on a particular situation. I am happy to
elaborate - not least because it would be good to get input from
experienced Ruby programmers before I just try to replicate the same
program in Ruby - I’m sure there will be more sensible/efficient ways of
doing things than what I would attempt first off and so comparisons
between my C/C++ version and a dodgy Ruby program might be even more
unfair . .

I will summarise the C/C++ program as it exists now (it has gone through
a number of versions and has a lot of code that does not need
replicating for the present comparison) with a general overview (I can
add more detail later if people are interested):

  • A population is represented by a number of sub-populations which
    occupy cells of a two dimensional array

  • each cell has pointers to three ordered lists - representing the
    parental population, the offspring population and a temporary population
    and each element of a list represents an individual

  • at each generation (of potentially hundreds), the whole array is
    iterated through, new offspring are produced, migrants move to adjacent
    cells, parents die off etc

As well as this main simulation program, I have already replaced all the
original shell scripts and some of the statistical processing with Ruby
scripts but the main simulation program is where I couldn’t afford an
order or two increase in running times by rewriting in Ruby.

The main problem I see with some of the Ruby conversions that I have
looked at (eg RubyInline) is that the performance problem comes in in
repeating the WHOLE simulation with different starting parameters which
is done thousands of times - so it is not like you have a single
recursive algorithm which is a bottleneck that can be optimised or
rewritten in C or something. There are lots of little steps that happen
millions of times . .

I hope that is sort of clear?

Regards,

Phil.

Philip R.

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: [email protected]

On Oct 6, 2010, at 09:55 , Philip R. wrote:

The main problem I see with some of the Ruby conversions that I have looked at (eg RubyInline) is that the performance problem comes in in repeating the WHOLE simulation with different starting parameters which is done thousands of times - so it is not like you have a single recursive algorithm which is a bottleneck that can be optimised or rewritten in C or something. There are lots of little steps that happen millions of times . .

Could you expand on this??

On Thu, Oct 7, 2010 at 2:57 AM, Ryan D. [email protected]
wrote:

On Oct 6, 2010, at 09:55 , Philip R. wrote:

The main problem I see with some of the Ruby conversions that I have looked at (eg RubyInline) is that the performance problem comes in in repeating the WHOLE simulation with different starting parameters which is done thousands of times - so it is not like you have a single recursive algorithm which is a bottleneck that can be optimised or rewritten in C or something. There are lots of little steps that happen millions of times . .

Could you expand on this??

Specifically

  • What properties (conceptually) does an individual / a population have?

  • What calculations are done for each individual / population?

  • What kind of iteration is applied?

Kind regards

robert

Philip R. wrote:

experienced Ruby programmers before I just try to replicate the same

  • A population is represented by a number of sub-populations which
    As well as this main simulation program, I have already replaced all
    happen millions of times . .

I hope that is sort of clear?

Regards,

Phil.
Some questions with my guesses at possible answers:

How much data is carried for each individual? a few alleles? a lot? the
whole human genome?

What is done in the innermost loop? I imagine you select a pair of
individuals, cross them, etc to produce more individuals. Or perhaps an
individual dies or migrated to another population.

What do you do to run a new experiment? generate the initial population,
specify the rules for combination, depth, migration, etc? gather data
and process it at the end?

How much code do you need to write or rewrite for an experiment? How
many parameters are involved?

Can you give some examples of aspects of the C++ program that you hope
to improve by moving the sim to Ruby?

Can you say something about the context of your modeling? How is your
simulator different from others used in the field?

– Bill

People,

I have accumulated the questions from different people into this
response:

On 2010-10-07 23:12, Robert K. wrote:

bottleneck that can be optimised or rewritten in C or something.
There are lots of little steps that happen millions of times . .

Could you expand on this??

Specifically

  • What properties (conceptually) does an individual / a population
    have?

This is an Individual Based Model (IBM) as opposed to populations that
can be modelled as a whole with equations etc - so the simulation
reflects this. In the C/C++ version an individual organism is an object
with a number of attributes: ID, parent ID(s), (microsatellite) allele
repeat length, randomizer int etc

  • What calculations are done for each individual / population?

The genetics of the population changes as the individuals change because
of mutations which affect repeat length of microsatellites ie they are
not selected against/for (neutral mutations).

  • What kind of iteration is applied?

For this particular discussion:

for k in 1…K # generations
for x in 1…X # array width
for y in 1…Y # array height
destroy parent ordered list
destroy tmp ordered list
offspring list becomes parent list
for i in 1…I # organism in parent ordered list
asexually reproduce offspring with possible mutation
possibly migrate offspring to neighbouring array cell
end
end
end
end

data is written to disk after each generation and genetics statistics
are done later.

How much data is carried for each individual? a few alleles? a lot?
the whole human genome?

For this exercise, one loci and allele (asexual reproduction) - multiple
loci can be modelled by multiple runs.

What is done in the innermost loop? I imagine you select a pair of
individuals, cross them, etc to produce more individuals. Or perhaps
an individual dies or migrated to another population.

Asexual reproduction in this case and see above.

What do you do to run a new experiment? generate the initial
population, specify the rules for combination, depth, migration, etc?
gather data and process it at the end?

A script calls the simulation program hundreds of times with different
starting populations. These starting populations are the output of the
same simulation program which has previously been run, starting with
different random number seeds, until they reach “quasi-equilibrium”
(typically 500 generations). Also see above.

How much code do you need to write or rewrite for an experiment? How
many parameters are involved?

The code doesn’t change - a configuration file and the starting
(initial) generation can be changed.

Can you give some examples of aspects of the C++ program that you
hope to improve by moving the sim to Ruby?

Not really - it was a while ago now that I did the actual coding (I
don’t need to touch the code much these days) and I just remember it was
a major pain trying to learn about the eccentricities of OOP with C++ -
I’m sure it would have been a much more pleasant experience doing it in
Ruby . .

I actually don’t need to change to Ruby now - I just need to write a
thesis . . What would be nice would be to further develop the program,
make it more generalised, more friendly for other users, be able to
respond to requests for enhancements/changes more quickly etc.

Can you say something about the context of your modeling? How is your
simulator different from others used in the field?

There are a couple of particular issues we wanted to investigate and my
supervisor thought that the existing stuff that was around then (about
nine years ago now) didn’t conveniently allow us to do what we wanted -
maybe it is has changed now but I haven’t had much exposure to other
programs. Also, a lot of them then were written in some sort of
graphical environment for Windows or Macs and I wanted to have a CLI
interface on Linux and have more direct control over what was happening
. .

Thanks,

Phil.

Philip R.

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: [email protected]

Phillip,

On 2010-10-09 02:57, Phillip G. wrote:

  destroy parent ordered list

is has changed now but I haven’t had much exposure to other programs. Also,
a lot of them then were written in some sort of graphical environment for
Windows or Macs and I wanted to have a CLI interface on Linux and have more
direct control over what was happening . .

This sounds like something JRuby would be good at it, since it can JIT
the loops, and the Java VM reaches nearly-native speeds these days.

Native to what? Ruby, Java, C++ ?

So, yes, improved technology could make this faster these days.

I would prefer to be able to produce stand-alone executables if possible
. .

Thanks,

Phil.

Philip R.

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: [email protected]

On 10/08/2010 06:37 AM, Philip R. wrote:

This is an Individual Based Model (IBM) as opposed to populations that
can be modelled as a whole with equations etc - so the simulation
reflects this.

Traffic simulation has an analogous distinction: microscopic (traffic as
aggregate behavior of individuals) vs. macroscopic (traffic as wave).

I keep trying to see how to use narray to express the state of your
model, but the variable length organism list probably prevents that.
Narray would give you so much for free, especially if part of the
timestep transformation is linear (matrix multiplication). But that’s
probably simplistic.

Joel.

On 2010-10-09 09:28, Joel VanderWerf wrote:

On 10/08/2010 06:37 AM, Philip R. wrote:

This is an Individual Based Model (IBM) as opposed to populations that
can be modelled as a whole with equations etc - so the simulation
reflects this.

Traffic simulation has an analogous distinction: microscopic (traffic as
aggregate behavior of individuals) vs. macroscopic (traffic as wave).

Yep, IMBs get used for lots of things.

I keep trying to see how to use narray to express the state of your
model, but the variable length organism list probably prevents that.
Narray would give you so much for free, especially if part of the
timestep transformation is linear (matrix multiplication). But that’s
probably simplistic.

I was expecting to have to change my ordered lists to one-dimensional
ordered arrays anyway . . I will check out narrays . .

Thanks,

Phil.

Philip R.

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: [email protected]

On Fri, Oct 8, 2010 at 3:37 PM, Philip R. [email protected]
wrote:

People,

I have accumulated the questions from different people into this response:
[snip /]
possibly migrate offspring to neighbouring array cell
end
end
end
end

[snip /]

The code doesn’t change - a configuration file and the starting (initial)
generation can be changed.

[snip /]

There are a couple of particular issues we wanted to investigate and my
supervisor thought that the existing stuff that was around then (about nine
years ago now) didn’t conveniently allow us to do what we wanted - maybe it
is has changed now but I haven’t had much exposure to other programs. Also,
a lot of them then were written in some sort of graphical environment for
Windows or Macs and I wanted to have a CLI interface on Linux and have more
direct control over what was happening . .

This sounds like something JRuby would be good at it, since it can JIT
the loops, and the Java VM reaches nearly-native speeds these days.

So, yes, improved technology could make this faster these days.


Phillip G.

Though the folk I have met,
(Ah, how soon!) they forget
When I’ve moved on to some other place,
There may be one or two,
When I’ve played and passed through,
Who’ll remember my song or my face.

On Fri, Oct 8, 2010 at 10:57 AM, Phillip G.
[email protected] wrote:

This sounds like something JRuby would be good at it, since it can JIT
the loops, and the Java VM reaches nearly-native speeds these days.

So, yes, improved technology could make this faster these days.

Re JRuby: Eventually, yes…the current compiler is largely unchanged
from JRuby 1.1, however, and so we’re not taking as good advantage of
the JVM as you might hope. Lots of opportunity to improve it, and
obviously we’d like these loops to be as fast as Java loops if
possible…

Soonish, perhaps! Help us fix bugs so I can spend more time on perf :slight_smile:

  • Charlie

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs