Re-organize original text data then write in files

junhuiliao · July 24, 2010, 12:24pm

Dear all,

Recently, I have to do this job.
Re-organize the original text data then write in files.
The original data is like this (tsv format).

First line: time_1.1 signal_1.1 time_2.1 signal_2.1 …
time_4096.1 signal_4096.1 (total 4096 pairs).
Second line: time_1.2 signal_1.2 time_2.2 signal_2.2 …
time_4096.2 signal_4096.2(total 4096 pairs).
…
last line(totally 2048 lines): time_1.2048 signal_1.2048 time_2.2048
signal_2.2048 … time_4096.2048 signal_4096.2048 (total 4096 pairs).

What shall I do is,

Step 0, all of the time_n.* should subtract to the time_n.1. That is to
say,
time_1.1, time_1.2, … time_1.2048 should subtract time_1.1.
time_2.1, time_2.2, … time_2.2048 should subtract time_2.1.
…
time_4096.1, time_4096.2, … time_4096.2048 should subtract
time_4096.1.

Step 1, make all of the time_k.* and signal_k.* in each line collected
together and save in files, let’s say, file_k.tsv .
Namely, all of the time_1.1 , signal_1.1, time_1.2, signal_1.2 …
time_1.2048, signal_1.2048 should save in file_1.tsv. And the first line
is time_1.1 signal_1.1; the second line is time_1.2, signal_1.2 …
the last line is time_1.2048, signal_1.2048.

All of the time_2.1 , signal_2.1, time_2.2, signal_2.2 …
time_2.2048, signal_1.2048 should save in file_2.tsv. And the first line
is time_2.1 signal_2.1; the second line is time_2.2, signal_2.2 …
the last line is time_2.2048, signal_2.2048.
…
All of the time_4096.1 , signal_4096.1, time_4096.2, signal_4096.2
… time_4096.2048, signal_4096.2048 should save in file_4096.tsv.
And the first line is time_4096.1 signal_4096.1; the second line is
time_4096.2, signal_4096.2 … the last line is time_4096.2048,
signal_4096.2048.

Already, I developed a script in C++, but it cost around 3 hours to deal
with this job.
And I am totally new guy to ruby, perl, a little on Python.
So, my question is,

1, how many time it will be cost to do this job under ruby?
If the time less than one and a half hours, then it worth to study for
me. I was attracted by the beautiful ruby, already : ) .
2, Is there any similar example ?

Best regards !
Junhui

junhuiliao · July 25, 2010, 11:30pm

On Sat, Jul 24, 2010 at 11:24 AM, Junhui L.
[email protected]wrote:

…
Already, I developed a script in C++, but it cost around 3 hours to deal
with this job.
And I am totally new guy to ruby, perl, a little on Python.
So, my question is,

1, how many time it will be cost to do this job under ruby?
If the time less than one and a half hours, then it worth to study for
me. I was attracted by the beautiful ruby, already : ) .

I’m neither a Ruby expert nor an expert programmer, but I have been
using
Ruby (for my own purposes) for over 8 years, and as a thought exercise I
tried this (not actually running anything), and it took me about 20 to
30
minutes, provided the computer memory is big enough to hold all the
data.
(I couldn’t think of an easy way to what I think you want to do without
reading in all the data first, modifying it, then writing it out. That,
or
open 4096 files at the same time: neither way seems elegant.)

And if you can do that in C++ then I’m sure you can probably do it in
Ruby,
Perl, Python, etc, etc. If you can program in C++ then I see no reason
why
you wouldn’t be able to program in Ruby, Perl, Python, etc. (It might
look
like C++ rewritten in R, P, P, etc, but so what if you’re trying things
out.)

Personally, if I didn’t have much time, and I wanted to try something
out in
another computer language, I’d go with a language that I knew a little
about, so in my case that would be Ruby, Pascal, Qbasic (!!!), and - in
your
case - maybe try something quick in Python. (But I’d also encourage you
to
look at Ruby sometime and try it.)

Maybe it partly depends on what standard methods/functions are
available:
for example, in Ruby you can read a line from a file into a String, and
then
use a builtin method on the String to split it into an array of values
using
a specified delimiter, so in your case a space character? But I’d be
very
surprised if there weren’t similar builtins in Perl and Python.

junhuiliao · July 26, 2010, 12:06am

(I couldn’t think of an easy way to what I think you want to do without
reading in all the data first, modifying it, then writing it out. That,
or
open 4096 files at the same time: neither way seems elegant.)

Actually, I developed two versions of C++ script. One is opening 4096
files
at the same time. This cost 3 hours. Another version is saving all of
the
data in a big vector, then scanning the vector to pick the right items
to write
in files. This cost 2 hours and 45 minutes. :-).

Personally, if I didn’t have much time, and I wanted to try something
out in
another computer language, I’d go with a language that I knew a little
about, so in my case that would be Ruby, Pascal, Qbasic (!!!), and - in
your
case - maybe try something quick in Python. (But I’d also encourage you
to
look at Ruby sometime and try it.)

Thanks a lot for your encourage, I tried to read something on ruby
already.
Since this language is very simple and beautiful, no matter it works for
my case
or not(But I hope it could be).

Maybe it partly depends on what standard methods/functions are
available:
for example, in Ruby you can read a line from a file into a String, and
then
use a builtin method on the String to split it into an array of values
using
a specified delimiter, so in your case a space character?

I need this kind of comment seriously, saying, what are the knowledge
which is necessary and enough to do my job. If there are some special
and
powerful methods or stances to do this kind of stuff.
Or be better, give a example just very close my case. I can get the
detailed
by reading book(s) or googling.

Anyway, thanks a lot for your reply!
Best !
Junhui

junhuiliao · July 28, 2010, 12:14am

I’m putting this at the top of my post because I think the basic problem
here may be intensive numeric calculations, and - even more so - disk
(input
and) output of about 16 MiB x N bytes of data, where N is 8 bytes (? for
Floating point numbers), so about 128 MiB in total, and other people
will
have a better knowledge of some possibly useful links.

On Sun, Jul 25, 2010 at 11:15 PM, Junhui L.
[email protected]wrote:

Actually, I developed two versions of C++ script.
One is opening 4096 files at the same time. This cost 3 hours.
Another version is saving all of the data in a big vector,
then scanning the vector to pick the right items to write
in files. This cost 2 hours and 45 minutes. :-).

Sorry - in my post I misunderstood what you meant by “cost”. I think it
is
(very?) unlikely that any Ruby (or Perl or Python, etc?) program will
run
faster than your C++ scripts. Where Ruby (or Python - I’m not so sure
about
Perl, I haven’t used it) does have an advantage is that I think
development
may be quicker. So there are trade-offs. (Incidentally, I’m not an
expert,
but those timings suggest to me that the major processing cost may be in
writing the results out to disk, so changing the language for all or
part of
the processing is unlikely to make a large difference?)

But I’m open to correction: there are people who have used Ruby for
fairly
intensive large data sets processing, but my understanding is that they
use
a mixture of Ruby as “glue” with any intensive calculations in C, etc.
For
example, from some limited experience I have the speed of Ruby reading
strings of bytes in from files is similar to the speed of Java or
compiled
Pascal, but for calculating CRCs of files the speed of pure Ruby
calculating
the CRCs once the bytes had read in was much slower than Java or
compiled
Pascal: so I used Ruby (or rather JRuby) to read in the strings of bytes
from the files, and then called Java code from Ruby to calculate the CRC
from the bytes. Overall the speed of this was similar to a pure Java or
pure
compiled Pascal program.

Piet Hut and Jun Makino have been using Ruby to model dense star
clusters.
(Note that this is something I know nothing about! I’m just intrigued by
the
underlying principle of using Ruby for intensive numerical calculations
by
developing in Ruby without worrying about speed by using smaller
unrealistic
models, and then using more realistic models by translating part (or
all!)
of the Ruby code to a faster language.)

http://www.kira.org/index.php?option=com_content&task=view&id=124&Itemid=154
…MODEST is the new name for the Stellar Dynamics workshop. It stands
for:
MOdeling DEnse STellar systems
…
The basic idea is to start a kind of N-body wikipedia, as a group’s
process.
It should be self contained, a place to gather all the basic information
that is currently missing from the literature. Up till now, if you want
to
write a decent N-body code from scratch, you have to somehow catch the
oral
knowledge that is floating around in the community, most of which has
never
been written down. We want to change that.

One place to start is the open-source introductory text written over the
last several years by Piet Hut and Jun Makino, “Moving Stars Around”;
see
http://www.artcompsci.org/#msa where we give pointers to two versions.
The
older one is written using C++ and the newer one, with even more
detailed
background, uses Ruby. We propose to use the basic text, and to
translate
the (rather short) pieces of code into other languages, starting with
LSL,
the Linden Scripting Language, and then to move on to OpenSim. …

http://www.manybody.org/wiki/index.php/Moving_Stars_Around
http://www.artcompsci.org
http://www.artcompsci.org/kali/pub/msa/title.html
http://www.artcompsci.org/kali/pub/msa/ch16.html#rdocsect123
16.7. Conclusion

Dan: This confirms our earlier conclusions. At least on this particular
computer, that we are now using to do some speed tests, the unoptimized
C
version takes 50% more time than the optimized version, the simplest
Ruby
version takes about 50 times more time, the Ruby array version about 100
times more, and finally the Ruby vector version takes more than 250
times
more time than the optimized C version.

Carol: But even so, for short calculations, who cares if a run takes ten
millisecond or a few seconds? I certainly like the power of Ruby in
giving
us vector classes, and a lot more goodies. We have barely scratched the
surface of all the power that Ruby can give us. You should see what we
can
do when we really start to pass blocks to methods and . . .

Dan: . . . and then we will start drinking a lot of coffee, while
waiting
for results when we begin to run 100-body experiments! Is there no way
to
speed up Ruby calculations?

Carol: There is. By the time we use 100 particles, we are talking about
10^2

10^2 = 10^4 force calculations for every time step. This means that
the
calculation of the mutually accelerations will take up almost all of the
computer time. What we can do is write a short C code for computing the
accelerations. It is possible to invoke such a C code from within a Ruby
code. In that way, we can leave most of the Ruby code unchanged, while
gaining most of the C speed.

Erica: I certainly like the flexibility of a high-level language like
Ruby,
at least for writing a few trial versions of a new code. In order to
play
around, Ruby is a lot more fun and a lot easier to use than C or C++ or
Fortran. After we have constructed a general N-body code that we are
really
happy with, we can always translate part of it into C, as Carol just
suggested. Or, if really needed to gain speed, we could even translate
the
whole code into C. Translating a code will always take far less time
than
developing a code in the first place. And is seems pretty clear to me
that
development will be faster in Ruby.

Dan: I’m not so sure about all that. In any case, we got started now
with
Ruby, so let us see how far we get. But if and when we really get bogged
down by the lack of speed of Ruby, we should not hesitate to switch to a
more efficient language.

…

Thanks a lot for your encourage,

I tried to read something on ruby already.
Since this language is very simple and beautiful,
no matter it works for my case or not.
(But I hope it could be.)

To quote something Damon Runyon quoted/adapted: The race is not always
to
the swift nor the battle to the strong - but thatâ€™s the way to bet.
http://www.barrypopik.com/index.php/new_york_city/entry/the_race_is_not_always_to_the_swift_no_the_battle_to_the_strong_but_thats_t/
Or to continue that analogy, “horses for courses”!
(Translation: for parts of a task that are heavy numerical processing a
“lower level” language is likely to be faster. That said, sometimes the
slower higher level language is fast enough: I was doing some actuarial
programming in JRuby, and intended to have some intensive numerical
calculations in Java. It turned out that JRuby was actually quite fast
enough for those particular calculations, although (for portability
reasons)
I will be translating that numerical code from Ruby to Java.)

Maybe it partly depends on what standard methods/functions are
available: for example, in Ruby you can read a line from a file
into a String, and then use a builtin method on the String
to split it into an array of values using a specified delimiter,
so in your case a space character?

I need this kind of comment seriously, saying, what are the knowledge
which is necessary and enough to do my job. If there are some special
and powerful methods or stances to do this kind of stuff.
Or be better, give a example just very close my case.
I can get the detailed by reading book(s) or googling.

For your particular problem you’ve already written the C++ code to do
the
splitting and the processing, so (ignoring maintainability issues) I
don’t
think Ruby will speed up the development, and - for the reasons above -
I
think it is unlikely to result in faster processing. But you might find
it
worth looking at integrating Ruby (or Python or Perl) with C, C++ or
Java
for other processing problems. I don’t have any experience with C or
C++,
but I have found integrating Ruby with Java fairly easily. (Once I found
my
way round some initial misunderstandings I had with how to package Java
to
be used by Ruby.)

Anyway, good luck with your task(s)!