Read file in parallel

musicdenotation · November 9, 2013, 3:22am

Dear all:

I have a file contain a lot of string.
ex.

ABADSFASVASDF
ASDFASFASVASDF
ASDFASFDASDF
VASDFASVAS
ASVASDFASDFASDF
ASVASDFASDFASDFA
ASDFASVASDFAF
ASDFASFDAF

I have to slice them if they match some criteria.
It not hard for ruby.

The file i had to process is very very hug (>100G).
In order to speed up my program , i use gun_parallel to split the file
and pipe to my script.

Is ts possible to drop parallel and use pure ruby to instead?

Thanks

Felix

inffcs00 · November 9, 2013, 1:01pm

On Sat, Nov 9, 2013 at 3:22 AM, felix chang [email protected]
wrote:

ASVASDFASDFASDFA
ASDFASVASDFAF
ASDFASFDAF

I have to slice them if they match some criteria.

What do you mean by “slice” here? What do you need to do to those
lines?

The file i had to process is very very hug (>100G).
In order to speed up my program , i use gun_parallel to split the file
and pipe to my script.

Do you mean “GNU Parallel”?
http://www.gnu.org/software/parallel/

Is ts possible to drop parallel and use pure ruby to instead?

Yes, of course you can do that. Keep in mind that depending how you
do it there might be a lot seeking on the file. That may actually hurt
performance.

There are also Gems around, e.g.
https://rubygems.org/gems/parallel

See search | RubyGems.org | your community gem host for more.

We can certainly come up with more concrete advice if you say what you
have to do to the input.

Kind regards

robert

inffcs00 · November 11, 2013, 2:52am

Hi Felix,

On 09/11/13 12:52, felix chang wrote:

Dear all:

I have a file contain a lot of string.
ex.

ABADSFASVASDF
ASDFASFASVASDF
ASDFASFDASDF
VASDFASVAS
ASVASDFASDFASDF
ASVASDFASDFASDFA
ASDFASVASDFAF
ASDFASFDAF

I have to slice them if they match some criteria.
It not hard for ruby.

The file i had to process is very very hug (>100G).
In order to speed up my program , i use gun_parallel to split the
file
and pipe to my script.

Is ts possible to drop parallel and use pure ruby to instead?

Thanks

Felix

You can definitely use Ruby to parallelise tasks. Just yesterday I
slapped together a simple task manager in Ruby and converted a couple of
scripts that did a bunch of mostly-independent tasks serially to use it,
resulting in some massive performance improvements when run on a
multi-core machine. On an almost daily basis I run a Ruby-based tool
that manages build tasks run in external programs and runs as much as it
can in parallel. Ruby is well suited to this sort of thing- particularly
when the task management is complex.

A couple of questions do come up from your post though.

Do you really want to use Ruby end-to-end for what you are doing? You’re
processing a lot of data (>100G). You might want to consider using the
best tool for the job for each stage of your processing, some of which
might involve using Ruby, and some of which might not.

If you’re starting with one huge file, are there any splitting or
filtering tasks you could perform on the data beforehand, before
throwing the rest back into Ruby for processing? Depending on what you
are doing, you might be better off writing something that splits the
files first (perhaps even sending them off to multiple machines!) and
then running your script on each independent file. Or possibly seeking
through the file initially using Ruby to determine good points to begin,
setting up a bunch of tasks that launch an external program to filter
the data into the most convenient form, and then processing it, with
each task controlled by some sort of job manager to ensure that you are
using each machine to its full capability.

Of course, it all depends on what you are trying to do, how much
filtering can be done, the complexity of each task, the practicality of
pre-processing the data, if it’s a once-off task versus ongoing, whether
you are processing faster than disk I/O, available time and cost/benefit
to implement, etc etc etc. More detail might net better suggestions.

Cheers,
Garth