Forum: Ruby read file in parallel

0cfa94f7ac6e9108db2c3e3438f05364?d=identicon&s=25 felix chang (felix125)
on 2013-11-09 03:22
Dear all:

I have a file contain a lot of string.
ex.

ABADSFASVASDF
ASDFASFASVASDF
ASDFASFDASDF
VASDFASVAS
ASVASDFASDFASDF
ASVASDFASDFASDFA
ASDFASVASDFAF
ASDFASFDAF


I have to slice them if they match some criteria.
It not hard for ruby.

The file i had to process is very very hug (>100G).
In order to speed up my program , i use gun_parallel to split the file
and pipe to my script.

Is ts possible to drop parallel and use pure ruby to instead?

Thanks

Felix
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (robert_k78)
on 2013-11-09 13:01
(Received via mailing list)
On Sat, Nov 9, 2013 at 3:22 AM, felix chang <lists@ruby-forum.com>
wrote:
> ASVASDFASDFASDFA
> ASDFASVASDFAF
> ASDFASFDAF
>
>
> I have to slice them if they match some criteria.

What do you mean by "slice" here?  What do you need to do to those
lines?

> The file i had to process is very very hug (>100G).
> In order to speed up my program , i use gun_parallel to split the file
> and pipe to my script.

Do you mean "GNU Parallel"?
http://www.gnu.org/software/parallel/

> Is ts possible to drop parallel and use pure ruby to instead?

Yes, of course you can do that.  Keep in mind that depending how you
do it there might be a lot seeking on the file. That may actually hurt
performance.

There are also Gems around, e.g.
https://rubygems.org/gems/parallel

See https://rubygems.org/search?utf8=%E2%9C%93&query=p... for more.

We can certainly come up with more concrete advice if you say what you
have to do to the input.

Kind regards

robert
9fb815c3bebd64d2b0dea7b9c742cedb?d=identicon&s=25 Garthy D (Guest)
on 2013-11-11 02:52
(Received via mailing list)
Hi Felix,

On 09/11/13 12:52, felix chang wrote:
 > Dear all:
 >
 > I have a file contain a lot of string.
 > ex.
 >
 > ABADSFASVASDF
 > ASDFASFASVASDF
 > ASDFASFDASDF
 > VASDFASVAS
 > ASVASDFASDFASDF
 > ASVASDFASDFASDFA
 > ASDFASVASDFAF
 > ASDFASFDAF
 >
 >
 > I have to slice them if they match some criteria.
 > It not hard for ruby.
 >
 > The file i had to process is very very hug (>100G).
 > In order to speed up my program , i use gun_parallel to split the
file
 > and pipe to my script.
 >
 > Is ts possible to drop parallel and use pure ruby to instead?
 >
 > Thanks
 >
 > Felix

You can definitely use Ruby to parallelise tasks. Just yesterday I
slapped together a simple task manager in Ruby and converted a couple of
scripts that did a bunch of mostly-independent tasks serially to use it,
resulting in some massive performance improvements when run on a
multi-core machine. On an almost daily basis I run a Ruby-based tool
that manages build tasks run in external programs and runs as much as it
can in parallel. Ruby is well suited to this sort of thing- particularly
when the task management is complex. :)

A couple of questions do come up from your post though.

Do you really want to use Ruby end-to-end for what you are doing? You're
processing a lot of data (>100G). You might want to consider using the
best tool for the job for each stage of your processing, some of which
might involve using Ruby, and some of which might not.

If you're starting with one huge file, are there any splitting or
filtering tasks you could perform on the data beforehand, before
throwing the rest back into Ruby for processing? Depending on what you
are doing, you might be better off writing something that splits the
files first (perhaps even sending them off to multiple machines!) and
then running your script on each independent file. Or possibly seeking
through the file initially using Ruby to determine good points to begin,
setting up a bunch of tasks that launch an external program to filter
the data into the most convenient form, and then processing it, with
each task controlled by some sort of job manager to ensure that you are
using each machine to its full capability.

Of course, it all depends on what you are trying to do, how much
filtering can be done, the complexity of each task, the practicality of
pre-processing the data, if it's a once-off task versus ongoing, whether
you are processing faster than disk I/O, available time and cost/benefit
to implement, etc etc etc. More detail might net better suggestions. :)

Cheers,
Garth
Please log in before posting. Registration is free and takes only a minute.
Existing account

NEW: Do you have a Google/GoogleMail, Yahoo or Facebook account? No registration required!
Log in with Google account | Log in with Yahoo account | Log in with Facebook account
No account? Register here.