On 09/11/13 12:52, felix chang wrote:
I have a file contain a lot of string.
I have to slice them if they match some criteria.
It not hard for ruby.
The file i had to process is very very hug (>100G).
In order to speed up my program , i use gun_parallel to split the
and pipe to my script.
Is ts possible to drop parallel and use pure ruby to instead?
You can definitely use Ruby to parallelise tasks. Just yesterday I
slapped together a simple task manager in Ruby and converted a couple of
scripts that did a bunch of mostly-independent tasks serially to use it,
resulting in some massive performance improvements when run on a
multi-core machine. On an almost daily basis I run a Ruby-based tool
that manages build tasks run in external programs and runs as much as it
can in parallel. Ruby is well suited to this sort of thing- particularly
when the task management is complex.
A couple of questions do come up from your post though.
Do you really want to use Ruby end-to-end for what you are doing? You’re
processing a lot of data (>100G). You might want to consider using the
best tool for the job for each stage of your processing, some of which
might involve using Ruby, and some of which might not.
If you’re starting with one huge file, are there any splitting or
filtering tasks you could perform on the data beforehand, before
throwing the rest back into Ruby for processing? Depending on what you
are doing, you might be better off writing something that splits the
files first (perhaps even sending them off to multiple machines!) and
then running your script on each independent file. Or possibly seeking
through the file initially using Ruby to determine good points to begin,
setting up a bunch of tasks that launch an external program to filter
the data into the most convenient form, and then processing it, with
each task controlled by some sort of job manager to ensure that you are
using each machine to its full capability.
Of course, it all depends on what you are trying to do, how much
filtering can be done, the complexity of each task, the practicality of
pre-processing the data, if it’s a once-off task versus ongoing, whether
you are processing faster than disk I/O, available time and cost/benefit
to implement, etc etc etc. More detail might net better suggestions.