Hi Felix,

On 09/11/13 12:52, felix chang wrote:
 > Dear all:
 >
 > I have a file contain a lot of string.
 > ex.
 >
 > ABADSFASVASDF
 > ASDFASFASVASDF
 > ASDFASFDASDF
 > VASDFASVAS
 > ASVASDFASDFASDF
 > ASVASDFASDFASDFA
 > ASDFASVASDFAF
 > ASDFASFDAF
 >
 >
 > I have to slice them if they match some criteria.
 > It not hard for ruby.
 >
 > The file i had to process is very very hug (>100G).
 > In order to speed up my program , i use gun_parallel to split the file
 > and pipe to my script.
 >
 > Is ts possible to drop parallel and use pure ruby to instead?
 >
 > Thanks
 >
 > Felix

You can definitely use Ruby to parallelise tasks. Just yesterday I 
slapped together a simple task manager in Ruby and converted a couple of 
scripts that did a bunch of mostly-independent tasks serially to use it, 
resulting in some massive performance improvements when run on a 
multi-core machine. On an almost daily basis I run a Ruby-based tool 
that manages build tasks run in external programs and runs as much as it 
can in parallel. Ruby is well suited to this sort of thing- particularly 
when the task management is complex. :)

A couple of questions do come up from your post though.

Do you really want to use Ruby end-to-end for what you are doing? You're 
processing a lot of data (>100G). You might want to consider using the 
best tool for the job for each stage of your processing, some of which 
might involve using Ruby, and some of which might not.

If you're starting with one huge file, are there any splitting or 
filtering tasks you could perform on the data beforehand, before 
throwing the rest back into Ruby for processing? Depending on what you 
are doing, you might be better off writing something that splits the 
files first (perhaps even sending them off to multiple machines!) and 
then running your script on each independent file. Or possibly seeking 
through the file initially using Ruby to determine good points to begin, 
setting up a bunch of tasks that launch an external program to filter 
the data into the most convenient form, and then processing it, with 
each task controlled by some sort of job manager to ensure that you are 
using each machine to its full capability.

Of course, it all depends on what you are trying to do, how much 
filtering can be done, the complexity of each task, the practicality of 
pre-processing the data, if it's a once-off task versus ongoing, whether 
you are processing faster than disk I/O, available time and cost/benefit 
to implement, etc etc etc. More detail might net better suggestions. :)

Cheers,
Garth