On Jun 30, 2008, at 12:58 AM, Greg Willits wrote:

> I have a pure Ruby project (no Rails) where I would like multiple
> "tasks" (ruby processes more or less) to run in parallel (collectively
> taking advantage of multiple CPU cores) while accessing a shared  
> memory
> space of data structures.
>
> OK, that's a mouthful.
>
> - single machine, multiple cores (4 or 8)
>
> - step one: pre-load a number of arrays and hashes (could be a  
> couple GB
> worth in total) into memory
>
> - step two: launch several independent Ruby scripts to search and read
> from the data pool in order to aggregate data in new sets to be  
> written
> to text files.
>
> Ruby 1.8's threading would seem poorly suited to this. Can 1.9 run
> multiple threads each accesing the same RAM-space while using all  
> cores
> of the machine?
>
> I've looked at memcache, but it seems like it could store and retrieve
> one of my pool's arrays, but it cannot look inside that array and
> retrieve just a single row of it? It would want to return the whole
> array, yes? (not good if that array is 100MB).
>
> -- gw
> -- 
> Posted via http://www.ruby-forum.com/.
>

tim is right i think, mmap is a great approach.  i've used the  
following paradigm many times for processing large datasets:

. mmap in the file
. decide the chunk size
. fork n processes working on each chunk

because mmap is carried across the fork you don't do any data  
copying.  actually the memory won't even be paged in until the  
children read them.

this is really ideal if the children can write the output - in  
otherwords if the children don't have to return data to the parent  
since returning a huge chunk of data can be expensive.

you might easily end up being IO bound and not CPU bound - in the  
similar processing i've done i've often found that the work scales  
best with the number of disk controllers, not the number of cpus -  
something worth considering

another approach to consider is to put all the input (or pathnames to  
it) into an sqlite database and then launch processes to work on it.   
this may not seem sexy but it has some huge advantages: namely that  
you'll be able to maintain state across runs which will allow you to  
make programming errors but still be making forward progress.  this  
isn't glamerous but it's very powerful as it allows incremental  
development and even coordination of ruby with other languages - like c.

one last suggestion if you have a stack of linux machines available

   . install rq
   . submit a bunch of jobs that process a chunk of data

go home for the day ;-)

with rq you should be able to setup a linux cluster in a few minutes  
and just submit a slow ruby script to 10 machines running 4 jobs each  
no problem.  you could also use rq on an 8 core machine to manage the  
jobs for you

food for thought.

ref:

   http://www.linuxjournal.com/article/7922
   http://codeforpeople.com/lib/ruby/rq/rq-3.1.0/README
   (rq 3.4.0 has a bug in it so use 3.1 if you decide to try that route)

a @ http://codeforpeople.com/
--
we can deny everything, except that we have the possibility of being  
better. simply reflect on that.
h.h. the 14th dalai lama