On Thu, 20 Aug 2009, Ben Christensen wrote:

>
> This is not a test of "file reading". The test is related to the
> performance of iterating over large lists of data and performing
> processing on them - such as indexing for searching, cleansing,
> normalizing etc.
>
> This is a very small representation of the level of complexity and size
> of data I would in reality be dealing with.
>
> It seems however that the answer is that this is not what Ruby  is well
> suited for. Am I correct in that determination?
>

Ben -- I've been working with Java since '96 (and taught Java for sun for 
a while, so I think I can understand where you may be coming from).  At 
this point, I prefer to write Ruby -- it's much more readable and lots 
less *crufty* than Java, but Java still pays the bills.

I do have the following questions and/or things to consider --

1. How *often* are you going to be processing these files?  If they are 
batch style jobs, then does absolute speed matter over maintainability?

2. Are there any reasons to not keep the data in a database and then 
perform queries, etc.?



If you're wanting to do things such as indexing and so forth, Ruby's 
string handling far outshines, imho, Java's.  Ruby's "collections" and 
enumerables are far more robust as well.  As a result, I can spend 5 
minutes writing something that would take me 30 or even 60 minutes in 
Java.  Yes, ruby may not be faster in execution time -- of course, as the 
results show, it depends on how you write it (in one instance it was 
faster than java), but even if a run takes, say, 1 second longer, it'd 
have to run 1500 times before the total of java's development and runtime 
caught up with ruby's.  And that's not including maintenance time.  Then 
factor in that developer time is usually far more expensive than cpu time, 
and Ruby tends to come out in the lead.

What would be a far more fair assessment would be to factor in the amount 
of time it takes to write a test, as well as the number of lines of code, 
since size of code tends to increase complexity and also maintenance 
costs.  Then run the two and see which is better.

If you're processing these files in realtime to extract data, etc., then 
perhaps you'd be better loading them into a database.  However, if they're 
batched, as I expect, by simply comparing "speed of execution" you're 
looking at only one facet of the problem.

Matt