On May 23, 2012, at 03:22 , George Dupre wrote:

> I have to:
> 1) generate a database of a couple dozens of millions of Fixnums
> (ranging from 0 to 2^32 - 1), while avoiding redundancy
> 2) iterate through them
> 3) quickly search for the presence of a given Fixnum in the database
>=20
> The Set class fulfills the speed conditions and conveniently handles
> redundancy itself, but its uses up too much memory. It looks like each
> entry uses up around a 100 bytes, even if I only put 4 bytes in there.
> Array#include? is too slow without solving the memory problem.
> Representing each Fixnum as 4 bytes in a huge String doesn't use up =
much
> memory at all and String#include? is fast enough, but I can't tell it =
to
> only search by 4 bytes increments.
>=20
> Could someone help me with a solution for this problem? Thank you in
> advance.

This seems like one of those %w[small fast good].pick(2) problems. =
Also... smells like homework.

But if not, it reminds me of this article:

=
http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-=
billion-distinct-objects-us.html

and their implementation:

https://github.com/clearspring/stream-lib

it's java, so you could use it directly via jruby... or you can use unix =
IO via the included shell scripts