"Mark Alexander Friedgan" <hubrix / gmail.com> writes: > I've been playing with Bloom Filters in Perl and found that there is a ruby > module for them as well (perl inspired). I am trying to come up with a good > way to serialize the filter post construction that would be compatible with > perl, ruby, php and asp.net (don't ask). I am not sure what the best > representation would be. Any ideas? Well, what do you need to encode? As I see it, you need to encode k (number of hashfunctions), m (number of bits in the filter), some knowledge of the k hashfunctions themselves, and the bitstring. Now, both the perl and ruby implementation use hashfunctions that are done by: hash_N(arg) = sha1(seed_N + arg) This has possibilities since the definition of sha1 is rigorously specified (i.e. two implementations fed the same byte string produce the same hash), sha1 libraries are available in all the potential target platforms, and the hash functions can be easily specified by giving the k seed strings. So, you need a cross-language way to serialize two integers, a list of byte strings, and a huge bytestring. This looks like a job for bencoding! (the simple data serialization developed for bittorrent) I'd serialize it as a bencoded dictionary with the keys "m", "k", "seeds" and "filter". You should specify that the long bytestring "filter" is to be interpreted as a bit string, with the least signficant bit of byte 0 as bit 0. (This is what you get from BitSet.to_bytes, and it's what perl's Bloom::Filter will give you for $self->{filter}) Actually, thinking about it a bit more, there's one more thing you'll want to add to the bencoded dictionary: "hashmethod". For the current implementations of Bloom::Filter and pbloomfilter, I'd suggest the string "SHA1 32-bit XOR", since if I were presented with the challenge of getting a 32-bit hash value out of an SHA1 hash, my inclination would be to just choose the first 4 bytes. SHA1 already mixes things so well that I don't see any advantage in splitting the 20 bytes into five 4-byte chunks and XOR-ing them. -- s=%q( Daniel Martin -- martin / snowplow.org puts "s=%q(#{s})",s.to_a.last ) puts "s=%q(#{s})",s.to_a.last