“Mark Alexander F.” [email protected] writes:
I’ve been playing with Bloom Filters in Perl and found that there is a ruby
module for them as well (perl inspired). I am trying to come up with a good
way to serialize the filter post construction that would be compatible with
perl, ruby, php and asp.net (don’t ask). I am not sure what the best
representation would be. Any ideas?
Well, what do you need to encode? As I see it, you need to encode k
(number of hashfunctions), m (number of bits in the filter), some
knowledge of the k hashfunctions themselves, and the bitstring.
Now, both the perl and ruby implementation use hashfunctions that are
hash_N(arg) = sha1(seed_N + arg)
This has possibilities since the definition of sha1 is rigorously
specified (i.e. two implementations fed the same byte string produce
the same hash), sha1 libraries are available in all the potential
target platforms, and the hash functions can be easily specified by
giving the k seed strings.
So, you need a cross-language way to serialize two integers, a list of
byte strings, and a huge bytestring. This looks like a job for
bencoding! (the simple data serialization developed for bittorrent)
I’d serialize it as a bencoded dictionary with the keys “m”, “k”,
“seeds” and “filter”. You should specify that the long bytestring
“filter” is to be interpreted as a bit string, with the least
signficant bit of byte 0 as bit 0. (This is what you get from
BitSet.to_bytes, and it’s what perl’s Bloom::Filter will give you for
Actually, thinking about it a bit more, there’s one more thing you’ll
want to add to the bencoded dictionary: “hashmethod”. For the current
implementations of Bloom::Filter and pbloomfilter, I’d suggest the
string “SHA1 32-bit XOR”, since if I were presented with the challenge
of getting a 32-bit hash value out of an SHA1 hash, my inclination
would be to just choose the first 4 bytes. SHA1 already mixes things
so well that I don’t see any advantage in splitting the 20 bytes into
five 4-byte chunks and XOR-ing them.