Using ferret as a base64-encoded numerical db

Kelly_J · July 6, 2009, 4:16pm

I’m using ferret to store random base64 strings of length 72 (courtesy
“dd if=/dev/random … | mmencode”), with the long-term goal of
storing floating point/integral numbers (converted to
base64). Problems:

% Ferret regards the base64 characters “+” and “/” as word
separators, so a search for “content:[xji xjj]” yields things like
“FqWu9uXM99HXZEJMl0Ux/jdOSP0+XJiL9v1ZDK24D0LMp60PUMPdhkbnFQykVMfilxecQFU6”
where “xji” appears after a plus sign. How to avoid this? I could
change “+” to “_”, but I’m not sure changing “/” to “.” or “:” or “-”
or “!” would work.

% Ferret’s default search is case-insensitive, so I get things like
“xJiQf0PEagWJME9Tf5pFu6dk4UGGFw5Lc0PIfa9N70Mb2IG2IWO36VCsC0y7Q1zOrLjk2Lz4”,
which match “xJi” but not “xji”. How to fix?

% When I do a range query, does ferret return all documents
matching the query or only the highest scoring 10? For my purposes, I
need all documents matching a query, not just the first few.

Is anyone else using ferret as a db? Since it’s hash-based, it’s much
faster at indexing large numbers of strings than sqlite3.

I realize I could just 0-pad my numbers (eg, “000005” for 5), but I’ve
got a LOT of data (400M pairs of floating point numbers), so I prefer
compactness.

–
We’re just a Bunch Of Regular Guys, a collective group that’s trying
to understand and assimilate technology. We feel that resistance to
new ideas and technology is unwise and ultimately futile.