On May 16, 2006, at 7:53 AM, David B. wrote:
higher order characters things go askew.
What Erik said is exactly correct. Marvin H., (author of
KinoSearch, a Perl port of Lucene) has submitted a patch to Lucene so
that non-java ports of Lucene will be able to read Lucene indexes. It
currently slows Lucene down by about 25% at the moment (I think??)
Around 20% for indexing according to my benchmarker. I don’t have a
benchmark for searching.
Modified UTF-8 is not so much the problem for performance of my
patch, nor is it actually causing the index incompatibility in this
case. Modified UTF-8 is problematic for a couple other reasons.
When text contains either null bytes or Unicode code points above the
Basic Multilingual Plane (values 2^16 and up, such as U+1D160
“MUSICAL SYMBOL EIGHTH NOTE”), KinoSearch and Ferret, if they write
legal UTF-8, would write indexes which would cause Lucene to crash
from time to time with a baffling “read past EOF” error. Therefore,
to be Lucene-compatible they’d have to pre-scan all text to detect
those conditions, which would impose a performance burden and require
some crufty auxilliary code to turn the legal UTF-8 into Modified UTF-8.
Also, non-shortest-form UTF-8 presents a theoretical security risk,
and Perl is set up to issue a warning whenever a scalar which is
marked as UTF-8 isn’t shortest-form. That condition would occur
whenever Modified UTF-8 containing null bytes or code points above
the BMP was read in – thus requiring that all incoming text be pre-
scanned as well.
Those are rare conditions, but it isn’t realistic to just say
“KinoSearch|Ferret doesn’t support null bytes or characters above the
BMP”, because a lot of times the source text that goes into an index
isn’t under the full control of the indexing/search app’s author.
To be fair to Java and Lucene, they are paying a price for early
commitment to the Unicode standard. Lucene’s UTF-8 encoding/decoding
hasn’t been touched since Doug Cutting wrote it in 1998, when non-
shortest-form UTF-8 was still legal and Unicode was still 16-bit.
You could argue that the Unicode consortium pulled the rug out from
under its early champions by changing the spec so that existing
implementations were no longer compliant.
The performance problem sof my patch and the crashing are actually
tied to the Lucene File Format’s definition of a String. A String in
Lucene is the length of the string in Java chars, followed by the
character data translated to Modified UTF-8. A String in KinoSearch,
and if I am not mistaken in Ferret as well, is the length of the
character data in bytes, followed by the character data.
Those two definitions of String result in identical indexes so long
as your text is pure ASCII, but as Erik noted, when you add higher
order characters to the mix, problems arise. You end up reading
either too few bytes or too many, the stream gets out of sync, and
whammo: ‘Read past EOF’.
My patch modifies Lucene to use bytecounts as the prefix to its
Strings. Unfortunately, there are encoding/decoding inefficiencies
associated with the new way of doing things. Under Lucene’s current
definition of a string you allocate an array of Java char then read
characters into it one by one. With the new patch, you don’t know
how many chars you need, so you might have to re-allocate several
times. There are ways to address that inefficiency, but they’d take
a while to explain.
Don’t hold your
breath though. It’s going to take us a while to get it in there.
Yeah. Modifying Lucene so that it can read both the old index format
and the new without suffering a performance degradation in either
case is going to be non-trivial. I’m sympathetic to the notion that
it may not be worth it and that Lucene should declare its file format
private. There are a lot of issues in play.
No KinoSearch user has yet complained about Lucene/KinoSearch file-
format compatibility. The only thing I miss is Luke – which is
significant, because Luke is really handy.
How many users here care about Lucene compatibility, and why?