Indexing mostly-binary documents (.ppt)

Here’s an interesting problem: In my app, we are indexing various
types of documents, including microsoft powerpoint. Powerpoint
documents are mostly binary, but have a bunch of text (all of the
text in the document?) as well.

My thinking is that the binary will never get searched for, and the
proper text will be indexed and queried as expected, so the indexed
binary will never affect results. Is this correct?

Then my colleague mentioned that maybe the indexed garbage would
affect the weighting of certain searches? I figure that weighting is
only per-search so, same situation as above, only the proper terms
will be calculated.

What do you folks think?

John

On Apr 1, 2007, at 3:09 AM, John B. wrote:

Here’s an interesting problem: In my app, we are indexing various
types of documents, including microsoft powerpoint. Powerpoint
documents are mostly binary, but have a bunch of text (all of the
text in the document?) as well.

Are you serious? You’re adding raw, unprocessed PPT files to your index?

Now this is just wrong. PPT files may contain all sorts of binary
data, such as images and videos. I just had a look at the sample
presentation that came with my Office installation. This file is
3.5MB in size with a (plain text) payload of less than 1KB.

I’m sure there’s some tool available which converts PPT to plain text
and I strongly recommend you go out and find it.

Cheers,
Andy

On Apr 1, 2007, at 5:37 AM, Andreas K. wrote:

Are you serious? You’re adding raw, unprocessed PPT files to your
index?

Now this is just wrong. PPT files may contain all sorts of binary
data, such as images and videos. I just had a look at the sample
presentation that came with my Office installation. This file is
3.5MB in size with a (plain text) payload of less than 1KB.

As I stated in my previous email, I am conjecturing that indexing
these documents will not affect search performance. Do you disagree?

I’m sure there’s some tool available which converts PPT to plain text
and I strongly recommend you go out and find it.

I’ve searched far and wide and have found none.

john

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

there are some:

Catdoc and Antiword for example. Simple shell commands to extract text
from files.

http://www.45.free.net/~vitus/software/catdoc/
http://www.winfield.demon.nl/

Antiword has better windows support, but as far as I know doesn’t
support .ppt as well as catdoc. I’m no expert though, just used it once
or twice at the university. If you use them, I would be interested in
feedback on how well it works.

Thanks in advance and good luck
Florian

P.S.: There is an article about even more of them at
http://www.linux.com/article.pl?sid=06/02/22/201247 .

John Joseph B. wrote:

these documents will not affect search performance. Do you disagree?


Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGD+ID8RlGMqQ8m7oRAoFjAKCfgIzDsFnl+gKgnHQKI11yAkhTYQCfQpx3
fa5wJ2SaE2JlLzQABqxJe7Q=
=AX5Q
-----END PGP SIGNATURE-----

John Joseph B. wrote:

On Apr 1, 2007, at 5:37 AM, Andreas K. wrote:

I’m sure there’s some tool available which converts PPT to plain
text
and I strongly recommend you go out and find it.

I’ve searched far and wide and have found none.

On Apr 1, 2007, at 12:47 PM, Florian G. wrote:

once
or twice at the university. If you use them, I would be interested in
feedback on how well it works.

Wow, I had never come across catdoc and its siblings, and believe me,
I’ve searched far and wide. THANK YOU.

btw-- I’m pretty sure antiword does not work on powerpoint:

$ antiword powerpoint.ppt
This OLE file does not contain a Word document

John

On Apr 1, 2007, at 6:11 PM, John Joseph B. wrote:

Now this is just wrong. PPT files may contain all sorts of binary
data, such as images and videos. I just had a look at the sample
presentation that came with my Office installation. This file is
3.5MB in size with a (plain text) payload of less than 1KB.

As I stated in my previous email, I am conjecturing that indexing
these documents will not affect search performance. Do you disagree?

I couldn’t disagree more. Question is to what extent does it affect
performance.

I’m sure there’s some tool available which converts PPT to plain text
and I strongly recommend you go out and find it.

I’ve searched far and wide and have found none.

Seems like you found one now :slight_smile:

Good Luck!

– Andy

On Apr 1, 2007, at 12:47 PM, Florian G. wrote:

Catdoc and Antiword for example. Simple shell commands to extract text
from files.

http://www.45.free.net/~vitus/software/catdoc/
http://www.winfield.demon.nl/

If you use them, I would be interested in
feedback on how well it works.

I am now using catdoc, catppt, and xls2csv to index all of my
documents, and it is working well.

The content out of catppt seems to be rather incomplete, but is Good
Enough for our purposes.

John

John B. wrote:

I am now using catdoc, catppt, and xls2csv to index all of my
documents, and it is working well.

The content out of catppt seems to be rather incomplete, but is Good
Enough for our purposes.

If you were going to be happy with the plain contents being indexed, I’d
suggest just running the powerpoint document through strings before
indexing it. I don’t know if catppt does more or less than that, but
it’d be useful to compare.

On Apr 2, 2007, at 7:02 AM, Alex Y. wrote:

If you were going to be happy with the plain contents being
indexed, I’d
suggest just running the powerpoint document through strings before
indexing it. I don’t know if catppt does more or less than that, but
it’d be useful to compare.

I tried, that, the result is a LOT of binary garbage along with the
plain text.

On 4/2/07, Andreas K. [email protected] wrote:

performance.
Andy is right. Indexing binary data like this can really blow out the
size of an index. Indexing natural language you get a lot of common
terms so even in an index with millions of documents, you may have
only tens of thousands of terms. This has a natural compression effect
on the index so it will be a lot smaller than the collection of data
that is being indexed. This doesn’t work with binary data so the size
of your index will be much larger and you’ll have far more search
terms in the index. So it will definitely have an effect on search
performance but perhaps not as much as you’d expect. Nevertheless,
you’d be much better off extracting the text as others have already
said.

Cheers,
Dave

On Apr 6, 2007, at 4:02 AM, David B. wrote:

performance.
Andy is right. Indexing binary data like this can really blow out the
size of an index. Indexing natural language you get a lot of common
terms so even in an index with millions of documents, you may have
only tens of thousands of terms. This has a natural compression effect
on the index so it will be a lot smaller than the collection of data
that is being indexed. This doesn’t work with binary data so the size
of your index will be much larger and you’ll have far more search
terms in the index. So it will definitely have an effect on search
performance but perhaps not as much as you’d expect.

For the record, by performance I meant the quality of the search
(i.e., the results of a search query), and not the speed. I now
realize that there is now way for anyone to have known that :slight_smile:

Thanks again for all the ideas, I’m happy as a clam with catdoc/catppt.

John