Just for fun

mansfiem · March 15, 2007, 6:44pm

I posted this to another forum as part of a challenge… thought to
post it here just for fun. +5 pts if you know what algorithm is being
used here…

Anyone care to golf? Note that the breakdown of the string TXT was
just to prevent it from line-wrapping…

TXT = “.,~,;\ntee,yeeewgeendhoeheddrhdfteeenn” +
“\n\n\n\n\nr cl u enneiii ddd hhhlhhhh” +
"elllphsvHfo ousctTTttTttgdfddwl tddd " +
“uaooAat doopmp cmsse o i ari t isa” +
“joo e\n”

dat = [""] * TXT.size
TXT.size.times do
dat = TXT.split(//).zip(dat).map{ |a| a.join }.sort
end
puts dat.find{ |x| x[-1] == ?~ }[0…-1]

mansfiem · March 15, 2007, 7:25pm

Matthew M. wrote:

+5 pts if you know what algorithm is being used here…

I’m not going to try to golf it right now, but it looks like an inverse
Burrows-Wheeler Transform.

Stephen

mansfiem · March 15, 2007, 7:35pm

On 3/15/07, Stephen L. [email protected] wrote:

Matthew M. wrote:

+5 pts if you know what algorithm is being used here…

I’m not going to try to golf it right now, but it looks like an inverse
Burrows-Wheeler Transform.

+5 for Stephen. =)

mansfiem · March 19, 2007, 8:20pm

“Matthew M.” [email protected] writes:

On 3/15/07, Stephen L. [email protected] wrote:

Matthew M. wrote:

+5 pts if you know what algorithm is being used here…

I’m not going to try to golf it right now, but it looks like an inverse
Burrows-Wheeler Transform.

+5 for Stephen. =)

Whoa, I just thought of that, without even reading the question…
magic.

mansfiem · March 19, 2007, 8:52pm

On Mar 15, 2:33 pm, “Matthew M.” [email protected]
wrote:

On 3/15/07, Stephen L. [email protected] wrote:

Matthew M. wrote:

+5 pts if you know what algorithm is being used here…

I’m not going to try to golf it right now, but it looks like an inverse
Burrows-Wheeler Transform.

+5 for Stephen. =)

Looks interesting. Care to elaborate on “inverse Burrows-Wheeler
Transform”?

Thanks,
T.

mansfiem · March 19, 2007, 9:20pm

“Trans” [email protected] writes:

+5 for Stephen. =)

Looks interesting. Care to elaborate on “inverse Burrows-Wheeler
Transform”?

The Burrows-Wheeler transform takes some arbitrary string, say,
gigantomania and sorts the permutations: we start with

gigantomania
igantomaniag
gantomaniagi
antomaniagig
ntomaniagiga
tomaniagigan
omaniagigant
maniagiganto
aniagigantom
niagigantoma
iagigantoman
agigantomani

and then sort the lines:

agigantomani
aniagigantom
antomaniagig
gantomaniagi
gigantomania
iagigantoman
igantomaniag
maniagiganto
niagigantoma
ntomaniagiga
omaniagigant
tomaniagigan

Then we take the last column and assign numbers to it:

imgiangoaatn
0123456789AB

Now we sort the letters with the numbers alongside.

aaaggiimnnot
489260315B7A

And now we walk off a linked list starting at 4 (we need to remember
that starting position since all cyclic strings have the same BWT
transform otherwise).

That puts out the original sequence.

mansfiem · March 20, 2007, 4:13am

On Mar 19, 4:20 pm, David K. [email protected] wrote:

gantomaniagi
and then sort the lines:
ntomaniagiga
aaaggiimnnot
489260315B7A

And now we walk off a linked list starting at 4 (we need to remember
that starting position since all cyclic strings have the same BWT
transform otherwise).

That puts out the original sequence.

That’s nuts.

T.

mansfiem · March 19, 2007, 9:10pm

On 3/19/07, Trans [email protected] wrote:

I’m not going to try to golf it right now, but it looks like an
T.

Google and Wikipedia are your friends:

http://www.google.com/search?q=Burrows-Wheeler

First result:

Jason

mansfiem · March 20, 2007, 4:20am

Trans wrote:

On Mar 19, 4:20 pm, David K. [email protected] wrote:

aaaggiimnnot
489260315B7A
And now we walk off a linked list starting at 4 (we need to remember
that starting position since all cyclic strings have the same BWT
transform otherwise).
That puts out the original sequence.
That’s nuts.

Nuts it may be, but it’s the reason why “bzip” compresses tighter
than deflate (zip, gzip, etc). It does a BWT first, then compresses
that.

David, I followed your explanation right up to the quoted text about
how to read the original string back out. Can you elaborate a bit
please?

Clifford H…

mansfiem · March 20, 2007, 4:55am

On Mar 19, 11:15 pm, Clifford H. [email protected] wrote:

Nuts it may be, but it’s the reason why “bzip” compresses tighter
than deflate (zip, gzip, etc). It does a BWT first, then compresses
that.

Really? I find that surprising actually. I would not think a sorting
algorithm could lead to any significant compression. It just doesn’t
make sense under conservation of information. I assume it’s due to
deficiencies in deflate then.

T.

mansfiem · March 20, 2007, 6:08am

Clifford H. wrote:

David, I followed your explanation right up to the quoted text about
how to read the original string back out. Can you elaborate a bit
please?

Clifford H…
Just read the Wikipedia article. It has tables representing the loops
the algorithm goes through. It’s very easy to understand.

mansfiem · March 20, 2007, 7:55am

Trans wrote:

Really? I find that surprising actually. I would not think a sorting
algorithm could lead to any significant compression. It just doesn’t
make sense under conservation of information. I assume it’s due to
deficiencies in deflate then.

deflate works by substituting short codes for repeated text, then
Huffman packing the codes and the first instance of each text.
BWT creates more repetitions by grouping similar text, so the
compression has more to chew on.

mansfiem · March 21, 2007, 1:48pm

On 3/20/07, Clifford H. [email protected] wrote:

Trans wrote:

Really? I find that surprising actually. I would not think a sorting
algorithm could lead to any significant compression. It just doesn’t
make sense under conservation of information. I assume it’s due to
deficiencies in deflate then.

deflate works by substituting short codes for repeated text, then
Huffman packing the codes and the first instance of each text.
BWT creates more repetitions by grouping similar text, so the
compression has more to chew on.

And the “gigantomania” example given isn’t a particularly good
demonstration of how the transform tends to cluster repetitions.

Quoting from the wikipedia article:

"To understand why this creates more-easily-compressible data, let’s
consider transforming a long English text frequently containing the
word “the”. Sorting the rotations of this text will often group
rotations starting with "he " together, and the last character of that
rotation (which is also the character before the "he ") will usually
be “t”, so the result of the transform would contain a number of “t”
characters along with the perhaps less-common exceptions (such as if
it contains "Brahe “) mixed in. So it can be seen that the success of
this transform depends upon one value having a high probability of
occurring before a sequence, so that in general it needs fairly long
samples (a few kilobytes at least) of appropriate data (such as
text).”

The article uses the word bananna which is a better example.

Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/

mansfiem · March 21, 2007, 2:11pm

On Mar 21, 8:47 am, “Rick DeNatale” [email protected] wrote:

BWT creates more repetitions by grouping similar text, so the
rotations starting with "he " together, and the last character of that
rotation (which is also the character before the "he ") will usually
be “t”, so the result of the transform would contain a number of “t”
characters along with the perhaps less-common exceptions (such as if
it contains "Brahe “) mixed in. So it can be seen that the success of
this transform depends upon one value having a high probability of
occurring before a sequence, so that in general it needs fairly long
samples (a few kilobytes at least) of appropriate data (such as
text).”

The article uses the word bananna which is a better example.

I think one needs to be careful in understanding this though. The
sorting only provides a benefit in contrast to less efficient
algorithms. No amount of reordering will ever make a difference to the
lower bound of compressibility. That’s the crux of my point. And
hence, for a better algorithm, this kind of sorting should be a moot
point. (Of course, one could argue that this sorting Is the better
algorithm – fair enough.)

T.

mansfiem · March 22, 2007, 11:01am

Trans wrote:

I think one needs to be careful in understanding this though. The
sorting only provides a benefit in contrast to less efficient
algorithms. No amount of reordering will ever make a difference to the
lower bound of compressibility. That’s the crux of my point. And
hence, for a better algorithm, this kind of sorting should be a moot
point. (Of course, one could argue that this sorting Is the better
algorithm – fair enough.)

Your analysis is well-meaning, but flawed. Compression relies on
prediction. If you know I’m going to send you a message telling
you how many bananas to buy, and you already know that’s the subject
of the message, then I never need to say “bananas”, I can just say
“12”.

If you know you’re receiving a message that contains large amounts
of English text, or C language text, or i386 binary opcodes, or you
can recognise and adapt when you find that you are, you’ll be able
to compress better than any algorithm that only has a single trick,
repetition, as deflate does. Otherwise we’d all be using deflate for
download pirated video streams and no-one would need H.264

BWT improves deflate’s predictor for the kinds of text we tend to
use. Period. Nothing really interesting there, except how it does it.

Clifford H…

mansfiem · March 22, 2007, 3:19pm

On Mar 22, 6:00 am, Clifford H. [email protected] wrote:

prediction. If you know I’m going to send you a message telling
you how many bananas to buy, and you already know that’s the subject
of the message, then I never need to say “bananas”, I can just say
“12”.

That only works for limited data sets. The larger the set, the less
helpful it is. And for general compression it doesn’t help at all.
General compression is all about removing repetition. Try assign a
number to every word in the dictionary and “compress” a piece of text
with it. It won’t help.

If you know you’re receiving a message that contains large amounts
of English text, or C language text, or i386 binary opcodes, or you
can recognise and adapt when you find that you are, you’ll be able
to compress better than any algorithm that only has a single trick,
repetition, as deflate does. Otherwise we’d all be using deflate for
download pirated video streams and no-one would need H.264

Those things are not limited data sets. They have some limiting rules
(if the files are valid) but those are pretty much useless to boost
compression.

BWT improves deflate’s predictor for the kinds of text we tend to
use. Period. Nothing really interesting there, except how it does it.

“How it does it” is the interesting part. How does BWT improve
deflate? Ie. What’s that tell us about deflate? What does it tell us
about compression in general?

T.

mansfiem · March 23, 2007, 4:45am

Trans wrote:

On Mar 22, 6:00 am, Clifford H. [email protected] wrote:

Your analysis is well-meaning, but flawed.
General compression is all about removing repetition.

This is a very concise statement about what’s flawed in your analysis.
It’s not just about repetition, it’s about removing whatever is
predictable. All’s fair when it comes to improving predictability.
If I cared enough to write a compressor that could recognise
Somerset Maugham’s writing style and vocabulary, it could adapt
its general compression algorithm to that style and compress it
better. Sure the decompressor has to know his style as well, but
that’s allowed - it can be a general compressor with specific modes
(like video, i386 opcodes, etc, as I mentioned).

Try assign a
number to every word in the dictionary and “compress” a piece of text
with it. It won’t help.

Funny you should say that because I did that once, building a dictionary
of the King James Bible (10.5Kwords) and Huffman coding the words. It
worked very well, thanks very much. It was better than deflate for the
purpose because you can start decompressing anywhere (though that’s not
relevant to overall compression goodness).

“How it does it” is the interesting part. How does BWT improve
deflate? Ie. What’s that tell us about deflate?

It tells us that there’s more to the compressing typical data sets
than simply identifying repetition - there is data that encodes
higher order patterns that are not directly visible as repetition,
but are visible to a better predictor.

mansfiem · March 23, 2007, 9:21am

On Mar 22, 11:45 pm, Clifford H. [email protected] wrote:

Trans wrote:

On Mar 22, 6:00 am, Clifford H. [email protected] wrote:

Your analysis is well-meaning, but flawed.
General compression is all about removing repetition.

This is a very concise statement about what’s flawed in your analysis.
It’s not just about repetition, it’s about removing whatever is
predictable. All’s fair when it comes to improving predictability.

Of course. When I used the term “repetition” I didn’t mean to be so
narrow as to to suggest nothing more than “a a a a a”. A better term
would have been “patterns”.

Funny you should say that because I did that once, building a dictionary
of the King James Bible (10.5Kwords) and Huffman coding the words. It
worked very well, thanks very much. It was better than deflate for the
purpose because you can start decompressing anywhere (though that’s not
relevant to overall compression goodness).

Right. Size is of first relevance. So was you file smaller than
deflates or bzip2s?

“How it does it” is the interesting part. How does BWT improve
deflate? Ie. What’s that tell us about deflate?

It tells us that there’s more to the compressing typical data sets
than simply identifying repetition - there is data that encodes
higher order patterns that are not directly visible as repetition,
but are visible to a better predictor.

That’s a little belittling, don’t you think? I mean, you’ve taken one
connotation of one word I used and seemingly made it the centerpiece
of all my thoughts on the matter, while missing the larger picture of
my point. I’m not saying your analysis is completely wrong. I’m just
saying I think you are missing a subtle point about what makes
compression work. Predictability is not the whole center piece of
compression either. If it were, you couldn’t get much better a
compression of Kings James than a file with this in it:

“http://patriot.net/~bmcgin/kjv12.txt”

Pretty predictable. That may seems silly, but there’s a point. This is
not general compression. It doesn’t really matter that the process of
decompression involves a remote machine – locality is relative. But
even better, if I sent you a new algorithm for compressing King James,
and along with it the file I sent you to uncompress had only one bit
of data in it, “1”, then how can it work? Of course, the algorithm
itself had the entire content of King James within it. Pretty
predictable again. But not general compression. When looking at
general compression its important to take the size of all the elements
involved as well, including the compression algorithm itself.

I find this a very fascinating subject. I think ultimately it will
turn out to be very important, not just for compressing data, but for
understanding physics too. I’ve done some deep meditation on the
topic. It is interesting to consider that all conceivable data can be
found somewhere in any transcendental number. You might think, that
being the case, an excellent compression algorithm would be to find
the starting digit, say in Pi, for the data one is seeking. But
curiously it would do you know good. On average the number for the
starting digit would be just as long as the data itself.

But I digress, if you think you are so sure about what you are saying,
and that I am completely off-base, I think you should try to solve
this puzzle:

http://prize.hutter1.net/

I believe what they are trying to do is essentially impossible. At
best they can shave a few bytes off here and there by removing some
“higher-order” patterns, as you say. Their premise is flawed.
Knowledge is no more compressible b/c it’s knowledge, than anything
else. They should realize either the human brain is just that vast or
that human knowledge uses lossy compression. They’ll gain nothing of
great significance until they do.

T.

mansfiem · March 23, 2007, 11:16am

Trans wrote:

Of course. When I used the term “repetition” I didn’t mean to be so
narrow as to to suggest nothing more than “a a a a a”. A better term
would have been “patterns”.

Ok, sorry, that wasn’t clear. I agree with you.

Right. Size is of first relevance. So was you file smaller than
deflates or bzip2s?

I don’t think either existed in 1983 :-). But I expect that
the answer is “no”. In any case, neither would suit, as the
compression existed to serve the purpose of full-text search
that had a separate compressed index for every word, and to
use the index, it was necessary to be able efficiently to
open the file at any word. Nowadays I’d open the file at any
disk block boundary, or bigger, since the CPU cost is
negligible next to the I/O cost. You’d probably use some
form of arithmetic coding these days anyhow.

“How it does it” is the interesting part. How does BWT improve
deflate? Ie. What’s that tell us about deflate?
It tells us that there’s more to the compressing typical data sets
That’s a little belittling, don’t you think? I mean, you’ve taken one
connotation of one word I used and seemingly made it the centerpiece

Ok, I’m sorry, I thought you meant literal repetition, I didn’t
mean to unfairly latch on to that, I just needed to clarify that
Shannon’s theorem doesn’t imply such repetition.

Predictability is not the whole center piece of compression either.

Hmm. I think perhaps we still disagree then… at least about the
definition of predictability…

Pretty predictable. That may seems silly, but there’s a point. This is
not general compression. It doesn’t really matter that the process of
decompression involves a remote machine – locality is relative.

Information is relative. Our language is the prime example; if my
words don’t evoke symbols in your thoughts that map to a similar
symbolic morphology in you, then we can’t communicate. So English
isn’t just a set of words, it’s a symbolic net whose parts connect
our shared experiences.

In other words, you, as the “decompressor” of my sequence of words,
must have related context, experience, that forms a morphologically
similar mesh, the nodes of which are associated with words that we
share.

Taking that as my context, I maintain that your “general” data
stream only has theoretical existence. All meaningful data streams
have structure. There are standard forms of mathematical pattern
search like finding repetition, using Fourier analysis, even
fractal analysis, but these cannot hope to find all possible
patterns - they can only find the kinds of structure they look
for. At some point, those structure types that are coded into the
de/compressors are encodings of human interpretations, not intrinsic
realities. The encoding of the human interpretation is “fair”, even
the extreme one in your URL example.

I find this a very fascinating subject. I think ultimately it will
turn out to be very important, not just for compressing data, but for
understanding physics too.

I agree, though I hadn’t thought of the physics linkup. I think
that the very structure and nature of knowledge itself is hidden
here somewhere. So’s most of our non-conscious learning too, like
learning to walk… though the compression of patterns of movement
in our cerebellum requires an approach to using the time domain
in a way I haven’t seen a machine implement.

It is interesting to consider that all conceivable data can be
found somewhere in any transcendental number.

Since a transcendental number is just an infinite stream of digits,
there exist an infinite number of them that can encode any given
infinite stream of information. I don’t know how deep that insight
really is though.

You might think, that
being the case, an excellent compression algorithm would be to find
the starting digit, say in Pi, for the data one is seeking. But
curiously it would do you know good. On average the number for the
starting digit would be just as long as the data itself.

Good one! I’m sure there’s some deep theoretical studies behind
that.

But I digress, if you think you are so sure about what you are saying,
and that I am completely off-base,

Now that you’ve explained yourself more clearly, I don’t
think you’re off-base at all.

I believe what they are trying to do is essentially impossible.

I don’t know. I wish I knew where I read that the human brain can
“learn” about the contents of 2 CDs - 1200MB, so even Einstein’s
knowledge could be encoded into that, if we knew how. The sum of
all human knowledge would doubtless be bigger than that, but I
found it an interesting thought. Some recent neurological research
I read indicated that, contrary to previous thought, our brains
actually do have a limited capacity. The previous assumption of
almost unlimited was based on thoughts about the likely failure
modes when “full” - and those were based on incorrect theories
about how memory works. I could perhaps dig up those papers
somewhere.

They should realize either the human brain is just that vast or
that human knowledge uses lossy compression.

I don’t agree at all. There surely is a limit to the achievable
compression, but we’re nowhere near it, and the search will tell
us much about the nature of knowledge.

Anyhow, interesting stuff, but not about Ruby… continue offline
if you wish…

Clifford H…

mansfiem · March 23, 2007, 11:40am

On Mar 23, 5:59 am, Mauricio F. [email protected] wrote:

(redundancy)

not general compression. It doesn’t really matter that the process of
finding practical algorithms that approximate it?

I don’t see the intention behind your remarks: surely, any practical
compression method we come up with is going to perform worse than the optimum
one (whose performance we can only estimate… using inferior compression
methods!), so what? The existence of an uncomputable lower bound on the length
of the description of a message is no more of a deterrent to research on
better (practical) compression methods than e.g. Shannon’s theorem is to
channel coding.

When did I say otherwise? I believe my original point was simply one
of surprise that BWT improved deflate, which I believe says more about
deflate’s deficiencies than BWT’s use as a means of compression. I’ve
only been trying to explain my reasoning for that thought ever since.

Regarding your remark on the choice of the universal machine and some choices
not being “general compression”, you could add that even if descriptive
complexity relative to two different universal machines differs at most by a
constant term, the latter does matter when we’re approximating the algorithmic
complexity in practical scenarios…

I’m sorry. Would you care to put that in layman’s terms? I did not
have the good fortunate of academic education in the field (oh, if
only!), so I cannot readily address your statement (assuming of course
you are using academic terms in good faith, and not just being
intentionally obscure).

T.