Forum: Ruby Just for fun...

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Matthew M. (Guest)
on 2007-03-15 19:44
(Received via mailing list)
I posted this to another forum as part of a challenge...  thought to
post it here just for fun.  +5 pts if you know what algorithm is being
used here...

Anyone care to golf?  Note that the breakdown of the string TXT was
just to prevent it from line-wrapping...

TXT = ".,~,,;\ntee,yeeewgeendhoeheddrhdfteeenn" +
      "\n\n\n\n\nr cl u enneiii   ddd hhhlhhhh" +
      "elllphsvHfo ousctTTttTttgdfddwl  tddd  " +
      "uaooAat doopmp cmsse o i   ari    t isa" +
      "joo e\n"

dat = [""] * TXT.size
TXT.size.times do
   dat = TXT.split(//).zip(dat).map{ |a| a.join }.sort
end
puts dat.find{ |x| x[-1] == ?~ }[0...-1]
Stephen L. (Guest)
on 2007-03-15 20:25
(Received via mailing list)
Matthew M. wrote:
> +5 pts if you know what algorithm is being used here...

I'm not going to try to golf it right now, but it looks like an inverse
Burrows-Wheeler Transform.

Stephen
Matthew M. (Guest)
on 2007-03-15 20:35
(Received via mailing list)
On 3/15/07, Stephen L. <removed_email_address@domain.invalid> wrote:
> Matthew M. wrote:
> > +5 pts if you know what algorithm is being used here...
>
> I'm not going to try to golf it right now, but it looks like an inverse
> Burrows-Wheeler Transform.

+5 for Stephen.  =)
Christian N. (Guest)
on 2007-03-19 21:20
(Received via mailing list)
"Matthew M." <removed_email_address@domain.invalid> writes:

> On 3/15/07, Stephen L. <removed_email_address@domain.invalid> wrote:
>> Matthew M. wrote:
>> > +5 pts if you know what algorithm is being used here...
>>
>> I'm not going to try to golf it right now, but it looks like an inverse
>> Burrows-Wheeler Transform.
>
> +5 for Stephen.  =)

Whoa, I just thought of that, without even reading the question...
magic.
Trans (Guest)
on 2007-03-19 21:52
(Received via mailing list)
On Mar 15, 2:33 pm, "Matthew M." <removed_email_address@domain.invalid>
wrote:
> On 3/15/07, Stephen L. <removed_email_address@domain.invalid> wrote:
>
> > Matthew M. wrote:
> > > +5 pts if you know what algorithm is being used here...
>
> > I'm not going to try to golf it right now, but it looks like an inverse
> > Burrows-Wheeler Transform.
>
> +5 for Stephen.  =)

Looks interesting. Care to elaborate on "inverse Burrows-Wheeler
Transform"?


Thanks,
T.
Jason R. (Guest)
on 2007-03-19 22:10
(Received via mailing list)
On 3/19/07, Trans <removed_email_address@domain.invalid> wrote:
> > > I'm not going to try to golf it right now, but it looks like an
> T.
>
>
> Google and Wikipedia are your friends:

http://www.google.com/search?q=Burrows-Wheeler

First result:
http://en.wikipedia.org/wiki/Burrows-Wheeler_transform

Jason
David K. (Guest)
on 2007-03-19 22:20
(Received via mailing list)
"Trans" <removed_email_address@domain.invalid> writes:

>> +5 for Stephen.  =)
>
> Looks interesting. Care to elaborate on "inverse Burrows-Wheeler
> Transform"?

The Burrows-Wheeler transform takes some arbitrary string, say,
gigantomania and sorts the permutations: we start with

gigantomania
igantomaniag
gantomaniagi
antomaniagig
ntomaniagiga
tomaniagigan
omaniagigant
maniagiganto
aniagigantom
niagigantoma
iagigantoman
agigantomani

and then sort the lines:

agigantomani
aniagigantom
antomaniagig
gantomaniagi
gigantomania
iagigantoman
igantomaniag
maniagiganto
niagigantoma
ntomaniagiga
omaniagigant
tomaniagigan

Then we take the last column and assign numbers to it:

imgiangoaatn
0123456789AB

Now we sort the letters with the numbers alongside.

aaaggiimnnot
489260315B7A

And now we walk off a linked list starting at 4 (we need to remember
that starting position since all cyclic strings have the same BWT
transform otherwise).

That puts out the original sequence.
Trans (Guest)
on 2007-03-20 05:13
(Received via mailing list)
On Mar 19, 4:20 pm, David K. <removed_email_address@domain.invalid> wrote:
>
> gantomaniagi
> and then sort the lines:
> ntomaniagiga
> aaaggiimnnot
> 489260315B7A
>
> And now we walk off a linked list starting at 4 (we need to remember
> that starting position since all cyclic strings have the same BWT
> transform otherwise).
>
> That puts out the original sequence.

That's nuts.

T.
Clifford H. (Guest)
on 2007-03-20 05:20
(Received via mailing list)
Trans wrote:
> On Mar 19, 4:20 pm, David K. <removed_email_address@domain.invalid> wrote:
>> aaaggiimnnot
>> 489260315B7A
>> And now we walk off a linked list starting at 4 (we need to remember
>> that starting position since all cyclic strings have the same BWT
>> transform otherwise).
>> That puts out the original sequence.
> That's nuts.

Nuts it may be, but it's the reason why "bzip" compresses tighter
than deflate (zip, gzip, etc). It does a BWT first, then compresses
that.

David, I followed your explanation right up to the quoted text about
how to read the original string back out. Can you elaborate a bit
please?

Clifford H..
Trans (Guest)
on 2007-03-20 05:55
(Received via mailing list)
On Mar 19, 11:15 pm, Clifford H. <removed_email_address@domain.invalid> wrote:

> Nuts it may be, but it's the reason why "bzip" compresses tighter
> than deflate (zip, gzip, etc). It does a BWT first, then compresses
> that.

Really? I find that surprising actually. I would not think a sorting
algorithm could lead to any significant compression. It just doesn't
make sense under conservation of information. I assume it's due to
deficiencies in deflate then.

T.
Raj S. (Guest)
on 2007-03-20 07:08
(Received via mailing list)
Clifford H. wrote:
> David, I followed your explanation right up to the quoted text about
> how to read the original string back out. Can you elaborate a bit
> please?
>
> Clifford H..
Just read the Wikipedia article.  It has tables representing the loops
the algorithm goes through.  It's very easy to understand.
Clifford H. (Guest)
on 2007-03-20 08:55
(Received via mailing list)
Trans wrote:
> Really? I find that surprising actually. I would not think a sorting
> algorithm could lead to any significant compression. It just doesn't
> make sense under conservation of information. I assume it's due to
> deficiencies in deflate then.

deflate works by substituting short codes for repeated text, then
Huffman packing the codes and the first instance of each text.
BWT creates more repetitions by grouping similar text, so the
compression has more to chew on.
Rick D. (Guest)
on 2007-03-21 14:48
(Received via mailing list)
On 3/20/07, Clifford H. <removed_email_address@domain.invalid> wrote:
> Trans wrote:
> > Really? I find that surprising actually. I would not think a sorting
> > algorithm could lead to any significant compression. It just doesn't
> > make sense under conservation of information. I assume it's due to
> > deficiencies in deflate then.
>
> deflate works by substituting short codes for repeated text, then
> Huffman packing the codes and the first instance of each text.
> BWT creates more repetitions by grouping similar text, so the
> compression has more to chew on.

And the "gigantomania" example given isn't a particularly good
demonstration of how the transform tends to cluster repetitions.

Quoting from the wikipedia article:

"To understand why this creates more-easily-compressible data, let's
consider transforming a long English text frequently containing the
word "the". Sorting the rotations of this text will often group
rotations starting with "he " together, and the last character of that
rotation (which is also the character before the "he ") will usually
be "t", so the result of the transform would contain a number of "t"
characters along with the perhaps less-common exceptions (such as if
it contains "Brahe ") mixed in. So it can be seen that the success of
this transform depends upon one value having a high probability of
occurring before a sequence, so that in general it needs fairly long
samples (a few kilobytes at least) of appropriate data (such as
text)."

The article uses the word bananna which is a better example.
--
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/
Trans (Guest)
on 2007-03-21 15:11
(Received via mailing list)
On Mar 21, 8:47 am, "Rick DeNatale" <removed_email_address@domain.invalid> 
wrote:
> > BWT creates more repetitions by grouping similar text, so the
> rotations starting with "he " together, and the last character of that
> rotation (which is also the character before the "he ") will usually
> be "t", so the result of the transform would contain a number of "t"
> characters along with the perhaps less-common exceptions (such as if
> it contains "Brahe ") mixed in. So it can be seen that the success of
> this transform depends upon one value having a high probability of
> occurring before a sequence, so that in general it needs fairly long
> samples (a few kilobytes at least) of appropriate data (such as
> text)."
>
> The article uses the word bananna which is a better example.

I think one needs to be careful in understanding this though. The
sorting only provides a benefit in contrast to less efficient
algorithms. No amount of reordering will ever make a difference to the
lower bound of compressibility. That's the crux of my point. And
hence, for a better algorithm, this kind of sorting should be a moot
point. (Of course, one could argue that this sorting Is the better
algorithm -- fair enough.)

T.
Clifford H. (Guest)
on 2007-03-22 12:01
(Received via mailing list)
Trans wrote:
> I think one needs to be careful in understanding this though. The
> sorting only provides a benefit in contrast to less efficient
> algorithms. No amount of reordering will ever make a difference to the
> lower bound of compressibility. That's the crux of my point. And
> hence, for a better algorithm, this kind of sorting should be a moot
> point. (Of course, one could argue that this sorting Is the better
> algorithm -- fair enough.)

Your analysis is well-meaning, but flawed. Compression relies on
prediction. If you know I'm going to send you a message telling
you how many bananas to buy, and you already know that's the subject
of the message, then I never need to say "bananas", I can just say
"12".

If you know you're receiving a message that contains large amounts
of English text, or C language text, or i386 binary opcodes, or you
can recognise and adapt when you find that you are, you'll be able
to compress better than any algorithm that only has a single trick,
repetition, as deflate does. Otherwise we'd all be using deflate for
download pirated video streams and no-one would need H.264 :-)

BWT improves deflate's predictor for the kinds of text we tend to
use. Period. Nothing really interesting there, except how it does it.

Clifford H..
Trans (Guest)
on 2007-03-22 16:19
(Received via mailing list)
On Mar 22, 6:00 am, Clifford H. <removed_email_address@domain.invalid> wrote:
> prediction. If you know I'm going to send you a message telling
> you how many bananas to buy, and you already know that's the subject
> of the message, then I never need to say "bananas", I can just say
> "12".

That only works for limited data sets. The larger the set, the less
helpful it is. And for general compression it doesn't help at all.
General compression is all about removing repetition. Try assign a
number to every word in the dictionary and "compress" a piece of text
with it. It won't help.

> If you know you're receiving a message that contains large amounts
> of English text, or C language text, or i386 binary opcodes, or you
> can recognise and adapt when you find that you are, you'll be able
> to compress better than any algorithm that only has a single trick,
> repetition, as deflate does. Otherwise we'd all be using deflate for
> download pirated video streams and no-one would need H.264 :-)

Those things are not limited data sets. They have some limiting rules
(if the files are valid) but those are pretty much useless to boost
compression.

> BWT improves deflate's predictor for the kinds of text we tend to
> use. Period. Nothing really interesting there, except how it does it.

"How it does it" is the interesting part. How does BWT improve
deflate? Ie. What's that tell us about deflate? What does it tell us
about compression in general?

T.
Clifford H. (Guest)
on 2007-03-23 05:45
(Received via mailing list)
Trans wrote:
> On Mar 22, 6:00 am, Clifford H. <removed_email_address@domain.invalid> wrote:
>> Your analysis is well-meaning, but flawed.
> General compression is all about removing repetition.

This is a very concise statement about what's flawed in your analysis.
It's not just about repetition, it's about removing *whatever* is
predictable. All's fair when it comes to improving predictability.
If I cared enough to write a compressor that could recognise
Somerset Maugham's writing style and vocabulary, it could adapt
its general compression algorithm to that style and compress it
better. Sure the decompressor has to know his style as well, but
that's allowed - it can be a general compressor with specific modes
(like video, i386 opcodes, etc, as I mentioned).

> Try assign a
> number to every word in the dictionary and "compress" a piece of text
> with it. It won't help.

Funny you should say that because I did that once, building a dictionary
of the King James Bible (10.5Kwords) and Huffman coding the words. It
worked very well, thanks very much. It was better than deflate for the
purpose because you can start decompressing anywhere (though that's not
relevant to overall compression goodness).

> "How it does it" is the interesting part. How does BWT improve
> deflate? Ie. What's that tell us about deflate?

It tells us that there's more to the compressing typical data sets
than simply identifying repetition - there is data that encodes
higher order patterns that are not directly visible as repetition,
but are visible to a better predictor.
Trans (Guest)
on 2007-03-23 10:21
(Received via mailing list)
On Mar 22, 11:45 pm, Clifford H. <removed_email_address@domain.invalid> wrote:
> Trans wrote:
> > On Mar 22, 6:00 am, Clifford H. <removed_email_address@domain.invalid> wrote:
> >> Your analysis is well-meaning, but flawed.
> > General compression is all about removing repetition.
>
> This is a very concise statement about what's flawed in your analysis.
> It's not just about repetition, it's about removing *whatever* is
> predictable. All's fair when it comes to improving predictability.

Of course. When I used the term "repetition" I didn't mean to be so
narrow as to to suggest nothing more than "a a a a a". A better term
would have been "patterns".

>
> Funny you should say that because I did that once, building a dictionary
> of the King James Bible (10.5Kwords) and Huffman coding the words. It
> worked very well, thanks very much. It was better than deflate for the
> purpose because you can start decompressing anywhere (though that's not
> relevant to overall compression goodness).

Right. Size is of first relevance. So was you file smaller than
deflates or bzip2s?

> > "How it does it" is the interesting part. How does BWT improve
> > deflate? Ie. What's that tell us about deflate?
>
> It tells us that there's more to the compressing typical data sets
> than simply identifying repetition - there is data that encodes
> higher order patterns that are not directly visible as repetition,
> but are visible to a better predictor.

That's a little belittling, don't you think? I mean, you've taken one
connotation of one word I used and seemingly made it the centerpiece
of all my thoughts on the matter, while missing the larger picture of
my point. I'm not saying your analysis is completely wrong. I'm just
saying I think you are missing a subtle point about what makes
compression work. Predictability is not the whole center piece of
compression either. If it were, you couldn't get much better a
compression of Kings James than a file with this in it:

  "http://patriot.net/~bmcgin/kjv12.txt"

Pretty predictable. That may seems silly, but there's a point. This is
not general compression. It doesn't really matter that the process of
decompression involves a remote machine -- locality is relative. But
even better, if I sent you a new algorithm for compressing King James,
and along with it the file I sent you to uncompress had only one bit
of data in it, "1", then how can it work? Of course, the algorithm
itself had the entire content of King James within it. Pretty
predictable again. But not general compression. When looking at
general compression its important to take the size of all the elements
involved as well, including the compression algorithm itself.

I find this a very fascinating subject. I think ultimately it will
turn out to be very important, not just for compressing data, but for
understanding physics too. I've done some deep meditation on the
topic. It is interesting to consider that all conceivable data can be
found somewhere in any transcendental number. You might think, that
being the case, an excellent compression algorithm would be to find
the starting digit, say in Pi, for the data one is seeking. But
curiously it would do you know good. On average the number for the
starting digit would be just as long as the data itself.

But I digress, if you think you are so sure about what you are saying,
and that I am completely off-base, I think you should try to solve
this puzzle:

  http://prize.hutter1.net/

I believe what they are trying to do is essentially impossible. At
best they can shave a few bytes off here and there by removing some
"higher-order" patterns, as you say. Their premise is flawed.
Knowledge is no more compressible b/c it's knowledge, than anything
else. They should realize either the human brain is just that vast or
that human knowledge uses lossy compression. They'll gain nothing of
great significance until they do.

T.
Mauricio F. (Guest)
on 2007-03-23 12:01
(Received via mailing list)
On Fri, Mar 23, 2007 at 05:20:36PM +0900, Trans wrote:
> On Mar 22, 11:45 pm, Clifford H. <removed_email_address@domain.invalid> wrote:
> > This is a very concise statement about what's flawed in your analysis.
> > It's not just about repetition, it's about removing *whatever* is
> > predictable. All's fair when it comes to improving predictability.
>
> Of course. When I used the term "repetition" I didn't mean to be so
> narrow as to to suggest nothing more than "a a a a a". A better term
> would have been "patterns".

(redundancy)

> and along with it the file I sent you to uncompress had only one bit
> of data in it, "1", then how can it work? Of course, the algorithm
> itself had the entire content of King James within it. Pretty
> predictable again. But not general compression. When looking at
> general compression its important to take the size of all the elements
> involved as well, including the compression algorithm itself.

Given that Kolmogorov-Chaitin complexity is not computable, what's wrong
with
finding practical algorithms that approximate it?

I don't see the intention behind your remarks: surely, any practical
compression method we come up with is going to perform worse than the
optimum
one (whose performance we can only estimate... using inferior
compression
methods!), so what? The existence of an uncomputable lower bound on the
length
of the description of a message is no more of a deterrent to research on
better (practical) compression methods than e.g. Shannon's theorem is to
channel coding.

Regarding your remark on the choice of the universal machine and some
choices
not being "general compression", you could add that even if descriptive
complexity relative to two different universal machines differs at most
by a
constant term, the latter does matter when we're approximating the
algorithmic
complexity in practical scenarios...
Clifford H. (Guest)
on 2007-03-23 12:16
(Received via mailing list)
Trans wrote:
> Of course. When I used the term "repetition" I didn't mean to be so
> narrow as to to suggest nothing more than "a a a a a". A better term
> would have been "patterns".

Ok, sorry, that wasn't clear. I agree with you.

> Right. Size is of first relevance. So was you file smaller than
> deflates or bzip2s?

I don't think either existed in 1983 :-). But I expect that
the answer is "no". In any case, neither would suit, as the
compression existed to serve the purpose of full-text search
that had a separate compressed index for every word, and to
use the index, it was necessary to be able efficiently to
open the file at any word. Nowadays I'd open the file at any
disk block boundary, or bigger, since the CPU cost is
negligible next to the I/O cost. You'd probably use some
form of arithmetic coding these days anyhow.

>>> "How it does it" is the interesting part. How does BWT improve
>>> deflate? Ie. What's that tell us about deflate?
>> It tells us that there's more to the compressing typical data sets
> That's a little belittling, don't you think? I mean, you've taken one
> connotation of one word I used and seemingly made it the centerpiece

Ok, I'm sorry, I thought you meant literal repetition, I didn't
mean to unfairly latch on to that, I just needed to clarify that
Shannon's theorem doesn't imply such repetition.

> Predictability is not the whole center piece of compression either.

Hmm. I think perhaps we still disagree then... at least about the
definition of predictability...

> Pretty predictable. That may seems silly, but there's a point. This is
> not general compression. It doesn't really matter that the process of
> decompression involves a remote machine -- locality is relative.

Information is relative. Our language is the prime example; if my
words don't evoke symbols in your thoughts that map to a similar
symbolic morphology in you, then we can't communicate. So English
isn't just a set of words, it's a symbolic net whose parts connect
our shared experiences.

In other words, you, as the "decompressor" of my sequence of words,
must have related context, experience, that forms a morphologically
similar mesh, the nodes of which are associated with words that we
share.

Taking that as my context, I maintain that your "general" data
stream only has theoretical existence. All meaningful data streams
have structure. There are standard forms of mathematical pattern
search like finding repetition, using Fourier analysis, even
fractal analysis, but these cannot hope to find all possible
patterns - they can only find the kinds of structure they look
for. At some point, those structure types that are coded into the
de/compressors are encodings of human interpretations, not intrinsic
realities. The encoding of the human interpretation is "fair", even
the extreme one in your URL example.

> I find this a very fascinating subject. I think ultimately it will
> turn out to be very important, not just for compressing data, but for
> understanding physics too.

I agree, though I hadn't thought of the physics linkup. I think
that the very structure and nature of knowledge itself is hidden
here somewhere. So's most of our non-conscious learning too, like
learning to walk... though the compression of patterns of movement
in our cerebellum requires an approach to using the *time* domain
in a way I haven't seen a machine implement.

> It is interesting to consider that all conceivable data can be
> found somewhere in any transcendental number.

Since a transcendental number is just an infinite stream of digits,
there exist an infinite number of them that can encode any given
infinite stream of information. I don't know how deep that insight
really is though.

> You might think, that
> being the case, an excellent compression algorithm would be to find
> the starting digit, say in Pi, for the data one is seeking. But
> curiously it would do you know good. On average the number for the
> starting digit would be just as long as the data itself.

:-) Good one! I'm sure there's some deep theoretical studies behind
that.

> But I digress, if you think you are so sure about what you are saying,
> and that I am completely off-base,

Now that you've explained yourself more clearly, I don't
think you're off-base at all.

> I believe what they are trying to do is essentially impossible.

I don't know. I wish I knew where I read that the human brain can
"learn" about the contents of 2 CDs - 1200MB, so even Einstein's
knowledge could be encoded into that, if we knew how. The sum of
all human knowledge would doubtless be bigger than that, but I
found it an interesting thought. Some recent neurological research
I read indicated that, contrary to previous thought, our brains
actually do have a limited capacity. The previous assumption of
almost unlimited was based on thoughts about the likely failure
modes when "full" - and those were based on incorrect theories
about how memory works. I could perhaps dig up those papers
somewhere.

> They should realize either the human brain is just that vast or
> that human knowledge uses lossy compression.

I don't agree at all. There surely is a limit to the achievable
compression, but we're nowhere near it, and the search will tell
us much about the nature of knowledge.

Anyhow, interesting stuff, but not about Ruby... continue offline
if you wish...

Clifford H..
Trans (Guest)
on 2007-03-23 12:40
(Received via mailing list)
On Mar 23, 5:59 am, Mauricio F. <removed_email_address@domain.invalid> wrote:
> (redundancy)
> > not general compression. It doesn't really matter that the process of
> finding practical algorithms that approximate it?
>
> I don't see the intention behind your remarks: surely, any practical
> compression method we come up with is going to perform worse than the optimum
> one (whose performance we can only estimate... using inferior compression
> methods!), so what? The existence of an uncomputable lower bound on the length
> of the description of a message is no more of a deterrent to research on
> better (practical) compression methods than e.g. Shannon's theorem is to
> channel coding.

When did I say otherwise? I believe my original point was simply one
of surprise that BWT improved deflate, which I believe says more about
deflate's deficiencies than BWT's use as a means of compression. I've
only been trying to explain my reasoning for that thought ever since.

> Regarding your remark on the choice of the universal machine and some choices
> not being "general compression", you could add that even if descriptive
> complexity relative to two different universal machines differs at most by a
> constant term, the latter does matter when we're approximating the algorithmic
> complexity in practical scenarios...

I'm sorry. Would you care to put that in layman's terms? I did not
have the good fortunate of academic education in the field (oh, if
only!), so I cannot readily address your statement (assuming of course
you are using academic terms in good faith, and not just being
intentionally obscure).

T.
Mauricio F. (Guest)
on 2007-03-23 14:18
(Received via mailing list)
On Fri, Mar 23, 2007 at 07:38:28PM +0900, Trans wrote:
> of surprise that BWT improved deflate, which I believe says more about
> deflate's deficiencies than BWT's use as a means of compression. I've
> only been trying to explain my reasoning for that thought ever since.

Ah I see, you were surprised at deflate being so bad... I was surprised
at
your surprise :)

> intentionally obscure).
Sorry, I assumed you were familiar with Kolmogorov complexity and tried
to
remove some redundancy [1] from my message based on that premise ;-)
(you said
you had been thinking deeply about this so it wasn't too unreasonable :)
[although this is all redundant ultimately and could be replaced by a
pointer
into e.g. Cover&Thomas' Elements of information theory; BTW IIRC there's
a new
edition in the works?]

What you hinted at in your prev. message is known as the
Kolmogorov(-Chaitin)/descriptive complexity, corresponding to the length
of
the minimal description of a string using a program in a given Turing
machine/programming language. It is therefore relative to the universal
machine/language you choose, but if you pick two different ones the
corresponding complexities for a given string can differ at most by a
constant
amount, the length of the string that emulates one machine on the other.

In theory, you often ignore such constant factors (just choose a string
long
enough :), but in the example you gave, the machine included a full copy
of
the KJV bible, allowing it to replicate it with a single instruction...
and as
you said the description of the machine itself made all the difference
and it
only worked for that particular string, making it of little interest.
There's
a sort of trade-off between the complexity of the machine and that of
the
programs you feed into it and both parts matter in practice: a machine
able to
compress any English text to 1/1000th of its original size wouldn't help
residential users if it included/required a (mildly compressed?) copy of
Google's archive.

I probably wasn't 100% intellectually honest in my prev. message, I
apologize
for that. I wasn't as much giving new information as trying to drag this
thread into a more information-theoretic OT, hoping some terminology
would
trigger interesting replies.

[1] I'm sensitive to redundancy (as an EE, I've also learned a bit about
it...;). This why I rarely post to ruby-talk anymore; there's very
little
surprise in the questions/issues being discussed, and the responses to
new-but-actually-old questions I could give would hardly convey new
information. If everybody killed his messages when they don't pass the
"entropy criterion" the way I do, this ML would have a small fraction of
its
current traffic; I cannot ask people to only post things of interest to
me
(why would anybody consent to, and how would they know anyway), but I
can try
to only post messages that would have interested me. For instance, I was
going
to reply "inverse BWT!" to the OP when nobody had responded yet, but I
anticipated somebody would do eventually, so I killed my msg. I don't
respond
to questions which are likely to be answered by somebody else (you can
call it
proactive global redundancy control by local self-censoring ;).

A nice consequence of this is that it also filters out most unreasonable
content (would I be interested in some ad hominem argumentation? No, so
I
should not post any either. The temptation to do so is strong, as some
people,
including well-known/knowledgeable posters, use emotionally loaded
language
and sometimes defend their positions vehemently.)

sorry again
Chad P. (Guest)
on 2007-03-23 16:41
(Received via mailing list)
On Fri, Mar 23, 2007 at 07:15:05PM +0900, Clifford H. wrote:
>
> Anyhow, interesting stuff, but not about Ruby... continue offline
> if you wish...

Bummer.  I was having fun spectating.
Trans (Guest)
on 2007-03-26 04:19
(Received via mailing list)
On Mar 23, 6:15 am, Clifford H. <removed_email_address@domain.invalid> wrote:

> I don't think either existed in 1983 :-). But I expect that
> the answer is "no". In any case, neither would suit, as the
> compression existed to serve the purpose of full-text search
> that had a separate compressed index for every word, and to
> use the index, it was necessary to be able efficiently to
> open the file at any word. Nowadays I'd open the file at any
> disk block boundary, or bigger, since the CPU cost is
> negligible next to the I/O cost. You'd probably use some
> form of arithmetic coding these days anyhow.

Wow. You did that back in 83? I was just learning to code in C64
BASIC :)

> >>> "How it does it" is the interesting part. How does BWT improve
> >>> deflate? Ie. What's that tell us about deflate?
> >> It tells us that there's more to the compressing typical data sets
> > That's a little belittling, don't you think? I mean, you've taken one
> > connotation of one word I used and seemingly made it the centerpiece
>
> Ok, I'm sorry, I thought you meant literal repetition, I didn't
> mean to unfairly latch on to that, I just needed to clarify that
> Shannon's theorem doesn't imply such repetition.

Thanks. I understand your take too. I should have been a little more
thoughtful in my wording.

> words don't evoke symbols in your thoughts that map to a similar
> stream only has theoretical existence. All meaningful data streams
> have structure. There are standard forms of mathematical pattern
> search like finding repetition, using Fourier analysis, even
> fractal analysis, but these cannot hope to find all possible
> patterns - they can only find the kinds of structure they look
> for. At some point, those structure types that are coded into the
> de/compressors are encodings of human interpretations, not intrinsic
> realities. The encoding of the human interpretation is "fair", even
> the extreme one in your URL example.

Fair enough. I guess I'm limiting my definition of predictability too
much. My point was really just that we have to take into account the
means of prediction along with the compressed data to measure
effectiveness.

> > I find this a very fascinating subject. I think ultimately it will
> > turn out to be very important, not just for compressing data, but for
> > understanding physics too.
>
> I agree, though I hadn't thought of the physics linkup. I think
> that the very structure and nature of knowledge itself is hidden
> here somewhere. So's most of our non-conscious learning too, like
> learning to walk... though the compression of patterns of movement
> in our cerebellum requires an approach to using the *time* domain
> in a way I haven't seen a machine implement.

Hmm... interesting point. Perhaps OT, but I recall reading some time
ago about the discovery that out thoughts are not encoded by brain
signals in themselves, but in the timing of those signals. That really
threw me for a loop.

> > It is interesting to consider that all conceivable data can be
> > found somewhere in any transcendental number.
>
> Since a transcendental number is just an infinite stream of digits,
> there exist an infinite number of them that can encode any given
> infinite stream of information. I don't know how deep that insight
> really is though.

I can see that they can encode any "finite" stream. I don't know if
they can encode nay infinite stream though. In any case, I agree,
whether that's right or not, its probably a minor point.

> > You might think, that
> > being the case, an excellent compression algorithm would be to find
> > the starting digit, say in Pi, for the data one is seeking. But
> > curiously it would do you know good. On average the number for the
> > starting digit would be just as long as the data itself.
>
> :-) Good one! I'm sure there's some deep theoretical studies behind
> that.

Actually it just occurs to me that is probably pretty simple to prove,
if you assume that the sequence of numbers of a transcendental are
effectively random (which I think is a safe bet). Don't know why that
never occurred to me before.

> knowledge could be encoded into that, if we knew how. The sum of
> all human knowledge would doubtless be bigger than that, but I
> found it an interesting thought. Some recent neurological research
> I read indicated that, contrary to previous thought, our brains
> actually do have a limited capacity. The previous assumption of
> almost unlimited was based on thoughts about the likely failure
> modes when "full" - and those were based on incorrect theories
> about how memory works. I could perhaps dig up those papers
> somewhere.

That's all? Wow. I would have expected at least 100 times that. Of
course, that doesn't means it's readily accessible to our conscious
mind. Then again, it's hard to judge such large sizes, so maybe 2CDs
is about right.

> > They should realize either the human brain is just that vast or
> > that human knowledge uses lossy compression.
>
> I don't agree at all. There surely is a limit to the achievable
> compression, but we're nowhere near it, and the search will tell
> us much about the nature of knowledge.

Maybe, but I think the idea that human knowledge is more (lossless)
compressible just b/c it is human knowledge is flawed. I'm surprised
to hear we are nowhere near the limit though. Not sure why, but I
thought we were fairly close, say within 10-20% or so.

> Anyhow, interesting stuff, but not about Ruby... continue offline
> if you wish...

Cool. Thanks for the discussion. Unfortunately I'm pretty busy and
need to focus on work, but maybe down the road...

T.
Trans (Guest)
on 2007-03-29 04:46
(Received via mailing list)
Sorry about the delayed response. I just have too much one my mind...

On Mar 23, 8:18 am, Mauricio F. <removed_email_address@domain.invalid> wrote:

> Ah I see, you were surprised at deflate being so bad... I was surprised at
> your surprise :)

Truly surprised. Makes me wonder why other formats like 7z aren't more
widely used. Is decompression speed really that much more important
than size?

> > intentionally obscure).
>
> Sorry, I assumed you were familiar with Kolmogorov complexity and tried to
> remove some redundancy [1] from my message based on that premise ;-) (you said
> you had been thinking deeply about this so it wasn't too unreasonable :)
> [although this is all redundant ultimately and could be replaced by a pointer
> into e.g. Cover&Thomas' Elements of information theory; BTW IIRC there's a new
> edition in the works?]

I will have to look for that book. Thanks.

> the KJV bible, allowing it to replicate it with a single instruction... and as
> you said the description of the machine itself made all the difference and it
> only worked for that particular string, making it of little interest. There's
> a sort of trade-off between the complexity of the machine and that of the
> programs you feed into it and both parts matter in practice: a machine able to
> compress any English text to 1/1000th of its original size wouldn't help
> residential users if it included/required a (mildly compressed?) copy of
> Google's archive.

Thanks. Makes perfect sense and now I know what to call it. Which is
always good. :)

> I probably wasn't 100% intellectually honest in my prev. message, I apologize
> for that. I wasn't as much giving new information as trying to drag this
> thread into a more information-theoretic OT, hoping some terminology would
> trigger interesting replies.

Sounds like you know more about it than anyone else here. So I'm not
sure if I can hold your interest at all. But I might entertain you a
bit with a related story.

When I first learned about the concept of compression, I did some
experimentation. Now, at this point, I had no idea how it was done, or
anything about Huffman coding, or anything like that. But I generally
learn best by experimenting, and that usually means thought
experiments which, aside, also accounts a bit for my lack of "book
smarts". But in any case, I thought about how one might compress data
by algebraic means. I imagined the data as one giant number, and then
considered formula to rapidly approach that number. Exponents of
course grew rapidly, But I figured I needed to grow even faster, so I
"invented" the operation subsequent to "power". I called it, for no
apparent reason: "milk". So

  3 milk 3 = 3 ** (3 ** 3)

And I then postulated the next operation as well, which I called
"cheese".

  3 cheese 3 = 3 milk (3 milk 3)

Needles to say, 3 cheese 3 is a huge number. I took the final step and
abstracted the set of all such operation starting with addition (O0),
to multiplication (O1), to power (O2), milk (O3), etc. And so I
figured any large number could be represented as a vector Xi, such
that:

  Number = Sum(i..0) Oi(Xi+1, Xi)

At that point the problem became trying to salve for this equation
with a given number. I must have filled up every chalk board at my
school working on that. My fellow students thought I had lost my mind!
And I did to as I found myself trying to perform Integration on this
beast! That in fact turned out to be impossible.

Pretty nuts, eh? Of course, I later realized, after solving the
equation manually a number of times, the solutions would be no more
compressed than the original figure. My original idea of "rapid
growth" was flawed. And so it was that I got my first hint of what I
now learn is called "Kolmogorov(-Chaitin)/descriptive complexity".


> anticipated somebody would do eventually, so I killed my msg. I don't respond
> to questions which are likely to be answered by somebody else (you can call it
> proactive global redundancy control by local self-censoring ;).
>
> A nice consequence of this is that it also filters out most unreasonable
> content (would I be interested in some ad hominem argumentation? No, so I
> should not post any either. The temptation to do so is strong, as some people,
> including well-known/knowledgeable posters, use emotionally loaded language
> and sometimes defend their positions vehemently.)
>
> sorry again

No need for apologies. No doubt you know that I am far from exemplary
in these affairs, but I believe (or at least hope) that I have gotten
better with time. But with you, well, it almost sounds like you are
playing a game of Prisoner's Dilemma with your posts. And in that
case, I wonder who's winning? ;)


Thanks for the details, btw!
T.
Phillip G. (Guest)
on 2007-03-29 04:56
(Received via mailing list)
Trans wrote:
> Sorry about the delayed response. I just have too much one my mind...
>
> On Mar 23, 8:18 am, Mauricio F. <removed_email_address@domain.invalid> wrote:
>
>> Ah I see, you were surprised at deflate being so bad... I was surprised at
>> your surprise :)
>
> Truly surprised. Makes me wonder why other formats like 7z aren't more
> widely used. Is decompression speed really that much more important
> than size?

Taking at todays cheap processing time and even cheaper mass storage,
I'd say, all else being equal, size isn't that important to the end
user.

in other environments (transferring large files across a thin pipe),
size is more important than speed.

It's good to have choice in different scenarios.

--
Phillip "CynicalRyan" Gawlowski
http://cynicalryan.110mb.com/

Rule of Open-Source Programming #11:

When a developer says he will work on something, he or she means
"maybe".
Phillip G. (Guest)
on 2007-03-29 04:59
(Received via mailing list)
Phillip G. wrote:

> Taking at todays cheap processing time and even cheaper mass storage,
> I'd say, all else being equal, size isn't that important to the end user.

That sentence should read "Taking a look at..."

D'oh.


--
Phillip "CynicalRyan" Gawlowski
http://cynicalryan.110mb.com/

Rule of Open-Source Programming #5:

A project is never finished.
Chad P. (Guest)
on 2007-03-29 18:28
(Received via mailing list)
On Thu, Mar 29, 2007 at 09:56:17AM +0900, Phillip G. wrote:
> >than size?
>
> Taking at todays cheap processing time and even cheaper mass storage,
> I'd say, all else being equal, size isn't that important to the end user.
>
> in other environments (transferring large files across a thin pipe),
> size is more important than speed.

Actually, I suspect that there are a couple reasons for the greater
popularity of some compression schemes, despite their poorer compression
performance:

1. When you wish to use compression "on the fly", as it were, speed is
*very* important.  You want opening, closing, and otherwise using files
that are kept in a compressed format, to be about as quick and
responsive as using uncompressed files.  This is probably not anywhere
near as big a reason as the other one, though. . . .

2. Certain compression programs are very well known, and "everyone" has
them (for some definition of "everyone", depending on OS platform, et
cetera).  Thus, "everyone" uses them.  Short of producing a hugely
popular program that handles both old and new compression algorithms (or
both old and new file formats, in other examples of this phenomenon in
action), adoption of something new is going to be very slow and prone to
failure despite any technical advantages to the new algorithm/format.
This is illustrated by the demonstration of the commercial end-user
market failure of the Betamax -- VHS won that little skirmish simply
because it was more widely available, quickly became a household word,
and prevented migration to Betamax simply by way of market inertia.
Alex Y. (Guest)
on 2007-03-29 18:37
(Received via mailing list)
Chad P. wrote:
> and prevented migration to Betamax simply by way of market inertia.
This needn't be the end of the story.  For example, WinZip 5 (if I
remember correctly) didn't handle tar.gz files.  Modern WinZip does.  I
think it also handles tar.bz2 files.  I wouldn't be at all surprised if
the next version handled .7zs.
Kristoffer L. (Guest)
on 2007-03-30 06:48
(Received via mailing list)
On 3/29/07, Chad P. <removed_email_address@domain.invalid> wrote:
> and prevented migration to Betamax simply by way of market inertia.
>

Or just witness the huge popularity out there for the RAR format - a
format that is popular only because it once got popular - and the only
reason for that was that it made it easy to split archives into many
small parts. This was important (especially for the pirate market, of
course) in the days that a huge download could be corrupted. RAR
allowed people to only redownload  one small part when that happened.

Today, with torrents and hashes in all major protocols, this is
essentially a moot issue, and the slightly better compression than
some other formats is also negligible as in no user would actually
notice any difference except for byte count in the file manager...
Today it's just one big bad hassle, because people use it a LOT and
this means installing an extra program on every computer you use if
you want to participate - a non-free program at that, for those of us
who care.

Using RAR amounts to Cargo Cult Compression in my opinion, as it's
only invoking the old ritual of times past because "we have always
done it this way" with no understanding on why and no actual benefits.

Not that it's really the best solution for many reasons, but if you
want to be really practical, people should use zip for all their
compression needs - because that's the only format all modern OS:es
can open with no extra software, at least that I know of.

Just goes to show...

-- Kristoffer
Chad P. (Guest)
on 2007-03-30 07:44
(Received via mailing list)
On Fri, Mar 30, 2007 at 11:47:33AM +0900, Kristoffer Lund??n wrote:
> some other formats is also negligible as in no user would actually
> Not that it's really the best solution for many reasons, but if you
> want to be really practical, people should use zip for all their
> compression needs - because that's the only format all modern OS:es
> can open with no extra software, at least that I know of.

In general, I agree with your assessment of the situation, but I'm not
entirely sure about one thing you said.

Is zip format really available with all OSes without extra software?
I'm pretty sure MS Windows requires one to install additional software
still, and I don't seem to recall zip being standard on all major Linux
distributions.  I haven't bothered to check whether it's available on
FreeBSD, mostly just because I haven't needed an unzipper in quite a
while.
Chris S. (Guest)
on 2007-03-30 07:55
(Received via mailing list)
On Mar 29, 9:44 pm, Chad P. <removed_email_address@domain.invalid> wrote:
> > essentially a moot issue, and the slightly better compression than
>
> still, and I don't seem to recall zip being standard on all major Linux
> distributions.  I haven't bothered to check whether it's available on
> FreeBSD, mostly just because I haven't needed an unzipper in quite a
> while.
>
> --
> CCD CopyWrite Chad P. [http://ccd.apotheon.org]
> Leon Festinger: "A man with a conviction is a hard man to change. Tell
> him you disagree and he turns away. Show him facts and figures and he
> questions your sources. Appeal to logic and he fails to see your point."

Zip has been handled by the base Windows install since the first
release of XP.  In fact, I think since about that same time, I've
never had the need to install anything for handling zip files across
Windows, Linux, and Mac.

Chris
Kristoffer L. (Guest)
on 2007-03-30 08:51
(Received via mailing list)
On 3/30/07, Chad P. <removed_email_address@domain.invalid> wrote:
> Is zip format really available with all OSes without extra software?
> I'm pretty sure MS Windows requires one to install additional software
> still, and I don't seem to recall zip being standard on all major Linux
> distributions.  I haven't bothered to check whether it's available on
> FreeBSD, mostly just because I haven't needed an unzipper in quite a
> while.
>

It works on XP, or at least it did a few years ago when I used it the
last time, and as far as I've seen it works OOTB on any Linux/BSD/*nix
distro that ships GNOME or KDE. To be honest, I don't know about OS/X,
but I just assumed they handle this

From this point on, it's all arguing what a "major" OS or distribution
is. :) I'd personally draw the line at Windows, Mac, and the 3-6
biggest Linuxes, but I know a lot of people would include a lot more,
just as a lot of people would only include Windows.

I do think that I'm mostly right, though.

-- Kristoffer
Chad P. (Guest)
on 2007-03-30 09:19
(Received via mailing list)
On Fri, Mar 30, 2007 at 01:51:27PM +0900, Kristoffer Lund??n wrote:
> last time, and as far as I've seen it works OOTB on any Linux/BSD/*nix
> distro that ships GNOME or KDE. To be honest, I don't know about OS/X,
> but I just assumed they handle this
>
> >From this point on, it's all arguing what a "major" OS or distribution
> is. :) I'd personally draw the line at Windows, Mac, and the 3-6
> biggest Linuxes, but I know a lot of people would include a lot more,
> just as a lot of people would only include Windows.
>
> I do think that I'm mostly right, though.

With the above in mind, yeah, you're probably "mostly right" at least.
I just got so used to installing a zip/unzip utility on MS Windows that
I always installed it as one of the first things I did with my own
systems whenever I had need of XP (increasingly rare now), and since I
tend to do minimal installs with Debian or FreeBSD (for instance), no
KDE, GNOME, or zip utilities need apply in my "default" installs.  I
guess the culprit in this case is the simple fact that I'm pretty
atypical, as computer users go.
Trans (Guest)
on 2007-03-30 16:40
(Received via mailing list)
On Mar 29, 10:47 pm, "Kristoffer Lundén" <removed_email_address@domain.invalid>
wrote:

> Not that it's really the best solution for many reasons, but if you
> want to be really practical, people should use zip for all their
> compression needs - because that's the only format all modern OS:es
> can open with no extra software, at least that I know of.

Yet, this is practical only in the short-run, and because too many
people follow this kind of tenant, the long-run never gets here. In
other words, by not collectively working together to adopt better
technologies we are ultimately being less practical.

T.
Kristoffer L. (Guest)
on 2007-03-30 17:36
(Received via mailing list)
On 3/30/07, Trans <removed_email_address@domain.invalid> wrote:
> Yet, this is practical only in the short-run, and because too many
> people follow this kind of tenant, the long-run never gets here. In
> other words, by not collectively working together to adopt better
> technologies we are ultimately being less practical.
>

Like I said, not ideal. And I have nothing against going forward, and
would applaud any *open standard* (free to implement by anyone)
becoming widespread across all platforms.

Meanwhile, however, compression becomes steadily less and less of any
real value for the average desktop user, for almost any scenario -
today, archiving is mainly done to group files, and any compression is
mostly a nice side effect. A certain group of people who believe in
doing things as efficiently and correct as possible (they call us
nerds ;-)) will not agree, but it's true nonetheless and easy to
observe in almost any computer user - it's when a directory, or even
more than 2-3 files is to be mailed that the archivers are used,
almost never because of size issues. Talking general computer use
here.

Of course, if I release some code only meant to be used on Linux, I
usually tar.gz or .bz2 it, and Mac users probably still sit things in
general?

The point is, by defaulting to zip, you may hold back the revolution a
bit, but it also ensures that your mom, your boss, your customers, and
just about everyone except Perrin ;-) can open it. This is convenient
and pragmatic, but it's also important: there's a lot of busy people
out there who will not even bother to reply wit a complaint if they
can't open the files directly, they will just junk the message and
move on. A *lot* of people out the aren't interested in installing
anything, no matter how good it is. A lot of these people may pay your
salary or sponsor your next project.

I have about 40 students I work with, computer literate and in the
ages 18-30, and I am writing this because this has shown to be a real
problem, before I got them all to only send stuff in formats as
standard as possible - this includes not sending MS files unless
requested. Students have, in a very real sense, lost internship
opportunities, and maybe even job offers (who knows?) because of this,
and it has only a long time after shown that it was only because it
was a RAR archive or funnily, an Open Document, or something like
that.

Some will say it's a matter of education, to which I wish you the best
of luck. It hasn't worked so far, because most people are not
interested. However, I sincerely support any movements to getting say
7zip built in in the next Windows service pack... that's what it is
gonna take at this moment.

-- Kristoffer
Robert D. (Guest)
on 2007-03-30 18:17
(Received via mailing list)
On 3/23/07, Trans <removed_email_address@domain.invalid> wrote:
<snip>
> When did I say otherwise? I believe my original point was simply one
> of surprise that BWT improved deflate, which I believe says more about
> deflate's deficiencies than BWT's use as a means of compression. I've
> only been trying to explain my reasoning for that thought ever since.

I'd like to ask a question Tom, as this is becoming a little bit
obscure, as you put it.

 BWT is used for bzip2 too, frankly I am surprised that you are
surprised Tom.
As BWT will create repetitions, which is quite clear, and allows for
an easy undoing of its transformation would it not help a whole class
of compression algorithms.

When you talked about a deficiency would that not mean -in your setup
- that a compression algorithm should implicitly find "clever"
transformations...
If this was your thought it is an incredible advanced and frankly
optimistic idea.

And that is my question BTW (not BWT;) do you indeed look at this
issue from this angle?

I can accept it as a 100% true statement though :)

But I guess putting them together (transformation and compression), by
hand is quite clever already and pretty much state of the art.

Cheers
Robert
<snip

 T.
Chad P. (Guest)
on 2007-03-30 19:57
(Received via mailing list)
On Fri, Mar 30, 2007 at 10:35:38PM +0900, Kristoffer Lund??n wrote:
>
> The point is, by defaulting to zip, you may hold back the revolution a
> bit, but it also ensures that your mom, your boss, your customers, and
> just about everyone except Perrin ;-) can open it. This is convenient
> and pragmatic, but it's also important: there's a lot of busy people
> out there who will not even bother to reply wit a complaint if they
> can't open the files directly, they will just junk the message and
> move on. A *lot* of people out the aren't interested in installing
> anything, no matter how good it is. A lot of these people may pay your
> salary or sponsor your next project.

Well . . . to be fair, I *can* open it, but I usually need to type
something like one of the following first:

  apt-get install unzip

  cd /usr/ports/archivers/unzip; make install clean

Thanks for thinking of me, though.


>
> I have about 40 students I work with, computer literate and in the
> ages 18-30, and I am writing this because this has shown to be a real
> problem, before I got them all to only send stuff in formats as
> standard as possible - this includes not sending MS files unless
> requested. Students have, in a very real sense, lost internship
> opportunities, and maybe even job offers (who knows?) because of this,
> and it has only a long time after shown that it was only because it
> was a RAR archive or funnily, an Open Document, or something like
> that.

It always pays to find out what the person on the other end wants, when
that person has something you need.
Christian N. (Guest)
on 2007-03-30 21:01
(Received via mailing list)
"Trans" <removed_email_address@domain.invalid> writes:

> Pretty nuts, eh? Of course, I later realized, after solving the
> equation manually a number of times, the solutions would be no more
> compressed than the original figure. My original idea of "rapid
> growth" was flawed. And so it was that I got my first hint of what I
> now learn is called "Kolmogorov(-Chaitin)/descriptive complexity".

Reminds me a lot of my idea of just "indexing" all data by their
offset in Pi. :-)
Martin DeMello (Guest)
on 2007-04-01 00:20
(Received via mailing list)
On 3/30/07, Christian N. <removed_email_address@domain.invalid> wrote:
>
> Reminds me a lot of my idea of just "indexing" all data by their
> offset in Pi. :-)

I think most geeks have come up with that one at least once in their
lives :)

martin
This topic is locked and can not be replied to.