Trans wrote:
Of course. When I used the term “repetition” I didn’t mean to be so
narrow as to to suggest nothing more than “a a a a a”. A better term
would have been “patterns”.
Ok, sorry, that wasn’t clear. I agree with you.
Right. Size is of first relevance. So was you file smaller than
deflates or bzip2s?
I don’t think either existed in 1983 :-). But I expect that
the answer is “no”. In any case, neither would suit, as the
compression existed to serve the purpose of full-text search
that had a separate compressed index for every word, and to
use the index, it was necessary to be able efficiently to
open the file at any word. Nowadays I’d open the file at any
disk block boundary, or bigger, since the CPU cost is
negligible next to the I/O cost. You’d probably use some
form of arithmetic coding these days anyhow.
“How it does it” is the interesting part. How does BWT improve
deflate? Ie. What’s that tell us about deflate?
It tells us that there’s more to the compressing typical data sets
That’s a little belittling, don’t you think? I mean, you’ve taken one
connotation of one word I used and seemingly made it the centerpiece
Ok, I’m sorry, I thought you meant literal repetition, I didn’t
mean to unfairly latch on to that, I just needed to clarify that
Shannon’s theorem doesn’t imply such repetition.
Predictability is not the whole center piece of compression either.
Hmm. I think perhaps we still disagree then… at least about the
definition of predictability…
Pretty predictable. That may seems silly, but there’s a point. This is
not general compression. It doesn’t really matter that the process of
decompression involves a remote machine – locality is relative.
Information is relative. Our language is the prime example; if my
words don’t evoke symbols in your thoughts that map to a similar
symbolic morphology in you, then we can’t communicate. So English
isn’t just a set of words, it’s a symbolic net whose parts connect
our shared experiences.
In other words, you, as the “decompressor” of my sequence of words,
must have related context, experience, that forms a morphologically
similar mesh, the nodes of which are associated with words that we
share.
Taking that as my context, I maintain that your “general” data
stream only has theoretical existence. All meaningful data streams
have structure. There are standard forms of mathematical pattern
search like finding repetition, using Fourier analysis, even
fractal analysis, but these cannot hope to find all possible
patterns - they can only find the kinds of structure they look
for. At some point, those structure types that are coded into the
de/compressors are encodings of human interpretations, not intrinsic
realities. The encoding of the human interpretation is “fair”, even
the extreme one in your URL example.
I find this a very fascinating subject. I think ultimately it will
turn out to be very important, not just for compressing data, but for
understanding physics too.
I agree, though I hadn’t thought of the physics linkup. I think
that the very structure and nature of knowledge itself is hidden
here somewhere. So’s most of our non-conscious learning too, like
learning to walk… though the compression of patterns of movement
in our cerebellum requires an approach to using the time domain
in a way I haven’t seen a machine implement.
It is interesting to consider that all conceivable data can be
found somewhere in any transcendental number.
Since a transcendental number is just an infinite stream of digits,
there exist an infinite number of them that can encode any given
infinite stream of information. I don’t know how deep that insight
really is though.
You might think, that
being the case, an excellent compression algorithm would be to find
the starting digit, say in Pi, for the data one is seeking. But
curiously it would do you know good. On average the number for the
starting digit would be just as long as the data itself.
Good one! I’m sure there’s some deep theoretical studies behind
that.
But I digress, if you think you are so sure about what you are saying,
and that I am completely off-base,
Now that you’ve explained yourself more clearly, I don’t
think you’re off-base at all.
I believe what they are trying to do is essentially impossible.
I don’t know. I wish I knew where I read that the human brain can
“learn” about the contents of 2 CDs - 1200MB, so even Einstein’s
knowledge could be encoded into that, if we knew how. The sum of
all human knowledge would doubtless be bigger than that, but I
found it an interesting thought. Some recent neurological research
I read indicated that, contrary to previous thought, our brains
actually do have a limited capacity. The previous assumption of
almost unlimited was based on thoughts about the likely failure
modes when “full” - and those were based on incorrect theories
about how memory works. I could perhaps dig up those papers
somewhere.
They should realize either the human brain is just that vast or
that human knowledge uses lossy compression.
I don’t agree at all. There surely is a limit to the achievable
compression, but we’re nowhere near it, and the search will tell
us much about the nature of knowledge.
Anyhow, interesting stuff, but not about Ruby… continue offline
if you wish…
Clifford H…