The difficulty that youll run into is in your need for the new, shorter
value to be unique. Hashes are not, and cannot be
designed to be unique. Its all in the numbers. If you have a 100
character string of 8 bit characters (assuming ASCII, not Unicode),
the you have 800 bits of information. You could tale advantage of the
fact that not all 256 values of a byte are valid for your string
to reduce its size some. If you limit to 7-bit ascii, then theres 1 bit
per byte that could be reclaimed. All of these factors are
taken into account in compression algorithms. So compression is the
direction you need to look. Be careful, because many
compression algorithms give longer results than their input if the input
is particularly short (I seem to recall that some have fall-back
approaches to account for this.
Hashes (either the built in hash() method youve already discovered the
issues with, or cryptographic hashes like MD5 or SHA1 are
designed to statistically minimize the number of has collisions. You
can take a reasonably long input and the odds of any two, different,
strings being the same are VERY low, but its not guaranteed. Two inputs
producing the same hash is referred to as a hash collision.
Cryptographic
hashes are designed to minimize collisions, but, since they are of a
fixed size, there are only so many possible result values and that wont
be
enough to guarantee unique results for strings. If your value space
(i.e. the number of strings youre trying to ensure uniqueness for is
not in the millions are billions, and you can live with the results of
your compare basically being a statement that, if you get the same
value,
theres a 1 in XXXXXXXXX change of them being actually being different
strings, then cryptographic hash might be sufficient for you. Just be
aware
that two strings with the same has might be VERY VERY likely to be the
same string, but that its, at least remotely possible, that they are tow
different
strings producing the same hash.
Look into compression methods first. Compression is what youve
described. If youre strings are sufficiently long, then, off-the-shelf
compression
could easily be your answer. If theyre short and you have special
knowledge of the allowed input values (ex: youre using ASCII, and only
allow
a-z, A-Z, space, comma, period, ) you may find that there are only, say
100 valid values per character (or anything less than 128), then you
could
compress them to 7/8ths of their original size (using very simplistic
compression). Take a look at simple zip compression and others like it.
Their purpose
is to do what youre asking provide a shorter value from which must be
unique for every unique input value (since it must be able to
decompress).
Using the theoretical 100 values per character scenario I just gave.
The number of possible values of a string are 100^n (where n is the
number of characters),
So, fo 20 characters
possible values = 100^20 => 1e40
number of bits = log2(possible values) => 132.8771
bytes = number of bits / 8 => 16.6096
So, in theory, you can get 20 character strings down to 17 bytes
If you go up to 200 characters167 bytes
Encryption, as youve seen, has no goal of producing shorter output than
the input, so its not going to provide your solution.
(OK, Ive started rambling… probably more detail than you needed look
for compression routines)