The difficulty that youll run into is in your need for the new, shorter

value to be unique. Hashes are not, and cannot be

designed to be unique. Its all in the numbers. If you have a 100

character string of 8 bit characters (assuming ASCII, not Unicode),

the you have 800 bits of information. You could tale advantage of the

fact that not all 256 values of a byte are valid for your string

to reduce its size some. If you limit to 7-bit ascii, then theres 1 bit

per byte that could be reclaimed. All of these factors are

taken into account in compression algorithms. So compression is the

direction you need to look. Be careful, because many

compression algorithms give longer results than their input if the input

is particularly short (I seem to recall that some have fall-back

approaches to account for this.

Hashes (either the built in hash() method youve already discovered the

issues with, or cryptographic hashes like MD5 or SHA1 are

designed to statistically minimize the number of has collisions. You

can take a reasonably long input and the odds of any two, different,

strings being the same are VERY low, but its not guaranteed. Two inputs

producing the same hash is referred to as a hash collision.

Cryptographic

hashes are designed to minimize collisions, but, since they are of a

fixed size, there are only so many possible result values and that wont

be

enough to guarantee unique results for strings. If your value space

(i.e. the number of strings youre trying to ensure uniqueness for is

not in the millions are billions, and you can live with the results of

your compare basically being a statement that, if you get the same

value,

theres a 1 in XXXXXXXXX change of them being actually being different

strings, then cryptographic hash might be sufficient for you. Just be

aware

that two strings with the same has might be VERY VERY likely to be the

same string, but that its, at least remotely possible, that they are tow

different

strings producing the same hash.

Look into compression methods first. Compression is what youve

described. If youre strings are sufficiently long, then, off-the-shelf

compression

could easily be your answer. If theyre short and you have special

knowledge of the allowed input values (ex: youre using ASCII, and only

allow

a-z, A-Z, space, comma, period, ) you may find that there are only, say

100 valid values per character (or anything less than 128), then you

could

compress them to 7/8ths of their original size (using very simplistic

compression). Take a look at simple zip compression and others like it.

Their purpose

is to do what youre asking provide a shorter value from which must be

unique for every unique input value (since it must be able to

decompress).

Using the theoretical 100 values per character scenario I just gave.

The number of possible values of a string are 100^n (where n is the

number of characters),

So, fo 20 characters

possible values = 100^20 => 1e40

number of bits = log2(possible values) => 132.8771

bytes = number of bits / 8 => 16.6096

So, in theory, you can get 20 character strings down to 17 bytes

If you go up to 200 characters167 bytes

Encryption, as youve seen, has no goal of producing shorter output than

the input, so its not going to provide your solution.

(OK, Ive started rambling… probably more detail than you needed look

for compression routines)