Shorten SHA1-hash

In one of my railsapplications I have system for up/downloading files.
Since I don’t want people to browse all the files, but anyone should
be able to link to his/her files each file has it’s own unique URL -
all pretty standard.

Right now the URL just is /download/sha-sum-of-the-file (since I have
the SHA anyway), but the length of the url is bugging me.

Now - the problem: The SHA1-sum is, as usually, just represented in
base16 which makes it 40 chars long. However, in an url we have at
least a-zA-Z0-9 (and also some special characters) which gives us the
opportunity to represent the SHA-sum in at least base62 which should
make it about half the size.

I know that I in this case just could store an extra little random
string in my database and link to /download/little-string, but that’s
not the point here :slight_smile: I would just like to hear you: How few
(printable) characters can you shorten a SHA-sum down to?

You could use String#hash instead. For strings more than 6 characters
(or
something), it’ll return a negative number, so it’s a little hacky, but:

file_contents.hash.to_s.tr(’-’, ‘0’)

I’m not saying this is a nice approach, but it is pretty short :wink:

Anyone with more knowledge of String#hash care to chime in?

Also, I doubt this is very cross platform friendly, just take a look at
all
the ifdefs in the C hash implementation.

Peter Skovgaard wrote:

least a-zA-Z0-9 (and also some special characters) which gives us the
opportunity to represent the SHA-sum in at least base62 which should
make it about half the size.

I know that I in this case just could store an extra little random
string in my database and link to /download/little-string, but that’s
not the point here :slight_smile: I would just like to hear you: How few
(printable) characters can you shorten a SHA-sum down to?

Modified Base64 for URLs is probably your best and easiest route.

Wikipedia (Base64 - Wikipedia) explains it well:

Base64 encoding can be helpful when fairly lengthy identifying
information is used in an HTTP environment. Hibernate
http://en.wikipedia.org/wiki/Hibernate_(Java), a database
persistence framework for Java
http://en.wikipedia.org/wiki/Java_(programming_language) objects,
uses Base64 encoding to encode a relatively large unique id (generally
128-bit UUIDs http://en.wikipedia.org/wiki/UUID) into a string for use
as an HTTP parameter in HTTP forms or HTTP GET URLs
http://en.wikipedia.org/wiki/URL. Also, many applications need to
encode binary data in a way that is convenient for inclusion in URLs,
including in hidden web form fields, and Base64 is a convenient encoding
to render them in not only a compact way, but in a relatively unreadable
one when trying to obscure the nature of data from a casual human
observer.

Using a URL-encoder on standard Base64, however, is inconvenient as it
will translate the ‘+’ and ‘/’ characters into special ‘%XX’ hexadecimal
sequences (‘+’ = ‘%2B’ and ‘/’ = ‘%2F’). When this is later used with
database storage or across heterogeneous systems, they will themselves
choke on the ‘%’ character generated by URL-encoders (because the ‘%’
character is also used in ANSI SQL as a wildcard).

For this reason, a modified Base64 for URL variant exists, where /no/
padding ‘=’ will be used, and the ‘+’ and ‘/’ characters of standard
Base64 are respectively replaced by ‘*’ and ‘-’, so that using URL
encoders/decoders is no longer necessary and has no impact on the length
of the encoded value, leaving the same encoded form intact for use in
relational databases, web forms, and object identifiers in general.

Tom