Best approach to convert string to unique value

Detlef_R · January 2, 2014, 4:56am

I am trying to generate a unique number for a string and would like some
suggestions.

So far, I’ve researched the hash, but it seems not consistent on the
values generated. It is not reliable since it does not generate the same
value all the time.

My next idea was to use any easy encrypt method, but that would generate
a large string and would be resource expensive.

My next idea is convert the string to hex. Asp.net uses this approach a
lot. Is this is a good idea? What are the drawbacks?

The reason I need this unique “Hashcode” is because I want to save this
value on the database and compare it with new inserts in order to avoid
duplicate values. There are hundreds of Json objects that corresponds to
fields in the db.

So I thought of instead of comparing values with each
field maybe the easier method was to find a way to get the unique Json
string value/code. With the hashcode I can check in the last 15 minutes
if the user is trying to submit the same data to db and prevent it from
being submitted again.

I also think that this method uses very little memory compared to using
session, but I don’t want to discard the session idea yet. If there is a
good approach, I may use it.

Well, I hope I made it clear enough to get some feedbacks.

Thanks

Rod

rodusa · January 2, 2014, 6:13am

On 01/01/2014 07:56 PM, Rodrigo L. wrote:

I am trying to generate a unique number for a string and would like some
suggestions.

So far, I’ve researched the hash, but it seems not consistent on the
values generated. It is not reliable since it does not generate the same
value all the time.

Maybe because of key ordering differences? String#hash will not detect
that, but Hash#hash will:

{a: 1, b: 2}.hash
=> -755489698742161519

{b: 2, a: 1}.hash
=> -755489698742161519

“{a: 1, b: 2}”.hash
=> -3350349176911638030

“{b: 2, a: 1}”.hash
=> -4008483756911189513

You can fix that by parsing the json and calling #hash on the parsed
object instead.

Then compare with hash values stored in the db. If found, then do a full
object compare to see if it is just a hash collision or a data-entry
dup. Both should be rare, so you won’t spend much time comparing
objects.

OTOH, if data-entry dup is common, you probably need a different
solution…

rodusa · January 2, 2014, 9:29am

On 02 Jan 2014, at 06:12, Joel VanderWerf [email protected]
wrote:

{a: 1, b: 2}.hash

Then compare with hash values stored in the db. If found, then do a full object
compare to see if it is just a hash collision or a data-entry dup. Both should be
rare, so you won’t spend much time comparing objects.

OTOH, if data-entry dup is common, you probably need a different solution…

#hash yields different results after a runtime restart since Ruby
1.9.something to avoid hash-collision DOS attacks:

$ irb
[1] pry(main)> “foo”.hash
=> -3624421491175122911
[2] pry(main)> “foo”.hash
=> -3624421491175122911
[3] pry(main)> ^D
$ irb
[1] pry(main)> “foo”.hash
=> -1929116373959977848

Use one of the hash functions provided by e.g. OpenSSL. Be aware that
these are cryptographic hashes, so they do need space and time on
purpose. Otherwise search for fast hash functions in Ruby, but they are
sometimes more prone to collisions!

As Joel mentioned, be aware that hashing always leads to collisions, so
the check should be something like (pseudocode):

candidate = lookup(hash(user_action))
if candidate && candidate == user_action
raise DoublePost
end

Even then, only fix a problem that you actually have: for proper
duplicate detection, you have to save the value to the database in any
case.

Given that databases are good at storing and building efficient indices
over string values, I’d actually expect that just storing the original
value and just querying for it on a database field with a proper index
should be faster than hashing on the client side and then querying the
database for a stored hash value.

Do measure first and if you cannot point to an obvious problem, don’t
fix.

Regards,
Florian

rodusa · January 2, 2014, 12:53pm

Thanks for the comments. Just answering Robert’s post.

What does “convert to hex” mean in this context? RE: My next idea is
convert the string to hex. Asp.net uses this approach a lot. Is this is
a good idea? What are the drawbacks?

RE: Well, you just made me think and it sounds like a dumb idea to user
hex. I thought there was a way to generate a unique hex, but it will
just be a big string, same as encrypting with not much use for this
purpose.

What type of DB is this? What are all these strings that you insert? I
mean, JSON is just a format. What do these represent? What do you mean
by “hundreds of Json objects that corresponds to fields in the db”?

RE: db used is Mysql, the strings are JSON such as {“field1”:“bla bla”,
“field2”: “ble ble”, etc. Where field one correspond to Mysql field1,
and so far. There are also nested objects which corresponds to “files”
and there can be hundredes of files. The JSON is not a pure javascript
object. I use JSON.stringify to turn it into a string. It is this JSON
string that I want to find a hashcode for.

What kind of session are we talking about here?

RE: Web session (either stored in db, memory or cookie). The session
here would be something like the session token for an specific period of
time, for instance, let’s say the user can be logged in for max of 20
min with no activity. If, after this period is passed he/she can
reinsert values. This may be the easiest approach if there is a way to
grab this unique token. But this has another disadvantage. What if I
want to restrict the same purchases during a larger period, let’s say,
24hours period. The only way to control that would be via cookie, and if
the user clear it’s browser cache he/she can then reinsert the same
order again and again. There is gotta be a better way than this.

Can you describe the nature of your application and the data that you
want to protect against duplicate submission?

RE: The nature is similar to a shopping cart. But instead of products,
print services are being sold and their corresponding files are sent
together with each order. There are three tables: one is order_header,
another is order_details and the last is order_assets which contains the
files.

Ps: Using index on order_header to validate uniqueness is not a good
idea because the user can indeed have two or three, or forth… orders
with same content. (As far as it is not in the same period of time).
Print
services is not like buying TVs where most big stored such Amazon won’t
allow the same order multiple times. In my scenario, nothing prohibits
the user from submitting multiple service orders using the same files,
as far as it is not in the same period of time. This period I haven’t
decided yet: 20min, 60min, or 24hours.

Hope I answered your questions

rodusa · January 2, 2014, 11:29am

On Thu, Jan 2, 2014 at 4:56 AM, Rodrigo L. [email protected]
wrote:

I am trying to generate a unique number for a string and would like some
suggestions.

So far, I’ve researched the hash, but it seems not consistent on the
values generated. It is not reliable since it does not generate the same
value all the time.

You could use MD5 or another cryptographic robust hash code.

My next idea was to use any easy encrypt method, but that would generate
a large string and would be resource expensive.

Encryption does not reduce the size so it won’t help with saving memory.

My next idea is convert the string to hex. Asp.net uses this approach a
lot. Is this is a good idea? What are the drawbacks?

What does “convert to hex” mean in this context?

The reason I need this unique “Hashcode” is because I want to save this
value on the database and compare it with new inserts in order to avoid
duplicate values. There are hundreds of Json objects that corresponds to
fields in the db.

What type of DB is this? What are all these strings that you insert? I
mean, JSON is just a format. What do these represent? What do you mean
by “hundreds of Json objects that corresponds to fields in the db”?

So I thought of instead of comparing values with each
field maybe the easier method was to find a way to get the unique Json
string value/code. With the hashcode I can check in the last 15 minutes
if the user is trying to submit the same data to db and prevent it from
being submitted again.

I also think that this method uses very little memory compared to using
session, but I don’t want to discard the session idea yet. If there is a
good approach, I may use it.

What kind of session are we talking about here?

Well, I hope I made it clear enough to get some feedbacks.

Not entirely - at least not for me. Can you describe the nature of
your application and the data that you want to protect against
duplicate submission?

Kind regards

robert

rodusa · January 2, 2014, 2:24pm

On Thu, Jan 2, 2014 at 12:53 PM, Rodrigo L.
[email protected] wrote:

Thanks for the comments. Just answering Robert’s post.

Can you please stick with the usual quoting style? That will make it
much easier for everyone to follow discussions.

What type of DB is this? What are all these strings that you insert? I
mean, JSON is just a format. What do these represent? What do you mean
by “hundreds of Json objects that corresponds to fields in the db”?

RE: db used is Mysql, the strings are JSON such as {“field1”:“bla bla”,
“field2”: “ble ble”, etc. Where field one correspond to Mysql field1,
and so far. The JSON is not a pure javascript object. I use
JSON.stringify to turn it into a string. It is this JSON string that I
want to find a hashcode for.

So you want to avoid that the exact same order (potentially containing
multiple documents to print) to be repeated? That would mean if a user
submits order for Doc1 and Doc2 and later for Doc1 only the second one
will still succeed. Also, submitting an order with Doc2 and Doc1
(order reversed) will also succeed. Is this what you want?

What kind of session are we talking about here?

RE: Web session (either stored in db, memory or cookie). The session

Can you describe the nature of your application and the data that you
want to protect against duplicate submission?

RE: The nature is similar to a shopping cart. But instead of products,
print services are being sold and their corresponding files are sent
together with each order. There are two tables: one is order_header,
another is order_details and the last is order_assets which contains the
files.

In case you want to avoid duplicate printing of the same file you
could store a SHA1 hash in order_assets along with the blob that
contains the file data to avoid duplicate submission of a file.
However, that would be a different mechanism from the one described
above.

Are you trying to guard against browser resubmitting forms or is there
another reason for what you are trying to do?

Ps: Using index on order_header to validate uniqueness is not a good
idea because the user can have two or three, or forth… orders with
same content. (As far as it is not in the same period of time).

You could define time periods (e.g. every five minutes), calculate the
order’s period and store it in the datbase. Then create a unique index
covering the period and the header (whatever that is).

Hope I answered your questions

Partly yes. Thank you.

Kind regards

robert

rodusa · January 2, 2014, 3:05pm

The difficulty that youll run into is in your need for the new, shorter
value to be unique. Hashes are not, and cannot be
designed to be unique. Its all in the numbers. If you have a 100
character string of 8 bit characters (assuming ASCII, not Unicode),
the you have 800 bits of information. You could tale advantage of the
fact that not all 256 values of a byte are valid for your string
to reduce its size some. If you limit to 7-bit ascii, then theres 1 bit
per byte that could be reclaimed. All of these factors are
taken into account in compression algorithms. So compression is the
direction you need to look. Be careful, because many
compression algorithms give longer results than their input if the input
is particularly short (I seem to recall that some have fall-back
approaches to account for this.

Hashes (either the built in hash() method youve already discovered the
issues with, or cryptographic hashes like MD5 or SHA1 are
designed to statistically minimize the number of has collisions. You
can take a reasonably long input and the odds of any two, different,
strings being the same are VERY low, but its not guaranteed. Two inputs
producing the same hash is referred to as a hash collision.
Cryptographic
hashes are designed to minimize collisions, but, since they are of a
fixed size, there are only so many possible result values and that wont
be
enough to guarantee unique results for strings. If your value space
(i.e. the number of strings youre trying to ensure uniqueness for is
not in the millions are billions, and you can live with the results of
your compare basically being a statement that, if you get the same
value,
theres a 1 in XXXXXXXXX change of them being actually being different
strings, then cryptographic hash might be sufficient for you. Just be
aware
that two strings with the same has might be VERY VERY likely to be the
same string, but that its, at least remotely possible, that they are tow
different
strings producing the same hash.

Look into compression methods first. Compression is what youve
described. If youre strings are sufficiently long, then, off-the-shelf
compression
could easily be your answer. If theyre short and you have special
knowledge of the allowed input values (ex: youre using ASCII, and only
allow
a-z, A-Z, space, comma, period, ) you may find that there are only, say
100 valid values per character (or anything less than 128), then you
could
compress them to 7/8ths of their original size (using very simplistic
compression). Take a look at simple zip compression and others like it.
Their purpose
is to do what youre asking provide a shorter value from which must be
unique for every unique input value (since it must be able to
decompress).

Using the theoretical 100 values per character scenario I just gave.
The number of possible values of a string are 100^n (where n is the
number of characters),
So, fo 20 characters
possible values = 100^20 => 1e40
number of bits = log2(possible values) => 132.8771
bytes = number of bits / 8 => 16.6096
So, in theory, you can get 20 character strings down to 17 bytes

If you go up to 200 characters167 bytes

Encryption, as youve seen, has no goal of producing shorter output than
the input, so its not going to provide your solution.

(OK, Ive started rambling… probably more detail than you needed look
for compression routines)

rodusa · January 2, 2014, 3:34pm

On Thu, Jan 2, 2014 at 3:11 PM, Rodrigo L. [email protected]
wrote:

service, etc. The order in which this files are submitted does not
user resubmits, all I need is to generate the SHA1 hash again and
compare to the SHA1 hash stored in db. I mean, I don’t need to check for
each file. Why would you do this way?

That was just an option depending on what you want to achieve.
Apparently this does not meet your needs.

Are you trying to guard against browser resubmitting forms or is there
another reason for what you are trying to do?
RE: browser and malicious user.

For browser resubmit issues you could use a different approach: place
a hidden field in the HTML form with a generated value (e.g. UUID).
Check that value on submission and reject if it has been seen already.

Not sure what a malicious user would be up to as it seems you know the
user already and the printing jobs likely will be billed against him.
If you need to avoid fake submissions you could store the generated
value (UUID or other) and only accept exactly one submission. You
could use a table with

value (PK)
count (initial 0, constraint to be 0 or 1)
timestamp (not null)

have a index on timestamp (for cleanup)

On submission of the form you do

update the_table
set count = count + 1
where value = ?

Bind parameter is set to what you got from the form. There are two
error situations you need to cover

count exceeds valid range => this is a resubmission
no rows are updated => forged request

In regular intervals you delete all rows with a timestamp older than x
time units.

You may also want to look into cross site scripting defense. IIRC a
similar mechanism is used there as well.

You could define time periods (e.g. every five minutes), calculate the
order’s period and store it in the datbase. Then create a unique index
covering the period and the header (whatever that is).
RE: to me this sounds to be the best idea ever.

Nice to hear that this will work for you.

Thank you very much

You’re welcome.

Cheers

robert

rodusa · January 2, 2014, 4:04pm

thanks Robert,

You just keep coming with better ideas. This last input about the UUID
was very good. I heard somewhere that UUID is not “atomic” when used in
different “Load Balancing” servers. But I don’t care if it works even
with some small risk. I just have one server.

Yeah, malicious user in my case is fake or accidental submissions. But
you know what? with files stored in Amazon S3, I must be careful not to
be billed by storage usage.

thanks
Rod

rodusa · January 2, 2014, 3:11pm

Can you please stick with the usual quoting style? That will make it
much easier for everyone to follow discussions.
RE: sorry, my bad.
So you want to avoid that the exact same order (potentially containing
multiple documents to print) to be repeated? That would mean if a user
submits order for Doc1 and Doc2 and later for Doc1 only the second one
will still succeed. Also, submitting an order with Doc2 and Doc1
(order reversed) will also succeed. Is this what you want?
RE: more or less. To be a dupe, the duplicated order needs to contain
exactly the same files, the same quantities, the same price for the
service, etc. The order in which this files are submitted does not
matter.

In case you want to avoid duplicate printing of the same file you
could store a SHA1 hash in order_assets along with the blob that
contains the file data to avoid duplicate submission of a file.
However, that would be a different mechanism from the one described
above.
RE: this may be the solution, except I don’t need to store in order
assets, but only in order header. Let’s say I have a SHA1 hash
“#$lkdsflda” and this string is equal to the overall Json string. If the
user resubmits, all I need is to generate the SHA1 hash again and
compare to the SHA1 hash stored in db. I mean, I don’t need to check for
each file. Why would you do this way?

Are you trying to guard against browser resubmitting forms or is there
another reason for what you are trying to do?
RE: browser and malicious user.

You could define time periods (e.g. every five minutes), calculate the
order’s period and store it in the datbase. Then create a unique index
covering the period and the header (whatever that is).
RE: to me this sounds to be the best idea ever.

Thank you very much