Convert text string i.e 'Peter' into integer ID


#1

Hello,

is there any method to quickly convert a text string such as ‘Peter’
into an integer? (And vice versa?)

I am not asking about just changing the format (to_i). I’m looking for a
method thus can do the computation.

Thanks,
Justus


#2

code=0
“Peter”.each_byte{|b| code += b} @ will not generate unique tho


#3

On Wed, 12 Nov 2008, Justus O. wrote:

Hello,

is there any method to quickly convert a text string such as ‘Peter’
into an integer? (And vice versa?)

I am not asking about just changing the format (to_i). I’m looking for a
method thus can do the computation.

What properties do you want the string:integer relationship to have?
The problem is too unbounded at the moment.

[“Peter”, “James”, “John”, “Andrew”].index(“Peter”)

satisfies the constraints given so far.

Thanks,
Justus

    Hugh

#4

On 2008.11.12., at 14:16, Justus O. wrote:

Hello,

is there any method to quickly convert a text string such as ‘Peter’
into an integer? (And vice versa?)

There are a lot of easy ways to do the string -> integer part - e.g.

  1. def x(s); return 42; end # now you can call x on your string and it
    will convert it to an integer
  2. “Peter”[0]
  3. “Peter”.object_id
  4. “Peter”.hash

The integer -> string part is the real problem; it’s impossible to do
it with methods 1) - 3), with some extra effort (storing the result in
a lookup table) it should be possible to accomplish it with 4) unless
you have some extra requirements.

The question is, what do you really want to achieve? What properties
should the mapping have?

Cheers
Peter


#5

Hugh S. wrote:

On Wed, 12 Nov 2008, Justus O. wrote:

Hello,

is there any method to quickly convert a text string such as ‘Peter’
into an integer? (And vice versa?)

I am not asking about just changing the format (to_i). I’m looking for a
method thus can do the computation.

What properties do you want the string:integer relationship to have?
The problem is too unbounded at the moment.

[“Peter”, “James”, “John”, “Andrew”].index(“Peter”)

satisfies the constraints given so far.

Thanks,
Justus

    Hugh

Thanks for your help. No I need to turn the string which would usually
be a headline (‘Man lands on the moon’) into a unique number. The
purpose is to speed up my database queries when checking whether an
entry with the same headline already exists in the db.
Justus


#6

On Wed, Nov 12, 2008 at 8:39 AM, Justus O. removed_email_address@domain.invalid
wrote:

    Hugh

Thanks for your help. No I need to turn the string which would usually
be a headline (‘Man lands on the moon’) into a unique number. The
purpose is to speed up my database queries when checking whether an
entry with the same headline already exists in the db.
Justus

Why does it have to be a number? Your db should already index.

Todd


#7

Justus O. removed_email_address@domain.invalid wrote:

Hugh S. wrote:
Thanks for your help. No I need to turn the string which would usually
be a headline (‘Man lands on the moon’) into a unique number. The
purpose is to speed up my database queries when checking whether an
entry with the same headline already exists in the db. Justus
I would do something like this with a md5 hash:

require ‘digest/md5’
Digest::MD5.hexdigest(‘Peter’)

=> “6fa95b1427af77b3d769ae9cb853382f”

Regards
Jan


#8

Thanks for your help. No I need to turn the string which would usually
be a headline (‘Man lands on the moon’) into a unique number. The
purpose is to speed up my database queries when checking whether an
entry with the same headline already exists in the db.

If you just want to check for duplicates, just store a SHA1 hash of
the string:
(AR example follows, obviously you can use any ORM or plain SQL):

require “sha1”;

headline = “Man lands on the moon”
Article.create(:headline => headline, uniq_hash =>
SHA1.digest(headline))


later

you want to decide whether new_headline already exists:

Article.create( … ) unless
Article.find_by_uniq_hash(SHA1.digest(new_headline))

HTH,
Peter


#9

Peter S. wrote:

  1. “Peter”.hash

The integer -> string part is the real problem; it’s impossible to do
it with methods 1) - 3), with some extra effort (storing the result in
a lookup table) it should be possible to accomplish it with 4) unless
you have some extra requirements.

Hash values are not unique. Two different strings can have the same hash
value, so you can’t get the string back from the hash alone.

HTH,
Sebastian


#10

You need to generate a SHA1 or MD5 hash from your string.

require ‘digest/md5’
d = Digest::MD5.new
d.update(“if you believe it, they have put a man on the moon”)
uniq = d.hexdigest

-----Original Message-----
From: removed_email_address@domain.invalid [mailto:removed_email_address@domain.invalid]


#11

On Wed, Nov 12, 2008 at 8:47 AM, Jan F.
removed_email_address@domain.invalid wrote:

Digest::MD5.hexdigest(‘Peter’)

=> “6fa95b1427af77b3d769ae9cb853382f”

Regards
Jan

Unless I’m missing something here, strings are just numbers in order.
Why encode/encrypt?

Most db’s should handle natural keys.

If absolutely necessary to store as number strings (I can’t see why),
look at #pack and #unpack.

Todd


#12

Todd B. wrote:

On Wed, Nov 12, 2008 at 8:47 AM, Jan F.
removed_email_address@domain.invalid wrote:

Digest::MD5.hexdigest(‘Peter’)

=> “6fa95b1427af77b3d769ae9cb853382f”

Regards
Jan

Unless I’m missing something here, strings are just numbers in order.
Why encode/encrypt?

Most db’s should handle natural keys.

If absolutely necessary to store as number strings (I can’t see why),
look at #pack and #unpack.

Todd

Thanks a lot everybody. I’m really impressed by the quick and useful
feedback.

Todd, I was told that searching by integers instead of strings would
speed up performance when using large mysql tables. Is that not so?

Justus


#13

Todd B. removed_email_address@domain.invalid wrote:

Unless I’m missing something here, strings are just numbers in order.
Why encode/encrypt?
You’re right:
‘Peter’.to_i(36) # => 42681699

But you will become problems with larger strings if your dbms doesn’t
have integers with arbitrary length.

Most db’s should handle natural keys.
This is right. I don’t adress the db part in my post.

If absolutely necessary to store as number strings (I can’t see why),
look at #pack and #unpack.
This would also not work for larger strings (see above).

HTH,
Jan


#14

On Wed, 12 Nov 2008 08:16:36 -0500, Justus O. wrote:

Hello,

is there any method to quickly convert a text string such as ‘Peter’
into an integer? (And vice versa?)

I am not asking about just changing the format (to_i). I’m looking for a
method thus can do the computation.

Thanks,
Justus

Does “Peter”.hash do what you want?

–Ken


#15

Ken B. wrote:

On Wed, 12 Nov 2008 08:16:36 -0500, Justus O. wrote:

Hello,

is there any method to quickly convert a text string such as ‘Peter’
into an integer? (And vice versa?)

I am not asking about just changing the format (to_i). I’m looking for a
method thus can do the computation.

Thanks,
Justus

Does “Peter”.hash do what you want?

–Ken

Yes, thanks it seems to do what I need, except for two possible
limitations:

1.Accordinng to Sebastian (above) ‘Hash values are not unique. Two
different strings can have the same hash value’

2.It may not serve my original purpose, which is speeding up database
queries.

Thanks again,
Justus


#16

On Nov 12, 2008, at 12:06 PM, Justus O. wrote:

Thanks a lot everybody. I’m really impressed by the quick and useful
feedback.

Todd, I was told that searching by integers instead of strings would
speed up performance when using large mysql tables. Is that not so?

You should just index the column you need, and if you want that column
to be unique, create a unique constraint:

create index <index_name> on <table_name> (<name_of_the_column>)

Something like that should work. To create the unique constraint, look
on your database documentation.
Adding a column and getting a hash from a string is going to be slower
than having an index at the database level.

Justus

Regards,


#17

On Wed, Nov 12, 2008 at 9:06 AM, Justus O. removed_email_address@domain.invalid
wrote:

Why encode/encrypt?
Todd, I was told that searching by integers instead of strings would
speed up performance when using large mysql tables. Is that not so?

To be honest, I know almost nothing about mysql. I will say, however,
that you should try natural keys and see how the performance works
(testing). PostgreSQL, for example, claims you gain no more
performance on any natural key (be it integer, character, otherwise).
The true bottleneck is almost always in the application. But, I don’t
know your exact situation.

From what you have said, it seems like you are looking for a primary
key that’s unique and fast. Most db’s that are set up correctly do a
“behind-the-scenes” lookup for your key; which means that there is an
ID (number) assigned to your element. The search is definitely what
people are concerned about, but having a string turned into a number
won’t help you there, unless it’s like a password or something.

If you want the string compacted, then follow some of the other
suggestions.

hth,
Todd


#18

Justus O. wrote:

I was told that searching by integers instead of strings would
speed up performance when using large mysql tables. Is that not so?

This is what’s known as “premature optimisation”.

The general rule is: build your application first. If it has performance
problems, profile it. Only after profiling to determine where it
really is slow, then modify it. Even the most experienced of
architects and programmers often get it wrong, when they rely on their
intuition as to where optimisation actually benefits you.

I’d say that if you have a few hundred thousand rows in a table, and the
column you are searching on is indexed, then I doubt you will get any
noticeable speed improvement searching on an integer rather than a
string column.

However, note that it is common practice in database applications to
have an integer primary key assigned from a sequence.

1 : “Peter”
2 : “James”
3 : “John”
4 : “Andrew”
… etc

In that way, if you happen to know the ID of the row you want already,
you can jump to it by ID. But if you want to search for it by name -
which might return 0, 1 or more results - you can do that efficiently
too as long as that column is indexed.

What you seem to be asking for is to allocate the IDs in such a way that
given the string, you can calculate the ID off-line without performing a
database search. But to avoid the possibility of two strings giving the
same integer, then you would have to use a strong cryptographic hash
like SHA1. This will give you an integer of size 2^160, which is very
large; so large that actually just storing the string (and searching on
it) will likely be more efficient anyway. Furthermore, the integers
themselves will effectively be randomly distributed, rather than a
linear sequence, so the same sort of tree index and lookup will be
required.


#19

Thanks for your help. No I need to turn the string which would usually
be a headline (‘Man lands on the moon’) into a unique number. The
purpose is to speed up my database queries when checking whether an
entry with the same headline already exists in the db.
Justus

Viewing your problem from a theoretical point of view, something similar
exists in compression theory, namely arithmetic compression. Of course
that is not convenient for your porpuses.


#20

On Wed, 12 Nov 2008 10:22:05 -0500, Justus O. wrote:

1.Accordinng to Sebastian (above) ‘Hash values are not unique. Two
different strings can have the same hash value’

2.It may not serve my original purpose, which is speeding up database
queries.

I’m going to suggest what Todd B. and Rolando A. suggested,
which
is to just work with strings in the database. Don’t bother with
computing
some kind of (possibly unique) hash. Use a CREATE INDEX statement to
index the headline field, and you’ll probably never notice a speed
difference between your roundabout method and feeding in the string
directly to the database.

–Ken