Forum: Ruby Convert text string i.e 'Peter' into integer ID

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Justus O. (Guest)
on 2008-11-12 15:19
Hello,

is there any method to quickly convert a text string such as 'Peter'
into an integer? (And vice versa?)

I am not asking about just changing the format (to_i). I'm looking for a
method thus can do the computation.

Thanks,
Justus
Sarcar, Shourya C (GE Healthcare) (Guest)
on 2008-11-12 15:29
(Received via mailing list)
code=0
"Peter".each_byte{|b| code += b} @ will not generate unique tho
Hugh S. (Guest)
on 2008-11-12 15:45
(Received via mailing list)
On Wed, 12 Nov 2008, Justus O. wrote:

> Hello,
>
> is there any method to quickly convert a text string such as 'Peter'
> into an integer? (And vice versa?)
>
> I am not asking about just changing the format (to_i). I'm looking for a
> method thus can do the computation.

What properties do you want the string:integer relationship to have?
The problem is too unbounded at the moment.

["Peter", "James", "John", "Andrew"].index("Peter")

satisfies the constraints given so far.
>
> Thanks,
> Justus

        Hugh
Peter S. (Guest)
on 2008-11-12 16:41
(Received via mailing list)
On 2008.11.12., at 14:16, Justus O. wrote:

> Hello,
>
> is there any method to quickly convert a text string such as 'Peter'
> into an integer? (And vice versa?)


There are a lot of easy ways to do the string -> integer part - e.g.

1) def x(s); return 42; end # now you can call x on your string and it
will convert it to an integer
2) "Peter"[0]
3) "Peter".object_id
4) "Peter".hash

The integer -> string part is the real problem; it's impossible to do
it with methods 1) - 3), with some extra effort (storing the result in
a lookup table)  it should be possible to accomplish it with 4) unless
you have some extra requirements.

The question is, what do you really want to achieve? What properties
should the mapping have?

Cheers
Peter
Justus O. (Guest)
on 2008-11-12 16:42
Hugh S. wrote:
> On Wed, 12 Nov 2008, Justus O. wrote:
>
>> Hello,
>>
>> is there any method to quickly convert a text string such as 'Peter'
>> into an integer? (And vice versa?)
>>
>> I am not asking about just changing the format (to_i). I'm looking for a
>> method thus can do the computation.
>
> What properties do you want the string:integer relationship to have?
> The problem is too unbounded at the moment.
>
> ["Peter", "James", "John", "Andrew"].index("Peter")
>
> satisfies the constraints given so far.
>>
>> Thanks,
>> Justus
>
>         Hugh

Thanks for your help. No I need to turn the string which would usually
be a headline ('Man lands on the moon') into a unique number. The
purpose is to speed up my database queries when checking whether an
entry with the same headline already exists in the db.
Justus
Todd B. (Guest)
on 2008-11-12 16:50
(Received via mailing list)
On Wed, Nov 12, 2008 at 8:39 AM, Justus O. 
<removed_email_address@domain.invalid>
wrote:
>>
>>         Hugh
>
> Thanks for your help. No I need to turn the string which would usually
> be a headline ('Man lands on the moon') into a unique number. The
> purpose is to speed up my database queries when checking whether an
> entry with the same headline already exists in the db.
> Justus

Why does it have to be a number?  Your db should already index.

Todd
Jan F. (Guest)
on 2008-11-12 16:50
(Received via mailing list)
Justus O. <removed_email_address@domain.invalid> wrote:

> Hugh S. wrote:
> Thanks for your help. No I need to turn the string which would usually
> be a headline ('Man lands on the moon') into a unique number. The
> purpose is to speed up my database queries when checking whether an
> entry with the same headline already exists in the db. Justus
I would do something like this with a md5 hash:

 require 'digest/md5'
 Digest::MD5.hexdigest('Peter')
 # => "6fa95b1427af77b3d769ae9cb853382f"

Regards
Jan
Peter S. (Guest)
on 2008-11-12 16:51
(Received via mailing list)
> Thanks for your help. No I need to turn the string which would usually
> be a headline ('Man lands on the moon') into a unique number. The
> purpose is to speed up my database queries when checking whether an
> entry with the same headline already exists in the db.

If you just want to check for duplicates, just store a SHA1 hash of
the string:
(AR example follows, obviously you can use any ORM or plain SQL):

require "sha1";

headline = "Man lands on the moon"
Article.create(:headline => headline, uniq_hash =>
SHA1.digest(headline))

...
later
....

you want to decide whether new_headline already exists:

Article.create( ... ) unless
Article.find_by_uniq_hash(SHA1.digest(new_headline))

HTH,
Peter
Sebastian H. (Guest)
on 2008-11-12 16:53
(Received via mailing list)
Peter S. wrote:
> 4) "Peter".hash
>
> The integer -> string part is the real problem; it's impossible to do  
> it with methods 1) - 3), with some extra effort (storing the result in  
> a lookup table)  it should be possible to accomplish it with 4) unless  
> you have some extra requirements.

Hash values are not unique. Two different strings can have the same hash
value, so you can't get the string back from the hash alone.

HTH,
Sebastian
Sarcar, Shourya C (GE Healthcare) (Guest)
on 2008-11-12 16:54
(Received via mailing list)
You need to generate a SHA1 or MD5 hash from your string.

require 'digest/md5'
d = Digest::MD5.new
d.update("if you believe it, they have put a man on the moon")
uniq = d.hexdigest


> -----Original Message-----
> From: removed_email_address@domain.invalid [mailto:removed_email_address@domain.invalid]
Todd B. (Guest)
on 2008-11-12 16:59
(Received via mailing list)
On Wed, Nov 12, 2008 at 8:47 AM, Jan F.
<removed_email_address@domain.invalid> wrote:
>  Digest::MD5.hexdigest('Peter')
>  # => "6fa95b1427af77b3d769ae9cb853382f"
>
> Regards
> Jan

Unless I'm missing something here, strings are just numbers in order.
Why encode/encrypt?

Most db's should handle natural keys.

If absolutely necessary to store as number strings (I can't see why),
look at #pack and #unpack.

Todd
Justus O. (Guest)
on 2008-11-12 17:08
Todd B. wrote:
> On Wed, Nov 12, 2008 at 8:47 AM, Jan F.
> <removed_email_address@domain.invalid> wrote:
>>  Digest::MD5.hexdigest('Peter')
>>  # => "6fa95b1427af77b3d769ae9cb853382f"
>>
>> Regards
>> Jan
>
> Unless I'm missing something here, strings are just numbers in order.
> Why encode/encrypt?
>
> Most db's should handle natural keys.
>
> If absolutely necessary to store as number strings (I can't see why),
> look at #pack and #unpack.
>
> Todd

Thanks a lot everybody.  I'm really impressed by the quick and useful
feedback.

Todd, I was told that searching by integers instead of strings would
speed up performance when using large mysql tables. Is that not so?

Justus
Jan F. (Guest)
on 2008-11-12 17:10
(Received via mailing list)
Todd B. <removed_email_address@domain.invalid> wrote:
> Unless I'm missing something here, strings are just numbers in order.
> Why encode/encrypt?
You're right:
 'Peter'.to_i(36) # => 42681699

But you will become problems with larger strings if your dbms doesn't
have integers with arbitrary length.

> Most db's should handle natural keys.
This is right. I don't adress the db part in my post.

> If absolutely necessary to store as number strings (I can't see why),
> look at #pack and #unpack.
This would also not work for larger strings (see above).

HTH,
Jan
Ken B. (Guest)
on 2008-11-12 17:13
(Received via mailing list)
On Wed, 12 Nov 2008 08:16:36 -0500, Justus O. wrote:

> Hello,
>
> is there any method to quickly convert a text string such as 'Peter'
> into an integer? (And vice versa?)
>
> I am not asking about just changing the format (to_i). I'm looking for a
> method thus can do the computation.
>
> Thanks,
> Justus

Does "Peter".hash do what you want?

--Ken
Justus O. (Guest)
on 2008-11-12 17:24
Ken B. wrote:
> On Wed, 12 Nov 2008 08:16:36 -0500, Justus O. wrote:
>
>> Hello,
>>
>> is there any method to quickly convert a text string such as 'Peter'
>> into an integer? (And vice versa?)
>>
>> I am not asking about just changing the format (to_i). I'm looking for a
>> method thus can do the computation.
>>
>> Thanks,
>> Justus
>
> Does "Peter".hash do what you want?
>
> --Ken

Yes, thanks it seems to do what I need, except for two possible
limitations:

1.Accordinng to Sebastian (above) 'Hash values are not unique. Two
different strings can have the same hash value'

2.It may not serve my original purpose, which is speeding up database
queries.


Thanks again,
Justus
Todd B. (Guest)
on 2008-11-12 17:31
(Received via mailing list)
On Wed, Nov 12, 2008 at 9:06 AM, Justus O. 
<removed_email_address@domain.invalid>
wrote:
>> Why encode/encrypt?
> Todd, I was told that searching by integers instead of strings would
> speed up performance when using large mysql tables. Is that not so?

To be honest, I know almost nothing about mysql.  I will say, however,
that you should try natural keys and see how the performance works
(testing).  PostgreSQL, for example, claims you gain no more
performance on any natural key (be it integer, character, otherwise).
The true bottleneck is almost always in the application.  But, I don't
know your exact situation.

From what you have said, it seems like you are looking for a primary
key that's unique and fast.  Most db's that are set up correctly do a
"behind-the-scenes" lookup for your key; which means that there is an
ID (number) assigned to your element.  The search is definitely what
people are concerned about, but having a string turned into a number
won't help you there, unless it's like a password or something.

If you want the string compacted, then follow some of the other
suggestions.

hth,
Todd
Rolando A. (Guest)
on 2008-11-12 17:41
(Received via mailing list)
On Nov 12, 2008, at 12:06 PM, Justus O. wrote:

> Thanks a lot everybody.  I'm really impressed by the quick and useful
> feedback.
>
> Todd, I was told that searching by integers instead of strings would
> speed up performance when using large mysql tables. Is that not so?

You should just index the column you need, and if you want that column
to be unique, create a unique constraint:

create index <index_name> on <table_name> (<name_of_the_column>)

Something like that should work. To create the unique constraint, look
on your database documentation.
Adding a column and getting a hash from a string is going to be slower
than having an index at the database level.

> Justus

Regards,
Giampiero Z. (Guest)
on 2008-11-12 18:43
> Thanks for your help. No I need to turn the string which would usually
> be a headline ('Man lands on the moon') into a unique number. The
> purpose is to speed up my database queries when checking whether an
> entry with the same headline already exists in the db.
> Justus

Viewing your problem from a theoretical point of view, something similar
exists in compression theory, namely arithmetic compression. Of course
that is not convenient for your porpuses.
Brian C. (Guest)
on 2008-11-12 19:08
Justus O. wrote:
> I was told that searching by integers instead of strings would
> speed up performance when using large mysql tables. Is that not so?

This is what's known as "premature optimisation".

The general rule is: build your application first. If it has performance
problems, profile it. Only after profiling to determine where it
*really* is slow, then modify it. Even the most experienced of
architects and programmers often get it wrong, when they rely on their
intuition as to where optimisation actually benefits you.

I'd say that if you have a few hundred thousand rows in a table, and the
column you are searching on is indexed, then I doubt you will get any
noticeable speed improvement searching on an integer rather than a
string column.

However, note that it is common practice in database applications to
have an integer primary key assigned from a sequence.

 1 : "Peter"
 2 : "James"
 3 : "John"
 4 : "Andrew"
 ... etc

In that way, if you happen to know the ID of the row you want already,
you can jump to it by ID. But if you want to search for it by name -
which might return 0, 1 or more results - you can do that efficiently
too as long as that column is indexed.

What you seem to be asking for is to allocate the IDs in such a way that
given the string, you can calculate the ID off-line without performing a
database search. But to avoid the possibility of two strings giving the
same integer, then you would have to use a strong cryptographic hash
like SHA1. This will give you an integer of size 2^160, which is very
large; so large that actually just storing the string (and searching on
it) will likely be more efficient anyway. Furthermore, the integers
themselves will effectively be randomly distributed, rather than a
linear sequence, so the same sort of tree index and lookup will be
required.
Ken B. (Guest)
on 2008-11-12 20:10
(Received via mailing list)
On Wed, 12 Nov 2008 10:22:05 -0500, Justus O. wrote:

>>>
> 1.Accordinng to Sebastian (above) 'Hash values are not unique. Two
> different strings can have the same hash value'
>
> 2.It may not serve my original purpose, which is speeding up database
> queries.

I'm going to suggest what Todd B. and Rolando A. suggested,
which
is to just work with strings in the database. Don't bother with
computing
some kind of (possibly unique) hash. Use a CREATE INDEX statement to
index the headline field, and you'll probably never notice a speed
difference between your roundabout method and feeding in the string
directly to the database.

--Ken
Justus O. (Guest)
on 2008-11-12 20:37
Ken B. wrote:
> On Wed, 12 Nov 2008 10:22:05 -0500, Justus O. wrote:
>
>>>>
>> 1.Accordinng to Sebastian (above) 'Hash values are not unique. Two
>> different strings can have the same hash value'
>>
>> 2.It may not serve my original purpose, which is speeding up database
>> queries.
>
> I'm going to suggest what Todd B. and Rolando A. suggested,
> which
> is to just work with strings in the database. Don't bother with
> computing
> some kind of (possibly unique) hash. Use a CREATE INDEX statement to
> index the headline field, and you'll probably never notice a speed
> difference between your roundabout method and feeding in the string
> directly to the database.
>
> --Ken

Thanks again for all you help everyone!

I have made one small test already using an additional integer column
instead of the original headline string. To convert the headline string
into an integer value I used the .hash method. The db I'm using is
mysql. Using a very small sample of entries (about 1000) I found
virtually no difference at all in the time it took to check the entire
table for existing entries when comparing using the string column vs.
using the integer column for all searches. If there is any difference in
time it takes it would be less than 1%. Considering that there is an
additional computation (.hash method) being performed when using the
integer column one could maybe assume that the latter - the integer
column - by itself must slightly faster for the database to check. In
any case I am going to stick with the original string column for the
headline field for now.

I will try to optimize the table indexing the headline field as
suggested. One question regarding this: Can this be done from rails or
are these mysql commands ('CREATE INDEX' etc.)?

Thanks again for all the help!
Justus







 and found virtually no difference in the time it took to compare about
100,000 rows in MySQL.
 using an integer value which was derived using 'headline'.hash

Again, thanks everybody
Peter S. (Guest)
on 2008-11-12 20:45
(Received via mailing list)
> I will try to optimize the table indexing the headline field as
> suggested. One question regarding this: Can this be done from rails or
> are these mysql commands ('CREATE INDEX' etc.)?


Yeah, you can do it from Rails migrations:

http://apidock.com/rails/ActiveRecord/Migration, section 'Available
transformations':

add_index(table_name, column_names, options):
Adds a new index with the name of the column. Other options
include :name and :unique (e.g. { :name => "users_name_index", :unique
=> true }).

Cheers,
Peter
Lloyd L. (Guest)
on 2008-11-14 17:14
Justus O. wrote:
> Todd, I was told that searching by integers instead of strings would
> speed up performance when using large mysql tables. Is that not so?

Well, it depends.

Searching for numbers can be faster only if the numbers are sorted.  For
this you would usually have an index.  This way, you halfsplit your way
to the number you want.

i.e.  Go the middle: is it bigger or smaller?  That narrows your search
by half with one check.  Then go to the middle of the half it will be in
and repeat.

The problem with headlines is that you are not necessarily searching for
a whole headline.

e.g. searching for "man on the moon" might not find "There is a man on
the moon again."  Then, there is case sensitivity which will affect
conversions to numbers often enough.

You could have what many websites have:  a key word search.  You could
have a list of key words, "man", "moon", etc., in another file and have
that column indexed.  Then, you search for keywords and yield all
articles that have all of them.  That would be fast and cover a lot of
bases.

IHTH
This topic is locked and can not be replied to.