Url normalization

luislavena · November 23, 2011, 8:14pm

Hi,

I have a set of urls that I want to normalize but I can’t find a regex
to do that, this is an url sample:
http://www.example.com/index.php?/topic/something/page__st__20__s__99590dc581fe8e7386051d6dfgdfg4eca4c/
when I use a web browser I find that this url is equivalent of the
following:
http://www.example.com/index.php?/topic/something/page__st__20
It is clear that the last part is a checksum but how can I detect that
automatically

best regards

rubix01 · November 23, 2011, 8:30pm

On 24/11/11 08:14, rubix Rubix wrote:

best regards

Can you work with something like this?

url_re = /^http://.*((?<=s)[a-g0-9]{32,64})/$/

I’ve assumed your checksum is between 32 and 64 characters which may or
may not be correct.

Sam

rubix01 · November 23, 2011, 8:58pm

thank you for your answer,
Are you saying that a checksum has only characters from a to g
if it is true, i think it will help very much
regards,

rubix01 · November 24, 2011, 5:42pm

On Wed, Nov 23, 2011 at 8:14 PM, rubix Rubix [email protected]
wrote:

I have a set of urls that I want to normalize but I can’t find a regex
to do that, this is an url sample:

http://www.example.com/index.php?/topic/something/page__st__20__s__99590dc581fe8e7386051d6dfgdfg4eca4c/

when I use a web browser I find that this url is equivalent of the
following:
http://www.example.com/index.php?/topic/something/page__st__20
It is clear that the last part is a checksum but how can I detect that
automatically

There is no defined generic semantic for the path and query parameters
in an URL. Semantic is only defined for the leading parts (protocol,
host, port etc.). How do you expect any mechanism to know that the
last part is a checksum (of what btw?)? I mean, completely
independent from technical questions of parsing: how would a piece of
software detect the checksum from looking at the URL?

For specific formatted URLs it’s a different story (see Sam’s
suggestion).

Kind regards

robert

rubix01 · December 29, 2011, 8:50am

-----Messaggio originale-----
Da: Ryan D. [mailto:[email protected]]
Inviato: mercoled 23 novembre 2011 23:21
A: ruby-talk ML
Oggetto: Re: url normalization

On Nov 23, 2011, at 11:14 , rubix Rubix wrote:

automatically
I think if you pay close attention you’ll see that your browser goes to
the
first url and then gets redirected by the server to the second url. The
proper thing to do would be to actually do the redirection, not munge
the
url directly.

–
Caselle da 1GB, trasmetti allegati fino a 3GB e in piu’ IMAP, POP3 e
SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f

Sponsor:
ING DIRECT Conto Arancio. 4,20% per 12 mesi, zero spese, aprilo in due
minuti!
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid924&d)-12

rubix01 · November 23, 2011, 11:22pm

On Nov 23, 2011, at 11:14 , rubix Rubix wrote:

Hi,

I have a set of urls that I want to normalize but I can’t find a regex
to do that, this is an url sample:

http://www.example.com/index.php?/topic/something/page__st__20__s__99590dc581fe8e7386051d6dfgdfg4eca4c/

when I use a web browser I find that this url is equivalent of the
following:
http://www.example.com/index.php?/topic/something/page__st__20
It is clear that the last part is a checksum but how can I detect that
automatically

I think if you pay close attention you’ll see that your browser goes to
the first url and then gets redirected by the server to the second url.
The proper thing to do would be to actually do the redirection, not
munge the url directly.