Regular expressions (extracting urls)

ntk · February 5, 2007, 9:37pm

Hi!

I have to extract an url from the text and make it a link (a href…)…
The trick is that I have to be careful not to replace the url, that are
already a part of the link.

so:

link = “Go here: http://www.something.com!”
link.gsub!(/https?://[a-z0-9.-_=&+/?]+/i, ‘<a
href=’\0’>\0’)

link becomes:
link = “Go here: http://www.something.com!”

Now…

When the link is this:
link = “Go here: http://www.something.com!”

The regular expression must not replace it!

I know that for example if I want to exclude the links that start with
xhttp, I can write:
link.gsub!(/([^x]https?://[a-z0-9.-_=&+/?]+)/i, ‘<a
href=’\1’>\1’)

but how can I exclude links that start with href=" and href=’ ?
The problem is that I don’t know how to specify that HREF cannot
preceede the link (cannot write [^href], [^(href)] doesn’t seem to work
either and it also screws \n … )

Please help !

David

ntk · February 5, 2007, 11:11pm

On 05.02.2007 21:37, David K. wrote:

href=’\0’>\0’)

either and it also screws \n … )
Maybe it’s enough to do

s.gsub( /([^"’])\b(http[^\s"’]+)\b([^"’])/,
‘\1\2\3’ )

That depends on your input text. This piece has some weaknesses, e.g.
won’t substitute hrefs at the beginning and end of the string (you could
pad with a whitespace).

Kind regards

robert

ntk · February 7, 2007, 5:20pm

Thank you… but yes… the problem is that it has to detect links at the
beginning and the end. and also it has to check for HREF, because
sometimes you can have the quotation marks before the link, even if it’s
not precedeed by href.

Is there a better solution? thank you

Robert K. wrote:

On 05.02.2007 21:37, David K. wrote:

href=’\0’>\0’)

either and it also screws \n … )
Maybe it’s enough to do

s.gsub( /([^"’])\b(http[^\s"’]+)\b([^"’])/,
‘\1\2\3’ )

That depends on your input text. This piece has some weaknesses, e.g.
won’t substitute hrefs at the beginning and end of the string (you could
pad with a whitespace).

Kind regards

robert

ntk · February 7, 2007, 6:46pm

On 5 feb, 17:37, David K. [email protected] wrote:

href='\0'>\0')

The regular expression must not replace it!

I know that for example if I want to exclude the links that start with
xhttp, I can write:
link.gsub!(/([^x]https?://[a-z0-9.-_=&+/?]+)/i, ‘<a
href='\1'>\1’)

but how can I exclude links that start with href=" and href=’ ?

You can exclude full words if you write regexes such as:

/(?!href=['"])/

Be careful about greediness, thou. If you are doing any web
scraping, you also should look into something like WWW::Mechanize,
instead of re-inventing the wheel.

ntk · February 7, 2007, 8:05pm

On 07.02.2007 18:40, gga wrote:

but how can I exclude links that start with href=" and href=’ ?

You can exclude full words if you write regexes such as:

/(?!href=[’"])/

I’m afraid this won’t work: this is negative lookahead but what you
really need here is negative lookbehind. That’s only possible with
the new regex engine in 1.9.

Kind regards

robert