Problem with trivial regular expression

dvancoevorden · December 14, 2009, 11:05am

hi ,sorry for my english.

I am trying to remember the use of regular expressions and i have a
problem with this :

i had a text with diferents strings, for example, url , like this :

fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja

i want to extract the diferents url but i try with :
/(http://.+com)/ it returns a long match:

http://www.marca.comjafosjodfahttp://www.as.com

how can i group this in 2 diferents matchs? example :
1- http://www.marca.com
2- http://www.as.com

thanks

dvancoevorden · December 14, 2009, 11:17am

On Mon, Dec 14, 2009 at 11:06 AM, David V. [email protected]
wrote:

i want to extract the diferents url but i try with :

thanks

Posted via http://www.ruby-forum.com/.

What you are missing is the non-greedy modifier (?) for the +:

irb(main):001:0> s = "
fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja"
=> "
fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja"
irb(main):003:0> s.scan(/http://.+?.com/)
=> [“http://www.marca.com”, “http://www.as.com”]

(I also added an extra . before com, to match a “.com” and not “com”
only). Then, scan helps you going through the full string retrieving
matches.

Hope this helps,

Jesus.

dvancoevorden · December 14, 2009, 11:17am

irb> s =
“fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja”
=>
“fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja”
irb> s.scan( %r{http://.+?.com} )
=> [“http://www.marca.com”, “http://www.as.com”]

I use scan because we want multiple results.

For the Regexp,
in “.+?”, you make it ungreedy
The %r{} let you write / without escaping
I also add a “.” to ensure there is a point before “com”

Enjoy

2009/12/14 David V. [email protected]

dvancoevorden · December 14, 2009, 11:20am

We are quite according in our posts
This should be the Ruby way, very clear this time !

2009/12/14 JesÃºs Gabriel y GalÃ¡n [email protected]

dvancoevorden · December 14, 2009, 11:30am

On Monday 14 December 2009 04:06:08 am David V. wrote:

fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja

Any particular context? Or is it actually that random?

i want to extract the diferents url but i try with :
/(http://.+com)/ it returns a long match:

http://www.marca.comjafosjodfahttp://www.as.com

If you think about it, that is still a valid URL. You’re trying to limit
it
not to URLs, but only to http:// followed by a domain, and then only a
domain
ending in .com – there are MANY urls that this will break.

If you’re OK with that, the basic problem is that . is going to match as
much
as it possibly can (greedy), and it matches any character. The simple
solution
is to make it match as few characters as it can (miserly). You do that
by
putting a question mark after the + or *:

/(http://.+?com)/

But again, that’s not matching .com, that’s matching anything ending in
com.
For example, on this URL:

it will only capture http://www.broadcom. So there’s an easy solution –
add
an escaped dot:

/(http://.+?.com)/

That’s as much as I want to do with it. I’m guessing what you’re trying
to do
is auto-linkify URLs in forum posts, or something like that – some
problem
that’s been solved a million times before, and better, so you should
look for
those solutions. But I won’t assume that applies to you…

By the way, if you don’t already know:

dvancoevorden · December 14, 2009, 11:42am

David M. wrote:

http://www.marca.comjafosjodfahttp://www.as.com

If you think about it, that is still a valid URL.

That’s arguable, because of the colon. RFC 1738:

URL schemes that involve the direct use
of an IP-based protocol to a specified host on the Internet use a
common syntax for the scheme-specific data:

    //<user>:<password>@<host>:<port>/<url-path>

…
port
The port number to connect to. Most schemes designate
protocols that have a default port number. Another port number
may optionally be supplied, in decimal, separated from the
host by a colon. If the port is omitted, the colon is as well.

However, it says “is” rather than “MUST BE”.

dvancoevorden · December 14, 2009, 11:38am

On Mon, Dec 14, 2009 at 11:20 AM, Benoit D. [email protected]
wrote:

We are quite according in our posts
This should be the Ruby way, very clear this time !

Yep, but I forgot the %r. Escaping / is ugly :-).
So, thanks for that !

Jesus.

dvancoevorden · December 14, 2009, 1:01pm

Thanks for all answer,

i used rubular.com for test, thanks however.

I posted a random string and i understand it better, but the real
question and string is :

""18%7Chttp%3A%2F%2Fv14.lscache3.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D18%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D4C33C8CB5787DBA4D111A2E66BFFACCC2E95E0A7.D302F87BF11A4BD47631704E858C762325913921%26factor%3D1.25%26id%3Dcf9829e68818de48%2C34%7Chttp%3A%2F%2Fv7.lscache8.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D34%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D158633D23CDC66CC40D171EC8CB48A84B4BCB223.86F6779A36E14684B367BCCF91588556A871C421%26factor%3D1.25%26id%3Dcf9829e68818de48%2C5%7Chttp%3A%2F%2Fv16.lscache8.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D5%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D91FAC179A7C02DD662942CCEB71D8BE05BD7B5D3.BC4BFC97F8BA02519CC817D2541E5D0548BC2C7C%26factor%3D1.25%26id%3Dcf9829e68818de48"

here, are some urls, one starts with : http… and end with %2C3 :

http%3A%2F%2Fv14.lscache3.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D18%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D4C33C8CB5787DBA4D111A2E66BFFACCC2E95E0A7.D302F87BF11A4BD47631704E858C762325913921%26factor%3D1.25%26id%3Dcf9829e68818de48%2C3

and other, following this, with star with : http…and end with %2C5.

http%3A%2F%2Fv7.lscache8.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D34%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D158633D23CDC66CC40D171EC8CB48A84B4BCB223.86F6779A36E14684B367BCCF91588556A871C421%26factor%3D1.25%26id%3Dcf9829e68818de48%2C5

So, when i match the frist, no problem, but when i try to match the
second, it matchs all the submatch,the first and the second :

Match captures:

http%3A%2F%2Fv14.lscache3.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D18%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D4C33C8CB5787DBA4D111A2E66BFFACCC2E95E0A7.D302F87BF11A4BD47631704E858C762325913921%26factor%3D1.25%26id%3Dcf9829e68818de48%2C34%7Chttp%3A%2F%2Fv7.lscache8.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D34%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D158633D23CDC66CC40D171EC8CB48A84B4BCB223.86F6779A36E14684B367BCCF91588556A871C421%26factor%3D1.25%26id%3Dcf9829e68818de48

how can i match only the second ?

David M. wrote:

On Monday 14 December 2009 04:06:08 am David V. wrote:

fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja

Any particular context? Or is it actually that random?

i want to extract the diferents url but i try with :
/(http://.+com)/ it returns a long match:

http://www.marca.comjafosjodfahttp://www.as.com

If you think about it, that is still a valid URL. You’re trying to limit
it
not to URLs, but only to http:// followed by a domain, and then only a
domain
ending in .com – there are MANY urls that this will break.

If you’re OK with that, the basic problem is that . is going to match as
much
as it possibly can (greedy), and it matches any character. The simple
solution
is to make it match as few characters as it can (miserly). You do that
by
putting a question mark after the + or *:

/(http://.+?com)/

But again, that’s not matching .com, that’s matching anything ending in
com.
For example, on this URL:

http://www.broadcom.com/

it will only capture http://www.broadcom. So there’s an easy solution –
add
an escaped dot:

/(http://.+?.com)/

That’s as much as I want to do with it. I’m guessing what you’re trying
to do
is auto-linkify URLs in forum posts, or something like that – some
problem
that’s been solved a million times before, and better, so you should
look for
those solutions. But I won’t assume that applies to you…

By the way, if you don’t already know:

http://rubular.com/

dvancoevorden · December 14, 2009, 1:46pm

David V. wrote:

I posted a random string and i understand it better, but the real
question and string is :

But where does this string actually come from? It looks a bit like URLs
but with an extra layer of URL-encoding, in that = appears as %3D, for
example, and some extra numeric prefixes like 18|

So it would be much more helpful to understand what the real structure
of this string is, rather than just guessing, in which case you don’t
need to guess about how to decode it.

Removing the first level of escaping:

irb(main):007:0> CGI.unescape(s)
=>
“"18|http://v14.lscache3.c.youtube.com/videoplayback?ip=0.0.0.0&sparams=id%2Cexpire%2Cip%2Cipbits%2Citag%2Calgorithm%2Cburst%2Cfactor&fexp=900034%2C902305&algorithm=throttle-factor&itag=18&ipbits=0&burst=40&sver=3&expire=1260813600&key=yt1&signature=4C33C8CB5787DBA4D111A2E66BFFACCC2E95E0A7.D302F87BF11A4BD47631704E858C762325913921&factor=1.25&id=cf9829e68818de48,34|http://v7.lscache8.c.youtube.com/videoplayback?ip\320.0.0.0&sparams=id%2Cexpire%2Cip%2Cipbits%2Citag%2Calgorithm%2Cburst%2Cfactor&fexp=900034%2C902305&algorithm=throttle-factor&itag=34&ipbits=0&burst=40&sver=3&expire=1260813600&key=yt1&signature=158633D23CDC66CC40D171EC8CB48A84B4BCB223.86F6779A36E14684B367BCCF91588556A871C421&factor=1.25&id=cf9829e68818de48,5|http://v16.lscache8.c.youtube.com/videoplayback?ip=0.0.0.0&sparams=id%2Cexpire%2Cip%2Cipbits%2Citag%2Calgrithm%2Cburst%2Cfactor&fexp=900034%2C902305&algorithm=throttle-factor&itag=5&ipbits=0&burst=40&sver=3&expire=1260813600&key=yt1&signature=91FAC179A7C02DD662942CCEB71D8BE05BD7B5D3.BC4BFC97F8BA02519CC817D2541E5D0548BC2C7C&factor=1.25&id=cf9829e68818de48\”"

So my total guess is that this is a double-quoted string, which
contains comma-separated fields, and each field is of the form nn|URL.
In which case you can unwrap it in stages:

irb(main):011:0> s.sub!(/\A"(.)"\z/) { $1 }
irb(main):012:0> fields = CGI.unescape(s).split(‘,’)
irb(main):013:0> fields.each { |f| num,url = f.split(‘|’,2); puts
"**",url }; nil

http://v14.lscache3.c.youtube.com/videoplayback?ip=0.0.0.0&sparams=id%2Cexpire%2Cip%2Cipbits%2Citag%2Calgorithm%2Cburst%2Cfactor&fexp=900034%2C902305&algorithm=throttle-factor&itag=18&ipbits=0&burst=40&sver=3&expire=1260813600&key=yt1&signature=4C33C8CB5787DBA4D111A2E66BFFACCC2E95E0A7.D302F87BF11A4BD47631704E858C762325913921&factor=1.25&id=cf9829e68818de48

http://v7.lscache8.c.youtube.com/videoplayback?ipï¿½.0.0.0&sparams=id%2Cexpire%2Cip%2Cipbits%2Citag%2Calgorithm%2Cburst%2Cfactor&fexp=900034%2C902305&algorithm=throttle-factor&itag=34&ipbits=0&burst=40&sver=3&expire=1260813600&key=yt1&signature=158633D23CDC66CC40D171EC8CB48A84B4BCB223.86F6779A36E14684B367BCCF91588556A871C421&factor=1.25&id=cf9829e68818de48

http://v16.lscache8.c.youtube.com/videoplayback?ip=0.0.0.0&sparams=id%2Cexpire%2Cip%2Cipbits%2Citag%2Calgrithm%2Cburst%2Cfactor&fexp=900034%2C902305&algorithm=throttle-factor&itag=5&ipbits=0&burst=40&sver=3&expire=1260813600&key=yt1&signature=91FAC179A7C02DD662942CCEB71D8BE05BD7B5D3.BC4BFC97F8BA02519CC817D2541E5D0548BC2C7C&factor=1.25&id=cf9829e68818de48
=> nil

IMO it’s far better to use the structure of the input to delimit the
data you’re looking for, rather than guessing where the start and end of
each datum is based on what you expect the datum to look like.

dvancoevorden · December 23, 2009, 7:39pm

Sorry, i forgot this post.

Thanks for all the people.

Finally, i think i end my script to download from youtube and extract
the sound, thanks again.

The script, very simple but very util for me and very interesting to
remember regular expressions which i had forgotten.