Cutting a piece of text

Zdebel · February 12, 2006, 5:17pm

Helo !
I’ve started to learn ruby and I’m amazed with it. Now I have a problem
that I can’t solve. If I have a string like this:
" Lalalalala " how can I
cut the " artist=XXX album=XXX title=XXX" part, so it would look like:
" Lalalalala " Could you please help me ?

Zdebel · February 12, 2006, 5:58pm

On Feb 12, 2006, at 10:18 AM, Zdebel wrote:

Helo !
I’ve started to learn ruby and I’m amazed with it. Now I have a
problem
that I can’t solve. If I have a string like this:
" Lalalalala " how
can I
cut the " artist=XXX album=XXX title=XXX" part, so it would look like:
" Lalalalala " Could you please help me ?

You can do it with a regular expression like the following, but I
must stress that this isn’t very robust:

" Lalalalala ".sub
(/<(\w+)[^>]+>/, “<\1>”)
=> " Lalalalala "

Hope that helps.

James Edward G. II

Zdebel · February 12, 2006, 6:05pm

James G. wrote:

On Feb 12, 2006, at 10:18 AM, Zdebel wrote:

Helo !
I’ve started to learn ruby and I’m amazed with it. Now I have a
problem
that I can’t solve. If I have a string like this:
" Lalalalala " how
can I
cut the " artist=XXX album=XXX title=XXX" part, so it would look like:
" Lalalalala " Could you please help me ?

You can do it with a regular expression like the following, but I
must stress that this isn’t very robust:

" Lalalalala ".sub
(/<(\w+)[^>]+>/, “<\1>”)
=> " Lalalalala "

Hope that helps.

James Edward G. II

:O, wow it works, I wish I knew how this (/<(\w+)[^>]+>/, “<\1>”)
regular expresion works :). Anyway thank you, you helped me very much.

Zdebel · February 12, 2006, 6:05pm

DÅ?a NedeÄ¾a 12 FebruÃ¡r 2006 17:18 Zdebel napÃsal:

Helo !
I’ve started to learn ruby and I’m amazed with it. Now I have a problem
that I can’t solve. If I have a string like this:
" Lalalalala " how can I
cut the " artist=XXX album=XXX title=XXX" part, so it would look like:
" Lalalalala " Could you please help me ?

The very geeky, and most probably least error-prone way would be
whacking the
string with a DOM parser, clearing the attributes, and then printing it
out
again. Unfortunately, I haven’t been doing any DOM manipulation in Ruby,
so I
can’t provide code.

David V.

Zdebel · February 12, 2006, 6:18pm

Learn regular expressions. Here’s a not great example:

a = " Lalalalala "
b = a.gsub(/\w*=\w*/ , “”)
c = b.gsub(/\s/, “”)
print c, “\n”

Lalalalala

A slightly (yes very slightly) more realistic example:

a = ’
Lalalalala ’
b = a.gsub(/\w*="\w*"/ , “”)
c = b.gsub(/\s/, “”)
print c, “\n”

Lalalalala

And what if there are spaces in a tag:

a = ’
Lalalalala ’
b = a.gsub(/\w*=".*"/ , “”)
c = b.gsub(/\s/, “”)

Zdebel · February 12, 2006, 6:14pm

On Feb 12, 2006, at 11:05 AM, David V. wrote:

The very geeky, and most probably least error-prone way would be
whacking the
string with a DOM parser, clearing the attributes, and then
printing it out
again. Unfortunately, I haven’t been doing any DOM manipulation in
Ruby, so I
can’t provide code.

The following is how you do it for valid XML, but the posted example
wasn’t quite:

#!/usr/local/bin/ruby -w

require “rexml/document”

doc = " Lalalalala </
lyrics>"
xml = REXML::Document.new(doc)
xml.root.attributes.clear
xml.write
puts

END

James Edward G. II

Zdebel · February 12, 2006, 6:33pm

Big thank you too all of you guys for such a response. This helped me
alot and my script is working, but I will practice more using your
advices

Zdebel · February 12, 2006, 6:20pm

On Feb 12, 2006, at 11:05 AM, Zdebel wrote:

I wish I knew how this (/<(\w+)[^>]+>/, “<\1>”)
regular expresion works :).

It reads:

/ < # find a < character
( # capture this next part into $1 (\1 in the replacement
string)
\w+ # followed by one or more word characters
) # end capture
[^>]+ # followed by one or more non > characters

 # and finally a > character

/x

The replacement just restores the <\w+> and leaves out the [^>]+ part
(the space and attributes).

Hope that helps.

James Edward G. II

Zdebel · February 12, 2006, 7:08pm

James Edward G. II wrote:

" Lalalalala
".sub(/<(\w+)[^>]+>/, “<\1>”)
=> " Lalalalala "

reluctant would a bit faster:

p " Lalalalala
".gsub(/<(\w+).*?>/, “<\1>”)

lopex

Zdebel · February 12, 2006, 9:21pm

DÅ?a NedeÄ¾a 12 FebruÃ¡r 2006 19:30 James Edward G. II napÃsal:

lyrics>".gsub(/<(\w+).*?>/, “<\1>”)
/<(w+)[^>]+>/ 7.170000 0.030000 7.200000 ( 7.227075)
x.report("/<(\w+)[^>]+>/") do

James Edward G. II

The nongreedy match has to “back up” and retry on every character after
the
tag name, whileas James’ [^>] doesn’t ever have to back up. In fact,
even a
greedy .* would probably be faster than a nongreedy one in this case.

Gotta love the black art that is optimizing regexps.

David V.

Zdebel · February 12, 2006, 9:38pm

David V. wrote:

The nongreedy match has to “back up” and retry on every character after the
tag name, whileas James’ [^>] doesn’t ever have to back up. In fact, even a
greedy .* would probably be faster than a nongreedy one in this case.

Gotta love the black art that is optimizing regexps.

Ooops… You are right!

But as I read greedy quantifiers do backtrack as well (but not in the
case above).

/a+aa/ =~ “aaaaa”
will backtrack two characters

only possesive quantifier (in oniguruma e.g.) consumes in the real,
greedy way.

so
/a++aa/ =~ “aaaaa”
won’t match.

lopex

Zdebel · February 12, 2006, 7:30pm

On Feb 12, 2006, at 12:08 PM, Marcin MielÅ¼yÅ?ski wrote:

James Edward G. II wrote:

" Lalalalala </
lyrics>".sub(/<(\w+)[^>]+>/, “<\1>”)
=> " Lalalalala "

reluctant would a bit faster:

p " Lalalalala </
lyrics>".gsub(/<(\w+).*?>/, “<\1>”)

Are you sure?

$ ruby regexp_time.rb
Rehearsal -------------------------------------------------
/<(w+)[^>]+>/ 7.210000 0.030000 7.240000 ( 7.266166)
/<(w+).*?>/ 7.710000 0.020000 7.730000 ( 7.757304)
--------------------------------------- total: 14.970000sec

                 user     system      total        real

/<(w+)[^>]+>/ 7.170000 0.030000 7.200000 ( 7.227075)
/<(w+).*?>/ 7.730000 0.020000 7.750000 ( 7.777196)
$ cat regexp_time.rb
#!/usr/local/bin/ruby -w

require “benchmark”

tests = 1000000
data = " Lalalalala "

Benchmark.bmbm do |x|
x.report("/<(\w+)[^>]+>/") do
tests.times { data.sub(/<(\w+)[^>]+>/, “<\1>”) }
end
x.report("/<(\w+).?>/") do
tests.times { data.sub(/<(\w+).?>/, “<\1>”) }
end
end

END

James Edward G. II

Zdebel · February 12, 2006, 10:35pm

DÅ?a NedeÄ¾a 12 FebruÃ¡r 2006 21:38 Marcin MielÅ¼yÅ?ski napÃsal:

But as I read greedy quantifiers do backtrack as well (but not in the
won’t match.

lopex

Yes, they do backtrack. The point is in using the one that you expect to
backtrack less.

Since in this case we very well knew there’s going to be quite a few
characters after the first word, the nongreedy quantifier was slower.

Where the REAL black magic is whether a greedy or possessive
quantification of
the [^>] variant would be faster. snicker Anyone running 1.9 able to
BM
this?

David V.

Zdebel · February 13, 2006, 8:58am

Zdebel wrote:

Helo !
I’ve started to learn ruby and I’m amazed with it. Now I have a problem
that I can’t solve. If I have a string like this:
“ Lalalalala ” how can I
cut the " artist=XXX album=XXX title=XXX" part, so it would look like:
“ Lalalalala ” Could you please help me ?

–
Posted via http://www.ruby-forum.com/.

p " Lalalalala ".
sub(/\s+[^<>]*(?=>)/, ‘’ )

p " Lalalalala ".
scan( /\G ( [^<]+ ) | \G ( < \S* ) [^>]* ( > ) /x ).
flatten.compact.join