Cannot remove multiple spaces

tbcox23 · February 7, 2009, 2:35pm

I’m baffled by this strange outcome - I cannot reduce multiple spaces
from a text file. This isn’t just a regex problem, somehow. I’m failing
to grasp something essential, but don’t know what it is. All help
appreciated, as usual!

Here is a demo of my problem, in which I try two different ways, and
both fail:

=== code ===

h2t.rb

def main

conversion table spec

conv = [
[ ‘

’, 'h1. ’ ], [ ‘

’, 'h2. ’ ], [ ‘

’, 'h3. ’ ],
[ ‘

’, 'h4. ’ ], [ ‘

’, 'h5. ’ ], [ ‘

’, 'h6. ’ ], [
/</h\d>/, ‘’ ],
[ " +", ’ ’ ]] # <= this last array element should do the trick, but
doesn’t

data = open( ‘h2t-in2.txt’, ‘r’ ) { |f| ( f.readlines( data )).to_s }

conv.each do |i|
data.gsub!( i[0], i[1] )
end
data.squeeze(’ ') # <= putting this here was sheer desperations, but
even THIS fails

open( “h2t-out.txt”, “w” ) { |f| f.write( data ) }

end

%w(rubygems ruby-debug readline strscan logger fileutils).each{ |lib|
require lib }

main

=== input file ===

Library catalog listing

x

Library catalog listing

x

Library catalog listing

x

p(subtitle). A complete listing of all material in the Library

=== output file ===

h1. Library catalog listing x

h3. Library catalog listing x

h2. Library catalog listing x

p(subtitle). A complete listing of all material in the Library

==============

The "x"s in the input file are to show that while the end tags are being
removed the space before them is NOT.

t.

–

Tom C., MS MA, LMHC - Private practice Psychotherapist
Bellingham, Washington, U.S.A: (360) 920-1226
<< [email protected] >> (email)
<< TomCloyd.com >> (website)
<< sleightmind.wordpress.com >> (mental health weblog)

tbcox23 · February 7, 2009, 2:43pm

Tom C. wrote:

def main

conversion table spec

conv = [
[ ‘

’, 'h1. ’ ], [ ‘
’, 'h2. ’ ], [ ‘
’, 'h3. ’ ],
[ ‘
’, 'h4. ’ ], [ ‘
’, 'h5. ’ ], [ ‘
’, 'h6. ’ ], [
/</h\d>/, ‘’ ],
[ " +", ’ ’ ]] # <= this last array element should do the trick, but
doesn’t
Ouch. THIS - [ / +/, ’ ’ ], substituted for [ " +", ’ ’ ] above fixes
it. I’m going blind, obviously.

t.

h1. Library catalog listing x
being removed the space before them is NOT.

t.

–

Tom C., MS MA, LMHC - Private practice Psychotherapist
Bellingham, Washington, U.S.A: (360) 920-1226
<< [email protected] >> (email)
<< TomCloyd.com >> (website)
<< sleightmind.wordpress.com >> (mental health weblog)

tbcox23 · February 7, 2009, 2:50pm

Hi –

On Sat, 7 Feb 2009, Tom C. wrote:

h2t.rb

going blind, obviously.
Just for fun, here’s another way to write the method:

def main
data = File.read(“tom.txt”)
data.gsub!(/<(h[1-6])>/, "\1. ")
data.gsub!(/</h\d>/, “”)
data.squeeze!(’ ')

open(“tom.out”, “w”) {|f| f.write(data) }

end

I think that does the same thing. Tweak to taste

David

–
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Coming in 2009: The Well-Grounded Rubyist (The Well-Grounded Rubyist)

http://www.wishsight.com => Independent, social wishlist management!

tbcox23 · February 7, 2009, 9:51pm

David A. Black wrote:

Here is a demo of my problem, in which I try two different ways, and
/</h\d>/, ‘’ ],
data.gsub!(/</h\d>/, “”)

That’s beautifully economical, and reveals a far better grasp of regex
than I was able to attain last night. However, I’m having trouble with
this line:

data.gsub!(/<(h[1-6])>/, "\1. ")

It certain works, but I don’t grasp the "\1. " part. I haven’t yet
found anything that might shed light on this magic. How does it retain
the ‘h’ and whatever digit follows it? It looks somehow like “\” ==
retain matched alpha, and the “1” does the same for matched digits, but
I really haven’t a clue. Can you elucidate just a bit?

Thanks!

Tom

–

Tom C., MS MA, LMHC - Private practice Psychotherapist
Bellingham, Washington, U.S.A: (360) 920-1226
<< [email protected] >> (email)
<< TomCloyd.com >> (website)
<< sleightmind.wordpress.com >> (mental health weblog)

tbcox23 · February 7, 2009, 3:23pm

See comments below.

On Sat, Feb 7, 2009 at 8:31 AM, Tom C. [email protected] wrote:

def main

conversion table spec

conv = [
[ ‘

’, 'h1. ’ ], [ ‘

’, 'h2. ’ ], [ ‘

’, 'h3. ’ ],
[ ‘

’, 'h4. ’ ], [ ‘

’, 'h5. ’ ], [ ‘

’, 'h6. ’ ], [ /</h\d>/,
‘’ ],
[ " +", ’ ’ ]] # <= this last array element should do the trick, but
doesn’t

The last element means replace occurrences of a space followed by a plus
with an empty string. I assume that you were trying to write a regular
expression, which would make your last array [/ +/, ‘’].

data = open( ‘h2t-in2.txt’, ‘r’ ) { |f| ( f.readlines( data )).to_s }

conv.each do |i|
data.gsub!( i[0], i[1] )
end
data.squeeze(’ ') # <= putting this here was sheer desperations, but even
THIS fails

This does nothing because String#squeeze returns a new string that you
don’t
capture. Instead of using the array above, you could do

data = data.squeeze(’ ')

open( “h2t-out.txt”, “w” ) { |f| f.write( data ) }

end

%w(rubygems ruby-debug readline strscan logger fileutils).each{ |lib|
require lib }

main

Hope that helps.

Regards,
Craig

tbcox23 · February 7, 2009, 9:55pm

Hi –

On Sun, 8 Feb 2009, Tom C. wrote:

[ ‘

’, 'h4. ’ ], [ ‘

’, 'h5. ’ ], [ ‘

’, 'h6. ’ ], [
data.gsub!(/<(h[1-6])>/, "\1. ")
David
clue. Can you elucidate just a bit?
The \1, \2, etc. in the replacement string are pegged to the
parenthetical captures. "\1. " means: the first capture (which is h
plus a digit), a period, and a space.

They work in single-quoted strings too, but there they’re just \1, \2,
etc. There’s some explanation in the ri docs for String#gsub.

David

–
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Coming in 2009: The Well-Grounded Rubyist (The Well-Grounded Rubyist)

http://www.wishsight.com => Independent, social wishlist management!

tbcox23 · February 7, 2009, 9:56pm

Tom C. schrieb:

[ ‘

’, 'h4. ’ ], [ ‘
’, 'h5. ’ ], [ ‘
’, 'h6. ’ ], [
data.gsub!(/<(h[1-6])>/, "\1. ")
David
retain matched alpha, and the “1” does the same for matched digits, but
I really haven’t a clue. Can you elucidate just a bit?

Thanks!

Tom

ah…regex! it’s easy if you know them =D
the (…) in the Regex defines a group.
this group now includes the ‘h’ followed by one of the numbers 1,2,3,4,5
or 6
in the second parameter \1 (double slash because of
double-quotes/escaping now is assgined to the matched pattern
/h[1-6]/
that’s it, nothing magic anymore

tbcox23 · February 7, 2009, 10:01pm

On 07.02.2009 21:50, Tom C. wrote:

I really haven’t a clue. Can you elucidate just a bit?
The keyword is “capturing groups”. Brackets in the regexp denote groups
of characters which can be referenced later via their numeric index as
you have seen. You can even use them for matching repetitions

/(fo+)\1/ =~ s # will match “fofo”, “foofoo”, “fooofooo” etc.

Cheers

robert

tbcox23 · February 7, 2009, 10:10pm

William J. wrote:

=== code ===

end

Library catalog listing
x

h3. Library catalog listing x

h2. Library catalog listing x

p(subtitle). A complete listing of all material in the Library

puts IO.read(“data2”).gsub( /<(h\d)>/, '\1. ’ ).gsub( /</h\d>/, “”).
squeeze " "

tbcox23 · February 7, 2009, 10:49pm

Robert K. wrote:

“\” == retain matched alpha, and the “1” does the same for matched
robert

David, badboy, Robert - thats to you all for the very clear
explanations. I really couldn’t find info. about this (yet). It IS
clear, once the explanation’s in had. I have to say that regex’s
becoming rather fun, now that I’m getting a little control of it.

I continue to be amazed at the generosity of this list in helping the
real amateurs here move things along. We get that AND we get to listen
in on all sorts of amazing and mysterious discussions of higher order
magic. Pretty cool.

t.

–

Tom C., MS MA, LMHC - Private practice Psychotherapist
Bellingham, Washington, U.S.A: (360) 920-1226
<< [email protected] >> (email)
<< TomCloyd.com >> (website)
<< sleightmind.wordpress.com >> (mental health weblog)

tbcox23 · February 7, 2009, 10:56pm

Robert K. wrote:

“\” == retain matched alpha, and the “1” does the same for matched
robert

I should have added this - as it was puzzling me and I just now “got it”
(and no one mentioned it) - for those who might be following along or
will come after: the “\1” business isn’t regex. That’s why I could find
nothing about it in my regex sources! It’s a String#gsub() convention,
and is documented there.

OK…all darkness is vanquished. For now. (!) ~t.

–

Tom C., MS MA, LMHC - Private practice Psychotherapist
Bellingham, Washington, U.S.A: (360) 920-1226
<< [email protected] >> (email)
<< TomCloyd.com >> (website)
<< sleightmind.wordpress.com >> (mental health weblog)

tbcox23 · February 7, 2009, 10:06pm

Tom C. wrote:

}
%w(rubygems ruby-debug readline strscan logger fileutils).each{ |lib|

Library catalog listing
x
h2. Library catalog listing x

p(subtitle). A complete listing of all material in the Library

==============

The "x"s in the input file are to show that while the end tags are
being removed the space before them is NOT.

t.

puts IO.readlines(“data2”).map{|line|
line.sub( /<(h\d)>/, '\1. ’ ).sub( /</h\d>/, “”).
squeeze " " }

— output —

h1. Library catalog listing x

h3. Library catalog listing x

h2. Library catalog listing x

p(subtitle). A complete listing of all material in the Library

tbcox23 · February 8, 2009, 10:22am

how to unsubscribe to this email list ?

De : Tom C. [email protected]
Ã€ : ruby-talk ML [email protected]
EnvoyÃ© le : Samedi, 7 FÃ©vrier 2009, 22h48mn 08s
ObjetÂ : Re: cannot remove multiple space> nuts!

Robert K. wrote:

Cheers

Â Â robert

David, badboy, Robert - thats to you all for the very clear
explanations. I really couldn’t find info. about this (yet). It IS
clear, once the explanation’s in had. I have to say that regex’s
becoming rather fun, now that I’m getting a little control of it.

I continue to be amazed at the generosity of this list in helping the
real amateurs here move things along. We get that AND we get to listen
in on all sorts of amazing and mysterious discussions of higher order
magic. Pretty cool.

t.

–

Tom C., MS MA, LMHC - Private practice Psychotherapist
Bellingham, Washington, U.S.A: (360) 920-1226
<< [email protected] >> (email)
<< TomCloyd.com >> (website) << sleightmind.wordpress.com >> (mental 
health weblog)

  __________________________________________________________________________________________________

tbcox23 · February 8, 2009, 1:45pm

On 07.02.2009 22:48, Tom C. wrote:

I have to say that regex’s
becoming rather fun, now that I’m getting a little control of it.

If you like to dig deeper into the matter, I can recommend

I covers functionality, different regexp dialects and performance
considerations thoroughly. Might be a bit difficult to read if you do
not have a full CS background but IMHO Jeff Friedl manages to keep
language theory at a minimum without scarifying precision.

And one more particular Ruby hint: method String#[] is capable of
working with regular expression arguments, so you can do

fetch the whole match

ip = input[/\d{1,3}(.\d{1,3}){3}/]

fetch group 1

name = input[/name=(\S+)/, 1]

Cheers

robert

tbcox23 · February 9, 2009, 7:34am

Jesús Gabriel y Galán wrote:

Jesus.

Robert, Jesus,
Thanks to you both. Great stuff. I’ll be digging into it tonight.

t.

–

Tom C., MS MA, LMHC - Private practice Psychotherapist
Bellingham, Washington, U.S.A: (360) 920-1226
<< [email protected] >> (email)
<< TomCloyd.com >> (website)
<< sleightmind.wordpress.com >> (mental health weblog)

tbcox23 · February 8, 2009, 7:06pm

On Sun, Feb 8, 2009 at 1:44 PM, Robert K.
[email protected] wrote:

have a full CS background but IMHO Jeff Friedl manages to keep language
theory at a minimum without scarifying precision.

If you want an introductory tutorials about regular expressions you
can check here:

Jesus.

Cannot remove multiple spaces

h2t.rb

conversion table spec

’, 'h1. ’ ], [ ‘

’, 'h2. ’ ], [ ‘

’, 'h3. ’ ],
[ ‘

’, 'h4. ’ ], [ ‘

’, 'h5. ’ ], [ ‘

’, 'h6. ’ ], [
/</h\d>/, ‘’ ],
[ " +", ’ ’ ]] # <= this last array element should do the trick, but
doesn’t

Library catalog listing

Library catalog listing

Library catalog listing

conversion table spec

’, 'h1. ’ ], [ ‘

’, 'h2. ’ ], [ ‘

’, 'h3. ’ ],
[ ‘

’, 'h4. ’ ], [ ‘

’, 'h5. ’ ], [ ‘

’, 'h6. ’ ], [
/</h\d>/, ‘’ ],
[ " +", ’ ’ ]] # <= this last array element should do the trick, but
doesn’t
Ouch. THIS - [ / +/, ’ ’ ], substituted for [ " +", ’ ’ ] above fixes
it. I’m going blind, obviously.

h2t.rb

conversion table spec

’, 'h1. ’ ], [ ‘

’, 'h2. ’ ], [ ‘

’, 'h3. ’ ],
[ ‘

’, 'h4. ’ ], [ ‘

’, 'h5. ’ ], [ ‘

’, 'h6. ’ ], [ /</h\d>/,
‘’ ],
[ " +", ’ ’ ]] # <= this last array element should do the trick, but
doesn’t

’, 'h4. ’ ], [ ‘

’, 'h5. ’ ], [ ‘

’, 'h6. ’ ], [
data.gsub!(/<(h[1-6])>/, "\1. ")
David
clue. Can you elucidate just a bit?
The \1, \2, etc. in the replacement string are pegged to the
parenthetical captures. "\1. " means: the first capture (which is h
plus a digit), a period, and a space.

’, 'h4. ’ ], [ ‘

’, 'h5. ’ ], [ ‘

’, 'h6. ’ ], [
data.gsub!(/<(h[1-6])>/, "\1. ")
David
retain matched alpha, and the “1” does the same for matched digits, but
I really haven’t a clue. Can you elucidate just a bit?

Library catalog listing

Library catalog listing

fetch the whole match

fetch group 1

Cannot remove multiple spaces

h2t.rb

conversion table spec

’, 'h1. ’ ], [ ‘

’, 'h2. ’ ], [ ‘

’, 'h3. ’ ], [ ‘

’, 'h4. ’ ], [ ‘

’, 'h5. ’ ], [ ‘

’, 'h6. ’ ], [ /</h\d>/, ‘’ ], [ " +", ’ ’ ]] # <= this last array element should do the trick, but doesn’t

Library catalog listing

Library catalog listing

Library catalog listing

conversion table spec

’, 'h1. ’ ], [ ‘

’, 'h2. ’ ], [ ‘

’, 'h3. ’ ], [ ‘

’, 'h4. ’ ], [ ‘

’, 'h5. ’ ], [ ‘

’, 'h6. ’ ], [ /</h\d>/, ‘’ ], [ " +", ’ ’ ]] # <= this last array element should do the trick, but doesn’t Ouch. THIS - [ / +/, ’ ’ ], substituted for [ " +", ’ ’ ] above fixes it. I’m going blind, obviously.

h2t.rb

conversion table spec

’, 'h1. ’ ], [ ‘

’, 'h2. ’ ], [ ‘

’, 'h3. ’ ], [ ‘

’, 'h4. ’ ], [ ‘

’, 'h5. ’ ], [ ‘

’, 'h6. ’ ], [ /</h\d>/, ‘’ ], [ " +", ’ ’ ]] # <= this last array element should do the trick, but doesn’t

’, 'h4. ’ ], [ ‘

’, 'h5. ’ ], [ ‘

’, 'h6. ’ ], [ data.gsub!(/<(h[1-6])>/, "\1. ") David clue. Can you elucidate just a bit? The \1, \2, etc. in the replacement string are pegged to the parenthetical captures. "\1. " means: the first capture (which is h plus a digit), a period, and a space.

’, 'h4. ’ ], [ ‘

’, 'h5. ’ ], [ ‘

’, 'h6. ’ ], [ data.gsub!(/<(h[1-6])>/, "\1. ") David retain matched alpha, and the “1” does the same for matched digits, but I really haven’t a clue. Can you elucidate just a bit?

Library catalog listing

Library catalog listing

fetch the whole match

fetch group 1

’, 'h3. ’ ],
[ ‘

’, 'h6. ’ ], [
/</h\d>/, ‘’ ],
[ " +", ’ ’ ]] # <= this last array element should do the trick, but
doesn’t

’, 'h3. ’ ],
[ ‘

’, 'h6. ’ ], [
/</h\d>/, ‘’ ],
[ " +", ’ ’ ]] # <= this last array element should do the trick, but
doesn’t
Ouch. THIS - [ / +/, ’ ’ ], substituted for [ " +", ’ ’ ] above fixes
it. I’m going blind, obviously.

’, 'h3. ’ ],
[ ‘

’, 'h6. ’ ], [ /</h\d>/,
‘’ ],
[ " +", ’ ’ ]] # <= this last array element should do the trick, but
doesn’t

’, 'h6. ’ ], [
data.gsub!(/<(h[1-6])>/, "\1. ")
David
clue. Can you elucidate just a bit?
The \1, \2, etc. in the replacement string are pegged to the
parenthetical captures. "\1. " means: the first capture (which is h
plus a digit), a period, and a space.

’, 'h6. ’ ], [
data.gsub!(/<(h[1-6])>/, "\1. ")
David
retain matched alpha, and the “1” does the same for matched digits, but
I really haven’t a clue. Can you elucidate just a bit?