Capitalizing words

bodikp · April 8, 2008, 8:53pm

Hi,
I need to capitalize the words in a string I find in XML files.

The string that’s in (.*) below is what I need to change. I just want to
capitalize the first letter of each word in the string.

I’m trying this, in a test:

Dir.chdir(“C:/users/pb4072/documents”)
file = File.read(“test1.txt”)
file.gsub(/^(.*)</emph>/) do |match|
array = $1.split
array.each do |word|
word.capitalize!
end
newfile = File.open(“c:/users/pb4072/documents/test1.txt”, “w”) { |f|
f.print array }
end

And, I’m getting this:

#(.*)</emph>theQuickBrownFoxJumpedOverTheLazyDog.

I want this:

The Quick Brown Fox Jumped Over The
Lazy Dog.</emph>/

Thanks,
Peter

bodikp · April 8, 2008, 9:29pm

On Tue, Apr 8, 2008 at 1:53 PM, Peter B. [email protected] wrote:

file.gsub(/^(.)</emph>/) do |match|
#(.)</emph>theQuickBrownFoxJumpedOverTheLazyDog.

I want this:

The Quick Brown Fox Jumped Over The
Lazy Dog.</emph>/

Thanks,
Peter

I don’t know what the original text looks like in test1.txt, but this
might point you in the right direction…

irb(main):001:0> s = “the quick brown fox”
=> “the quick brown fox”
irb(main):002:0> s.split.map {|w| w.capitalize}.join ’ ’
=> “The Quick Brown Fox”

Todd

bodikp · April 9, 2008, 1:09am

On Apr 8, 2008, at 2:53 PM, Peter B. wrote:

file = File.read(“test1.txt”)
And, I’m getting this:
Peter
Dir.chdir(“C:/users/pb4072/documents”) do |d|
file = File.read(“test1.txt”)
output = file.gsub(%r{^()(.*)(</
emph>)}m) do |match|
“#{$1}#{$2.gsub(%r{\b\w+\b}){|w|w.capitalize}}#{$3}”
end
File.open(“test1.txt”, “w”) { |f| f.write output }
end

Note the use of three capture groups to get the unchanged initial and
final parts as well as the middle part that is altered. The %r{\b\w+
\b} is a Regexp that matches words, \b is a word-boundary and \w is a
word-character (short for [a-zA-Z0-9_]). Your use of
String#capitalize! returns nil if no change is made.

-Rob

Rob B. http://agileconsultingllc.com
[email protected]

bodikp · April 9, 2008, 2:22pm

Todd B. wrote:

On Tue, Apr 8, 2008 at 1:53 PM, Peter B. [email protected] wrote:

file.gsub(/^(.)</emph>/) do |match|
#(.)</emph>theQuickBrownFoxJumpedOverTheLazyDog.

I want this:

The Quick Brown Fox Jumped Over The
Lazy Dog.</emph>/

Thanks,
Peter

I don’t know what the original text looks like in test1.txt, but this
might point you in the right direction…

irb(main):001:0> s = “the quick brown fox”
=> “the quick brown fox”
irb(main):002:0> s.split.map {|w| w.capitalize}.join ’ ’
=> “The Quick Brown Fox”

Todd

Thanks, Todd.
The original text is just:
THE QUICK BROWN FOX JUMPED OVER THE
LAZY DOG.</emph>/

Should I just make your “s” equal to $1 from my original gsub?

-Peter

bodikp · April 9, 2008, 2:24pm

Rob B. wrote:

On Apr 8, 2008, at 2:53 PM, Peter B. wrote:

file = File.read(“test1.txt”)
And, I’m getting this:
Peter
Dir.chdir(“C:/users/pb4072/documents”) do |d|
file = File.read(“test1.txt”)
output = file.gsub(%r{^()(.*)(</
emph>)}m) do |match|
“#{$1}#{$2.gsub(%r{\b\w+\b}){|w|w.capitalize}}#{$3}”
end
File.open(“test1.txt”, “w”) { |f| f.write output }
end

Note the use of three capture groups to get the unchanged initial and
final parts as well as the middle part that is altered. The %r{\b\w+
\b} is a Regexp that matches words, \b is a word-boundary and \w is a
word-character (short for [a-zA-Z0-9_]). Your use of
String#capitalize! returns nil if no change is made.

-Rob

Rob B. http://agileconsultingllc.com
[email protected]

Thanks, Rob. This works beautifully, except that I need that last
in my output. It’s being stripped with your code. I don’t see
why, because it’s just your $3, isn’t it?

bodikp · April 9, 2008, 4:13pm

On Apr 9, 2008, at 8:24 AM, Peter B. wrote:

end

Rob B. http://agileconsultingllc.com
[email protected]

Thanks, Rob. This works beautifully, except that I need that last
in my output. It’s being stripped with your code. I don’t see
why, because it’s just your $3, isn’t it?

Posted via http://www.ruby-forum.com/.

You said to Todd:
The original text is just:
THE QUICK BROWN FOX JUMPED OVER THE
LAZY DOG.</emph>/

I assumed that the “</emph>/” part was a cut-n-paste of a regexp for
the email (which is one reason that I change from // to %r{}
construction of the Regexp so the / wouldn’t have to be escaped. You
may have to change the second group to (.*?) [reluctant match rather
than greedy match] or adjust the third group to exactly match your
input.

-Rob

Rob B. http://agileconsultingllc.com
[email protected]

bodikp · April 9, 2008, 8:14pm

hi peter!

Peter B. [2008-04-09 20:04]:

Dir.chdir(“C:/users/pb4072/documents”) do |d| file =
File.read(“test1.txt”) output =
file.gsub(%r{^()(.*)(</emph>)}m) do |match|
“#{$1}#{$2.gsub(%r{\b\w+\b}){|w|w.capitalize}}#{$3}” end
File.open(“test1.txt”, “w”) { |f| f.write output } end

Here’s what I get. It works great, but, I don’t understand why
the $3 text is simply blown away.
because it’s reset when you’re doing that gsub on $2. the capture
variables only refer to the last match. so you have to capture
them into local variables first (can’t think of a better way right now).

cheers
jens

bodikp · April 9, 2008, 8:04pm

Rob B. wrote:

On Apr 9, 2008, at 8:24 AM, Peter B. wrote:

end

Rob B. http://agileconsultingllc.com
[email protected]

Thanks, Rob. This works beautifully, except that I need that last
in my output. It’s being stripped with your code. I don’t see
why, because it’s just your $3, isn’t it?

Posted via http://www.ruby-forum.com/.

You said to Todd:
The original text is just:
THE QUICK BROWN FOX JUMPED OVER THE
LAZY DOG.</emph>/

I assumed that the “</emph>/” part was a cut-n-paste of a regexp for
the email (which is one reason that I change from // to %r{}
construction of the Regexp so the / wouldn’t have to be escaped. You
may have to change the second group to (.*?) [reluctant match rather
than greedy match] or adjust the third group to exactly match your
input.

-Rob

Rob B. http://agileconsultingllc.com
[email protected]

Rob,
So, here’s my original file:
THE QUICK BROWN FOX JUMPED OVER THE
LAZY DOG.

Here’s my code, from you:
Dir.chdir(“C:/users/pb4072/documents”) do |d|
file = File.read(“test1.txt”)
output = file.gsub(%r{^()(.*)(</emph>)}m) do |match|
“#{$1}#{$2.gsub(%r{\b\w+\b}){|w|w.capitalize}}#{$3}”
end
File.open(“test1.txt”, “w”) { |f| f.write output }
end

Here’s what I get. It works great, but, I don’t understand why the $3
text is simply blown away.
The Quick Brown Fox Jumped Over The
Lazy Dog.

Thanks,
Peter

bodikp · April 9, 2008, 8:47pm

On Apr 9, 2008, at 2:13 PM, Jens W. wrote:

Here’s what I get. It works great, but, I don’t understand why
Jens W., Dipl.-Bibl. (FH)
prometheus - Das verteilte digitale Bildarchiv für Forschung & Lehre
Kunsthistorisches Institut der Universität zu Köln
Albertus-Magnus-Platz, D-50923 Köln
Tel.: +49 (0)221 470-6668, E-Mail: [email protected]
http://www.prometheus-bildarchiv.de/

Ah yes! Good catch, Jens.

Peter, you only need to capture $3, but it would make sense to get
them all:

head, content, tail = $1, $2, $3
“#{head}#{content.gsub(%r{\b\w+\b}){|w|w.capitalize}}#{tail}”

-Rob

Rob B. http://agileconsultingllc.com
[email protected]

bodikp · April 9, 2008, 8:24pm

On Apr 9, 2008, at 2:04 PM, Peter B. wrote:

why, because it’s just your $3, isn’t it?
construction of the Regexp so the / wouldn’t have to be escaped. You
So, here’s my original file:
THE QUICK BROWN FOX JUMPED OVER THE
LAZY DOG.
OK, change this to a regexp:

surround with the regexp literal bits
%r{THE QUICK BROWN FOX JUMPED OVER THE
LAZY DOG.}m
add the grouping ()'s
%r{()(THE QUICK BROWN FOX JUMPED OVER
THE
LAZY DOG.)()}m
replace text with wildcards .* or .?
%r{()(.?)()}m
(optional?) add anchor ^
%r{^()(.*?)()}m

I’m assuming that is not the WHOLE file since the
tags are not closed. It it quite likely that .* is slurping a lot
more that you think so that’s why I’ve change this to .*? which
matches as little as possible while continuing to succeed.

-Rob

Rob B. http://agileconsultingllc.com
[email protected]

bodikp · April 10, 2008, 12:07am

Peter B. [2008-04-09 20:04]:

output = file.gsub(%r{^()(.*)(</emph>)}m) do |match|
“#{$1}#{$2.gsub(%r{\b\w+\b}){|w|w.capitalize}}#{$3}”
end
oh, and for the fun of it, here’s what you can do with oniguruma:

Oniguruma::ORegexp.new(
‘(?<=^).+(?=)’, ‘m’
).gsub(file) { |md|
md[0].gsub(%r{\b\w+\b}) { |w| w.capitalize }
}

(note that i needed to change ‘.*’ to ‘.+’)

cheers
jens

bodikp · April 9, 2008, 9:57pm

Rob B. [2008-04-09 20:46]:

head, content, tail = $1, $2, $3
“#{head}#{content.gsub(%r{\b\w+\b}){|w|w.capitalize}}#{tail}”
now here’s a quick implementation that passes the MatchData object
into the block:

http://prometheus.khi.uni-koeln.de/svn/scratch/ruby-nuggets/lib/nuggets/string/sub_with_md.rb

so that code effectively becomes:

str.gsub_with_md(re) { |md|
“#{md[1]}#{md[2].gsub(%r{\b\w+\b}){|w|w.capitalize}}#{md[3]}”
}

cheers
jens

bodikp · April 10, 2008, 2:26pm

Jens W. wrote:

Peter B. [2008-04-09 20:04]:

output = file.gsub(%r{^()(.*)(</emph>)}m) do |match|
“#{$1}#{$2.gsub(%r{\b\w+\b}){|w|w.capitalize}}#{$3}”
end
oh, and for the fun of it, here’s what you can do with oniguruma:

Oniguruma::ORegexp.new(
‘(?<=^).+(?=)’, ‘m’
).gsub(file) { |md|
md[0].gsub(%r{\b\w+\b}) { |w| w.capitalize }
}

(note that i needed to change ‘.*’ to ‘.+’)

cheers
jens

Sorry, Jens, but, I have no idea what you’re referring to here. I
googled oniguruma. I see what it is. I installed it, but, it didn’t seem
to install successfully. Do I do a “require oniguruma” at the top of my
script?

bodikp · April 10, 2008, 4:20pm

Jens W. wrote:

Peter B. [2008-04-10 14:26]:

Do I do a “require oniguruma” at the top of my script?
sure. but you really don’t need it to solve your task at hand.

it’s just the new regexp engine for ruby 1.9 and sometimes i like to
do some stuff with it that the default engine of 1.8 can’t do
(zero-width look-behind in this case).

you can still simplify your substitution by using the look-ahead
(which 1.8 does understand), so you get rid of the third capture:

file.gsub(%r{^(
face=“b”>)(.*)(?=</emph>)}m) {
“#{$1}#{$2.gsub(%r{\b\w+\b}) { |w| w.capitalize }}”
}

cheers
jens

Thanks. But, again, do I need to do a “require” for oniguruma at the
top?
Cheers,
Peter

bodikp · April 10, 2008, 2:48pm

Peter B. [2008-04-10 14:26]:

Do I do a “require oniguruma” at the top of my script?
sure. but you really don’t need it to solve your task at hand.

it’s just the new regexp engine for ruby 1.9 and sometimes i like to
do some stuff with it that the default engine of 1.8 can’t do
(zero-width look-behind in this case).

you can still simplify your substitution by using the look-ahead
(which 1.8 does understand), so you get rid of the third capture:

file.gsub(%r{^(
face=“b”>)(.*)(?=</emph>)}m) {
“#{$1}#{$2.gsub(%r{\b\w+\b}) { |w| w.capitalize }}”
}

cheers
jens

bodikp · April 10, 2008, 5:56pm

Jens W. wrote:

Peter B. [2008-04-10 16:20]:

Jens W. wrote:

Peter B. [2008-04-10 14:26]:

Do I do a “require oniguruma” at the top of my script?
sure. but you really don’t need it to solve your task at hand.
Thanks. But, again, do I need to do a “require” for oniguruma at
the top?
if you want to use oniguruma, then yes, you have to require it first.

OK. Thanks!

bodikp · April 10, 2008, 5:24pm

Peter B. [2008-04-10 16:20]:

Jens W. wrote:

Peter B. [2008-04-10 14:26]:

Do I do a “require oniguruma” at the top of my script?
sure. but you really don’t need it to solve your task at hand.
Thanks. But, again, do I need to do a “require” for oniguruma at
the top?
if you want to use oniguruma, then yes, you have to require it first.

Capitalizing words

Thanks, Rob. This works beautifully, except that I need that last in my output. It’s being stripped with your code. I don’t see why, because it’s just your $3, isn’t it?

Thanks, Rob. This works beautifully, except that I need that last in my output. It’s being stripped with your code. I don’t see why, because it’s just your $3, isn’t it?

Thanks, Rob. This works beautifully, except that I need that last
in my output. It’s being stripped with your code. I don’t see
why, because it’s just your $3, isn’t it?

Thanks, Rob. This works beautifully, except that I need that last
in my output. It’s being stripped with your code. I don’t see
why, because it’s just your $3, isn’t it?