Forum: Ruby Simple substitutions

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
B58c6eef325656d513d26e2c3ae6bfd9?d=identicon&s=25 Peter Bailey (peterbailey)
on 2006-04-12 13:58
Hello,
Some of you all helped this newbie a couple of weeks ago. I'm getting
there!

I have a need to parse postscript print files on a daily basis. Some of
them are over 10 mB. One thing I need to find out is the number of pages
in the files. There's a simple one-liner at the top of each .ps file
that states the number of pages. It reads:
                             %%Pages: xx
It's a line of its own.

I simply want to find that line in my source file and then write a
corresponding line in my output file:
                             Number of pages: xx

I tried the following in IRB, but I get the syntax error shown below.

     File.open("psout.txt", "w") do |output|
       File.foreach("test1.ps") do |line|
         line.sub!(/\%\%Pages: ([0-9]{1,5})/"Pages: \1"/)
         output << line
       end
     end

    SyntaxError: compile error
    (irb):26: parse error, unexpected tSTRING_BEG, expecting ')'
    line.sub!(/\%\%Pages: ([0-9]{1,5})/"Pages: \1"/)
                                    ^
    (irb):26: parse error, unexpected ')'
           from (irb):29

Is there a significance to the ^ under the "P" in my syntax error?
What's it telling me? My sub requires parentheses. Is there a conflict
with my grouping parentheses inside my regular expression?

Another request for help--my translation above would basically move the
entire contents of my source file to the target file, with the exception
of the one line I want changed. That's overkill for me. I just want a
few lines of information in my target file. Would you suggest I just do
a simple search, using matching (=~), and then write about that find in
my target file, or, would it be better to do mass deletions of
everything I don't want from my source to my target files? In other
words, should I say "in this file, find this, then say this . . ." or,
"convert this entire file, but just change this?"

Thank you.
A9b6a93b860020caf9d2d1d58c32478f?d=identicon&s=25 Ross Bamford (Guest)
on 2006-04-12 14:22
(Received via mailing list)
Hi,

On Wed, 2006-04-12 at 20:58 +0900, Peter Bailey wrote:

>
>     (irb):26: parse error, unexpected tSTRING_BEG, expecting ')'
>     line.sub!(/\%\%Pages: ([0-9]{1,5})/"Pages: \1"/)
>                                     ^
>     (irb):26: parse error, unexpected ')'
>            from (irb):29
>
> Is there a significance to the ^ under the "P" in my syntax error?
> What's it telling me? My sub requires parentheses. Is there a conflict
> with my grouping parentheses inside my regular expression?
>

ruby is just complaining about a Perl-style s/// regexp. :) The correct
syntax here would be something like:

	line.sub!(/\%\%Pages: ([0-9]{1,5})/) { "Pages: #$1" }

(You can sometimes use the \1 escapes in the replacement but it's always
come out hit-and-miss for me, so I stick to the block form).

> Another request for help--my translation above would basically move the
> entire contents of my source file to the target file, with the exception
> of the one line I want changed. That's overkill for me. I just want a
> few lines of information in my target file. Would you suggest I just do
> a simple search, using matching (=~), and then write about that find in
> my target file, or, would it be better to do mass deletions of
> everything I don't want from my source to my target files? In other
> words, should I say "in this file, find this, then say this . . ." or,
> "convert this entire file, but just change this?"

I would probably go with extracting just what I needed from the file,
though if you can safely exclude portions of it before running matches
(for example a large data section) that may improve performance.

Also, I don't know if there are any Ruby postscript libraries, but it
might be worth a search if you haven't already...

Hope that helps,
35594c037eba2fa48f7129d5fded828b?d=identicon&s=25 Peter Szinek (Guest)
on 2006-04-12 14:25
(Received via mailing list)
Hello,

>          line.sub!(/\%\%Pages: ([0-9]{1,5})/"Pages: \1"/)
How about this:

line.sub!(/%%Pages: (\d{1,5})/){"Pages: " + $1}

bw,
Peter
B58c6eef325656d513d26e2c3ae6bfd9?d=identicon&s=25 Peter Bailey (peterbailey)
on 2006-04-12 15:04
Ross Bamford wrote:
> Hi,
>
> On Wed, 2006-04-12 at 20:58 +0900, Peter Bailey wrote:
>
>>
>>     (irb):26: parse error, unexpected tSTRING_BEG, expecting ')'
>>     line.sub!(/\%\%Pages: ([0-9]{1,5})/"Pages: \1"/)
>>                                     ^
>>     (irb):26: parse error, unexpected ')'
>>            from (irb):29
>>
>> Is there a significance to the ^ under the "P" in my syntax error?
>> What's it telling me? My sub requires parentheses. Is there a conflict
>> with my grouping parentheses inside my regular expression?
>>
>
> ruby is just complaining about a Perl-style s/// regexp. :) The correct
> syntax here would be something like:
>
> 	line.sub!(/\%\%Pages: ([0-9]{1,5})/) { "Pages: #$1" }
>
> (You can sometimes use the \1 escapes in the replacement but it's always
> come out hit-and-miss for me, so I stick to the block form).
>
>> Another request for help--my translation above would basically move the
>> entire contents of my source file to the target file, with the exception
>> of the one line I want changed. That's overkill for me. I just want a
>> few lines of information in my target file. Would you suggest I just do
>> a simple search, using matching (=~), and then write about that find in
>> my target file, or, would it be better to do mass deletions of
>> everything I don't want from my source to my target files? In other
>> words, should I say "in this file, find this, then say this . . ." or,
>> "convert this entire file, but just change this?"
>
> I would probably go with extracting just what I needed from the file,
> though if you can safely exclude portions of it before running matches
> (for example a large data section) that may improve performance.
>
> Also, I don't know if there are any Ruby postscript libraries, but it
> might be worth a search if you haven't already...
>
> Hope that helps,


Thank you, Ross.
Yes, my regex experience is indeed with Perl, and Perl only. I haven't
seen anything else in my docs, though, about Ruby being different with
regexes, except for the stuff about true objects with Regexp#match. Both
your response and Peter's below, though, show me something entirely new.
It looks like you're putting the replacement phrase into a block,
correct? It's kind of ugly at first, I must say, but, I see some sense
in it. Your suggestion worked for me, by the way. Now, I'd like to
streamline my conversion and aim for the much smaller target file I
mentioned above. Right now, my target file is still 10mB big, with just
that one pages line replaced with what I want. So, instead of working in
line mode, do you think I should just read in the entire file and use
".*" before and after my sub expression to get rid of everything else?
B58c6eef325656d513d26e2c3ae6bfd9?d=identicon&s=25 Peter Bailey (peterbailey)
on 2006-04-12 15:05
Peter Szinek wrote:
> Hello,
>
>>          line.sub!(/\%\%Pages: ([0-9]{1,5})/"Pages: \1"/)
> How about this:
>
> line.sub!(/%%Pages: (\d{1,5})/){"Pages: " + $1}
>
> bw,
> Peter

Yes, that worked for me. Thank you, Peter. Like Ross' response above, I
hadn't yet seen a regex expressed this way, where a block is used for
the replacement part of a regex. Different, but powerful. Thanks again.
A9b6a93b860020caf9d2d1d58c32478f?d=identicon&s=25 Ross Bamford (Guest)
on 2006-04-12 15:24
(Received via mailing list)
On Wed, 2006-04-12 at 22:04 +0900, Peter Bailey wrote:
>  Now, I'd like to
> streamline my conversion and aim for the much smaller target file I
> mentioned above. Right now, my target file is still 10mB big, with just
> that one pages line replaced with what I want. So, instead of working in
> line mode, do you think I should just read in the entire file and use
> ".*" before and after my sub expression to get rid of everything else?
>

Well, memory is cheap these days and I have to admit I'm much less
careful about how I use it (unless there's a specific reason to conserve
it) but in this case, if there's only one pages line in your file, then
how about doing something like (N.B. untested code):

	pages = nil
	File.foreach("test1.ps") do |line|
	  if line =~ /\%\%Pages: ([0-9]{1,5})/
	    pages = $1
	    break
	  end
	end

	if pages
	  File.open("psout.txt", "w") { |out| out << "Pages: #{pages}" }
	end

If you do choose to read the whole thing in at once, one way to achieve
that might be:

	pages = nil
	File.read('test1.ps').scan(/\%\%Pages: ([0-9]{1,5})/) do
	  pages = $1
	  break
	end

	if pages
	  File.open("psout.txt", "w") { |out| out << "Pages: #{pages}" }
	end
B58c6eef325656d513d26e2c3ae6bfd9?d=identicon&s=25 Peter Bailey (peterbailey)
on 2006-04-12 15:35
Ross Bamford wrote:
> On Wed, 2006-04-12 at 22:04 +0900, Peter Bailey wrote:
>>  Now, I'd like to
>> streamline my conversion and aim for the much smaller target file I
>> mentioned above. Right now, my target file is still 10mB big, with just
>> that one pages line replaced with what I want. So, instead of working in
>> line mode, do you think I should just read in the entire file and use
>> ".*" before and after my sub expression to get rid of everything else?
>>
>
> Well, memory is cheap these days and I have to admit I'm much less
> careful about how I use it (unless there's a specific reason to conserve
> it) but in this case, if there's only one pages line in your file, then
> how about doing something like (N.B. untested code):
>
> 	pages = nil
> 	File.foreach("test1.ps") do |line|
> 	  if line =~ /\%\%Pages: ([0-9]{1,5})/
> 	    pages = $1
> 	    break
> 	  end
> 	end
>
> 	if pages
> 	  File.open("psout.txt", "w") { |out| out << "Pages: #{pages}" }
> 	end
>
> If you do choose to read the whole thing in at once, one way to achieve
> that might be:
>
> 	pages = nil
> 	File.read('test1.ps').scan(/\%\%Pages: ([0-9]{1,5})/) do
> 	  pages = $1
> 	  break
> 	end
>
> 	if pages
> 	  File.open("psout.txt", "w") { |out| out << "Pages: #{pages}" }
> 	end


Aha. I didn't know about "scan." Just the word alone makes sense for
what I want. And, yes, I found it in the Big Book, on page 617.

What does the "break" do, Ross? Is it just telling RUBY to stop if it's
found what it was scanning for?

This worked! I'm more prone to just read the whole file in. Thanks a
lot, Ross! I feel I can move mountains now.
A9b6a93b860020caf9d2d1d58c32478f?d=identicon&s=25 Ross Bamford (Guest)
on 2006-04-12 16:59
(Received via mailing list)
On Wed, 2006-04-12 at 22:35 +0900, Peter Bailey wrote:
> > 	if pages
> > 	  File.open("psout.txt", "w") { |out| out << "Pages: #{pages}" }
> > 	end
>
>
> Aha. I didn't know about "scan." Just the word alone makes sense for
> what I want. And, yes, I found it in the Big Book, on page 617.
>
> What does the "break" do, Ross? Is it just telling RUBY to stop if it's
> found what it was scanning for?

Yes - you can find a bit about it near the end of this page:

	http://www.rubycentral.com/book/tut_expressions.html

Looking at that again, though, I can't imagine why I didn't just write:

	if File.read('test1.ps') =~ /\%\%Pages: ([0-9]{1,5})/
	  pages = $1
	end

For just a single match.
B58c6eef325656d513d26e2c3ae6bfd9?d=identicon&s=25 Peter Bailey (peterbailey)
on 2006-04-12 17:24
Ross Bamford wrote:
> On Wed, 2006-04-12 at 22:35 +0900, Peter Bailey wrote:
>> > 	if pages
>> > 	  File.open("psout.txt", "w") { |out| out << "Pages: #{pages}" }
>> > 	end
>>
>>
>> Aha. I didn't know about "scan." Just the word alone makes sense for
>> what I want. And, yes, I found it in the Big Book, on page 617.
>>
>> What does the "break" do, Ross? Is it just telling RUBY to stop if it's
>> found what it was scanning for?
>
> Yes - you can find a bit about it near the end of this page:
>
> 	http://www.rubycentral.com/book/tut_expressions.html
>
> Looking at that again, though, I can't imagine why I didn't just write:
>
> 	if File.read('test1.ps') =~ /\%\%Pages: ([0-9]{1,5})/
> 	  pages = $1
> 	end
>
> For just a single match.


Yes, the =~ notation certainly seems simpler. Now, I'm pursing this
further and now I need to look for blank pages in my postscript. These
are found with a regex, too, a more complex one, but, I've created one
that works. And, with this search, because there may be more than just
one, I've created an array and I'm pushing every page number I find
blank into that array. Cool. I love this stuff. I'm having a bit of
trouble though printing a simple array.length statement. Please check
this out:

************************************************
Dir.chdir('c:/scripts/ruby/temp')
blanks = []
File.read("test2.ps").scan(/\%\%Page: [\d()]+
(\d{1,5})\n\%\%PageBoundingBox: \d{1,5} \d{1,5} \d{1,5}
\d{1,5}\n\%\%PageOrientation:/) do
  blanks.push($1)
  number = blanks.length
  blanks.push(" ")
  #break
end

if blanks
  File.open("psout2.txt", "w") { |out| out << "Blank Pages in This PDF:
#{blanks}\nNo. of Blanks: #number\n" }
end
************************************************


I'm getting this, which I'm extremely proud of, but, it doesn't quite
make it:

+++++++++++++++++++++++++++++++++++++++++
Blank Pages in This PDF: 46 68 72 80 83
No. of Blanks: #number
+++++++++++++++++++++++++++++++++++++++++

Thanks. If I can just get this, then, I'll streamline my script to take
advantage of the simple match, like you suggested.
A9b6a93b860020caf9d2d1d58c32478f?d=identicon&s=25 Ross Bamford (Guest)
on 2006-04-12 18:44
(Received via mailing list)
On Thu, 2006-04-13 at 00:24 +0900, Peter Bailey wrote:
>   blanks.push($1)
>
>
> I'm getting this, which I'm extremely proud of, but, it doesn't quite
> make it:
>
> +++++++++++++++++++++++++++++++++++++++++
> Blank Pages in This PDF: 46 68 72 80 83
> No. of Blanks: #number
> +++++++++++++++++++++++++++++++++++++++++

You're almost there - I would make just a few small changes:

blanks = []
File.read("test2.ps").scan(/ ... your regexp ... /) do
  blanks.push($1)
end

unless blanks.empty?
  File.open("psout2.txt", "w") do |out|
    out << "Blank Pages in This PDF: #{blanks.join(' ')}\n" <<
	   "No. of blanks: #{blanks.length}\n"
  end
end

The key points I changed being:

	* Rather than having an extra " " in the array for each
	  element (which would throw the length off, too), I use
	  Array#join to format for output.

	* I surround all expressions within strings with #{} -
	  prefixing with # alone works only for global variables
	  and is probably bad form - stick to #{var} as much as
	  possible.

	* There's no need to have extra locals, e.g. 'number' in
	  your code - you can make method calls in string expressions
	  and it's often more obvious what's going on.

	* I change 'if blanks' for 'unless blanks.empty?'. In Ruby,
	  everything apart from nil and false evaluate to true. This
	  includes empty arrays and zero.
B58c6eef325656d513d26e2c3ae6bfd9?d=identicon&s=25 Peter Bailey (peterbailey)
on 2006-04-12 20:06
Ross Bamford wrote:
> On Thu, 2006-04-13 at 00:24 +0900, Peter Bailey wrote:
>>   blanks.push($1)
>>
>>
>> I'm getting this, which I'm extremely proud of, but, it doesn't quite
>> make it:
>>
>> +++++++++++++++++++++++++++++++++++++++++
>> Blank Pages in This PDF: 46 68 72 80 83
>> No. of Blanks: #number
>> +++++++++++++++++++++++++++++++++++++++++
>
> You're almost there - I would make just a few small changes:
>
> blanks = []
> File.read("test2.ps").scan(/ ... your regexp ... /) do
>   blanks.push($1)
> end
>
> unless blanks.empty?
>   File.open("psout2.txt", "w") do |out|
>     out << "Blank Pages in This PDF: #{blanks.join(' ')}\n" <<
> 	   "No. of blanks: #{blanks.length}\n"
>   end
> end
>
> The key points I changed being:
>
> 	* Rather than having an extra " " in the array for each
> 	  element (which would throw the length off, too), I use
> 	  Array#join to format for output.
>
> 	* I surround all expressions within strings with #{} -
> 	  prefixing with # alone works only for global variables
> 	  and is probably bad form - stick to #{var} as much as
> 	  possible.
>
> 	* There's no need to have extra locals, e.g. 'number' in
> 	  your code - you can make method calls in string expressions
> 	  and it's often more obvious what's going on.
>
> 	* I change 'if blanks' for 'unless blanks.empty?'. In Ruby,
> 	  everything apart from nil and false evaluate to true. This
> 	  includes empty arrays and zero.

This works great now! I'll try to remember the #{} syntax. I guess I'm
just learning how many ways things can be done in RUBY, and, how wrong
some ways can  be. I've always been a bit confused by the "<<" syntax.
Somehow, I would think it should be ">>" instead, like, you're putting
something to something. I guess it is saying that, just from the other
way around. The blanks.join statement is brilliant! It's so simple.

Thanks again for all your help, Ross. I'll close this issue now and
leave you to your real work. Perhaps we'll chat again. Cheers !
0ca6e5c33d7e7ff901d75ff0b13d9e1c?d=identicon&s=25 Sam Roberts (Guest)
on 2006-04-13 00:15
(Received via mailing list)
On Wed, Apr 12, 2006 at 08:58:17PM +0900, Peter Bailey wrote:
>
> I tried the following in IRB, but I get the syntax error shown below.
>
>      File.open("psout.txt", "w") do |output|
>        File.foreach("test1.ps") do |line|
>          line.sub!(/\%\%Pages: ([0-9]{1,5})/"Pages: \1"/)

Other people commented on the regex

>          output << line

For efficiency, you can do:

   if(line =~ /%%Pages: (\d+)/)
	  output << $1
	  break
   end

YMMV, but stopping processing the file as soon as you've found what you
want is faster.

Sam
B58c6eef325656d513d26e2c3ae6bfd9?d=identicon&s=25 Peter Bailey (peterbailey)
on 2006-04-13 13:49
Sam Roberts wrote:
> On Wed, Apr 12, 2006 at 08:58:17PM +0900, Peter Bailey wrote:
>>
>> I tried the following in IRB, but I get the syntax error shown below.
>>
>>      File.open("psout.txt", "w") do |output|
>>        File.foreach("test1.ps") do |line|
>>          line.sub!(/\%\%Pages: ([0-9]{1,5})/"Pages: \1"/)
>
> Other people commented on the regex
>
>>          output << line
>
> For efficiency, you can do:
>
>    if(line =~ /%%Pages: (\d+)/)
> 	  output << $1
> 	  break
>    end
>
> YMMV, but stopping processing the file as soon as you've found what you
> want is faster.
>
> Sam


Thanks, Sam. So, is doing your "if(line . . ." suggestion quicker, do
you think, than doing a .scan? Because I'm dealing with sometimes huge
files, whatever's quickest is good. By the way, what do you mean with
"YMMV?"
This topic is locked and can not be replied to.