Simple substitutions

bodikp · April 12, 2006, 1:58pm

Hello,
Some of you all helped this newbie a couple of weeks ago. I’m getting
there!

I have a need to parse postscript print files on a daily basis. Some of
them are over 10 mB. One thing I need to find out is the number of pages
in the files. There’s a simple one-liner at the top of each .ps file
that states the number of pages. It reads:
%%Pages: xx
It’s a line of its own.

I simply want to find that line in my source file and then write a
corresponding line in my output file:
Number of pages: xx

I tried the following in IRB, but I get the syntax error shown below.

 File.open("psout.txt", "w") do |output|
   File.foreach("test1.ps") do |line|
     line.sub!(/\%\%Pages: ([0-9]{1,5})/"Pages: \1"/)
     output << line
   end
 end

SyntaxError: compile error
(irb):26: parse error, unexpected tSTRING_BEG, expecting ')'
line.sub!(/\%\%Pages: ([0-9]{1,5})/"Pages: \1"/)
                                ^
(irb):26: parse error, unexpected ')'
       from (irb):29

Is there a significance to the ^ under the “P” in my syntax error?
What’s it telling me? My sub requires parentheses. Is there a conflict
with my grouping parentheses inside my regular expression?

Another request for help–my translation above would basically move the
entire contents of my source file to the target file, with the exception
of the one line I want changed. That’s overkill for me. I just want a
few lines of information in my target file. Would you suggest I just do
a simple search, using matching (=~), and then write about that find in
my target file, or, would it be better to do mass deletions of
everything I don’t want from my source to my target files? In other
words, should I say “in this file, find this, then say this . . .” or,
“convert this entire file, but just change this?”

Thank you.

bodikp · April 12, 2006, 2:22pm

Hi,

On Wed, 2006-04-12 at 20:58 +0900, Peter B. wrote:

(irb):26: parse error, unexpected tSTRING_BEG, expecting ')'
line.sub!(/\%\%Pages: ([0-9]{1,5})/"Pages: \1"/)
                                ^
(irb):26: parse error, unexpected ')'
       from (irb):29
Is there a significance to the ^ under the “P” in my syntax error?
What’s it telling me? My sub requires parentheses. Is there a conflict
with my grouping parentheses inside my regular expression?

ruby is just complaining about a Perl-style s/// regexp. The correct
syntax here would be something like:

line.sub!(/\%\%Pages: ([0-9]{1,5})/) { "Pages: #$1" }

(You can sometimes use the \1 escapes in the replacement but it’s always
come out hit-and-miss for me, so I stick to the block form).

Another request for help–my translation above would basically move the
entire contents of my source file to the target file, with the exception
of the one line I want changed. That’s overkill for me. I just want a
few lines of information in my target file. Would you suggest I just do
a simple search, using matching (=~), and then write about that find in
my target file, or, would it be better to do mass deletions of
everything I don’t want from my source to my target files? In other
words, should I say “in this file, find this, then say this . . .” or,
“convert this entire file, but just change this?”

I would probably go with extracting just what I needed from the file,
though if you can safely exclude portions of it before running matches
(for example a large data section) that may improve performance.

Also, I don’t know if there are any Ruby postscript libraries, but it
might be worth a search if you haven’t already…

Hope that helps,

bodikp · April 12, 2006, 2:25pm

Hello,

     line.sub!(/\%\%Pages: ([0-9]{1,5})/"Pages: \1"/)

How about this:

line.sub!(/%%Pages: (\d{1,5})/){"Pages: " + $1}

bw,
Peter

bodikp · April 12, 2006, 3:04pm

Ross B. wrote:

Hi,

On Wed, 2006-04-12 at 20:58 +0900, Peter B. wrote:
(irb):26: parse error, unexpected tSTRING_BEG, expecting ')'
line.sub!(/\%\%Pages: ([0-9]{1,5})/"Pages: \1"/)
                                ^
(irb):26: parse error, unexpected ')'
       from (irb):29
Is there a significance to the ^ under the “P” in my syntax error?
What’s it telling me? My sub requires parentheses. Is there a conflict
with my grouping parentheses inside my regular expression?
ruby is just complaining about a Perl-style s/// regexp. The correct
syntax here would be something like:

line.sub!(/%%Pages: ([0-9]{1,5})/) { “Pages: #$1” }

(You can sometimes use the \1 escapes in the replacement but it’s always
come out hit-and-miss for me, so I stick to the block form).

Another request for help–my translation above would basically move the
entire contents of my source file to the target file, with the exception
of the one line I want changed. That’s overkill for me. I just want a
few lines of information in my target file. Would you suggest I just do
a simple search, using matching (=~), and then write about that find in
my target file, or, would it be better to do mass deletions of
everything I don’t want from my source to my target files? In other
words, should I say “in this file, find this, then say this . . .” or,
“convert this entire file, but just change this?”

I would probably go with extracting just what I needed from the file,
though if you can safely exclude portions of it before running matches
(for example a large data section) that may improve performance.

Also, I don’t know if there are any Ruby postscript libraries, but it
might be worth a search if you haven’t already…

Hope that helps,

Thank you, Ross.
Yes, my regex experience is indeed with Perl, and Perl only. I haven’t
seen anything else in my docs, though, about Ruby being different with
regexes, except for the stuff about true objects with Regexp#match. Both
your response and Peter’s below, though, show me something entirely new.
It looks like you’re putting the replacement phrase into a block,
correct? It’s kind of ugly at first, I must say, but, I see some sense
in it. Your suggestion worked for me, by the way. Now, I’d like to
streamline my conversion and aim for the much smaller target file I
mentioned above. Right now, my target file is still 10mB big, with just
that one pages line replaced with what I want. So, instead of working in
line mode, do you think I should just read in the entire file and use
“.*” before and after my sub expression to get rid of everything else?

bodikp · April 12, 2006, 3:24pm

On Wed, 2006-04-12 at 22:04 +0900, Peter B. wrote:

Now, I’d like to
streamline my conversion and aim for the much smaller target file I
mentioned above. Right now, my target file is still 10mB big, with just
that one pages line replaced with what I want. So, instead of working in
line mode, do you think I should just read in the entire file and use
“.*” before and after my sub expression to get rid of everything else?

Well, memory is cheap these days and I have to admit I’m much less
careful about how I use it (unless there’s a specific reason to conserve
it) but in this case, if there’s only one pages line in your file, then
how about doing something like (N.B. untested code):

pages = nil
File.foreach("test1.ps") do |line|
  if line =~ /\%\%Pages: ([0-9]{1,5})/
    pages = $1
    break
  end
end

if pages
  File.open("psout.txt", "w") { |out| out << "Pages: #{pages}" }
end

If you do choose to read the whole thing in at once, one way to achieve
that might be:

pages = nil
File.read('test1.ps').scan(/\%\%Pages: ([0-9]{1,5})/) do
  pages = $1
  break
end

if pages
  File.open("psout.txt", "w") { |out| out << "Pages: #{pages}" }
end

bodikp · April 12, 2006, 3:05pm

Peter S. wrote:

Hello,
     line.sub!(/\%\%Pages: ([0-9]{1,5})/"Pages: \1"/)
How about this:

line.sub!(/%%Pages: (\d{1,5})/){"Pages: " + $1}

bw,
Peter

Yes, that worked for me. Thank you, Peter. Like Ross’ response above, I
hadn’t yet seen a regex expressed this way, where a block is used for
the replacement part of a regex. Different, but powerful. Thanks again.

bodikp · April 12, 2006, 3:35pm

Ross B. wrote:

On Wed, 2006-04-12 at 22:04 +0900, Peter B. wrote:

Now, I’d like to
streamline my conversion and aim for the much smaller target file I
mentioned above. Right now, my target file is still 10mB big, with just
that one pages line replaced with what I want. So, instead of working in
line mode, do you think I should just read in the entire file and use
“.*” before and after my sub expression to get rid of everything else?

Well, memory is cheap these days and I have to admit I’m much less
careful about how I use it (unless there’s a specific reason to conserve
it) but in this case, if there’s only one pages line in your file, then
how about doing something like (N.B. untested code):

pages = nil
File.foreach(“test1.ps”) do |line|
if line =~ /%%Pages: ([0-9]{1,5})/
pages = $1
break
end
end

if pages
File.open(“psout.txt”, “w”) { |out| out << “Pages: #{pages}” }
end

If you do choose to read the whole thing in at once, one way to achieve
that might be:

pages = nil
File.read(‘test1.ps’).scan(/%%Pages: ([0-9]{1,5})/) do
pages = $1
break
end

if pages
File.open(“psout.txt”, “w”) { |out| out << “Pages: #{pages}” }
end

Aha. I didn’t know about “scan.” Just the word alone makes sense for
what I want. And, yes, I found it in the Big Book, on page 617.

What does the “break” do, Ross? Is it just telling RUBY to stop if it’s
found what it was scanning for?

This worked! I’m more prone to just read the whole file in. Thanks a
lot, Ross! I feel I can move mountains now.

bodikp · April 12, 2006, 4:59pm

On Wed, 2006-04-12 at 22:35 +0900, Peter B. wrote:

if pages
File.open("psout.txt", "w") { |out| out << "Pages: #{pages}" }
end
Aha. I didn’t know about “scan.” Just the word alone makes sense for
what I want. And, yes, I found it in the Big Book, on page 617.

What does the “break” do, Ross? Is it just telling RUBY to stop if it’s
found what it was scanning for?

Yes - you can find a bit about it near the end of this page:

http://www.rubycentral.com/book/tut_expressions.html

Looking at that again, though, I can’t imagine why I didn’t just write:

if File.read('test1.ps') =~ /\%\%Pages: ([0-9]{1,5})/
  pages = $1
end

For just a single match.

bodikp · April 12, 2006, 5:24pm

Ross B. wrote:

On Wed, 2006-04-12 at 22:35 +0900, Peter B. wrote:
if pages
File.open("psout.txt", "w") { |out| out << "Pages: #{pages}" }
end
Aha. I didn’t know about “scan.” Just the word alone makes sense for
what I want. And, yes, I found it in the Big Book, on page 617.

What does the “break” do, Ross? Is it just telling RUBY to stop if it’s
found what it was scanning for?
Yes - you can find a bit about it near the end of this page:

http://www.rubycentral.com/book/tut_expressions.html

Looking at that again, though, I can’t imagine why I didn’t just write:

if File.read(‘test1.ps’) =~ /%%Pages: ([0-9]{1,5})/
pages = $1
end

For just a single match.

Yes, the =~ notation certainly seems simpler. Now, I’m pursing this
further and now I need to look for blank pages in my postscript. These
are found with a regex, too, a more complex one, but, I’ve created one
that works. And, with this search, because there may be more than just
one, I’ve created an array and I’m pushing every page number I find
blank into that array. Cool. I love this stuff. I’m having a bit of
trouble though printing a simple array.length statement. Please check
this out:

Dir.chdir(‘c:/scripts/ruby/temp’)
blanks = []
File.read(“test2.ps”).scan(/%%Page: [\d()]+
(\d{1,5})\n%%PageBoundingBox: \d{1,5} \d{1,5} \d{1,5}
\d{1,5}\n%%PageOrientation:/) do
blanks.push($1)
number = blanks.length
blanks.push(" ")
#break
end

if blanks
File.open(“psout2.txt”, “w”) { |out| out << “Blank Pages in This PDF:
#{blanks}\nNo. of Blanks: #number\n” }
end

I’m getting this, which I’m extremely proud of, but, it doesn’t quite
make it:

+++++++++++++++++++++++++++++++++++++++++
Blank Pages in This PDF: 46 68 72 80 83
No. of Blanks: #number
+++++++++++++++++++++++++++++++++++++++++

Thanks. If I can just get this, then, I’ll streamline my script to take
advantage of the simple match, like you suggested.

bodikp · April 12, 2006, 6:44pm

On Thu, 2006-04-13 at 00:24 +0900, Peter B. wrote:

blanks.push($1)

I’m getting this, which I’m extremely proud of, but, it doesn’t quite
make it:

+++++++++++++++++++++++++++++++++++++++++
Blank Pages in This PDF: 46 68 72 80 83
No. of Blanks: #number
+++++++++++++++++++++++++++++++++++++++++

You’re almost there - I would make just a few small changes:

blanks = []
File.read(“test2.ps”).scan(/ … your regexp … /) do
blanks.push($1)
end

unless blanks.empty?
File.open(“psout2.txt”, “w”) do |out|
out << “Blank Pages in This PDF: #{blanks.join(’ ')}\n” <<
“No. of blanks: #{blanks.length}\n”
end
end

The key points I changed being:

* Rather than having an extra " " in the array for each
  element (which would throw the length off, too), I use
  Array#join to format for output.

* I surround all expressions within strings with #{} -
  prefixing with # alone works only for global variables
  and is probably bad form - stick to #{var} as much as
  possible.

* There's no need to have extra locals, e.g. 'number' in
  your code - you can make method calls in string expressions
  and it's often more obvious what's going on.

* I change 'if blanks' for 'unless blanks.empty?'. In Ruby,
  everything apart from nil and false evaluate to true. This
  includes empty arrays and zero.

bodikp · April 13, 2006, 12:15am

On Wed, Apr 12, 2006 at 08:58:17PM +0900, Peter B. wrote:

I tried the following in IRB, but I get the syntax error shown below.
 File.open("psout.txt", "w") do |output|
   File.foreach("test1.ps") do |line|
     line.sub!(/\%\%Pages: ([0-9]{1,5})/"Pages: \1"/)

Other people commented on the regex

     output << line

For efficiency, you can do:

if(line =~ /%%Pages: (\d+)/)
output << $1
break
end

YMMV, but stopping processing the file as soon as you’ve found what you
want is faster.

Sam

bodikp · April 12, 2006, 8:06pm

Ross B. wrote:

On Thu, 2006-04-13 at 00:24 +0900, Peter B. wrote:

blanks.push($1)

I’m getting this, which I’m extremely proud of, but, it doesn’t quite
make it:

+++++++++++++++++++++++++++++++++++++++++
Blank Pages in This PDF: 46 68 72 80 83
No. of Blanks: #number
+++++++++++++++++++++++++++++++++++++++++

You’re almost there - I would make just a few small changes:

blanks = []
File.read(“test2.ps”).scan(/ … your regexp … /) do
blanks.push($1)
end

unless blanks.empty?
File.open(“psout2.txt”, “w”) do |out|
out << “Blank Pages in This PDF: #{blanks.join(’ ')}\n” <<
“No. of blanks: #{blanks.length}\n”
end
end

The key points I changed being:

Rather than having an extra " " in the array for each
element (which would throw the length off, too), I use
Array#join to format for output.

I surround all expressions within strings with #{} -
prefixing with # alone works only for global variables
and is probably bad form - stick to #{var} as much as
possible.

There’s no need to have extra locals, e.g. ‘number’ in
your code - you can make method calls in string expressions
and it’s often more obvious what’s going on.

I change ‘if blanks’ for ‘unless blanks.empty?’. In Ruby,
everything apart from nil and false evaluate to true. This
includes empty arrays and zero.

This works great now! I’ll try to remember the #{} syntax. I guess I’m
just learning how many ways things can be done in RUBY, and, how wrong
some ways can be. I’ve always been a bit confused by the “<<” syntax.
Somehow, I would think it should be “>>” instead, like, you’re putting
something to something. I guess it is saying that, just from the other
way around. The blanks.join statement is brilliant! It’s so simple.

Thanks again for all your help, Ross. I’ll close this issue now and
leave you to your real work. Perhaps we’ll chat again. Cheers !

bodikp · April 13, 2006, 1:49pm

Sam R. wrote:

On Wed, Apr 12, 2006 at 08:58:17PM +0900, Peter B. wrote:
I tried the following in IRB, but I get the syntax error shown below.
 File.open("psout.txt", "w") do |output|
   File.foreach("test1.ps") do |line|
     line.sub!(/\%\%Pages: ([0-9]{1,5})/"Pages: \1"/)
Other people commented on the regex
     output << line
For efficiency, you can do:

if(line =~ /%%Pages: (\d+)/)
output << $1
break
end

YMMV, but stopping processing the file as soon as you’ve found what you
want is faster.

Sam

Thanks, Sam. So, is doing your “if(line . . .” suggestion quicker, do
you think, than doing a .scan? Because I’m dealing with sometimes huge
files, whatever’s quickest is good. By the way, what do you mean with
“YMMV?”