Suggestions for improving a trivial tag parser

naething · July 30, 2008, 7:31pm

Hi folks,

I need to take strings that have embedded and tags and split
them out into an array.
Here are some examples:

No need to ensure valid pairs, just need to break out tags from the

text
#----------------------------------------------------------------------------------------------

describe “Inline style parsing” do
it “should return an identical string if inline styles are not
detected” do
create_pdf
@pdf.parse_inline_styles(“Hello World”).should == “Hello World”
end

it “should return an array of segments when a style is detected” do
create_pdf
@pdf.parse_inline_styles(“Hello Fine World”).should ==
["Hello “, “”,“Fine”, “”, " World”]
end

it “should create an array of segments when multiple styles are
detected” do
create_pdf
@pdf.parse_inline_styles(“Hello Fine World”).should ==
["Hello ", “”, "Fine ", “”, “World”, “”, “”]
end
end

I fear my implementation (below) is showing one of my weakest areas in
Ruby, and probably even has some unforeseen problems. I’m sure it can
be done more accurately in less code. Any kind RubyTalk folks want to
school me?

  def parse_inline_styles(text) #:nodoc:
    require "strscan"

    sc = StringScanner.new(text)
    output = []
    last_pos = 0

    loop do
     if sc.scan_until(/<\/?[ib]>/)
       pre = sc.pre_match[last_pos..-1]
       output << pre unless pre.empty?
       output << sc.matched
       last_pos = sc.pos
     else
       output << sc.rest if sc.rest?
       break output
     end
    end

    output.length == 1 ? output.first : output
  end

Thanks,
-greg

naething · July 30, 2008, 8:08pm

On Wed, Jul 30, 2008 at 1:53 PM, Robert D. [email protected]
wrote:

What about

eles = split( %r{(</?.*?>)} ).delete_if{|x| x.empty? }
eles.size.one? ? eles.first : eles

I needed something a little more restricted than that, but you gave me
almost exactly what I need:

  def parse_inline_styles(text) #:nodoc:
    segments = text.split( %r{(</?[ib].*?>)} ).delete_if{|x|

x.empty? }
segments.size == 1 ? segments.first : segments
end

This passes my specs, and so long as people don’t see any major issues
with it, it looks great.

I knew there had to be a way to do this with split. Thanks Robert.

-greg

naething · July 30, 2008, 7:54pm

What about

eles = split( %r{(</?.*?>)} ).delete_if{|x| x.empty? }
eles.size.one? ? eles.first : eles

HTH
Robert

–
http://ruby-smalltalk.blogspot.com/

There’s no one thing that’s true. It’s all true.

naething · July 30, 2008, 8:11pm

On Wed, Jul 30, 2008 at 2:07 PM, Gregory B.
[email protected] wrote:

   segments = text.split( %r{(</?[ib].*?>)} ).delete_if{|x| x.empty? }
   segments.size == 1 ? segments.first : segments
 end

Whoops, make that:

  def parse_inline_styles(text) #:nodoc:
    segments = text.split( %r{(</?[ib]>)} ).delete_if{|x| x.empty? }
    segments.size == 1 ? segments.first : segments
  end

Slaps head. I totally get what I was missing out on before, when
you use groupings, split includes the matched segments:

“kitten robot snake robot tree robot”.split(/(robot)/)
=> ["kitten ", “robot”, " snake ", “robot”, " tree ", “robot”]
“kitten robot snake robot tree robot”.split(/robot/)
=> ["kitten ", " snake ", " tree "]

naething · July 30, 2008, 8:14pm

On 30-07-2008, at 13:53, Robert D. wrote:

What about

eles = split( %r{(</?.*?>)} ).delete_if{|x| x.empty? }
eles.size.one? ? eles.first : eles

I think your regexp is wrong, since it (incorrectly) parses empty tags:

“Hello Fine<> World”.split( %r{(</?.*?
)} ).delete_if{|x| x.empty? }
=> ["Hello ", “”, “Fine”, “<>”, " ", “”, “World”, “”, “”]

I would try something like:

“Hello Fine World”.split( %r{(</?
[bi]>)} ).delete_if{|x| x.empty? }
=> ["Hello ", “”, "Fine ", “”, “World”, “”, “”]

“Hello Fine<> World”.split( %r{(</?
[bi]>)} ).delete_if{|x| x.empty? }
=> ["Hello ", “”, "Fine<> ", “”, “World”, “”, “”]

Or:

“Hello Fine World”.split( %r{(</?[^>]
+>)} ).delete_if{|x| x.empty? }
=> ["Hello ", “”, "Fine ", “”, “World”, “”, “”]

“Hello Fine<> World”.split( %r{(</?[^>]
+>)} ).delete_if{|x| x.empty? }
=> ["Hello ", “”, "Fine<> ", “”, “World”, “”, “”]

HTH
Robert

Now, I would bet that this might be a little too expensive with large
strings.
regards,

naething · July 30, 2008, 10:37pm

On Wed, Jul 30, 2008 at 8:12 PM, Rolando A. [email protected]
wrote:

On 30-07-2008, at 13:53, Robert D. wrote:

What about

eles = split( %r{(</?.*?>)} ).delete_if{|x| x.empty? }
eles.size.one? ? eles.first : eles

I think your regexp is wrong, since it (incorrectly) parses empty tags:
It passed the specs did it not? I did not know what Gregory wanted
exactly, turns out he wanted %r{(</?[ib]>)}
but he got the message ;).
Now, I would bet that this might be a little too expensive with large
strings.
Hmm why?
It would be different if we were treating files, but as the string is
already here we use the memory required by the
specification and nothing more.

Robert

–
http://ruby-smalltalk.blogspot.com/

There’s no one thing that’s true. It’s all true.

naething · July 30, 2008, 8:16pm

On Jul 30, 2008, at 11:28 AM, Gregory B. wrote:

describe “Inline style parsing” do
end
    else
Thanks,
-greg

my take:

cfp:~ > cat a.rb
require ‘yaml’

strings =
“Hello World”,
“Hello Fine World”,
“Hello Fine World”

def parse_inline_styles string, tags = %w( )
re = Regexp.new tags.flatten.map{|tag| “(#{ Regexp.escape
tag })”}.join(‘|’)
tokens = string.split(re)
tokens.delete_if{|token| token.empty?}
((tokens.size == 1 and tokens.first == string) ? string : tokens)
end

strings.each do |string|
y string => parse_inline_styles(string)
end

cfp:~ > ruby a.rb

Hello World: Hello World

Hello Fine World:

"Hello "

Fine

" World"

Hello Fine World:

"Hello "

"Fine "

World

a @ http://codeforpeople.com/

naething · July 31, 2008, 12:14am

On Wed, Jul 30, 2008 at 10:43 PM, Gregory B.
[email protected] wrote:

It does a double pass through the segments rather than a single pass,
Yes but these two passes are quite fast, see below.
However, I think it’ll be okay for my purposes (PDF inline styling),
unless I missed some other concern Rolando had.
Well trying to be useful I checked for some larger texts, I omitted
the conditional #first at the end of the parsing method for clarity.

Turns out that split is still much better than scanning for a string
of a size over one megabyte, but it is not
really fast either:
539/39 > cat split.rb && ruby split.rb

require ‘benchmark’

def split_select txt
txt.split(%r{(</?[bi]>)}).select{|x| ! x.empty? }
end

def split_delete txt
txt.split(%r{(</?[bi]>)}).delete_if{|x| x.empty? }
end

def use_scan txt
r=[]
txt.scan(%r{(.*?)(</?[ib]>)}) do | pr,po | r << pr unless pr.empty?;
r << po end
r << $’ unless $'.empty?
end

N = 400_000;
text = “bolditalicboldboldnormal” * N;

Benchmark.bmbm do | bm |
bm.report(“split_select”) do
split_select text
end
bm.report(“split_delete”) do
split_delete text
end
bm.report(“use_scan”) do
use_scan text
end
end
Rehearsal ------------------------------------------------
split_select 6.844000 0.094000 6.938000 ( 7.063000)
split_delete 7.687000 0.109000 7.796000 ( 7.953000)
use_scan 20.063000 0.203000 20.266000 ( 20.109000)
-------------------------------------- total: 35.000000sec

               user     system      total        real

split_select 6.344000 0.031000 6.375000 ( 6.485000)
split_delete 6.500000 0.109000 6.609000 ( 6.359000)
use_scan 16.625000 0.265000 16.890000 ( 16.906000)

HTH
Robert

–
http://ruby-smalltalk.blogspot.com/

There’s no one thing that’s true. It’s all true.

naething · July 30, 2008, 10:44pm

On Wed, Jul 30, 2008 at 4:35 PM, Robert D. [email protected]
wrote:

exactly, turns out he wanted %r{(</?[ib]>)}
but he got the message ;).

My implementation was tighter than my specs, but I added an extra one
to catch this.

It would be different if we were treating files, but as the string is
already here we use the memory required by the
specification and nothing more.

It does a double pass through the segments rather than a single pass,
and I guess that if I had a giant string with a ton of tags I needed
to parse, that’d make it less efficient.

However, I think it’ll be okay for my purposes (PDF inline styling),
unless I missed some other concern Rolando had.

naething · July 31, 2008, 12:19am

On Wed, Jul 30, 2008 at 6:12 PM, Robert D. [email protected]
wrote:

Turns out that split is still much better than scanning for a string
of a size over one megabyte, but it is not
really fast either:
539/39 > cat split.rb && ruby split.rb

Interesting that it’s faster to do a double pass with split than a
single pass with StringScanner.
I didn’t implement it that way for efficiency, just out of total
forgetfulness on how to get split() to work the way I want

-greg

naething · July 31, 2008, 1:08am

Interesting that it’s faster to do a double pass with split than a
single pass with StringScanner.
I didn’t implement it that way for efficiency, just out of total
forgetfulness on how to get split() to work the way I want

Anyway for small strings even scan will take only one split second,
sorry could not resist the pun ;).
Robert