Hi folks,
I need to take strings that have embedded and tags and split
them out into an array.
Here are some examples:
No need to ensure valid pairs, just need to break out tags from the
text
#----------------------------------------------------------------------------------------------
describe “Inline style parsing” do
it “should return an identical string if inline styles are not
detected” do
create_pdf
@pdf.parse_inline_styles(“Hello World”).should == “Hello World”
end
it “should return an array of segments when a style is detected” do
create_pdf
@pdf.parse_inline_styles(“Hello Fine World”).should ==
["Hello “, “”,“Fine”, “”, " World”]
end
it “should create an array of segments when multiple styles are
detected” do
create_pdf
@pdf.parse_inline_styles(“Hello Fine World”).should ==
["Hello ", “”, "Fine ", “”, “World”, “”, “”]
end
end
I fear my implementation (below) is showing one of my weakest areas in
Ruby, and probably even has some unforeseen problems. I’m sure it can
be done more accurately in less code. Any kind RubyTalk folks want to
school me?
def parse_inline_styles(text) #:nodoc:
require "strscan"
sc = StringScanner.new(text)
output = []
last_pos = 0
loop do
if sc.scan_until(/<\/?[ib]>/)
pre = sc.pre_match[last_pos..-1]
output << pre unless pre.empty?
output << sc.matched
last_pos = sc.pos
else
output << sc.rest if sc.rest?
break output
end
end
output.length == 1 ? output.first : output
end
Thanks,
-greg
On Wed, Jul 30, 2008 at 1:53 PM, Robert D. [email protected]
wrote:
What about
eles = split( %r{(</?.*?>)} ).delete_if{|x| x.empty? }
eles.size.one? ? eles.first : eles
I needed something a little more restricted than that, but you gave me
almost exactly what I need:
def parse_inline_styles(text) #:nodoc:
segments = text.split( %r{(</?[ib].*?>)} ).delete_if{|x|
x.empty? }
segments.size == 1 ? segments.first : segments
end
This passes my specs, and so long as people don’t see any major issues
with it, it looks great.
I knew there had to be a way to do this with split. Thanks Robert.
-greg
What about
eles = split( %r{(</?.*?>)} ).delete_if{|x| x.empty? }
eles.size.one? ? eles.first : eles
HTH
Robert
–
http://ruby-smalltalk.blogspot.com/
There’s no one thing that’s true. It’s all true.
On Wed, Jul 30, 2008 at 2:07 PM, Gregory B.
[email protected] wrote:
segments = text.split( %r{(</?[ib].*?>)} ).delete_if{|x| x.empty? }
segments.size == 1 ? segments.first : segments
end
Whoops, make that:
def parse_inline_styles(text) #:nodoc:
segments = text.split( %r{(</?[ib]>)} ).delete_if{|x| x.empty? }
segments.size == 1 ? segments.first : segments
end
Slaps head. I totally get what I was missing out on before, when
you use groupings, split includes the matched segments:
“kitten robot snake robot tree robot”.split(/(robot)/)
=> ["kitten ", “robot”, " snake ", “robot”, " tree ", “robot”]
“kitten robot snake robot tree robot”.split(/robot/)
=> ["kitten ", " snake ", " tree "]
On 30-07-2008, at 13:53, Robert D. wrote:
What about
eles = split( %r{(</?.*?>)} ).delete_if{|x| x.empty? }
eles.size.one? ? eles.first : eles
I think your regexp is wrong, since it (incorrectly) parses empty tags:
“Hello Fine<> World”.split( %r{(</?.*?
)} ).delete_if{|x| x.empty? }
=> ["Hello ", “”, “Fine”, “<>”, " ", “”, “World”, “”, “”]
I would try something like:
“Hello Fine World”.split( %r{(</?
[bi]>)} ).delete_if{|x| x.empty? }
=> ["Hello ", “”, "Fine ", “”, “World”, “”, “”]
“Hello Fine<> World”.split( %r{(</?
[bi]>)} ).delete_if{|x| x.empty? }
=> ["Hello ", “”, "Fine<> ", “”, “World”, “”, “”]
Or:
“Hello Fine World”.split( %r{(</?[^>]
+>)} ).delete_if{|x| x.empty? }
=> ["Hello ", “”, "Fine ", “”, “World”, “”, “”]
“Hello Fine<> World”.split( %r{(</?[^>]
+>)} ).delete_if{|x| x.empty? }
=> ["Hello ", “”, "Fine<> ", “”, “World”, “”, “”]
HTH
Robert
Now, I would bet that this might be a little too expensive with large
strings.
regards,
On Wed, Jul 30, 2008 at 8:12 PM, Rolando A. [email protected]
wrote:
On 30-07-2008, at 13:53, Robert D. wrote:
What about
eles = split( %r{(</?.*?>)} ).delete_if{|x| x.empty? }
eles.size.one? ? eles.first : eles
I think your regexp is wrong, since it (incorrectly) parses empty tags:
It passed the specs did it not? I did not know what Gregory wanted
exactly, turns out he wanted %r{(</?[ib]>)}
but he got the message ;).
Now, I would bet that this might be a little too expensive with large
strings.
Hmm why?
It would be different if we were treating files, but as the string is
already here we use the memory required by the
specification and nothing more.
Robert
–
http://ruby-smalltalk.blogspot.com/
There’s no one thing that’s true. It’s all true.
On Jul 30, 2008, at 11:28 AM, Gregory B. wrote:
describe “Inline style parsing” do
end
else
Thanks,
-greg
my take:
cfp:~ > cat a.rb
require ‘yaml’
strings =
“Hello World”,
“Hello Fine World”,
“Hello Fine World”
def parse_inline_styles string, tags = %w( )
re = Regexp.new tags.flatten.map{|tag| “(#{ Regexp.escape
tag })”}.join(‘|’)
tokens = string.split(re)
tokens.delete_if{|token| token.empty?}
((tokens.size == 1 and tokens.first == string) ? string : tokens)
end
strings.each do |string|
y string => parse_inline_styles(string)
end
cfp:~ > ruby a.rb
Hello World: Hello World
Hello Fine World:
Hello Fine World:
a @ http://codeforpeople.com/
On Wed, Jul 30, 2008 at 10:43 PM, Gregory B.
[email protected] wrote:
It does a double pass through the segments rather than a single pass,
Yes but these two passes are quite fast, see below.
However, I think it’ll be okay for my purposes (PDF inline styling),
unless I missed some other concern Rolando had.
Well trying to be useful I checked for some larger texts, I omitted
the conditional #first at the end of the parsing method for clarity.
Turns out that split is still much better than scanning for a string
of a size over one megabyte, but it is not
really fast either:
539/39 > cat split.rb && ruby split.rb
require ‘benchmark’
def split_select txt
txt.split(%r{(</?[bi]>)}).select{|x| ! x.empty? }
end
def split_delete txt
txt.split(%r{(</?[bi]>)}).delete_if{|x| x.empty? }
end
def use_scan txt
r=[]
txt.scan(%r{(.*?)(</?[ib]>)}) do | pr,po | r << pr unless pr.empty?;
r << po end
r << $’ unless $'.empty?
end
N = 400_000;
text = “bolditalicboldboldnormal” * N;
Benchmark.bmbm do | bm |
bm.report(“split_select”) do
split_select text
end
bm.report(“split_delete”) do
split_delete text
end
bm.report(“use_scan”) do
use_scan text
end
end
Rehearsal ------------------------------------------------
split_select 6.844000 0.094000 6.938000 ( 7.063000)
split_delete 7.687000 0.109000 7.796000 ( 7.953000)
use_scan 20.063000 0.203000 20.266000 ( 20.109000)
-------------------------------------- total: 35.000000sec
user system total real
split_select 6.344000 0.031000 6.375000 ( 6.485000)
split_delete 6.500000 0.109000 6.609000 ( 6.359000)
use_scan 16.625000 0.265000 16.890000 ( 16.906000)
HTH
Robert
–
http://ruby-smalltalk.blogspot.com/
There’s no one thing that’s true. It’s all true.
On Wed, Jul 30, 2008 at 4:35 PM, Robert D. [email protected]
wrote:
exactly, turns out he wanted %r{(</?[ib]>)}
but he got the message ;).
My implementation was tighter than my specs, but I added an extra one
to catch this. 
It would be different if we were treating files, but as the string is
already here we use the memory required by the
specification and nothing more.
It does a double pass through the segments rather than a single pass,
and I guess that if I had a giant string with a ton of tags I needed
to parse, that’d make it less efficient.
However, I think it’ll be okay for my purposes (PDF inline styling),
unless I missed some other concern Rolando had.
On Wed, Jul 30, 2008 at 6:12 PM, Robert D. [email protected]
wrote:
Turns out that split is still much better than scanning for a string
of a size over one megabyte, but it is not
really fast either:
539/39 > cat split.rb && ruby split.rb
Interesting that it’s faster to do a double pass with split than a
single pass with StringScanner.
I didn’t implement it that way for efficiency, just out of total
forgetfulness on how to get split() to work the way I want 
-greg
Interesting that it’s faster to do a double pass with split than a
single pass with StringScanner.
I didn’t implement it that way for efficiency, just out of total
forgetfulness on how to get split() to work the way I want 
Anyway for small strings even scan will take only one split second,
sorry could not resist the pun ;).
Robert