String replacing help


#1

I’m working with Mechanize doing some screen scraping. Because of the
project, I have to use an older version of Mechanize for now, I’m using
0.8.4.

The goal of what I’m trying to do is take a string and insert pipes ‘|’
before words that are not inside of .

I have:

template_body.class
=> Hpricot::Elements

template_body.to_html
=> “

<a href=“javascript:document.f6.SLID.value=‘F36’;
document.f6.submit();” onMouseOut=“window.status=’’;”
title=“Select” onMouseOver=“window.status=‘Select’; return
true;”>HEAD \n <a
href=“javascript:document.f6.SLID.value=‘F37’; document.f6.submit();”
onMouseOut=“window.status=’’;” title=“Select”
onMouseOver=“window.status=‘Select’; return true;”>TITLE <a
href=“javascript:document.f6.SLID.value=‘F38’; document.f6.submit();”
onMouseOut=“window.status=’’;” title=“Select”
onMouseOver=“window.status=‘Select’; return true;”>“test”\n<a
href=“javascript:document.f6.SLID.value=‘F39’; document.f6.submit();”
onMouseOut=“window.status=’’;” title=“Select”
onMouseOver=“window.status=‘Select’; return true;”>BODY \n
<a href=“javascript:document.f6.SLID.value=‘F40’;
document.f6.submit();” onMouseOut=“window.status=’’;”
title=“Select” onMouseOver=“window.status=‘Select’; return
true;”>DIV id <a
href=“javascript:document.f6.SLID.value=‘F41’; document.f6.submit();”
onMouseOut=“window.status=’’;” title=“Select”
onMouseOver=“window.status=‘Select’; return true;”>“main-div”\n
<a href=“javascript:document.f6.SLID.value=‘F42’;
document.f6.submit();” onMouseOut=“window.status=’’;”
title=“Select” onMouseOver=“window.status=‘Select’; return
true;”>CSS-WITH-LINK destination <a
href=“javascript:document.f6.SLID.value=‘F43’; document.f6.submit();”
onMouseOut=“window.status=’’;” title=“Select”
onMouseOver=“window.status=‘Select’; return true;”>TO <a
href=“javascript:document.f6.SLID.value=‘F44’; document.f6.submit();”
onMouseOut=“window.status=’’;” title=“Select”
onMouseOver=“window.status=‘Select’; return true;”>:index\n
<a href=“javascript:document.f6.SLID.value=‘F45’;
document.f6.submit();” onMouseOut=“window.status=’’;”
title=“Select” onMouseOver=“window.status=‘Select’; return
true;”>IMAGE source <a
href=“javascript:document.f6.SLID.value=‘F46’; document.f6.submit();”
onMouseOut=“window.status=’’;” title=“Select”
onMouseOver=“window.status=‘Select’; return
true;”>RENDER image <a
href=“javascript:document.f6.SLID.value=‘F47’; document.f6.submit();”
onMouseOut=“window.status=’’;” title=“Select”
onMouseOver=“window.status=‘Select’; return true;”>@image\n
max-height <a href=“javascript:document.f6.SLID.value=‘F48’;
document.f6.submit();” onMouseOut=“window.status=’’;”
title=“Select” onMouseOver=“window.status=‘Select’; return
true;”>m-h\n

The words I’m trying to insert the pipe before have a non-breaking space
tags around them. I had it working where I can iterate through
everything and return a new string, but I end up losing all line breaks
and non-breaking spaces using

new_body = ‘’
template_body.to_html.split(" “).each do |el|
el.split(”\n").each do |e|
unless e.empty? or e =~ /</?[^>]*>/
e = ‘|’ + e
end
end
new_body += el
end

Any ideas?

Thanks,
~Jeremy


#2

On Sun, Apr 19, 2009 at 6:59 AM, Jeremy W.
removed_email_address@domain.invalid wrote:

template_body.to_html
href=“javascript:document.f6.SLID.value=‘F39’; document.f6.submit();”
document.f6.submit();" onMouseOut=“window.status=’’;”
title=“Select” onMouseOver=“window.status=‘Select’; return
title=“Select” onMouseOver=“window.status=‘Select’; return
el.split(”\n”).each do |e|
Thanks,
~Jeremy

Posted via http://www.ruby-forum.com/.

Does this work?:

in_a = false
result = “”
s.scan(/<[^>]+>|[^<]+/).each do |e|
if e =~ /<a/
in_a = true
result << e
elsif e =~ /</a/
in_a = false
result << e
elsif !in_a && e =~ /\A\s*\w/
result << “|#{e}”
else
result << e
end
end
result

Andrew T.
http://ramblingsonrails.com
http://www.linkedin.com/in/andrewtimberlake

“I have never let my schooling interfere with my education” - Mark Twain


#3

Jeremy W. wrote:

The goal of what I’m trying to do is take a string and insert pipes ‘|’
before words that are not inside of .

Yet you were not able to provide an example of the desired result?

The words I’m trying to insert the pipe before have a non-breaking space
tags around them.

Based on my interpretation of your description, I think this is what
you want:

result = str.gsub(/ ([^<]*) /m, ’ |\1 ')


#4

7stud – wrote:

Jeremy W. wrote:

The goal of what I’m trying to do is take a string and insert pipes ‘|’
before words that are not inside of .

Yet you were not able to provide an example of the desired result?

I’ll try what you have, but here is what the desired result

HEAD
TITLE “test”
BODY
DIV |id “main-div”
CSS-WITH-LINK |destination TO :index
IMAGE |source RENDER |image @image

I’m getting…

HEAD

TITLE
“test”
BODY

DIV
|id
“main-div”
CSS-WITH-LINK
|destination
TO
:index

  IMAGE

|source
RENDER
|image
@image


#5

Jeremy W. wrote:

7stud – wrote:

Jeremy W. wrote:

The goal of what I’m trying to do is take a string and insert pipes ‘|’
before words that are not inside of .

Yet you were not able to provide an example of the desired result?

I’ll try what you have, but here is what the desired result

HEAD
TITLE “test”
BODY
DIV |id “main-div”
CSS-WITH-LINK |destination TO :index
IMAGE |source RENDER |image @image

Ok. Here’s the deal. When asking these types of question, you need to
post two things:

  1. The starting string.
  2. The result string.

Apparently, you want to know what regex will transform this string:

<a href=“javascript:document.f6.SLID.value=‘F36’;
document.f6.submit();” onMouseOut=“window.status=’’;”
title=“Select” onMouseOver=“window.status=‘Select’; return
true;”>HEAD \n <a
href=“javascript:document.f6.SLID.value=‘F37’; document.f6.submit();”
onMouseOut=“window.status=’’;” title=“Select”
onMouseOver=“window.status=‘Select’; return true;”>TITLE <a
href=“javascript:document.f6.SLID.value=‘F38’; document.f6.submit();”
onMouseOut=“window.status=’’;” title=“Select”
onMouseOver=“window.status=‘Select’; return true;”>“test”\n<a
href=“javascript:document.f6.SLID.value=‘F39’; document.f6.submit();”
onMouseOut=“window.status=’’;” title=“Select”
onMouseOver=“window.status=‘Select’; return true;”>BODY \n
<a href=“javascript:document.f6.SLID.value=‘F40’;
document.f6.submit();” onMouseOut=“window.status=’’;”
title=“Select” onMouseOver=“window.status=‘Select’; return
true;”>DIV id <a
href=“javascript:document.f6.SLID.value=‘F41’; document.f6.submit();”
onMouseOut=“window.status=’’;” title=“Select”
onMouseOver=“window.status=‘Select’; return true;”>“main-div”\n
<a href=“javascript:document.f6.SLID.value=‘F42’;
document.f6.submit();” onMouseOut=“window.status=’’;”
title=“Select” onMouseOver=“window.status=‘Select’; return
true;”>CSS-WITH-LINK destination <a
href=“javascript:document.f6.SLID.value=‘F43’; document.f6.submit();”
onMouseOut=“window.status=’’;” title=“Select”
onMouseOver=“window.status=‘Select’; return true;”>TO <a
href=“javascript:document.f6.SLID.value=‘F44’; document.f6.submit();”
onMouseOut=“window.status=’’;” title=“Select”
onMouseOver=“window.status=‘Select’; return true;”>:index\n
<a href=“javascript:document.f6.SLID.value=‘F45’;
document.f6.submit();” onMouseOut=“window.status=’’;”
title=“Select” onMouseOver=“window.status=‘Select’; return
true;”>IMAGE source <a
href=“javascript:document.f6.SLID.value=‘F46’; document.f6.submit();”
onMouseOut=“window.status=’’;” title=“Select”
onMouseOver=“window.status=‘Select’; return
true;”>RENDER image <a
href=“javascript:document.f6.SLID.value=‘F47’; document.f6.submit();”
onMouseOut=“window.status=’’;” title=“Select”
onMouseOver=“window.status=‘Select’; return true;”>@image\n
max-height <a href=“javascript:document.f6.SLID.value=‘F48’;
document.f6.submit();” onMouseOut=“window.status=’’;”
title=“Select” onMouseOver=“window.status=‘Select’; return
true;”>m-h\n

into this string:

“HEAD
TITLE “test”
BODY
DIV |id “main-div”
CSS-WITH-LINK |destination TO :index
IMAGE |source RENDER |image @image”

Good luck with that.


#6

yeah, sorry. I posted that at 3:00am after a few beers, I didn’t think
about it until this morning.I made a few changes with your guys
suggestions. So, let’s try this again…

Here is my method
http://rafb.net/p/rVurUa81.html

Here is what I am getting…
http://rafb.net/p/O1c03297.html

Here is what I want
http://rafb.net/p/OmoKsi17.html

My thought was that I would take that HTML string, and use a regexp that
would find the text not inside any anchor tags and just add the pipe to
it, then return that original HTML string. I want to keep all the  
and \n so the formatting remains the same.

At this point, if I can figure out how to place the line breaks back in
the right place, then I may just be set.

Thanks for the help guys,

~Jeremy