Regex find everything between

So here’s the problem:

I have a html document that is being spit out to me as a string.

example: "<!doctype html>\n<html lang=“en”>\n \n\n

\n \t\n \t \n \t

My page Testing

\n

some text here

\t\n \t

This is my footer info

\n \t\n \n"

I’m using regular expression to find all the opening tags of the dom
elements. <html lang=“en”>, , , <h1 class=“my-class”>,
etc… and it’s working. This is via scan() method.

==============================
elements = []
opening_tags = file.scan(/<\w+\s+[^>]>/)
opening_tags.each do |tag|
if tag.match(/class=\"(.
?)editor(.*?)\"/) # tries to match anything
with a class=“editor”
close = get_closing_tag(tag)
# finds which DOM element it is and returns close tag
# example if ‘

’ returns ‘


file.match(/#{tag}(.+)#{close}]/) { |m| elements << m }
# pushes all matches to elements array

=======================================

So I get the opening tags as it should

and

and I get a proper closing tag for each and

but /#{tag}(.+)#{close}]/ returns nothing

Output from Rails.logger.info
+++++++++++++++++++++++++++++++++++++++
==== tag ====
“<h1 class=“my-class”>”
==== close ====


==== /#{tag}(.+)#{close}]/ ====
/

(.+)</p>]/
==== tag ====
“<p class=“my-class icon”>”
==== close ====


==== /#{tag}(.+)#{close}]/ ====
/

(.+)</p>]/
==== tag ====
“<p class=“fred my-class”>”
==== close ====


==== /#{tag}(.+)#{close}]/ ====
/

(.+)</p>]/
======= elements ========
[]

+++++++++++++++++++++++++++++++++++++++

Any help would be appreciated. I’m at my wits end here. If there is a
completely better way to do this, I’m all ears as well.

Thank you in advance.

On Mon, Aug 22, 2011 at 4:53 PM, Keith R.
[email protected]wrote:

I’m using regular expression to find all the opening tags of the dom
# finds which DOM element it is and returns close tag
but /#{tag}(.+)#{close}]/ returns nothing
“<p class="my-class icon">”
======= elements ========
Posted via http://www.ruby-forum.com/.

Try out nokogiri: https://github.com/tenderlove/nokogiri

After you’ve let it parse your document you can use css3 or xpath
selectors
to find what you are looking for.

Letting someone else do all the dirty work is a good idea for
potentially
dirty html.

– John-John T.

On Mon, Aug 22, 2011 at 7:53 AM, Keith R. [email protected]
wrote:

I have a html document that is being spit out to me as a string.

I’m using regular expression to find …

If there is a completely better way to do this, I’m all ears as well.

There is: nokogiri – it’s made for exactly this. Trying to parse XML
or
HTML via regex is a path to tears and insanity. :slight_smile:

if tag.match(/class=\“(.?)editor(.?)\”/) # tries to match anything
with a class=“editor”

tag = ‘

if tag.match(/class=\“(.?)editor(.?)\”/)
puts ‘yes’
else
puts ‘no’
end

–output:–
no

tag = ‘

if tag.match(/class=“(.?)editor(.?)”/)
puts ‘yes’
else
puts ‘no’
end

–output:–
yes

close = get_closing_tag(tag)

but /#{tag}(.+)#{close}]/ returns nothing

Do you really expect anyone to be able to tell you what’s wrong there?
How would anyone know what get_closing_tag() returns?

require ‘nokogiri’

f = File.open(‘html.htm’)
doc = Nokogiri::HTML(f)

results = doc.xpath(‘//*[contains(@class,“editor”)]’).each do |el|
p [
el.attributes[‘class’].value,
el.children[0].text
]
end

–output:–
[“editor_greeting”, “Hello world”]
[“myeditor_fruit”, “Apple”]
[“editor_name”, “Papillon”]

==== html.htm:

Test

Hello world

<div class="myeditor_fruit">Apple</div>

<div class="article">
  <div>Not this node.</div>
  <div class="editor_name">Papillon</div>
</div>

===

See the following for the basics of xpath:

http://www.w3schools.com/xpath/default.asp

Keith R. wrote in post #1017873:

So here’s the problem:

I have a html document that is being spit out to me as a string.

I’m using regular expression to find all the opening tags of the dom
elements.

For what purpose? What is your ultimate goal?

You guys have just made my week!!! Thank you so much!

Nokogirl works like a charm. Soooo amazing!

I will definitely add this to my list of “must have” gems.

Thank you again.