Forum: Ruby how do I extract the ID name of a Div and its content?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Hans W. (Guest)
on 2008-12-12 13:05
Hi folks,
sorry to bother you with such a mundane question, but I tried for 3hours
and don`t know how :-(
I need to extract the strings, that means everything between "" or ''

that is the Input:

<body>
<div id='pagewrapper'>
<div id='header'>
<p>Ruby Forum is cool</p>
</div>
<div id='navbar'>
<ul>
<li><%= link_to 'Cities', cities_path %></li>
<li><%= link_to 'Restaurants', restaurants_path %></li>
<li><%= link_to 'Categories', categories_path %></li>
<li><%= link_to 'Products',  products_path %></li>
</ul>

wished output:

pagewrapper
header
navbar
  Cities
  Restaurants
  Categories
  Products


Please help
Thanks in advance
pannda
Brian C. (Guest)
on 2008-12-12 14:02
Hans Wurst wrote:
> but I tried for 3hours and don`t know how :-(

So, what approach or approaches did you try? What was the solution that
got closest to what you were trying, and what did it output?

There are a whole host of ways you might be approaching this. For
example, you might be using the Hpricot (HTML parsing) library, or REXML
(XML parsing), or simple regular expression matching.

If all you want is the bits between 'single quotes' then a regular
expression match is probably easiest. Try using String#scan, and give it
a regular expression which matches a single quote followed by any number
of non-single-quote characters followed by a single quote.

Or you could use use String#split("'") and keep only the odd-numbered
elements of the returned array.

I'm sure if you post your actual code and what it does, someone will
help you tweak it to work.
Robert D. (Guest)
on 2008-12-12 14:05
(Received via mailing list)
Disclaimer: If there are other ' somewhere in the document (comments,
CDATA sections, Text elements)
this will miserably break and you need HPricot or other HTML parsers.
If however your data is simple enough and you can fulfill the
prerequisite it becomes very easy...

robert@siena:~/log/ruby/ML 12:56:41
505/6 > cat strings.rb && ruby strings.rb
#!/usr/bin/ruby
# vim: sw=2 ts=2 ft=ruby expandtab tw=0 nu syn=on:
# file: strings.rb


text = DATA.read

p text.scan /'(.*?)'/
__END__
<body>
<div id='pagewrapper'>
<div id='header'>
<p>Ruby Forum is cool</p>
</div>
<div id='navbar'>
<ul>
<li><%= link_to 'Cities', cities_path %></li>
<li><%= link_to 'Restaurants', restaurants_path %></li>
<li><%= link_to 'Categories', categories_path %></li>
<li><%= link_to 'Products',  products_path %></li>
</ul>
[["pagewrapper"], ["header"], ["navbar"], ["Cities"], ["Restaurants"],
["Categories"], ["Products"]]
Hans W. (Guest)
on 2008-12-12 14:13
Thaaaaaaaank You Robert!
Thumbs up!, this is exactly what I wanted, clean & simple without
hpricot, rexml or overly complicated regexs

Have a nice weekend
greetz Pannda
Hans W. (Guest)
on 2008-12-12 14:43
Hi Brian
> I'm sure if you post your actual code and what it does, someone will
> help you tweak it to work.

actual code was this

require 'rubygems'
require 'yaml'
require 'hpricot'




 html_datei = File.open(ARGV[0]).readlines.collect do |line|


  option 1#  line.gsub(/<\/?[^>]*>/, "").to_yaml
  option 2 # line.gsub(/</,
"").gsub(/>/,"").strip.to_yaml.gsub(/---/,"").lstrip
  option 3 # line.gsub(/^<\/?[^>]*>/,"").lstrip
 end

# all 3 options worked, but I couldn't figure out how to get those "" or
'' in between
# so I tried hpricot

  doc = open(ARGV[0]) { |f| Hpricot(f).search("div") }

 # but how to go from here? I couldn't figure out the documentation of
Hpricot, because .to_inner_html doesn't work




 yaml_datei = File.new(ARGV[1], 'w+')

 yaml_datei << html_datei
 yaml_datei << doc
 yaml_datei.close

 so, that was my actual code
This topic is locked and can not be replied to.