How do I extract the ID name of a Div and its content?


#1

Hi folks,
sorry to bother you with such a mundane question, but I tried for 3hours
and don`t know how :frowning:
I need to extract the strings, that means everything between “” or ‘’

that is the Input:

Ruby Forum is cool

  • <%= link_to 'Cities', cities_path %>
  • <%= link_to 'Restaurants', restaurants_path %>
  • <%= link_to 'Categories', categories_path %>
  • <%= link_to 'Products', products_path %>

wished output:

pagewrapper
header
navbar
Cities
Restaurants
Categories
Products

Please help
Thanks in advance
pannda


#2

Hans Wurst wrote:

but I tried for 3hours and don`t know how :frowning:

So, what approach or approaches did you try? What was the solution that
got closest to what you were trying, and what did it output?

There are a whole host of ways you might be approaching this. For
example, you might be using the Hpricot (HTML parsing) library, or REXML
(XML parsing), or simple regular expression matching.

If all you want is the bits between ‘single quotes’ then a regular
expression match is probably easiest. Try using String#scan, and give it
a regular expression which matches a single quote followed by any number
of non-single-quote characters followed by a single quote.

Or you could use use String#split("’") and keep only the odd-numbered
elements of the returned array.

I’m sure if you post your actual code and what it does, someone will
help you tweak it to work.


#3

Disclaimer: If there are other ’ somewhere in the document (comments,
CDATA sections, Text elements)
this will miserably break and you need HPricot or other HTML parsers.
If however your data is simple enough and you can fulfill the
prerequisite it becomes very easy…

robert@siena:~/log/ruby/ML 12:56:41
505/6 > cat strings.rb && ruby strings.rb
#!/usr/bin/ruby

vim: sw=2 ts=2 ft=ruby expandtab tw=0 nu syn=on:

file: strings.rb

text = DATA.read

p text.scan /’(.*?)’/
END

Ruby Forum is cool

  • <%= link_to 'Cities', cities_path %>
  • <%= link_to 'Restaurants', restaurants_path %>
  • <%= link_to 'Categories', categories_path %>
  • <%= link_to 'Products', products_path %>
[["pagewrapper"], ["header"], ["navbar"], ["Cities"], ["Restaurants"], ["Categories"], ["Products"]]

#4

Thaaaaaaaank You Robert!
Thumbs up!, this is exactly what I wanted, clean & simple without
hpricot, rexml or overly complicated regexs

Have a nice weekend
greetz Pannda


#5

Hi Brian

I’m sure if you post your actual code and what it does, someone will
help you tweak it to work.

actual code was this

require ‘rubygems’
require ‘yaml’
require ‘hpricot’

html_datei = File.open(ARGV[0]).readlines.collect do |line|

option 1# line.gsub(/</?[^>]>/, “”).to_yaml
option 2 # line.gsub(/</,
“”).gsub(/>/,"").strip.to_yaml.gsub(/—/,"").lstrip
option 3 # line.gsub(/^</?[^>]
>/,"").lstrip
end

all 3 options worked, but I couldn’t figure out how to get those “” or

‘’ in between

so I tried hpricot

doc = open(ARGV[0]) { |f| Hpricot(f).search(“div”) }

but how to go from here? I couldn’t figure out the documentation of

Hpricot, because .to_inner_html doesn’t work

yaml_datei = File.new(ARGV[1], ‘w+’)

yaml_datei << html_datei
yaml_datei << doc
yaml_datei.close

so, that was my actual code