http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/207625
is an answer to most of my requirements, except one.
How can I do a selective traverse_text so that I can skip text of
specific tags?
One option was to use parent.name while traversing over text.
Here is the code that I tried,
require ‘hpricot’
class Hpricot::Text
def set(string)
@content = string
self.raw_string = string
end
end
s = <<HTML
Abcd
this is in java1
- aabbcc
- mmnnoo
- this is in java2
this is in java3
HTML
index = Hpricot.parse(s)
index.traverse_text { |text|
t = text.to_s.strip
if text.parent and text.parent.name and text.parent.name != ‘java’ and
not t.empty?
t = “=#{t}=”
text.set(t)
puts “Modified text to:#{t}”
end
}
puts index
Getting following error,
Modified text to:=Abcd=
Modified text to:=aabbcc=
Modified text to:=mmnnoo=
hpricot-test1.rb:30: undefined method name' for #<Hpricot::Doc:0x2e49c18> (NoMethodError) from c:/ruby/lib/ruby/gems/1.8/gems/hpricot-0.4-mswin32/lib/hpricot/traverse.rb:377:in
traverse_text_internal’
from
c:/ruby/lib/ruby/gems/1.8/gems/hpricot-0.4-mswin32/lib/hpricot/traverse.rb:366:in
traverse_text_internal' from c:/ruby/lib/ruby/gems/1.8/gems/hpricot-0.4-mswin32/lib/hpricot/traverse.rb:146:in
each’
from
c:/ruby/lib/ruby/gems/1.8/gems/hpricot-0.4-mswin32/lib/hpricot/traverse.rb:146:in
each_child' from c:/ruby/lib/ruby/gems/1.8/gems/hpricot-0.4-mswin32/lib/hpricot/traverse.rb:366:in
traverse_text_internal’
from
c:/ruby/lib/ruby/gems/1.8/gems/hpricot-0.4-mswin32/lib/hpricot/traverse.rb:358:in
`traverse_text’
from hpricot-test1.rb:28
Am I making any mistake?
I am new to the world of Ruby and Hpricot … so please bear with me.
Siddharth Karandikar wrote:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/207625
is an answer to most of my requirements, except one.
How can I do a selective traverse_text so that I can skip text of
specific tags?
/ … snip lengthy listing of Hpricot error messages
Am I making any mistake?
Rather than describe the problems you are having trying to make Hpricot
deliver a particular result, why not say what you are trying to
accomplish
and we can discuss that instead?
Parsing and extracting particular text from syntactically correct HTML
pages
is relatively easy. It only requires a few lines of Ruby code. You can
choose which tags to extract text from, and leave all the others behind.
In some cases, it is simpler to write your own extraction code than to
try
to get a library to do this for you. But this approach requires that the
HTML pages be reasonably error-free – it doesn’t work very well if
there
are errors in the syntax of the source pages.
If the pages you have to parse are reasonably error-free, you may have a
much easier time getting what you are after than you may think at this
point.
Here is the scenario,
I am trying to have my blog in 2 languages. English and my native
language ‘marathi’. The blog posts will be written in plain text. Using
bluecloth, I am generating required html markup.
I have hacked bluecloth to spit … in required
places,
e.g.
will generate
title
Now when I get this kind of html, I would like to skip all the text
under ‘english’ tag and convert all the remaining text to my language
‘marathi’ (utf8 codes). Using Hpricot for this.
After that I am thinking of removing all the ‘english’ tags but keeping
the markup surrounding them.
Am I making any mistake?
Sure :).
Some W3C DOM theory:
A document consists of different Nodes - in practice subclasses of Node:
Element, Document, Attribute, Comment, Text, ProcessingInstruction etc
(just from the top of my head - there are some more like
DocumentFragment , CData, … but it is unlikely you will need them
here). Not every Node has a name, or children, or parent, or xxx. You
have to make sure that the subclass of Node you are talking to is
actually responding to a method you are trying to send him.
a Hpricot DOM is not exactly a W3C DOM, but it is mostly similar:
Only HPricot::Element has children, (not HPricot::Document or
HPricot::Comment or…) and also not every Node has a name - like in
your example HPricot::Document (Similarly HPricot::Text or
HPricot::Comment does not have a name…). Also A HPricot::Document does
not have a parent I think.
Your problem is that you are traversing up, and reach the Document node
which does not have a method name.
So you have to modify your code like this:
if text.parent and text.parent.name and text.parent.name != ‘java’ and
to
parent = text.parent
if (parent.instance_of? Hpricot::Text) #or with respond_to, or with
parent.parent == nil
#do the stuff
else
#you have reached the top Node - Document; nothing to do
end
(The else branch is not needed, I just added it for illustration)
HTH,
Peter
__
http://www.rubyrailways.com
Thanks Peter.
I need to improve my knowledge abt DOM in general.
I have modified the code and do “if p.instance_of? Hpricot::Elem and
…”
Right now, Its working fine for me. Still need to think abt all the
possible cases.
Thanks,
Siddharth
Siddharth Karandikar wrote:
will generate
title
Now when I get this kind of html, I would like to skip all the text
under ‘english’ tag and convert all the remaining text to my language
‘marathi’ (utf8 codes). Using Hpricot for this.
Okay, that sounds a great deal more complex than a typical text
extraction
task from an HTML page. I assume you mean to preserve some parts
unchanged,
while translating other parts, and reassemble the page at the end of the
process.
This could be done using your own custom code, but only if a much more
specific, detailed description were offered. The same thing could be
said
of an Hpricot-based approach, by the way.
After that I am thinking of removing all the ‘english’ tags but keeping
the markup surrounding them.
Okay, that part is easy:
data.gsub!(%r{.*?}im,"")
Most tasks in this class are easy to accomplish, as long as the
description
is clear and detailed enough.