Why does #content method in nokogiri not printing the full text?

my-ruby · April 14, 2013, 6:18pm

Here is the documentation: File: README — Documentation for nokogiri (1.15.3)

Why does below code not printing the full text?

Code:

require ‘nokogiri’

html = <<-END

<head>

<title> A Dirge </title>

<link rel     = "schema.DC"
      href    = "http://purl.org/DC/elements/1.0/">

<meta name    = "DC.Title"
      content = "A Dirge">

<meta name    = "DC.Creator"
      content = "Shelley, Percy Bysshe">

<meta name    = "DC.Type"
      content = "poem">

<meta name    = "DC.Date"
      content = "1820">

<meta name    = "DC.Format"
      content = "text/html">

<meta name    = "DC.Language"
      content = "en">

</head>

<body><pre>

        Rough wind, that moanest loud
          Grief too sad for song;
        Wild wind, when sullen cloud
          Knells all the night long;
        Sad storm, whose tears are vain,
        Bare woods, whose branches strain,
        Deep caves and dreary main, -
          Wail, for the world's wrong!

</pre></body>

</html>

END

doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.children.each do |ch|
p ch.content if ch.text?
end

Output:

"\n\n \n\n "
"\n\n "

Expected output:

    Rough wind, that moanest loud
          Grief too sad for song;
        Wild wind, when sullen cloud
          Knells all the night long;
        Sad storm, whose tears are vain,
        Bare woods, whose branches strain,
        Deep caves and dreary main, -
          Wail, for the world's wrong!

my-ruby · April 14, 2013, 6:59pm

On Sun, Apr 14, 2013 at 11:19 AM, Love U Ruby [email protected]
wrote:

      content = "text/html">
        Wild wind, when sullen cloud


          Wail, for the world's wrong!

–
Posted via http://www.ruby-forum.com/.

If you actually look at the structure of doc, the next to last entry
in it’s children contains children as well, which you need to loop
through. Try this:

(load your code into irb)
require ‘pp’
pp doc

and see what the structure is.

my-ruby · April 14, 2013, 7:04pm

On Sun, Apr 14, 2013 at 11:59 AM, tamouse mailing lists
[email protected] wrote:

html = <<-END
content = “A Dirge”>
<meta name = “DC.Format”
Grief too sad for song;
END
"\n\n "
Deep caves and dreary main, -
(load your code into irb)
require ‘pp’
pp doc

and see what the structure is.

Follow-up: since you have a complete html document, why treat it as a
fragment? You can call Nokogiri::HTML.parse(html) instead and get the
actual complete document tree with all the proper nesting.

my-ruby · April 14, 2013, 7:07pm

tamouse mailing lists wrote in post #1105602:

On Sun, Apr 14, 2013 at 11:59 AM, tamouse mailing lists
[email protected] wrote:

Follow-up: since you have a complete html document, why treat it as a
fragment? You can call Nokogiri::HTML.parse(html) instead and get the
actual complete document tree with all the proper nesting.

I am just learning this Nokogiri first time. So don’t have that much
knowledge about their uses.

Could you tell me please?

When should I use Nokogiri::HTML.parse(html), and the when the other?

my-ruby · April 14, 2013, 7:22pm

On Sun, Apr 14, 2013 at 12:07 PM, Love U Ruby [email protected]
wrote:

Could you tell me please?

No. I will tell you this though. You have entirely the wrong
strategy for learning how to be a developer. You have adopted the
strategy of “someone must tell me”. You need to adopt the strategy of
“try things out until I learn what works”. If you get stuck on this
low a level of understanding, you will never progress, and as you have
seen, it just frustrates people whom you continuously run back to with
every single step. You may think you are learning, but you are not at
all learning how to learn, which is the more important step. You are
not learning how to solve problems, especially your own. People are
NOT on this list to teach you. We are not your instructors. We answer
questions out of the goodness of our hearts, but repeated trips to the
well for every sip wears everyone here down. Frankly, it makes me want
to part this list and go elsewhere. It makes it very unenjoyable, and
very unpleasant.

When should I use Nokogiri::HTML.parse(html), and the when the other?

Please compare and contrast the terms “Document” and “Document Fragment”

–
Posted via http://www.ruby-forum.com/.

What do the words “Document” and “Document Fragment” mean to you?

my-ruby · April 14, 2013, 7:20pm

tamouse mailing lists wrote in post #1105601:

On Sun, Apr 14, 2013 at 11:19 AM, Love U Ruby [email protected]

(load your code into irb)
require ‘pp’
pp doc

and see what the structure is.

Now, I tried

doc = Nokogiri::HTML::DocumentFragment.parse(html)
pp doc
doc.children.each do |ch|
p ch.content if ch.text?
end

output:

children = [
#(Text "\n\n Rough wind, that moanest loud\n
Grief too sad for song;\n Wild wind, when sullen cloud\n
Knells all the night long;\n Sad storm, whose tears are
vain,\n Bare woods, whose branches strain,\n Deep
caves and dreary main, -\n Wail, for the world’s wrong!\n\n
")]
}),
#(Text "\n\n ")

–
"\n\n \n\n "
"\n\n "

where does go the middle characters between the first "\n\n \n\n "
?

my-ruby · April 14, 2013, 7:23pm

I see.

You did not actually read what pp doc told you, did you?

my-ruby · April 14, 2013, 7:25pm

tamouse mailing lists wrote in post #1105606:

I see.

You did not actually read what pp doc told you, did you?

I have given the partial output that I got from pp here.

my-ruby · April 14, 2013, 7:59pm

You copying it in to a message and you reading it are two entirely
different things.

my-ruby · April 14, 2013, 9:41pm

Finally I got the output what I was looking for:

require 'nokogiri'
require 'pp'

html = <<-END
<html>

    <head>

    <title> A Dirge </title>

    <link rel     = "schema.DC"
          href    = "http://purl.org/DC/elements/1.0/">

    <meta name    = "DC.Title"
          content = "A Dirge">

    <meta name    = "DC.Creator"
          content = "Shelley, Percy Bysshe">

    <meta name    = "DC.Type"
          content = "poem">

    <meta name    = "DC.Date"
          content = "1820">

    <meta name    = "DC.Format"
          content = "text/html">

    <meta name    = "DC.Language"
          content = "en">

    </head>

    <body><pre>

            Rough wind, that moanest loud
              Grief too sad for song;
            Wild wind, when sullen cloud
              Knells all the night long;
            Sad storm, whose tears are vain,
            Bare woods, whose branches strain,
            Deep caves and dreary main, -
              Wail, for the world's wrong!

    </pre></body>

    </html>
 END

doc = Nokogiri::HTML::DocumentFragment.parse(html)

doc.children.each do |ch|
  puts ch.child.content if ch.node_name == 'pre'
end

output:

        Rough wind, that moanest loud
          Grief too sad for song;
        Wild wind, when sullen cloud
          Knells all the night long;
        Sad storm, whose tears are vain,
        Bare woods, whose branches strain,
        Deep caves and dreary main, -
          Wail, for the world's wrong!

my-ruby · April 14, 2013, 10:44pm

On Sun, 14 Apr 2013 18:19:00 +0200, Love U Ruby [email protected]
wrote:

doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.children.each do |ch|
p ch.content if ch.text?
end

ch.text? will only return true when a node is a text node - ie., it’s
not a tag. Since the document root contains no text itself apart from
whitespace, this just prints the whitespace. Remove the if ch.text?
part to print contents of everything (or just use doc.content).

my-ruby · April 15, 2013, 8:09am

tamouse mailing lists wrote in post #1105606:

I see.

You did not actually read what pp doc told you, did you?

Thanks to you for the hints pp doc.It helped me great. Just one more thing to tell you. Can you suggest in what other ways I could solve the same problem? I just want to learnNokogiri`. Give me only hints, I
will try to solve using that the same assignment as above.

my-ruby · April 15, 2013, 10:18am

Bartosz Dziewoński wrote in post #1105615:

On Sun, 14 Apr 2013 18:19:00 +0200, Love U Ruby [email protected]
wrote:

doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.children.each do |ch|
p ch.content if ch.text?
end

ch.text? will only return true when a node is a text node - ie., it’s
not a tag. Since the document root contains no text itself apart from
whitespace, this just prints the whitespace. Remove the if ch.text?
part to print contents of everything (or just use doc.content).

Thank you very much for your comments.

my-ruby · April 16, 2013, 5:05am

On Mon, Apr 15, 2013 at 1:09 AM, Love U Ruby [email protected]
wrote:

tamouse mailing lists wrote in post #1105606:

I see.

You did not actually read what pp doc told you, did you?

Thanks to you for the hints pp doc.It helped me great. Just one more thing to tell you. Can you suggest in what other ways I could solve the same problem? I just want to learn Nokogiri`. Give me only hints, I
will try to solve using that the same assignment as above.

Write your own Mechanize gem.

my-ruby · April 20, 2013, 9:26am

Just looking for a definition of the use: When should I need to think
of what to use from below ?

Nokogiri::HTML::Document and Nokogiri::HTML::DocumentFragment

and when I should think to use parse method of each?

my-ruby · April 20, 2013, 10:59am

On Apr 20, 2013 2:28 AM, “Love U Ruby” [email protected] wrote:

Posted via http://www.ruby-forum.com/.

What do you suppose the meaning of “fragment” is, and why would that
make a
distinction?

my-ruby · April 20, 2013, 11:05am

tamouse mailing lists wrote in post #1106373:

On Apr 20, 2013 2:28 AM, “Love U Ruby” [email protected] wrote:

Posted via http://www.ruby-forum.com/.

What do you suppose the meaning of “fragment” is, and why would that
make a
distinction?

I understand,but looking for what would be perfect use-case to select
the best one. means when I must think that I have to use
Nokogiri::HTML::Document and when the other?

my-ruby · April 20, 2013, 12:04pm

On Apr 20, 2013 4:06 AM, “Love U Ruby” [email protected] wrote:

I understand,but looking for what would be perfect use-case to select
the best one. means when I must think that I have to use
Nokogiri::HTML::Document and when the other?

–
Posted via http://www.ruby-forum.com/.

If you understand the difference, then you have your ‘perfect’ use-case.

my-ruby · May 29, 2013, 10:45pm

Hi,

I wrote the below code:

require ‘nokogiri’
require ‘open-uri’

doc = Nokogiri::HTML(open(‘http://www.homeshop18.com/’))
p doc.css(“div#megamenu-sub-nav li span:nth-child(2)”).map{|x|
x.parent.text.strip}
#=> [“books”, “clothing”, “footwear”, “fashion accessories”, “health &
beauty”, “jewellery”, “watches”, “mobiles”, “gsm mobiles\r\rnew”, “upto
62% off\r\rnew”, “camera & camcorders”, “computers”, “electronics”,
“home & kitchen”, “household appliances”, “kids & toys”, “gift &
flowers”, “office & stationery”]

But in the array output, I am getting 2 extra items - “gsm
mobiles\r\rnew”, “upto 62% off\r\rnew”, which I don’t expect.

Could anyone tell me where I did the mistake.