How can I count number of elements in an HTML page

Hi there, I’m using net/http to retrieve some html pages and now I
want to count the number of items in a list on the page. The
response.body is stored as a string.

The HTML looks something like this:

Section Heading I'm interested in:

  • foo
  • bar

So what I want to do is count the number of li’s in a particular div
section. In this case the answer is 2. It might be more, it might be
0.

I can find the section I want with a regex but I don’t know how to
iterate through the string looking for particular elements. I was
thinking about taking the section I’m interested in and saving it as
an array and then iterating through each array element (html line)
that way, but I thought there might be a quicker way to do it.

suggestions?

TIA

On Tue, Oct 5, 2010 at 10:45 PM, Paul [email protected] wrote:

iterate through the string looking for particular elements. I was thinking about taking the section I'm interested in and saving it as an array and then iterating through each array element (html line) that way, but I thought there might be a quicker way to do it.

suggestions?

I’d use Nokogiri. Off the top of my head, it would be something like
(untested):

require ‘nokogiri’

html_string=<<END
#[your html]
END

doc = Nokogiri::HTML(html_string)
puts doc.search(“/div/ul/li”).size

Maybe you will need to adjust the xpath search, but I think it should
be something like that.

Hope this helps,

Jesus.

Nokogiri allows direct access to HTML elements as data. I use it a lot
in my work. Try something like:

require ‘nokogiri’
page = Nokogiri::HTML response.body
count = 0
page.xpath("//div[@class=‘first section’]).each do |element|
count += 1 if element.xpath("/ul")
end

Or something along those lines… (I didn’t test this first).

I would say if you aren’t exactly concerned with the content of the row.
Perhaps just counting the number of lines in the array? I guess you
would have to read in the page line by line…but that isn’t too hard.

I can find the section I want with a regex but I don’t know how to
iterate through the string looking for particular elements. I was
thinking about taking the section I’m interested in and saving it as
an array and then iterating through each array element (html line)
that way, but I thought there might be a quicker way to do it.

$html.scan(%r{<div.first section.}m).to_s.scan(/

  • /).size
  • 2010/10/5 Jesús Gabriel y Galán [email protected]:

    I can find the section I want with a regex but I don't know how to

    html_string=<<END
    #[your html]
    END

    doc = Nokogiri::HTML(html_string)
    puts doc.search(“/div/ul/li”).size

    Maybe you will need to adjust the xpath search, but I think it should
    be something like that.

    I am now at my computer so I can test this. It seems that
    Nokogiri::HTML yields a complete HTML, adding and tags
    around the fragment, so this works:

    irb(main):002:0> require ‘nokogiri’
    => true
    irb(main):003:0> html_string =<<END
    irb(main):004:0"


    irb(main):005:0"


    irb(main):006:0" Section Heading I’m interested in:
    irb(main):007:0"


    irb(main):008:0"

      irb(main):009:0"

    • irb(main):010:0"
      irb(main):011:0" foo
      irb(main):012:0"
      irb(main):013:0"

    • irb(main):014:0"

    • irb(main):015:0"
      irb(main):016:0" bar
      irb(main):017:0"
      irb(main):018:0"

    • irb(main):019:0"

    irb(main):020:0"

    irb(main):021:0" END
    […snip…]
    irb(main):034:0> doc.search(“/html/body/div/ul/li”).size
    => 2

    Hope this helps,

    Jesus.

    Hi,

    The “jazzez” gem lists – Count for each and every html tag.

    jazzez used – mechanize, Hpricot libraries…

    For more details → http://jazzez.wordpress.com

    Thanks
    Raveendran

    For more details :

    1. Get the Html tags

    Ex.

    require ‘jazzez’
    output= Jazzez.new
    output.tagdetails(“google.com\”)

    Output:

    1<html tag(s)
    1 tag(s)
    1<head tag(s)
    1 tag(s)
    1<body tag(s)
    1 tag(s)
    2<table tag(s)
    2 tag(s)
    3<tr tag(s)
    3 tag(s)
    9<td tag(s)
    9 tag(s)
    0<th tag(s)
    0 tag(s)
    0<l tag(s)
    0 tag(s)
    0<link tag(s)
    1<p tag(s)
    1

    tag(s)
    4<div tag(s)
    4 tag(s)
    0<span tag(s)
    0 tag(s)
    4<script tag(s)
    4 tag(s)
    0<ul tag(s)
    0 tag(s)
    0<ol tag(s)
    0 tag(s)
    16<a tag(s)
    15 tag(s)
    0<h1 tag(s)
    0 tag(s)
    0<h2 tag(s)
    0 tag(s)
    0<h3 tag(s)
    0 tag(s)
    0<h4 tag(s)
    0 tag(s)
    0<h5 tag(s)
    0 tag(s)
    0<h6 tag(s)
    0 tag(s)
    4<font tag(s)
    4 tag(s)
    0<select tag(s)
    0 tag(s)
    0<option tag(s)
    0 tag(s)

    Thanks
    Raveendran

    On Fri, Oct 8, 2010 at 4:10 PM, Paul [email protected] wrote:

    search with .*?

    I’ve got nothing against Nokogiri or the other solutions but I was
    hoping for a solution like this that just uses the core libraries for
    portability.

    Cheers! Paul.

    I would try REXML, then. It’s an XML parser in the standard library.
    http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/index.html

    I’d be reserve regex parsing of xml only for very informal situations
    where
    I just a quick solution non rigorous solution (ie a one-time solution
    that I
    plan to verify personally), I am pretty sure that it is not possible to
    correctly parse xml with regex.

    On Fri, Oct 8, 2010 at 11:10 PM, Paul [email protected] wrote:

    search with .*?

    I’ve got nothing against Nokogiri or the other solutions but I was
    hoping for a solution like this that just uses the core libraries for
    portability.

    You have to be careful, then, about all the possible combinations that
    make a valid HTML but make the above regexp fail:

    irb(main):025:0> string=<<EOS
    irb(main):026:0"

      <li

      something

    • something else

    irb(main):027:0" EOS
    => “<div class="first section">
      <li
      something
    • something else
    \n”
    irb(main):029:0> string.scan(%r{<div.first
    section.
    }m).to_s.scan(/
  • /).size
    => 1

    irb(main):031:0> require ‘nokogiri’
    => true
    irb(main):032:0> doc = Nokogiri::HTML(string)
    => #<Nokogiri::HTML::Document:0x…fdb940e66 name=“document”
    children=[#<Nokogiri::XML::DTD:0x…fdb940ce0 name=“html”>,
    #<Nokogiri::XML::Element:0x…fdb940cb8 name=“html”
    children=[#<Nokogiri::XML::Element:0x…fdb94043e name=“body”
    children=[#<Nokogiri::XML::Element:0x…fdb940268 name=“div”
    attributes=[#<Nokogiri::XML::Attr:0x…fdb9401c8 name=“class”
    value=“first section”>]
    children=[#<Nokogiri::XML::Element:0x…fdb93fe8a name=“ul”
    children=[#<Nokogiri::XML::Element:0x…fdb93fcbe name=“li”
    children=[#<Nokogiri::XML::Text:0x…fdb93fb2e “something”>]>,
    #<Nokogiri::XML::Element:0x…fdb93fa70 name=“li”
    children=[#<Nokogiri::XML::Text:0x…fdb93f91c “something
    else”>]>]>]>]>]>]>
    irb(main):033:0> doc.search(“/html/body/div/ul/li”).size
    => 2

    In general: parsing HTML with regexp can get messy. Best leave the
    work to a proper library that handles all the strange nuances.

    Jesus.

  • On Oct 5, 9:33 pm, Steel S. [email protected] wrote:

    I can find the section I want with a regex but I don’t know how to
    iterate through the string looking for particular elements. I was
    thinking about taking the section I’m interested in and saving it as
    an array and then iterating through each array element (html line)
    that way, but I thought there might be a quicker way to do it.

    $html.scan(%r{<div.first section.}m).to_s.scan(/

  • /).size

  • Thanks Steel. This worked fine. I just needed to make it a lazy
    search with .*?

    I’ve got nothing against Nokogiri or the other solutions but I was
    hoping for a solution like this that just uses the core libraries for
    portability.

    Cheers! Paul.