How can I count number of elements in an HTML page

Hi there, I’m using net/http to retrieve some html pages and now I
want to count the number of items in a list on the page. The
response.body is stored as a string.

The HTML looks something like this:

Section Heading I'm interested in:

  • foo
  • bar

So what I want to do is count the number of li’s in a particular div
section. In this case the answer is 2. It might be more, it might be
0.

I can find the section I want with a regex but I don’t know how to
iterate through the string looking for particular elements. I was
thinking about taking the section I’m interested in and saving it as
an array and then iterating through each array element (html line)
that way, but I thought there might be a quicker way to do it.

suggestions?

TIA

On Tue, Oct 5, 2010 at 10:45 PM, Paul [email protected] wrote:

iterate through the string looking for particular elements. I was thinking about taking the section I'm interested in and saving it as an array and then iterating through each array element (html line) that way, but I thought there might be a quicker way to do it.

suggestions?

I’d use Nokogiri. Off the top of my head, it would be something like
(untested):

require ‘nokogiri’

html_string=<<END
#[your html]
END

doc = Nokogiri::HTML(html_string)
puts doc.search("/div/ul/li").size

Maybe you will need to adjust the xpath search, but I think it should
be something like that.

Hope this helps,

Jesus.

Nokogiri allows direct access to HTML elements as data. I use it a lot
in my work. Try something like:

require ‘nokogiri’
page = Nokogiri::HTML response.body
count = 0
page.xpath("//div[@class=‘first section’]).each do |element|
count += 1 if element.xpath("/ul")
end

Or something along those lines… (I didn’t test this first).

I would say if you aren’t exactly concerned with the content of the row.
Perhaps just counting the number of lines in the array? I guess you
would have to read in the page line by line…but that isn’t too hard.

I can find the section I want with a regex but I don’t know how to
iterate through the string looking for particular elements. I was
thinking about taking the section I’m interested in and saving it as
an array and then iterating through each array element (html line)
that way, but I thought there might be a quicker way to do it.

$html.scan(%r{<div.first section.}m).to_s.scan(/

  • /).size
  • 2010/10/5 Jesús Gabriel y Galán [email protected]:

    I can find the section I want with a regex but I don't know how to

    html_string=<<END
    #[your html]
    END

    doc = Nokogiri::HTML(html_string)
    puts doc.search("/div/ul/li").size

    Maybe you will need to adjust the xpath search, but I think it should
    be something like that.

    I am now at my computer so I can test this. It seems that
    Nokogiri::HTML yields a complete HTML, adding and tags
    around the fragment, so this works:

    irb(main):002:0> require ‘nokogiri’
    => true
    irb(main):003:0> html_string =<<END
    irb(main):004:0"


    irb(main):005:0"


    irb(main):006:0" Section Heading I’m interested in:
    irb(main):007:0"


    irb(main):008:0"

      irb(main):009:0"

    • irb(main):010:0"
      irb(main):011:0" foo
      irb(main):012:0"
      irb(main):013:0"

    • irb(main):014:0"

    • irb(main):015:0"
      irb(main):016:0" bar
      irb(main):017:0"
      irb(main):018:0"

    • irb(main):019:0"

    irb(main):020:0"

    irb(main):021:0" END
    […snip…]
    irb(main):034:0> doc.search("/html/body/div/ul/li").size
    => 2

    Hope this helps,

    Jesus.

    Hi,

    The “jazzez” gem lists – Count for each and every html tag.

    jazzez used – mechanize, Hpricot libraries…

    For more details --> http://jazzez.wordpress.com

    Thanks
    Raveendran

    For more details :

    1. Get the Html tags

    Ex.

    require ‘jazzez’
    output= Jazzez.new
    output.tagdetails(“google.com\”)

    Output:

    1<html tag(s)
    1 tag(s)
    1<head tag(s)
    1 tag(s)
    1<body tag(s)
    1 tag(s)
    2<table tag(s)
    2 tag(s)
    3<tr tag(s)
    3 tag(s)
    9<td tag(s)
    9 tag(s)
    0<th tag(s)
    0 tag(s)
    0<l tag(s)
    0 tag(s)
    0<link tag(s)
    1<p tag(s)
    1

    tag(s)
    4<div tag(s)
    4 tag(s)
    0<span tag(s)
    0 tag(s)
    4<script tag(s)
    4 tag(s)
    0<ul tag(s)
    0 tag(s)
    0<ol tag(s)
    0 tag(s)
    16<a tag(s)
    15 tag(s)
    0<h1 tag(s)
    0 tag(s)
    0<h2 tag(s)
    0 tag(s)
    0<h3 tag(s)
    0 tag(s)
    0<h4 tag(s)
    0 tag(s)
    0<h5 tag(s)
    0 tag(s)
    0<h6 tag(s)
    0 tag(s)
    4<font tag(s)
    4 tag(s)
    0<select tag(s)
    0 tag(s)
    0<option tag(s)
    0 tag(s)

    Thanks
    Raveendran

    On Fri, Oct 8, 2010 at 4:10 PM, Paul [email protected] wrote:

    search with .*?

    I’ve got nothing against Nokogiri or the other solutions but I was
    hoping for a solution like this that just uses the core libraries for
    portability.

    Cheers! Paul.

    I would try REXML, then. It’s an XML parser in the standard library.
    http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/index.html

    I’d be reserve regex parsing of xml only for very informal situations
    where
    I just a quick solution non rigorous solution (ie a one-time solution
    that I
    plan to verify personally), I am pretty sure that it is not possible to
    correctly parse xml with regex.

    On Fri, Oct 8, 2010 at 11:10 PM, Paul [email protected] wrote:

    search with .*?

    I’ve got nothing against Nokogiri or the other solutions but I was
    hoping for a solution like this that just uses the core libraries for
    portability.

    You have to be careful, then, about all the possible combinations that
    make a valid HTML but make the above regexp fail:

    irb(main):025:0> string=<<EOS
    irb(main):026:0"

      <li

      something

    • something else

    irb(main):027:0" EOS
    => “<div class=“first section”>
      <li
      something
    • something else
    \n”
    irb(main):029:0> string.scan(%r{<div.first
    section.
    }m).to_s.scan(/
  • /).size
    => 1

    irb(main):031:0> require ‘nokogiri’
    => true
    irb(main):032:0> doc = Nokogiri::HTML(string)
    => #<Nokogiri::HTML::Document:0x…fdb940e66 name=“document”
    children=[#<Nokogiri::XML::DTD:0x…fdb940ce0 name=“html”>,
    #<Nokogiri::XML::Element:0x…fdb940cb8 name=“html”
    children=[#<Nokogiri::XML::Element:0x…fdb94043e name=“body”
    children=[#<Nokogiri::XML::Element:0x…fdb940268 name=“div”
    attributes=[#<Nokogiri::XML::Attr:0x…fdb9401c8 name=“class”
    value=“first section”>]
    children=[#<Nokogiri::XML::Element:0x…fdb93fe8a name=“ul”
    children=[#<Nokogiri::XML::Element:0x…fdb93fcbe name=“li”
    children=[#<Nokogiri::XML::Text:0x…fdb93fb2e “something”>]>,
    #<Nokogiri::XML::Element:0x…fdb93fa70 name=“li”
    children=[#<Nokogiri::XML::Text:0x…fdb93f91c “something
    else”>]>]>]>]>]>]>
    irb(main):033:0> doc.search("/html/body/div/ul/li").size
    => 2

    In general: parsing HTML with regexp can get messy. Best leave the
    work to a proper library that handles all the strange nuances.

    Jesus.

  • On Oct 5, 9:33 pm, Steel S. [email protected] wrote:

    I can find the section I want with a regex but I don’t know how to
    iterate through the string looking for particular elements. I was
    thinking about taking the section I’m interested in and saving it as
    an array and then iterating through each array element (html line)
    that way, but I thought there might be a quicker way to do it.

    $html.scan(%r{<div.first section.}m).to_s.scan(/

  • /).size
  • Thanks Steel. This worked fine. I just needed to make it a lazy
    search with .*?

    I’ve got nothing against Nokogiri or the other solutions but I was
    hoping for a solution like this that just uses the core libraries for
    portability.

    Cheers! Paul.

    This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

    | Privacy Policy | Terms of Service | Remote Ruby Jobs