How can I count number of elements in an HTML page

paul · October 5, 2010, 10:45pm

Hi there, I’m using net/http to retrieve some html pages and now I
want to count the number of items in a list on the page. The
response.body is stored as a string.

The HTML looks something like this:

Section Heading I'm interested in:

foo
bar

So what I want to do is count the number of li’s in a particular div
section. In this case the answer is 2. It might be more, it might be
0.

I can find the section I want with a regex but I don’t know how to
iterate through the string looking for particular elements. I was
thinking about taking the section I’m interested in and saving it as
an array and then iterating through each array element (html line)
that way, but I thought there might be a quicker way to do it.

suggestions?

TIA

paul · October 5, 2010, 10:52pm

On Tue, Oct 5, 2010 at 10:45 PM, Paul [email protected] wrote:

iterate through the string looking for particular elements. I was thinking about taking the section I'm interested in and saving it as an array and then iterating through each array element (html line) that way, but I thought there might be a quicker way to do it.
suggestions?

I’d use Nokogiri. Off the top of my head, it would be something like
(untested):

require ‘nokogiri’

html_string=<<END
#[your html]
END

doc = Nokogiri::HTML(html_string)
puts doc.search(“/div/ul/li”).size

Maybe you will need to adjust the xpath search, but I think it should
be something like that.

Hope this helps,

Jesus.

paul · October 5, 2010, 10:53pm

Nokogiri allows direct access to HTML elements as data. I use it a lot
in my work. Try something like:

require ‘nokogiri’
page = Nokogiri::HTML response.body
count = 0
page.xpath("//div[@class=‘first section’]).each do |element|
count += 1 if element.xpath("/ul")
end

Or something along those lines… (I didn’t test this first).

paul · October 5, 2010, 10:54pm

I would say if you aren’t exactly concerned with the content of the row.
Perhaps just counting the number of lines in the array? I guess you
would have to read in the page line by line…but that isn’t too hard.

paul · October 6, 2010, 3:33am

I can find the section I want with a regex but I don’t know how to
iterate through the string looking for particular elements. I was
thinking about taking the section I’m interested in and saving it as
an array and then iterating through each array element (html line)
that way, but I thought there might be a quicker way to do it.

$html.scan(%r{<div.first section.}m).to_s.scan(/

/).size

paul · October 6, 2010, 9:29am

2010/10/5 Jesús Gabriel y Galán [email protected]:

I can find the section I want with a regex but I don't know how to

html_string=<<END
#[your html]
END

doc = Nokogiri::HTML(html_string)
puts doc.search(“/div/ul/li”).size

Maybe you will need to adjust the xpath search, but I think it should
be something like that.

I am now at my computer so I can test this. It seems that
Nokogiri::HTML yields a complete HTML, adding and tags
around the fragment, so this works:

irb(main):002:0> require ‘nokogiri’
=> true
irb(main):003:0> html_string =<<END
irb(main):004:0"

irb(main):005:0"

irb(main):006:0" Section Heading I’m interested in:
irb(main):007:0"

irb(main):008:0"

irb(main):010:0"
irb(main):011:0" foo
irb(main):012:0"
irb(main):013:0"

irb(main):015:0"
irb(main):016:0" bar
irb(main):017:0"
irb(main):018:0"

irb(main):020:0"

irb(main):021:0" END
[…snip…]
irb(main):034:0> doc.search(“/html/body/div/ul/li”).size
=> 2

Hope this helps,

Jesus.

paul · October 6, 2010, 10:55am

Hi,

The “jazzez” gem lists – Count for each and every html tag.

jazzez used – mechanize, Hpricot libraries…

For more details → http://jazzez.wordpress.com

Thanks
Raveendran

paul · October 6, 2010, 10:57am

For more details :

Get the Html tags

Ex.

require â€˜jazzezâ€™
output= Jazzez.new
output.tagdetails(â€œgoogle.com\â€)

Output:

1<html tag(s)
1 tag(s)
1<head tag(s)
1 tag(s)
1<body tag(s)
1 tag(s)
2<table tag(s)
2 tag(s)
3<tr tag(s)
3 tag(s)
9<td tag(s)
9 tag(s)
0<th tag(s)
0 tag(s)
0<l tag(s)
0 tag(s)
0<link tag(s)
1<p tag(s)
1

tag(s)
4<div tag(s)
4 tag(s)
0<span tag(s)
0 tag(s)
4<script tag(s)
4 tag(s)
0<ul tag(s)
0 tag(s)
0<ol tag(s)
0 tag(s)
16<a tag(s)
15 tag(s)
0<h1 tag(s)
0 tag(s)
0<h2 tag(s)
0 tag(s)
0<h3 tag(s)
0 tag(s)
0<h4 tag(s)
0 tag(s)
0<h5 tag(s)
0 tag(s)
0<h6 tag(s)
0 tag(s)
4<font tag(s)
4 tag(s)
0<select tag(s)
0 tag(s)
0<option tag(s)
0 tag(s)

Thanks
Raveendran

paul · October 10, 2010, 2:05am

On Fri, Oct 8, 2010 at 4:10 PM, Paul [email protected] wrote:

search with .*?

I’ve got nothing against Nokogiri or the other solutions but I was
hoping for a solution like this that just uses the core libraries for
portability.

Cheers! Paul.

I would try REXML, then. It’s an XML parser in the standard library.
http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/index.html

I’d be reserve regex parsing of xml only for very informal situations
where
I just a quick solution non rigorous solution (ie a one-time solution
that I
plan to verify personally), I am pretty sure that it is not possible to
correctly parse xml with regex.

paul · October 10, 2010, 2:12am

On Fri, Oct 8, 2010 at 11:10 PM, Paul [email protected] wrote:

search with .*?

I’ve got nothing against Nokogiri or the other solutions but I was
hoping for a solution like this that just uses the core libraries for
portability.

You have to be careful, then, about all the possible combinations that
make a valid HTML but make the above regexp fail:

irb(main):025:0> string=<<EOS
irb(main):026:0"

something

something else

irb(main):027:0" EOS
=> “<div class="first section">

something else

\n”
irb(main):029:0> string.scan(%r{<div.first
section.}m).to_s.scan(/

/).size
=> 1

irb(main):031:0> require ‘nokogiri’
=> true
irb(main):032:0> doc = Nokogiri::HTML(string)
=> #<Nokogiri::HTML::Document:0x…fdb940e66 name=“document”
children=[#<Nokogiri::XML::DTD:0x…fdb940ce0 name=“html”>,
#<Nokogiri::XML::Element:0x…fdb940cb8 name=“html”
children=[#<Nokogiri::XML::Element:0x…fdb94043e name=“body”
children=[#<Nokogiri::XML::Element:0x…fdb940268 name=“div”
attributes=[#<Nokogiri::XML::Attr:0x…fdb9401c8 name=“class”
value=“first section”>]
children=[#<Nokogiri::XML::Element:0x…fdb93fe8a name=“ul”
children=[#<Nokogiri::XML::Element:0x…fdb93fcbe name=“li”
children=[#<Nokogiri::XML::Text:0x…fdb93fb2e “something”>]>,
#<Nokogiri::XML::Element:0x…fdb93fa70 name=“li”
children=[#<Nokogiri::XML::Text:0x…fdb93f91c “something
else”>]>]>]>]>]>]>
irb(main):033:0> doc.search(“/html/body/div/ul/li”).size
=> 2

In general: parsing HTML with regexp can get messy. Best leave the
work to a proper library that handles all the strange nuances.

Jesus.

paul · October 8, 2010, 11:17pm

On Oct 5, 9:33 pm, Steel S. [email protected] wrote:

I can find the section I want with a regex but I don’t know how to
iterate through the string looking for particular elements. I was
thinking about taking the section I’m interested in and saving it as
an array and then iterating through each array element (html line)
that way, but I thought there might be a quicker way to do it.

$html.scan(%r{<div.first section.}m).to_s.scan(/

/).size

Thanks Steel. This worked fine. I just needed to make it a lazy
search with .*?

I’ve got nothing against Nokogiri or the other solutions but I was
hoping for a solution like this that just uses the core libraries for
portability.

Cheers! Paul.

How can I count number of elements in an HTML page

The HTML looks something like this:

Section Heading I'm interested in:

irb(main):006:0" Section Heading I’m interested in: irb(main):007:0"

irb(main):006:0" Section Heading I’m interested in:
irb(main):007:0"