Hpricot - strip out images and show first x words

cmaxvv · July 19, 2009, 9:42pm

Hey all

I have a bunch of html in p tags and i want to show shorted summary -
like the first 25 words followed by ‘…’. I also want to strip out any
images. I’m using hpricot for this and am finding myself writing a
convoluted messy method. Can anyone show me a clean and simple way? Is
Hpricot even the right tool for the job?

Here’s the hacky and not-really-to-spec mess i have so far.

def lede(word_count = 25)
  doc = self.hpricot_body
  #wipe images
  doc.search("img").remove
  paras = doc.search("//p")
  text = ""
  while paras.size > 0 && text.split(" ").size < word_count
    text += paras.shift.to_html
  end
  if (arr = text.split(" ")).size > 25
    return arr[0..24].join(" ") + " ..."
  else
    return arr.join(" ")
  end
end

thanks
max

cmaxvv · July 19, 2009, 11:23pm

Max W. wrote:

Hey all

I have a bunch of html in p tags and i want to show shorted summary -
like the first 25 words followed by ‘…’. I also want to strip out any
images. I’m using hpricot for this and am finding myself writing a
convoluted messy method. Can anyone show me a clean and simple way? Is
Hpricot even the right tool for the job?

See if this helps:

require “rubygems”
require “hpricot”

html =<<ENDOFHTML

hello:

first paragraph of longer text

second paragraph

bye:

ENDOFHTML

doc = Hpricot(html)
results = []

paras = doc.search("//p")

paras.each do |para|
para.search(“img”).remove
text = para.inner_html

if text.length <= 25
results << text
else
results << “#{text[0, 25]}…”
end
end

p results

[“first paragraph of…”, “second paragraph”]

cmaxvv · July 19, 2009, 11:45pm

Whoops! You wanted 25 words:

require “rubygems”
require “hpricot”

html =<<ENDOFHTML

hello:

first paragraph of longer text apple apple apple apple apple apple apple apple apple apple pear pear pear pear pear pear pear pear pear pear pear pear ball ball ball ball ball ball ball ball ball ball ball ball

second paragraph

bye:

ENDOFHTML

doc = Hpricot(html)
results = []
max_words = 25

paras = doc.search("//p")

paras.each do |para|
para.search(“img”).remove

text = para.inner_html
words = text.split()

if words.length <= max_words
results << words.join(" ")
else
results << “#{words[0, 25].join(” “)}…”
end
end

p results
[“first paragraph of longer text apple apple apple apple apple
apple apple apple apple apple pear pear pear pear pear pear pear pear
pear pear…”, “second paragraph”]

cmaxvv · July 20, 2009, 2:17am

Max W. wrote:

Hi 7stud, thanks.

This seems to return the first 25 words of every paragraph?

Well, what does this say:

I have a bunch of html in p tags and i want to show shorted
summary - like the first 25 words followed by ‘…’.

If you want relevant answers, then you have to post precise questions.

So now you just want the first 25 words of your html?

html =<<ENDOFHTML

From the UKTI blog:

\n

Early on 20 January 2009, nine British new media innovators and I arrived at the Foreign Office for a meeting with the Foreign Secretary, David Milliband MP, to discuss how government can support this important sector. ENDOFHTML

max_words = 25

no_images = html.gsub(/<\simg.?>/, “”)
words = no_images.split()

if words.length <= 25:
puts words.join(" ")
else
puts “#{words[0, 25].join(” “)}…”
end

–output:–

From the UKTI blog:

Early on 20 January 2009, nine British new media innovators and I arrived at the Foreign Office for a...

cmaxvv · July 20, 2009, 12:51am

Hi 7stud, thanks.

This seems to return the first 25 words of every paragraph? Maybe i’m
not returning the right thing from the method though…here’s how i
wrapped up your code (in lede2 method) and mine (in lede method):

def lede(word_count = 25)
doc = self.hpricot_body
#wipe images
doc.search(“img”).remove
paras = doc.search(“//p”)
text = “”
while paras.size > 0 && text.split(" “).size < word_count
text += paras.shift.to_html
end
if (arr = text.split(” “)).size > 25
return arr[0…24].join(” “) + " …”
else
return arr.join(" ")
end
end

def lede2(word_count = 25)
  doc = self.hpricot_body
  results = []
  paras = doc.search("//p")
  paras.each do |para|
    para.search("img").remove
    text = para.inner_html
    words = text.split()
    if words.length <= word_count
      results << words.join(" ")
    else
      results << "#{words[0, word_count].join(" ")}..."
    end
  end
  results.collect{|text| "<p>#{text}</p>"}.join
end

And here are the results of calling each method on the same post: first
i show the html content (body_rendered) that we use in the lede methods.

post.body_rendered
=> “

From the UKTI blog:
\n
Early on 20 January 2009,
nine British new media innovators and I arrived at the Foreign Office
for a meeting with the Foreign Secretary, David Milliband
MP, to discuss how government can support this important
sector. Over breakfast, the delegates, Mr Milliband, Foreign Office
Minister Gillian Merron and Sir Andrew
Cahn explored this key area for opportunities in wealth
creation, and to understand how the public sector can find new ways of
working using new media tools.
\n
The biggest
challenge raised by the companies was the apparent dearth of funding
opportunities for new start-ups in this economic climate and beyond.
There are few investors in the UK outside London, and most companies
seek funding from US sources.
\n
Admittedly, the digital sector has
a stronger ability to weather the storm than other market sectors; after
the bubble of 2000 burst, start-ups regrouped and re-emerged from the
ashes as \342\200\230Web
2.0\342\200\262 in 2005, aiming to generate businesses with sturdy
and market-resilient plans that can face inclement weather. However,
delegates called for funds to bridge the gap between
\302\24310-\302\243250k, and to support training programmes for new
talent into the digital and creative industries.
\n
The Foreign
Secretary, who has a track record as a blogger and new media enthusiast,
also wanted to explore with his visitors the contribution new
technologies can make to diplomacy and international
problem-solving.
\n
British new media developers excel at building
social entrepreneurial applications, ensuring that their social
networking, social software (e.g., blogs) and other social systems
(e.g., search, data visualisation) work to ensure participation in the
community.
\n
New media leverages communities based on
commonalities, rather than proximity, encouraging participation on an
equal playing field. It has been crucial in breaking down international
and social barriers, exposing participants first-hand to news and news
sources, encouraging them to engage with people of different religions,
cultures and creeds, of different abilities and languages. It is
transforming the way our children learn, the way our teachers teach and
the way we do business. In short, it has put the person back into the
technology, lowering the barriers for knowledge and sharing.
\n
Yet
the challenge remains in opening up policy debates in such a way as they
mobilise the many-to-many networks which new media supports. Delegates
suggested using gaming technologies to break down boundaries of
participation and communication, and to open up public assets to
communities who may be able to offer better solutions than those that
have come before.
\n
The Business Breakfast was organised by the ICT
Sector Team and is part of a series of meetings aimed at
facilitating abetter understanding between business and government on
key issues affecting businesses. Many thanks to all
involved!
\n
The
Attendees
The attendees had been hand-picked to represent
the spectrum of digital services developed in the UK, from innovators in
social entrepreneurship and education to broadcasters and videogame
developers:
\n
4iP
Channel 4 Education
Chinwag
Dopplr
Mind Candy
MySociety
School of Everything
TTGames
UnLtdWorld
”
post.lede
=> “
From the UKTI blog:

Early on 20 January 2009,
nine British new media innovators and I arrived at the Foreign Office
for a meeting …”
post.lede2
=> “

From the UKTI blog:

Early on 20 January 2009,
nine British new media innovators and I arrived at the Foreign Office
for a meeting with the Foreign Secretary, <a…

The
biggest challenge raised by the companies was the apparent dearth of
funding opportunities for new start-ups in this economic climate and
beyond. There are…

Admittedly, the digital sector has a stronger
ability to weather the storm than other market sectors; after the bubble
of 2000 burst, start-ups regrouped and…

The Foreign Secretary,
who has a track record as a blogger and new media enthusiast, also
wanted to explore with his visitors the contribution
new…

British new media developers excel at building social
entrepreneurial applications, ensuring that their social networking,
social software (e.g., blogs) and other social systems (e.g.,
search,…

New media leverages communities based on commonalities,
rather than proximity, encouraging participation on an equal playing
field. It has been crucial in breaking down international…

Yet
the challenge remains in opening up policy debates in such a way as they
mobilise the many-to-many networks which new media supports. Delegates
suggested…

The Business Breakfast was organised by the ICT
Sector Team and is part of a series of meetings aimed at
facilitating abetter understanding…

The Attendees
The
attendees had been hand-picked to represent the spectrum of digital
services developed in the UK, from innovators in social entrepreneurship
and…

4iP
Channel 4 Education
Chinwag
Dopplr
Mind Candy
MySociety
School of Everything
TTGames
UnLtdWorld
”

cmaxvv · July 20, 2009, 11:30am

Max W. wrote:

7stud – wrote:

Well, what does this say:

I have a bunch of html in p tags and i want to show shorted
summary - like the first 25 words followed by ‘…’.
I guess that was a bit ambiguous, sorry. Anyway, i’m interested to see
you don’t think it’s worth bothering with hpricot in this case, and just
use a regex - i was wondering if hpricot was overkill myself.

One problem that just occurred to me is that treating the whole html
like text and taking the first 25 words will make it into invalid html -
because we have start tags with no matching end tags.

I stopped using Hpricot because you said this was your desired output:

post.lede
=> “

From the UKTI blog:

Early on 20 January 2009,
nine British new media innovators and I arrived at the Foreign Office
for a meeting …”

cmaxvv · July 20, 2009, 10:23am

7stud – wrote:

Well, what does this say:

I have a bunch of html in p tags and i want to show shorted
summary - like the first 25 words followed by ‘…’.
I guess that was a bit ambiguous, sorry. Anyway, i’m interested to see
you don’t think it’s worth bothering with hpricot in this case, and just
use a regex - i was wondering if hpricot was overkill myself.

One problem that just occurred to me is that treating the whole html
like text and taking the first 25 words will make it into invalid html -
because we have start tags with no matching end tags. So, ideally i
would preserve the start and end tags and strip the content down to 25
words. That’s why i used hpricot initially.

Anyway, thanks for your help.
max

cmaxvv · July 20, 2009, 11:44am

So, ideally i would preserve the start and end tags and
strip the content down to 25 words.

It’s easy enough to slap a “

” on the end. But something else you
might not have considered is: what if the 25th word is inside a tag, for
instance:

<a href=“blah”

cmaxvv · July 20, 2009, 11:43am

I didn’t say it was the desired output, i just said that was what my
crappy version was currently doing Anyway, i’ve troubled you enough

thanks for all your help.

cmaxvv · July 20, 2009, 11:50am

7stud – wrote:

So, ideally i would preserve the start and end tags and
strip the content down to 25 words.

It’s easy enough to slap a “
” on the end. But something else you
might not have considered is: what if the 25th word is inside a tag, for
instance:
<a href=“blah”

Yeah, i know, it’s a bit complicated isn’t it. Slapping a

on the
end isn’t enough because there could be a load of unfinished tags - for
example, half a p tag with half an a tag inside it. Anything really.
That’s why ideally i would just consider the inner content of tags and
when it comes to tags inside tags, either remove them completely or
leave them as they are. This seems like such a common thing on the net,
to have a short section followed by ‘read more’ for example, that i
thought there would be an easy way to do it.