Rss parsing error

Young_Gyu_P · July 13, 2009, 4:54pm

At these days, I try to parse ‘http://www.forbes.com/news/index.xml’
using
feedzirra.
As you access this url, you can recognize what the problem is.

They added an unnecessary html tag which made malformed rss format.

But in the google rss reader, they process correctly without any
problem.
This is the point I wonder how they can make it happen, while I can’t.

please help me out to narrow the gap between me and google ^.^

be a happy day.

Young_Gyu_P · July 14, 2009, 1:50pm

Hi,

In [email protected]
“Re: rss parsing error.” on Tue, 14 Jul 2009 00:26:37 +0900,
Juvenn W. [email protected] wrote:

Hi, Young:
I think you may checkout universal feed parser at feedparser.org,
which is a python package, mainly created by Mark Pilgrim. And, I’m
guessing Google Reader uses it in the backend.
As far as I know, there’s no equivalent ruby package for that.

The RSS can be parsed with the bundled RSS Parser.
We doesn’t need to use Universal Feed Parser.

Young_Gyu_P · July 13, 2009, 5:27pm

Hi, Young:
I think you may checkout universal feed parser at feedparser.org,
which is a python package, mainly created by Mark Pilgrim. And, I’m
guessing Google Reader uses it in the backend.
As far as I know, there’s no equivalent ruby package for that.

Regards,

On 7/13/09, Young Gyu P. [email protected] wrote:

be a happy day.

–
Sent from my mobile device

Juvenn W.

Young_Gyu_P · September 7, 2009, 6:40am

Has anybody tried comparing Feedzirra vs Universal Feed Parsers
performance? Which is faster when processing thousands of feeds?

Young_Gyu_P · September 7, 2009, 8:05pm

Young Gyu P. wrote:

At these days, I try to parse ‘http://www.forbes.com/news/index.xml’
using
feedzirra.
As you access this url, you can recognize what the problem is.

They added an unnecessary html tag which made malformed rss format.

Glancing at the output of their feed I see no malformed RSS. I do see
them “exercising some options” that most feeds don’t, such as embedding
CDATA in the link tags.

Using Nokogiri to parse this feed is easy:

#!/usr/bin/env ruby -wKU

require 'rubygems'
require 'nokogiri'
require 'open-uri'

url = 'http://www.forbes.com/news/index.xml'
xml = Nokogiri::XML(open(url))

puts "Feed title:       #{ (xml%'title').content }"
puts "Feed description: #{ (xml%'description').content }"
puts "Feed link:        #{ (xml%'link').content }"

# get the first item
item = (xml/'item').first
puts "Item title:       #{ (item%'title').content }"
puts "Item link:        #{ (item%'link').content }"
puts "Item pubDate:     #{ (item%'pubDate').content }"
puts "Item description: #{ (item%'description').content }"
puts "Item author:      #{ (item%'author').content }"

Not all feeds are this straightforward or well constructed. That’s where
using a pre-built library to parse comes in handy but I haven’t found
one yet that handles everything out there correctly. Even Google’s
reader gets it wrong on some malformed feeds.

Aaron P. (AKA tenderlove) has done a great job with Nokogiri.
I’ve tested a lot of feeds and seen occasions where the built-in RSS
reader and other libraries puked or spun off and never returned. I’ve
run into feeds that caused Hpricot to be unable to strip broken HTML
embedded inside the descriptions, but Nokogiri was able to handle it.
So, if you can’t get a library to do what you want, jump in with
Nokogiri and give it a try.