Webpage to RSS

Does anyone have a ruby based script lying around that would transform
updates to a webpage to RSS or some other feed format?
We don’t use a CMS for our website but there are items that are updated
often and RSS feeds might be appreciated. Someone must have done this
before I guess.
So is there a script that might do that? The categories are separated by
h2
tags and the items are in li tags.

Bart

Hi,
Since this is such a specailized task, it really depends on the website
you
are transforming. I would suggest you take a look at Hpricot(
http://code.whytheluckystiff.net/hpricot) and at the RSS class in the
standard library. It shouldn’t be to hard to roll one up, and we can
always
help. I am usually in #ruby-lang on freenode after 7 every night.

Chris

Hi,

On Oct 9, 9:12 pm, Bart B. [email protected] wrote:

Does anyone have a ruby based script lying around that would transform
updates to a webpage to RSS or some other feed format?

You could use hpricot (http://code.whytheluckystiff.net/hpricot/) to
parse the HTML and then use feedtools
(http://sporkmonger.com/articles/2005/08/11/tutorial/) to generate the
RSS.

Lutz

On 10/10/06, Chris C. [email protected] wrote:

On 10/9/06, Bart B. [email protected] wrote:

Bart

Rails Recipes Book by Chad F. has similar stuff. Almost ready for
use. If you don’t have the book, still you can download the code
sample i guess.

On Wed, Oct 11, 2006 at 12:30:16AM +0900, Bart B. wrote:

  • Somefiles description. Addition date.
  • I can cope with setting a date in the RSS, the problem is parsing this
    structure. There is no surrounding element for the ul and I need both the
    structure and the substructure information because the combination of those
    too defines the effective identity of the ul and its items.
    There seems to be no method to “give everything between to specific tags and
    then go on to the next one”…

    I’m not sure I understand exactly, but here’s my impression of what
    you’re
    trying to do.

    doc = Hpricot(html_string)
    (doc/:h3).each do |ele|
    rss_title = ele # okay, so you have the 3rd-level header
    rss_contents = Hpricot::Elements[]

    while ele = h3.next_sibling
      rss_contents << ele
      break if ele.respond_to?(:name) and ele.name == "ul"
    end
    

    end

    So, basically, you can use next_sibling (or previous_sibling) to
    walk back
    and forth between HTML brothers and sisters. I store it in an
    Hpricot::Elements
    array, since you can then just call rss_contents.to_html or do other
    searches
    on it.

    This is available since changset [49], so you’ll need to either install
    from SVN
    or monkeypatch.

    _why

    [49] http://code.whytheluckystiff.net/hpricot/changeset/49

    why the lucky stiff wrote:

    break if ele.respond_to?(:name) and ele.name == “ul”
    end
    end

    So, basically, you can use next_sibling (or previous_sibling) to walk
    back and forth between HTML brothers and sisters. I store it in an
    Hpricot::Elements array, since you can then just call
    rss_contents.to_html or do other searches on it.

    This is available since changset [49], so you’ll need to either install
    from SVN or monkeypatch.

    The next_sibling and previous_sibling methods are just what I needed.
    Now for an svn checkout…

    Thanks a lot!
    Bart

    Lutz H. wrote:

    You could use hpricot (http://code.whytheluckystiff.net/hpricot/) to
    parse the HTML and then use feedtools
    (http://sporkmonger.com/articles/2005/08/11/tutorial/) to generate the
    RSS.

    Wow hpricot seems pretty nice, I noticed the hype but now I
    understand…
    One question though: do you see a way of parsing a structure like this
    with
    hpricot:

    Structure 1

    Substructre 1

    Substructure info

    • Somefiles description. Addition date.
    • I can cope with setting a date in the RSS, the problem is parsing this
      structure. There is no surrounding element for the ul and I need both
      the
      structure and the substructure information because the combination of
      those
      too defines the effective identity of the ul and its items.
      There seems to be no method to “give everything between to specific tags
      and
      then go on to the next one”…

      Thanks for the pointers
      Bart