Hpricot Html Parsing

Hi,
I’m getting funky characters, when parsing html using Hpricot.
How to remove this funky character?

Anyone have a fix / workaround for this?

thanks in advance,
Suja

Suja JS wrote:

Hi,
I’m getting funky characters, when parsing html using Hpricot.
How to remove this funky character?

Anyone have a fix / workaround for this?

thanks in advance,
Suja

Could you describe these ‘funky characters’?

Lee J. wrote:

Suja JS wrote:

Hi,
I’m getting funky characters, when parsing html using Hpricot.
How to remove this funky character?

Anyone have a fix / workaround for this?

thanks in advance,
Suja

Could you describe these ‘funky characters’?

Like ‘�’ in this text.
“By Mike Monson CHAMPAIGN � Effective today the city of Champaign is
closing three bridges and posting load limits on three others.”

Hi Suja,

two suggestions:

  • check the encoding used by the page you’re hashpricoting (doh -
    think I just invented a verb, or what).
  • puts $KCODE to see if you’re running in unicode or not. If you are
    hashpricoting a page encoded in UTF-8, but KCODE is set to none (or if
    the page is in latin1, but KCODE is set to U), then you’ll have to
    change the encoding using iconv for instance.

cheers

Thibaut

“By Mike Monson CHAMPAIGN ? Effective today the city of Champaign is
closing three bridges and posting load limits on three others.”

hint hint :
http://www.news-gazette.com/news/local/2007/09/14/city_closes_three_bridges_limits_loads

The minus character you see after CHAMPAIGN is not a regular “-”.