Saving the web, charset problems and symbols problems


#1

Hi all!

I think that a lot of ruby scripts are for web crawling, web scrapping
and many more applications with the web. I’m working with the web too, I
try to save text of many different webs. In this moment I’m trying to
solve two problems:

1 - How to standard the charset of the web. There are a lot of
differents charsets and I think that it must be possible another
solution that see every charset and convert to proper charset each time.
(By the way, what is the best method to see charset of a file? command
file is not very good, I think)

2 - How to convert HTML to plain text. I use Hpricot but a lot of very
rare simbols continues there like “€” or “””. Wich is the most used
method?

Thanks a lot