Hello
I am started my adventures with Ruby I want to write simple parser:
if RUBY_VERSION =~ /1.9/
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
end
url = URI.parse(‘example url’)
response = Net::HTTP.start(url.host, url.port) do |http|
http.get(url.path)
end
main_page = response.body
links = main_page.slice!(/
.+</table>/)
I am getting error: parser.rb:17:in `slice!’: invalid byte sequence in
UTF-8 (ArgumentError)
Could somebody explain me how to resolve this problem?
All solutions that I found doesn’t work for me.
Regards
marek
March 3, 2011, 9:34pm
2
2011/3/3 Marek K. [email protected] :
main_page = response.body
links = main_page.slice!(/
.+</table>/)
Add this line to check the got body:
puts main_page.inspect
marek
March 3, 2011, 9:37pm
3
“Iñaki Baz C.” [email protected] wrote in post #985298:
2011/3/3 Marek K. [email protected] :
main_page = response.body
links = main_page.slice!(/
.+</table>/)
Add this line to check the got body:
puts main_page.inspect
Thanks for yours answer.
It returns me html of the page.
marek
March 3, 2011, 9:56pm
4
2011/3/3 Marek K. [email protected] :
It returns me html of the page.
Sure, but does it print a “strange” symbol in your screen?
marek
March 3, 2011, 10:16pm
5
2011/3/3 Marek K. [email protected] :
Everything seems to looks ok, any strange maybe :
Biura nieruchomo\xB6ci | Agencje nieruchomo\xB6ci
polish letters in page are problem ?
Maybe such page is not encoded in UTF8.
marek
March 3, 2011, 10:02pm
6
Everything seems to looks ok, any strange maybe :
Biura nieruchomo\xB6ci | Agencje nieruchomo\xB6ci
polish letters in page are problem ?
marek
March 3, 2011, 10:28pm
7
It’s encoding is iso-8859-2
marek
March 4, 2011, 1:35am
8
…then why did you say this:
Encoding.default_external = Encoding::UTF_8
marek
March 4, 2011, 11:00pm
9
Problem is solved. My IDE causing this problem.
Regards
marek
March 4, 2011, 10:31am
10
I forgot about checking encoding.
I put Encoding.default_external = Encoding::UTF_8 because without this I
got the same error, so I thought that page encoding was UTF 8
Any chance that it will be work with iso-8859-2?
I found another solution, I add # coding: iso-8859-2 to top of file,
remove Encoding.default_external = Encoding::UTF_8 ofcource but still
the same problem.