Setting encoding of pages in Capybara

james-calivar · September 7, 2010, 12:35pm

Hi all,

Quick encoding question: say I’m trying to grab data from a Japanese
page
using Capybara and Rack::Test, and I get badly encoded text in the
response.
e.g. running this script:

require ‘rubygems’
require ‘capybara’
require ‘rack/test’
require ‘rack/proxy’

Capybara.default_selector = :css

class Japan < Rack::Proxy
def rewrite_env(env)
env[‘HTTP_HOST’] = ‘l-tike.com’
env
end
end

session = Capybara::Session.new(:rack_test, Japan.new)
session.visit ‘/pickup/concert_more.html’
puts session.body

You’ll see weird characters in the output, and I can’t find nodes that
should be there with css/xpath. How do I set the encoding so that
Nokogiri
parses the page properly?

james-calivar · September 7, 2010, 1:30pm

Hi,

On Tue, Sep 7, 2010 at 6:27 AM, James C. [email protected]
wrote:

Hi all,

Quick encoding question: say I’m trying to grab data from a Japanese page
using Capybara and Rack::Test, and I get badly encoded text in the
response.
e.g. running this script:

First, a quick note, that this question is probably more appropriate for
the
capybara or nokogiri mailing lists. You’re likely to get a quicker
response
from those groups.

env[‘HTTP_HOST’] = ‘l-tike.com’
env
end
end

session = Capybara::Session.new(:rack_test, Japan.new)
session.visit ‘/pickup/concert_more.html’
puts session.body

It looks like this page claims (in its header) to be encoding in
SHIFT_JIS,
but the page is encoded in UTF-8. LibXML’s guesses at encoding are not
perfect, and in this case the misleading information causes it to trust
the
header and use the wrong encoding.

If this page is edited to contain

instead of

then all is well.

Perhaps someone with more experience than me using non-western character
sets will have a deeper insight into libxml’s behavior here?