How can I get malformed UTF-8 characters to display properly

Hello everyone,

I’m scraping a lot of sites for a project, and occasionally the
scraped content will have “malformed UTF-8” characters. When the
scraped content is processed (basically a database record is created),
these characters often don’t appear as they’re supposed to.

Normally, the following code works great:

str.unpack(“U*”).collect {|s| (s > 127 ? “&##{s};” : s.chr) }.join("")

But it won’t work with these “malformed UTF-8” characters. So I’ve
written the following to handle these characters, but it still isn’t
perfect. For example, I scraped this page:

The alt attribute of the first thumbnail, steel surround, contains the
text “Steel has that effect where you’d least expect it”. The ’
character shows up as Õ when I use the method below, and the “d” is
just swallowed.

data.gsub!(/\323/, '"')

require 'oniguruma'

o ='[^[:ascii:]]')
# o ='[^[:ascii:]]', {:encoding =>

chars = []
data.each_char{|c|chars << c}
chars.collect do |c|
if o.match c
rescue ArgumentError
add_log_message(“Has malformed UTF-8 characters”)
#handling malformed UTF-8 : a huge pain and possibly future
cause of problems
bytes = []
c.each_byte{|b| bytes << b}
# assumes we’re handling at most, 2-byte strings. We have
no way if the malformed character is
# supposed to be one byte or two, but we’re assuming it’s 1.
["&##{bytes[0]}"] + bytes[1…-1].collect{|b|b.chr}

Any suggestions?


