Extended ASCII character handeling

okkezSS · November 17, 2010, 5:01pm

“200 Millionen Jahre später # 17.39
\n”,
“200 Millionen Jahre später # 9.87
3404211707 \n”,
“A l’assaut de l’invisible 1977 # 4.91
\n”,
“A l’assaut de l’invisible 1990 # 5.18
226603779 \n”,

The above 4 lines are data I was attempting to load into an array to
test some code. I was getting what I thought were strange results until
I realized not all characters were being loaded into the element
resulting in column alignment problems.

The data above was cut from a file that had been manipulated a dozen
times in ruby arrays before being written to a file. So it appears the
default way ruby handles extended ASCII(?) is fine.

I have two questions

Should I ever have to worry about data being scraped from web pages
not being handled correctly by ruby.

2)How do I flag this data to allow me to manipulate it properly. That is
load it into an array or write to a file.

Tried playing with the following but even if the code below is correct
the extended ascii characters are lost by the time it gets to IRB

str = String.new
str.encode((“US-ASCII”)
str = “Millionen Jahre später”

Any suggestions where I might find some insight.

Thanks Don

rustysam · November 17, 2010, 5:15pm

On Thu, 2010-11-18 at 01:01 +0900, Don N. wrote:

test some code. I was getting what I thought were strange results until
I realized not all characters were being loaded into the element
resulting in column alignment problems.

The data above was cut from a file that had been manipulated a dozen
times in ruby arrays before being written to a file. So it appears the
default way ruby handles extended ASCII(?) is fine.

I have two questions

Should I ever have to worry about data being scraped from web pages
not being handled correctly by ruby.

That would depend very much on how you scrape the data and if you handle
stuff like meta tags correcty.

rustysam · November 17, 2010, 6:16pm

On 17.11.2010 17:01, Don N. wrote:

test some code. I was getting what I thought were strange results until
I realized not all characters were being loaded into the element
resulting in column alignment problems.

The data above was cut from a file that had been manipulated a dozen
times in ruby arrays before being written to a file. So it appears the
default way ruby handles extended ASCII(?) is fine.

I have two questions

Should I ever have to worry about data being scraped from web pages
not being handled correctly by ruby.

Depends how you read the data from webpages.

2)How do I flag this data to allow me to manipulate it properly. That is
load it into an array or write to a file.

You need to set encodings properly. You can do that when opening the
file. Example:

irb(main):001:0> io = File.open “x”,“r”
=> #<File:x>
irb(main):002:0> io.external_encoding
=> #Encoding:UTF-8
irb(main):003:0> io.internal_encoding
=> nil
irb(main):004:0> io.read.encoding
=> #Encoding:UTF-8
irb(main):005:0> io.close
=> nil

irb(main):006:0> io = File.open “x”,“r:ASCII”
=> #<File:x>
irb(main):007:0> io.external_encoding
=> #Encoding:US-ASCII
irb(main):008:0> io.internal_encoding
=> nil
irb(main):009:0> io.read.encoding
=> #Encoding:US-ASCII
irb(main):010:0> io.close
=> nil

See http://blog.grayproductions.net/articles/understanding_m17n

Tried playing with the following but even if the code below is correct
the extended ascii characters are lost by the time it gets to IRB

str = String.new
str.encode((“US-ASCII”)
str = “Millionen Jahre spter”

This won’t work - ever. You set the encoding for an instance and then
you reassign str to point to another instance, so all your encoding
settings are lost. Also, there is no “” in ASCII which is 7bit!

irb(main):011:0> s=“a”
=> “a”
irb(main):012:0> s.encoding
=> #Encoding:UTF-8
irb(main):013:0> t = s.encode “ASCII”
=> “a”
irb(main):014:0> t.encoding
=> #Encoding:US-ASCII

Now with “”:

irb(main):015:0> s=“”
=> “”
irb(main):016:0> s.encoding
=> #Encoding:UTF-8
irb(main):017:0> t = s.encode “ASCII”
Encoding::UndefinedConversionError: “\xC3\xBC” from UTF-8 to US-ASCII
from (irb):17:in encode' from (irb):17 from /usr/local/bin/irb19:12:in ’

Kind regards

robert

rustysam · November 18, 2010, 9:59pm

Don N. wrote in post #962171:

I have two questions

Should I ever have to worry about data being scraped from web pages
not being handled correctly by ruby.

In ruby 1.9, you have to worry about this very much.

Strings in ruby 1.9 are two-dimensional: they have a sequence of bytes,
and they have an encoding. There are additional ‘dimensions’ based on
the string’s content - empty, ascii_compatible, valid_encoding.

If your scraper library doesn’t document how it choses the encodings to
tag each string it returns, and doesn’t document how it handles invalid
encodings if it comes across them, then you have to test its behaviour
for all the various edge cases.

You never have this issue with ruby 1.8, because a string is just a
string of bytes. Of course, the “garbage in, garbage out” principle
still applies; you just don’t choke on the garbage.

2)How do I flag this data to allow me to manipulate it properly. That is
load it into an array or write to a file.

That’s a short question with a long answer, and I’m afraid my own
attempt to answer it is incomplete:

github.com

candlerb/string19/blob/master/string19.rb

#!/usr/bin/env ruby19
# encoding: UTF-8
# This document is Copyright (C) Brian Candler 2009 and released under a
# Creative Commons Attribution-NonCommercial 3.0 Unported License.

############# CONTENTS ###################

# -1. PREAMBLE
#  0. INTRODUCTION
#  1. ENCODINGS
#  2. PROPERTIES OF ENCODINGS
#  3. STRING, FILE AND REGEXP ENCODINGS
#  4. VALID AND FIXED ENCODINGS
#  5. COMPATIBLE OBJECTS
#  6. STRING CONCATENATION
#  7. THE BINARY / ASCII-8BIT ENCODING
#  8. SINGLE CHARACTERS
#  9. EQUALITY AND COLLATION
# 10. HASH AND EQL?
# 11. UPPER AND LOWER CASE

This file has been truncated. show original

If you’re reading stuff from a file or a socket yourself, you can
control the process. If you’re trusting a third-party library to fetch
data from somewhere, then you have to trust that library to do the right
thing in the situations you’re interested in.

Tried playing with the following but even if the code below is correct
the extended ascii characters are lost by the time it gets to IRB

irb is not a good predictor of encoding behaviour for ruby 1.9, and
you’d be better writing standalone .rb scripts that you run.

Note that it’s one of the 1.9 language inconsistencies that transcoding
is not done on output by default. So if you have a read a string from
a file, and carefully tag it as say UTF-8, but your terminal is IBM437,
then

puts my_string

will just squirt the UTF-8 bytes to the terminal and they’ll display
wrongly. You can try something like this:

STDOUT.set_encoding "IBM437"

or
STDOUT.set_encoding “locale”

Regards,

Brian.

rustysam · November 17, 2010, 11:00pm

I am using nokogiri (with Mechanize) to scrape the data and the data I
am concerned with is extracted only from displayable fields <table
class="result> …

The code set/language references I see are

Which is I believe, what I am calling Extended ASCII(8 bit 0 - 255)

AND

//

The scrapped data has never caused a problem within the ruby program
(would have been very obvious). Can I safely assume that code sets will
never present a problem for this specific application as long as the
retrieval methods do not change???.

=========================
That being said when I open the file with io it reports
#Encoding:IBM437 which would contain the characters giving problems
(but not there correct representation). That is to say the IBM437 for
character E4 is a Graphic character not the accented French ‘a’ in
“später”. The graphic is what is also being displayed in the IRB
console.

I have gone through most of the Shades of Gray link and only thing that
I thought might have been of value is the LC_TYPE but either UTF-9 or
ISO-8859-1 both work identically in my situation. I have removed
LC_TYPE since there is no problem with internal data and it might cause
a problem down the line when I have forgotten about it.

Also tried saving code & data to a file and running the file (ruby
xxx.rb) and still reports a multibyte error.

Played with ruby command line encoding settings (ruby -E XXX)and still
received errors regardless of code set I picked - may be related to
LC_TYPE as did not reboot so still valid??

Error is
CodeSet.rb:4: invalid multibyte char (US-ASCII) which is 7 bit.

Extended ASCII code sets ISO-8859 & IBM437 are 8 bit but can not seem to
set this.

=======================

I can edit the data file externally and read the data into an array
without problems.
So will assume no need to pursue the code set settings at this time.

Will not update unless I have a revelation.

By the recommended link was excellent, will save URL as a resource.

Extended ASCII character handeling

The scrapped data has never caused a problem within the ruby program (would have been very obvious). Can I safely assume that code sets will never present a problem for this specific application as long as the retrieval methods do not change???.

=======================

The scrapped data has never caused a problem within the ruby program
(would have been very obvious). Can I safely assume that code sets will
never present a problem for this specific application as long as the
retrieval methods do not change???.