Forum: Ruby how to remove strange characters

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
73700e119917433681f2e8f3e4369f74?d=identicon&s=25 Li CN (alex-osu3)
on 2008-10-07 18:30
Hi all,

I grap some info from a webpage. Sometimes I get some stranges
characters as follows (by p):
 To depart in a hurry; abscond:  \342\200\234Your horse
has\nabsquatulated!\342\200\235 (Robert M. Bird) To die.

or (by print):
To depart in a hurry; abscond:  “Your horse has absquatulated!”
(Robert M. Bird) To die.

Any idea to to get rid of them?


Thanks,

Li
D68c97e8e2f1653b54c24493caf236ae?d=identicon&s=25 Stephen Celis (Guest)
on 2008-10-07 20:15
(Received via mailing list)
Hi,

On Tue, Oct 7, 2008 at 11:28 AM, Li Chen <chen_li3@yahoo.com> wrote:
> I grap some info from a webpage. Sometimes I get some stranges
> characters as follows (by p):
>  To depart in a hurry; abscond:  \342\200\234Your horse
> has\nabsquatulated!\342\200\235 (Robert M. Bird) To die.
>
> or (by print):
> To depart in a hurry; abscond:  “Your horse has absquatulated!â€
> (Robert M. Bird) To die.
>
> Any idea to to get rid of them?

Those are multi-byte characters (curly quotes, in this case). You
probably don't want to get rid of them, but you can use the iconv
library to transliterate them back to their ASCII almost-equivalents:

>> string = "To depart in a hurry; abscond:  \342\200\234Your horse 
has\nabsquatulated!\342\200\235 (Robert M. Bird) To die."
=> "To depart in a hurry; abscond:  \342\200\234Your horse
has\nabsquatulated!\342\200\235 (Robert M. Bird) To die."
>> require 'iconv'
=> true
>> puts Iconv.iconv('ascii//translit', 'utf-8', string).to_s
To depart in a hurry; abscond:  "Your horse has
absquatulated!" (Robert M. Bird) To die.
=> nil

Stephen
73700e119917433681f2e8f3e4369f74?d=identicon&s=25 Li CN (alex-osu3)
on 2008-10-08 16:26
Stephen Celis wrote:

> Those are multi-byte characters (curly quotes, in this case). You
> probably don't want to get rid of them, but you can use the iconv
> library to transliterate them back to their ASCII almost-equivalents:
>
>>> string = "To depart in a hurry; abscond:  \342\200\234Your horse 
has\nabsquatulated!\342\200\235 (Robert M. Bird) To die."
> => "To depart in a hurry; abscond:  \342\200\234Your horse
> has\nabsquatulated!\342\200\235 (Robert M. Bird) To die."
>>> require 'iconv'
> => true
>>> puts Iconv.iconv('ascii//translit', 'utf-8', string).to_s
> To depart in a hurry; abscond:  "Your horse has
> absquatulated!" (Robert M. Bird) To die.
> => nil
>
> Stephen

Thank you,

Li
73700e119917433681f2e8f3e4369f74?d=identicon&s=25 Li CN (alex-osu3)
on 2008-10-08 18:16
Hi Stephen and others,

Iconv only works for some characters. It doesn't work for the following
scripts.

Any idea?

Thanks,

Li


C:\Users\Alex>irb
irb(main):001:0> require 'iconv'
=> true
irb(main):002:0>  string1="Fatal injury or ruin:\223Hath some fond lover
tic'd thee to thy bane?\224
\342\200\246"
=> "Fatal injury or ruin:\223Hath some fond lover tic'd thee to thy
bane?\224\342\200\246"
irb(main):003:0>  puts
Iconv.iconv('ASCII//TRANSLIT','utf-8',string1).to_s
Iconv::IllegalSequence: "\223Hath some fond "...
        from (irb):3:in `iconv'
        from (irb):3
irb(main):004:0>
8f27f50c60bbd39d81dc4acd7d3bebdb?d=identicon&s=25 Pablo Q. (Guest)
on 2008-10-08 18:37
(Received via mailing list)
what do you think doing something like this?

class String
  def remove_nonascii(replacement)
    n=self.split("")
    self.slice!(0..self.size)
    n.each{|b|
      if (b[0].to_i< 32 || b[0].to_i>124) then
        self.concat(replacement)
      elsif
[34,35,37,42,43,44,45,47,60,61,62,63,91,92,93,94,96,123].include?(b[0].to_i)
        self.concat(replacement)
      else
        self.concat(b)
      end
    }
    self.to_s
  end
end

"Fatal injury or ruin:\223Hath some fond lover tic'd thee to
thybane?\224\342\200\246".remove_nonascii('+')

=> "Fatal injury or ruin:+Hath some fond lover tic'd thee to
thybane+++++"

how you can see, it made the replacement with char '+'.


2008/10/8 Li Chen <chen_li3@yahoo.com>
0026dd77fd9ecc97b36e5b79cdbcf590?d=identicon&s=25 R. Kumar (sentinel)
on 2008-10-09 05:47
Li Chen wrote:
> Hi all,
>
> I grap some info from a webpage. Sometimes I get some stranges
> characters as follows (by p):
>  To depart in a hurry; abscond:  \342\200\234Your horse
> has\nabsquatulated!\342\200\235 (Robert M. Bird) To die.

Here's a quick hack I used recently. It was messing my display on
ncurses, and I did not need the characters.

dataitem.gsub!(/[^[:space:][:print:]]/,'')

I got this while googling, iirc, its used somewhere in ROR.
73700e119917433681f2e8f3e4369f74?d=identicon&s=25 Li CN (alex-osu3)
on 2008-10-09 21:53
Nit Khair wrote:
> Here's a quick hack I used recently. It was messing my display on
> ncurses, and I did not need the characters.
>
> dataitem.gsub!(/[^[:space:][:print:]]/,'')
>
> I got this while googling, iirc, its used somewhere in ROR.

It works on scenario where iconv doesn't work. Good job!!!

Li
58d1114fe29aab2d433231a722dd2034?d=identicon&s=25 Bilyk, Alex (Guest)
on 2008-10-10 02:57
(Received via mailing list)
There is no one-click installer for 1.9 on Windows as far as I can tell.
Downloading and unpacking the ziped binaries didn't get me very far as
both ruby and irb complain that something is missing. Does binary
distribution require me to install anything else? Like libraries? If
this is the case what additional stuff do I need to make 1.9 to work and
where can I get it?

Thanks,
Alex
This topic is locked and can not be replied to.