Perl regexp to ruby one conversion?

unknown · March 23, 2006, 1:43pm

i’ve a perl regexp :

$field =~
m/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x;

able to detect if $field is of UTF-8 chars or not and i’d like to
convert it into a ruby regexp.

How to do that ?

unknown · March 23, 2006, 3:00pm

On Mar 23, 2006, at 6:43 AM, Une bévue wrote:

| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x;

able to detect if $field is of UTF-8 chars or not and i’d like to
convert it into a ruby regexp.

How to do that ?

The expression looks fine to me. Did you try using it?

James Edward G. II

unknown · March 23, 2006, 3:38pm

James Edward G. II [email protected] wrote:

The expression looks fine to me. Did you try using it?

yes, without the correct result, here is my code :

field=‘&é§è!çàîûtybvn¤’
utf8rgx=Regexp.new(‘m/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x’)

the test :

flag=(field === utf8rgx)
p “flag = #{flag}”

the result being :
“flag = false”

i’m sure my encoding is utf-8…

may be i’ve a misunderstanding of “===” ?

because when trying :

truc = ‘toto’
rgx=Regexp.new(‘^toto$’)
flag=(truc === rgx)
p “flag = #{flag}”

i got :

=> “flag = false” ///seems NOT OK to me

flag=(truc =~ rgx)
p “flag = #{flag}”

=> “flag = 0” ///seems OK to me

unknown · March 23, 2006, 3:55pm

On Mar 23, 2006, at 8:38 AM, Une bévue wrote:

utf8rgx=Regexp.new(‘m/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x’)

Try changing this to:

utf8rgx = / … /x

Hope that helps.

James Edward G. II

unknown · March 23, 2006, 3:50pm

On Thu, 2006-03-23 at 23:38 +0900, Une bÃ©vue wrote:

| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
flag=(field === utf8rgx)
p “flag = #{flag}”

You’ll need to switch those around, as I showed in my response to your
other thread. flag will then be true, but unfortunately I think too
often:

utf8rgx === “onlyascii”

=> true

I think to do that kind of test you’d have to remove the first line
(matching ASCII chars) and not anchor the regexp with ^ and $.

Incidentally, I believe that the regexp above is best translated to Ruby
like this:

utf8rgx = /^(.)*$/u

You should also look into $KCODE (specifically $KCODE = ‘u’).

(Caveat to the above: I’m not much of an encoding expert at all).

unknown · March 23, 2006, 4:14pm

Ross B. [email protected] wrote:

Incidentally, I believe that the regexp above is best translated to Ruby
like this:
  utf8rgx = /^(.)*$/u
You should also look into $KCODE (specifically $KCODE = ‘u’).

(Caveat to the above: I’m not much of an encoding expert at all).

ok thanks for all, may be it could be better streaming out all of the
html tags and bringing only part of what’s in the …

unknown · March 23, 2006, 4:13pm

James Edward G. II [email protected] wrote:

Try changing this to:

utf8rgx = / … /x

Hope that helps.

ok, thanks, i see what u mean !

unknown · March 23, 2006, 5:47pm

“U” == =?ISO-8859-1?Q?Une b=E9vue?= [email protected] writes:

U> the above regexp doesn’t work as expected with ruby, i’ve compared
the
U> output for the same files with perl and ruby, ruby says always “yes
it
U> is UTF-8”, where perl says NO over an ISO-8859-1 encoded file…
(even
U> after wipping out the first line the first ^and the last $)

moulon% cat b.rb
field=‘&é§è!çàîûtybvn¤’
utf8rgx=Regexp.new(‘^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$’, Regexp::EXTENDED)

p utf8rgx =~ field
moulon%

moulon% file b.rb
b.rb: ISO-8859 text
moulon%

moulon% ruby b.rb
nil
moulon%

Guy Decoux

unknown · March 23, 2006, 5:38pm

James Edward G. II [email protected] wrote:

Try changing this to:

utf8rgx = / … /x

the above regexp doesn’t work as expected with ruby, i’ve compared the
output for the same files with perl and ruby, ruby says always “yes it
is UTF-8”, where perl says NO over an ISO-8859-1 encoded file… (even
after wipping out the first line the first ^and the last $)

then, for the time being, i’ll use the perl script from ruby in a commad
line fashion…

unknown · March 23, 2006, 6:14pm

ts [email protected] wrote:

p utf8rgx =~ field
moulon%

moulon% file b.rb
b.rb: ISO-8859 text
moulon%

moulon% ruby b.rb
nil
moulon%

i don’t understand your post )))

my rb file is UTF-8 encoded, at best i can have an answer, from this
script, being the reverse of what is wanted )))

otherwise i get always true…

unknown · March 23, 2006, 6:20pm

“U” == =?ISO-8859-1?Q?Une b=E9vue?= [email protected] writes:

U> i don’t understand your post )))

U> ts [email protected] wrote:

moulon% file b.rb
b.rb: ISO-8859 text
moulon%

my file is ISO-8859 encoded

moulon% ruby b.rb
nil
moulon%

and ruby say NO

U> output for the same files with perl and ruby, ruby says always “yes
it
^^^^^^^
U> is UTF-8”, where perl says NO over an ISO-8859-1 encoded file…
(even
^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^

Guy Decoux

unknown · March 23, 2006, 7:13pm

ts [email protected] wrote:

my file is ISO-8859 encoded

ok i’ve done one “biso.rb” ISO encoded and the result is ok :

ruby biso.rb
nil
“false”

with :
field=‘&éèàçôîûêâöïü’utf8rgx=Regexp.new(’^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$', Regexp::EXTENDED)
p utf8rgx =~ field
p (utf8rgx === field).to_s

^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^
BUT, in “butf.rb” (an UTF-8 encoded file) i do :
field=‘&é§è!çàîûtybvn¤’
utf8rgx=Regexp.new(‘^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$’, Regexp::EXTENDED)

p utf8rgx =~ field
p (utf8rgx === field).to_s

str=“”
File.open(“tut_exceptions.html”).each { |l| str << l}

p utf8rgx =~ str
p (utf8rgx === str).to_s

and get :

ruby butf.rb
0
“true”
0
“true”

this file comes from :
http://www.rubycentral.com/book/tut_exceptions.html

with the following meta tag :
<meta http-equiv=“Content-Type” content=“text/html; charset=iso-8859-1”
notice Firefox does aggree with the “iso-8859-1” one of my text editor
also.

then, it is seen as an UTF-8 file but isn’t, may be this is due to html
tags, i wippe them out saving the file tut_exceptions.html to
tut_exceptions.txt without any more tags nor even one < or >, retry on
that file :

ruby butf.rb
0
“true”
0
“true”

(i’ve only change the :
File.open(“tut_exceptions.html”).each { |l| str << l}

to :
File.open(“tut_exceptions.txt”).each { |l| str << l}
--------------------------^^^

however :

file tut_exceptions.txt
tut_exceptions.txt: UTF-8 Unicode English text

may be this isn’t a good exemple because most of the char are us ascci
someway, the file as an english written one.

over :
http://www.linux-france.org/
saying it is a :

and Firefox aggres also with that, then with the regexp i get :

ruby butf.rb
0
“true”
0
“true”

…

unknown · March 23, 2006, 10:37pm

Hi,

On Thu, 23 Mar 2006 19:13:51 +0100, “Une bévue”
[email protected] wrote:

utf8rgx=Regexp.new(‘^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$’, Regexp::EXTENDED)

As I understand it utf8rgx matches any string that is utf8, which
includes
pure ascii strings (see first line).
So it should match http://www.rubycentral.com/book/tut_exceptions.html.

First, here is a working version:

$ cat utf8tst.rb
utf8rgx = /\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x

p utf8rgx === ARGF.read
$ curl -s http://www.linux-france.org/ | ruby utf8tst.rb
false
$ curl -s http://www.rubycentral.com/book/tut_exceptions.html | ruby
utf8tst.rb
true

Your problem was that in Perl ^ and $ only match beginning and end of
string, but in ruby they also match beginning and end of line. So if a
string contains for example a single empty line, it does always match:

irb(main):001:0> a = “xxx\n\nyyyy”
=> “xxx\n\nyyyy”
irb(main):002:0> a =~ /^(w)*$/
=> 4

So for beginning and end of string in ruby you need \A and \z:

irb(main):003:0> a =~ /\A(w)*\z/
=> nil

Hope that helps,
Dominik

unknown · March 23, 2006, 10:53pm

Dominik B. [email protected] wrote:

Hope that helps,

fine thanks a lot it works, you explained very well why the ruby version
works on string like : string=“&éçàôûîêäë” BUT NOT no files because of
the \n…, here is a script able to compare perl output with ruby one :
def isFileUtf8Encoded(fileName)
utf8rgx = /\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x
str=“”
File.open(“#{fileName}”).each { |l| str << l}
return (utf8rgx === str)
end

p isFileUtf8Encoded(“lutte-ouvriere.html”) # => false
p isFileUtf8Encoded(“l_harmatan.html”) # => false
p isFileUtf8Encoded(“tut_exceptions.html”) # => false
p isFileUtf8Encoded(“butf.rb”) # => true
p isFileUtf8Encoded(“biso.rb”) # => false

p perl IsUTF-8.pl "lutte-ouvriere.html" # => “0”
p perl IsUTF-8.pl "l_harmatan.html" # => “0”
p perl IsUTF-8.pl "tut_exceptions.html" # => “0”
p perl IsUTF-8.pl "butf.rb" # => “1”
p perl IsUTF-8.pl "biso.rb" # => “0”

p $KCODE # => “UTF8”

the perl script being (called from the ruby one) :

#!/usr/bin/perl

sub isFileUtf8Encoded
{
my ($fn) = @_;
$string=‘’;
open (F, $fn) || die “Unable to open file $file : $!”;
while ($line = ) {
$string.=$line;
}
close F;
$flag = ($string =~
m/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x);
if( $flag != 1 )
{
return 0;
}
return $flag;
}
print isFileUtf8Encoded(@ARGV[0])