Forum: Ruby perl regexp to ruby one conversion ?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
unknown (Guest)
on 2006-03-23 14:43
(Received via mailing list)
i've a perl regexp :

$field =~
  m/^(
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*$/x;

able to detect if $field is of UTF-8 chars or not and i'd like to
convert it into a ruby regexp.

How to do that ?
James G. (Guest)
on 2006-03-23 16:00
(Received via mailing list)
On Mar 23, 2006, at 6:43 AM, Une bévue wrote:

>    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
>    |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
>   )*$/x;
>
> able to detect if $field is of UTF-8 chars or not and i'd like to
> convert it into a ruby regexp.
>
> How to do that ?

The expression looks fine to me.  Did you try using it?

James Edward G. II
unknown (Guest)
on 2006-03-23 16:38
(Received via mailing list)
James Edward G. II <removed_email_address@domain.invalid> wrote:

>
> The expression looks fine to me.  Did you try using it?

yes, without the correct result, here is my code :

field='&é§è!çàîûtybvn¤'
utf8rgx=Regexp.new('m/^(
   [\x09\x0A\x0D\x20-\x7E]            # ASCII
 | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
 |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
 | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
 |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
 |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
 | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
 |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
)*$/x')

the test :

flag=(field === utf8rgx)
p "flag = #{flag}"

the result being :
"flag = false"

i'm sure my encoding is utf-8...

may be i've a misunderstanding of "===" ?

because when trying :

truc = 'toto'
rgx=Regexp.new('^toto$')
flag=(truc === rgx)
p "flag = #{flag}"

i got :
# => "flag = false"      ///seems NOT OK to me

flag=(truc =~ rgx)
p "flag = #{flag}"
# => "flag = 0"          ///seems OK to me
Ross B. (Guest)
on 2006-03-23 16:50
(Received via mailing list)
On Thu, 2006-03-23 at 23:38 +0900, Une bévue wrote:
>  | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
> flag=(field === utf8rgx)
> p "flag = #{flag}"
>

You'll need to switch those around, as I showed in my response to your
other thread. flag will then be true, but unfortunately I think too
often:

utf8rgx === "onlyascii"
# => true

I think to do that kind of test you'd have to remove the first line
(matching ASCII chars) and not anchor the regexp with ^ and $.

Incidentally, I believe that the regexp above is best translated to Ruby
like this:

	utf8rgx = /^(.)*$/u

You should also look into $KCODE (specifically $KCODE = 'u').

(Caveat to the above: I'm not much of an encoding expert at all).
James G. (Guest)
on 2006-03-23 16:55
(Received via mailing list)
On Mar 23, 2006, at 8:38 AM, Une bévue wrote:

> utf8rgx=Regexp.new('m/^(
>    [\x09\x0A\x0D\x20-\x7E]            # ASCII
>  | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
>  |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
>  | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
>  |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
>  |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
>  | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
>  |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
> )*$/x')

Try changing this to:

utf8rgx = / ... /x

Hope that helps.

James Edward G. II
unknown (Guest)
on 2006-03-23 17:13
(Received via mailing list)
James Edward G. II <removed_email_address@domain.invalid> wrote:

> Try changing this to:
>
> utf8rgx = / ... /x
>
> Hope that helps.

ok, thanks, i see what u mean !
unknown (Guest)
on 2006-03-23 17:14
(Received via mailing list)
Ross B. <removed_email_address@domain.invalid> wrote:

> Incidentally, I believe that the regexp above is best translated to Ruby
> like this:
>
>       utf8rgx = /^(.)*$/u
>
> You should also look into $KCODE (specifically $KCODE = 'u').
>
> (Caveat to the above: I'm not much of an encoding expert at all).

ok thanks for all, may be it could be better streaming out all of the
html tags and bringing only part of what's in the <body/>...
unknown (Guest)
on 2006-03-23 18:38
(Received via mailing list)
James Edward G. II <removed_email_address@domain.invalid> wrote:

>
> Try changing this to:
>
> utf8rgx = / ... /x

the above regexp doesn't work as expected with ruby, i've compared the
output for the same files with perl and ruby, ruby says always "yes it
is UTF-8", where perl says NO over an ISO-8859-1 encoded file... (even
after wipping out the first line the first ^and the last $)

then, for the time being, i'll use the perl script from ruby in a commad
line fashion...
ts (Guest)
on 2006-03-23 18:47
(Received via mailing list)
>>>>> "U" == =?ISO-8859-1?Q?Une b=E9vue?= <removed_email_address@domain.invalid> writes:

U> the above regexp doesn't work as expected with ruby, i've compared
the
U> output for the same files with perl and ruby, ruby says always "yes
it
U> is UTF-8", where perl says NO over an ISO-8859-1 encoded file...
(even
U> after wipping out the first line the first ^and the last $)

moulon% cat b.rb
field='&é§è!çàîûtybvn¤'
utf8rgx=Regexp.new('^(
   [\x09\x0A\x0D\x20-\x7E]            # ASCII
 | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
 |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
 | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
 |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
 |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
 | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
 |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
)*$', Regexp::EXTENDED)

p utf8rgx =~ field
moulon%

moulon% file b.rb
b.rb: ISO-8859 text
moulon%

moulon% ruby b.rb
nil
moulon%


Guy Decoux
unknown (Guest)
on 2006-03-23 19:14
(Received via mailing list)
ts <removed_email_address@domain.invalid> wrote:

> p utf8rgx =~ field
> moulon%
>
> moulon% file b.rb
> b.rb: ISO-8859 text
> moulon%
>
> moulon% ruby b.rb
> nil
> moulon%

i don't understand your post )))

my rb file is UTF-8 encoded, at best i can have an answer, from this
script, being the reverse of what is wanted )))

otherwise i get always true...
ts (Guest)
on 2006-03-23 19:20
(Received via mailing list)
>>>>> "U" == =?ISO-8859-1?Q?Une b=E9vue?= <removed_email_address@domain.invalid> writes:

U> i don't understand your post )))

U> ts <removed_email_address@domain.invalid> wrote:

>> moulon% file b.rb
>> b.rb: ISO-8859 text
>> moulon%

 my file is ISO-8859 encoded

>> moulon% ruby b.rb
>> nil
>> moulon%

 and ruby say NO

U> output for the same files with perl and ruby, ruby says always "yes
it
                                                                  ^^^^^^^
U> is UTF-8", where perl says NO over an ISO-8859-1 encoded file...
(even
   ^^^^^^^^                              ^^^^^^^^^^^^^^^^^^^^^^^

Guy Decoux
unknown (Guest)
on 2006-03-23 20:13
(Received via mailing list)
ts <removed_email_address@domain.invalid> wrote:

>
>  my file is ISO-8859 encoded

ok i've done one "biso.rb" ISO encoded and the result is ok :

> ruby biso.rb
nil
"false"

with :
field='&éèàçôîûêâöïü'utf8rgx=Regexp.new('^(
   [\x09\x0A\x0D\x20-\x7E]            # ASCII
 | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
 |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
 | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
 |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
 |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
 | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
 |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
)*$', Regexp::EXTENDED)
p utf8rgx =~ field
p (utf8rgx === field).to_s

>    ^^^^^^^^                              ^^^^^^^^^^^^^^^^^^^^^^^
BUT, in "butf.rb" (an UTF-8 encoded file) i do :
field='&é§è!çàîûtybvn¤'
utf8rgx=Regexp.new('^(
   [\x09\x0A\x0D\x20-\x7E]            # ASCII
 | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
 |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
 | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
 |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
 |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
 | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
 |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
)*$', Regexp::EXTENDED)

p utf8rgx =~ field
p (utf8rgx === field).to_s

str=""
File.open("tut_exceptions.html").each { |l| str << l}

p utf8rgx =~ str
p (utf8rgx === str).to_s


and get :
> ruby butf.rb
0
"true"
0
"true"


this file comes from :
<http://www.rubycentral.com/book/tut_exceptions.html>

with the following meta tag :
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"
notice Firefox does aggree with the "iso-8859-1" one of my text editor
also.

then, it is seen as an UTF-8 file but isn't, may be this is due to html
tags, i wippe them out saving the file tut_exceptions.html to
tut_exceptions.txt without any more tags nor even one < or >, retry on
that file :

 ruby butf.rb
0
"true"
0
"true"


(i've only change the :
File.open("tut_exceptions.html").each { |l| str << l}

to :
File.open("tut_exceptions.txt").each { |l| str << l}
--------------------------^^^

however :
> file tut_exceptions.txt
tut_exceptions.txt: UTF-8 Unicode English text

may be this isn't a good exemple because most of the char are us ascci
someway, the file as an english written one.

over :
<http://www.linux-france.org/>
saying it is a :
<meta http-equiv="Content-type" content="text/html;
charset=iso-8859-15"/>

and Firefox aggres also with that, then with the regexp i get :

> ruby butf.rb
0
"true"
0
"true"

....
Dominik B. (Guest)
on 2006-03-23 23:37
(Received via mailing list)
Hi,

On Thu, 23 Mar 2006 19:13:51 +0100, "Une bévue"
<removed_email_address@domain.invalid> wrote:

> utf8rgx=Regexp.new('^(
>    [\x09\x0A\x0D\x20-\x7E]            # ASCII
>  | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
>  |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
>  | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
>  |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
>  |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
>  | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
>  |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
> )*$', Regexp::EXTENDED)

As I understand it utf8rgx matches any string that is utf8, which
includes
pure ascii strings (see first line).
So it should match http://www.rubycentral.com/book/tut_exceptions.html.

First, here is a working version:

$ cat utf8tst.rb
utf8rgx = /\A(
    [\x09\x0A\x0D\x20-\x7E]            # ASCII
  | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
  |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
  | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
  |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
  |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
  | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
  |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
)*\z/x

p utf8rgx === ARGF.read
$ curl -s http://www.linux-france.org/ | ruby utf8tst.rb
false
$ curl -s http://www.rubycentral.com/book/tut_exceptions.html | ruby
utf8tst.rb
true


Your problem was that in Perl ^ and $ only match beginning and end of
string, but in ruby they also match beginning and end of line. So if a
string contains for example a single empty line, it does always match:

irb(main):001:0> a = "xxx\n\nyyyy"
=> "xxx\n\nyyyy"
irb(main):002:0> a =~ /^(w)*$/
=> 4

So for beginning and end of string in ruby you need \A and \z:

irb(main):003:0> a =~ /\A(w)*\z/
=> nil

Hope that helps,
Dominik
unknown (Guest)
on 2006-03-23 23:53
(Received via mailing list)
Dominik B. <removed_email_address@domain.invalid> wrote:

> Hope that helps,

fine thanks a lot it works, you explained very well why the ruby version
works on string like : string="&éçàôûîêäë" BUT NOT no files because of
the \n..., here is a script able to compare perl output with ruby one :
def isFileUtf8Encoded(fileName)
  utf8rgx = /\A(
      [\x09\x0A\x0D\x20-\x7E]            # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
    |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*\z/x
  str=""
  File.open("#{fileName}").each { |l| str << l}
  return (utf8rgx === str)
end

p isFileUtf8Encoded("lutte-ouvriere.html") # => false
p isFileUtf8Encoded("l_harmatan.html")     # => false
p isFileUtf8Encoded("tut_exceptions.html") # => false
p isFileUtf8Encoded("butf.rb")             # => true
p isFileUtf8Encoded("biso.rb")             # => false

p `perl IsUTF-8.pl "lutte-ouvriere.html"`  # => "0"
p `perl IsUTF-8.pl "l_harmatan.html"`      # => "0"
p `perl IsUTF-8.pl "tut_exceptions.html"`  # => "0"
p `perl IsUTF-8.pl "butf.rb"`              # => "1"
p `perl IsUTF-8.pl "biso.rb"`              # => "0"

p $KCODE                                   # => "UTF8"

the perl script being (called from the ruby one) :

#!/usr/bin/perl

sub isFileUtf8Encoded
{
        my ($fn) = @_;
        $string='';
        open (F, $fn) || die "Unable to open file $file : $!";
        while ($line = <F>) {
                $string.=$line;
        }
        close F;
        $flag = ($string =~
          m/^(
             [\x09\x0A\x0D\x20-\x7E]            # ASCII
           | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
           |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
           | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
           |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
           |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
           | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
           |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
          )*$/x);
                if( $flag != 1 )
                {
                   return 0;
                }
        return $flag;
}
print isFileUtf8Encoded(@ARGV[0])
This topic is locked and can not be replied to.