Ctype functionality without a gem?

Dobai-Pataky_BSSSSl · October 28, 2010, 9:51am

Hello,

Is there a built-in method for identifying character types a la
ctype() in C? I would like to avoid requiring a gem dependency.

I’m considering the following approach, if nothing exists, but it
seems like overkill to me.

def ctype©
case c
when /[[:alnum:]]/; :alnum
when /[[:alpha:]]/; :alpha

etc…

end
end

def alnum?©; ctype© == :alnum end
def alpha?©; ctype© == :alpha end

etc…

Also, it requires some 1.8 vs 1.9 special cases for ‘ascii’ and
‘word’, if not more.

Regards,
Ammar

ammar · October 28, 2010, 10:49am

On Thu, Oct 28, 2010 at 9:50 AM, Ammar A. [email protected]
wrote:

Is there a built-in method for identifying character types a la
ctype() in C? I would like to avoid requiring a gem dependency.

No other than regexp as far as I know.

def alnum?(c); ctype(c) == :alnum end
def alpha?(c); ctype(c) == :alpha end

etc…

What do you need that for? Why not directly use a regexp to match a
string? Often you can use capturing groups in a single regexp, e.g.

irb(main):009:0> %w{foo bar 123}.each do |s|
irb(main):010:1* if /(\d+)|(\w+)/ =~ s
irb(main):011:2> puts “number” if $1
irb(main):012:2> puts “chars” if $2
irb(main):013:2> end
irb(main):014:1> end
chars
chars
number
=> [“foo”, “bar”, “123”]

Kind regards

robert

ammar · October 28, 2010, 11:23am

On Thu, Oct 28, 2010 at 11:47 AM, Robert K.
[email protected] wrote:

On Thu, Oct 28, 2010 at 9:50 AM, Ammar A. [email protected] wrote:

Is there a built-in method for identifying character types a la
ctype() in C? I would like to avoid requiring a gem dependency.

chars
number
=> [“foo”, “bar”, “123”]

That’s very cool. I’ll have to remember that one.

I do actually need to test if a character is of a certain type, all by
itself, as part of a parser I’m working on. I came up with this quick
solution for now, Character type identification module · GitHub

Thanks,
Ammar

ammar · October 28, 2010, 12:02pm

On 10/28/2010 11:16 AM, Ammar A. wrote:

irb(main):010:1* if /(\d+)|(\w+)/ =~ s
irb(main):011:2> puts “number” if $1
irb(main):012:2> puts “chars” if $2
irb(main):013:2> end
irb(main):014:1> end
chars
chars
number
=> [“foo”, “bar”, “123”]

That’s very cool. I’ll have to remember that one.

It is a bit brittle, though. Watch:

%w{foo bar 123 345bar}.each do |s|
if /(\d+)|(\w+)/ =~ s
puts “number” if $1
puts “chars” if $2
end
end
#=>
chars
chars
number
number

I do actually need to test if a character is of a certain type, all by
itself, as part of a parser I’m working on. I came up with this quick
solution for now, Character type identification module · GitHub

Thanks,
Ammar

This is similarily fragile. If you are only testing for one character,
it seems good, if you are going to test for strings, it will not be
enough.

Also, the whole solution is not necessarily fast - I do love regular
expressions, but I would search on for other methods to meet your
requirements. It really depends on what your goal is. If it is about
bit streams, you might want to look into bit-struct or similar
approaches that use Array#pack and String#unpack.

Regards,

t.

ammar · October 28, 2010, 12:37pm

On Thu, Oct 28, 2010 at 12:02 PM, Anton B. [email protected]
wrote:

irb(main):009:0> %w{foo bar 123}.each do |s|
That’s very cool. I’ll have to remember that one.
chars
chars
number
number

Yes, of course. The regexp was a quick hack only to demonstrate the
mechanism. You can easily fix that by proper anchoring the regexp.

requirements. It really depends on what your goal is. If it is about
bit streams, you might want to look into bit-struct or similar
approaches that use Array#pack and String#unpack.

If I have to implement a parser manually I would typically use regexp
for scanning. You can even make this fairly readable by using /x.

input.scan %r{

white space

(\s+)

integer

([-+]?\d+)

keyword: if

(if)

etc

}x do |m|
…
end

Of course, if the number of tokens is large this will become very
awkward. Better use a proper parser generator then.

Cheers

robert

ammar · October 28, 2010, 12:30pm

On Thu, Oct 28, 2010 at 1:02 PM, Anton B. [email protected]
wrote:

I came up with this quick
solution for now, Character type identification module · GitHub

This is similarily fragile. If you are only testing for one character,
it seems good, if you are going to test for strings, it will not be
enough.

I do only need to test individual characters. Extending the code to
match entire strings should be easy:

/^[[:alnum:]]+$/

Also, the whole solution is not necessarily fast - I do love regular
expressions, but I would search on for other methods to meet your
requirements. It really depends on what your goal is. If it is about
bit streams, you might want to look into bit-struct or similar
approaches that use Array#pack and String#unpack.

I agree, and wish this was built-in. Interesting suggestions. Thanks.

Regards,
Ammar