Remove all illegal chars form string

Hi,

Is there a simple way to remove all but the legal chars from a string.
where the legal chars are for example: a-z A-Z 0-9
So everything should be removed from the string but these characters.
“exam@p Le3|§” --> “exampLe”

I don’t know very much about regular expressions, so I don’t know if
it’s possible with sub or gsub. My first Idea was to loop over the
string and check every character but I wondered if there is something
more simple or better.

Thanks

thomas coopman wrote:

Hi,

Is there a simple way to remove all but the legal chars from a string.
where the legal chars are for example: a-z A-Z 0-9
So everything should be removed from the string but these characters.
“exam@p Le3|�” --> “exampLe”

I don’t know very much about regular expressions, so I don’t know if
it’s possible with sub or gsub. My first Idea was to loop over the
string and check every character but I wondered if there is something
more simple or better.

Thanks

irb(main):006:0> s = “abcd??ABCD!!0123”
=> “abcd??ABCD!!0123”
irb(main):007:0> s
=> “abcd??ABCD!!0123”
irb(main):008:0> s.tr(’^a-zA-Z0-9’,’’)
=> “abcdABCD0123”

thomas coopman wrote:

Hi,

Is there a simple way to remove all but the legal chars from a string. where the legal chars are for example: a-z A-Z 0-9
So everything should be removed from the string but these characters.
“exam@p Le3|§” --> “exampLe”

I don’t know very much about regular expressions, so I don’t know if it’s possible with sub or gsub. My first Idea was to loop over the string and check every character but I wondered if there is something more simple or better.

Thanks

str.gsub(/[^a-zA-Z]/, ‘’) should do it.

On Jun 28, 2006, at 8:09 AM, thomas coopman wrote:

Hi,

Is there a simple way to remove all but the legal chars from a
string. where the legal chars are for example: a-z A-Z 0-9
So everything should be removed from the string but these characters.
“exam@p Le3|§” --> “exampLe”

string.delete("^a-zA-Z0-9")

Hope that helps.

James Edward G. II

“t” == thomas coopman [email protected] writes:

t> Is there a simple way to remove all but the legal chars from a
string.
t> where the legal chars are for example: a-z A-Z 0-9
t> So everything should be removed from the string but these characters.
t> “exam@p Le3|§” → “exampLe”

Well you can try with String#tr

moulon% ruby -e ‘p “exam@p Le3|§”.tr(“^a-zA-Z0-9”, “”)’
“exampLe3”
moulon%

which means replace all characters, except a-z A-Z 0-9, with “”

Guy Decoux

sender: “thomas coopman” date: “Wed, Jun 28, 2006 at 10:09:35PM +0900” <<<EOQ
Hi,
Hi,

Is there a simple way to remove all but the legal chars from a string.
where the legal chars are for example: a-z A-Z 0-9
So everything should be removed from the string but these characters.
“exam@p Le3|§” --> “exampLe”
Taking your example, if legal chars are a-z A-Z 0-9 then
the output of:
“exam@p Le3|§”
should be:
“exampLe3” and not “exampLe”…

I don’t know very much about regular expressions, so I don’t know if
it’s possible with sub or gsub. My first Idea was to loop over the
string and check every character but I wondered if there is something
more simple or better.
Yes, this is why regexps were invented :slight_smile:

irb

irb(main):001:0> “exam@p Le3|§”.gsub(/[^a-zA-Z0-9]/,’’)
=> “exampLe3”

Thanks
You’re welcome,
Alex

On 6/28/06, thomas coopman [email protected] wrote:

Hi,

Is there a simple way to remove all but the legal chars from a string. where the legal chars are for example: a-z A-Z 0-9
So everything should be removed from the string but these characters.
“exam@p Le3|§” → “exampLe”

I don’t know very much about regular expressions, so I don’t know if it’s possible with sub or gsub. My first Idea was to loop over the string and check every character but I wondered if there is something more simple or better.

It’s easy with gsub and a regex:

stringToCheck.gsub!(/[^a-zA-Z0-9]/, “”)

The ‘^’ inverses the list of characters.

Les

On Jun 28, 2006, at 9:09 AM, thomas coopman wrote:

something more simple or better.

Thanks

The regexp for such a thing would be:

“exam@p Le3|§”.gsub(/[^a-zA-Z0-9]/, “”)
=> “exampLe3” (you listed 3 as a legal character in the above
email. /[^a-zA-Z]/ would remove numbers as well)

Probably about time to learn some regular expressions. Have a look
at Regular Expression Tutorial - Learn How to Use Regular Expressions
You’ll really start to like them once you learn even just the basic
matching ideas.
-Mat

Also, if you’re willing to accept underscores in the accepted character
list, you could just use the \W character class, which is equal to
[^A-Za-z0-9_].

This is definitely regex territory. And gsub() is the thing:

ex = “exam@p Le3|§”
puts ex.gsub( /[^A-Za-z0-9]/, ‘’ )

exampLe3

Or

puts “exam@p Le3|§”.gsub( /[^A-Za-z0-9]/, ‘’ )

exampLe3

I assume you wanted the 3 in there, since you asked for numbers in your
range of characters. Whe doing something like this, in my opinion, it’s
best to not try to roll your own.

On 28/06/06, thomas coopman [email protected] wrote:

Is there a simple way to remove all but the legal chars from a string. where the legal chars are for example: a-z A-Z 0-9
So everything should be removed from the string but these characters.
“exam@p Le3|§” → “exampLe”

Yes, there is:

str = “exam@p Le3|§”
str.gsub(/[^a-z0-9]/i, ‘’) # => “exampLe3”

Paul.

On Jun 28, 2006, at 14:09, thomas coopman wrote:

something more simple or better.

Thanks

It’s dead easy:

test = “exam@p Le3|§”
test.gsub(/[^A-Za-z0-9]/, ‘’)
=> “exampLe3”

Quick explanation:

[] defines a group you want to treat as one character. If you only
wanted vowels, ferinstance, you’d use [aeiou].

^ as the first character in a group means the opposite of that
group. Everything that isn’t a vowel would be like this: [^aeiou]

A-Z is a range of characters, and you can do smaller ranges like c-q
or whatever, the ordering used to determine the range is the
character encoding. This means you can just say A-z in place of A-Za-
z (at least in ASCII and compatible encodings - I don’t know about
anything else), but I think that tends to make things a little less
clear, especially since a-Z is invalid because a > Z in ASCII.

There are also shortcuts for some classes of characters, \d is
equivalent to [0-9], and \w is close to [A-z0-9] but also includes
the underscore character ‘_’.

So the regular expression says ‘match any single character that is
not in the ranges A-Z, a-z, or 0-9’. #gsub takes everything matched
by the regular expression, and replaces it with nothing.

matthew smillie.

Is there a simple way to remove all but the legal chars from
a string. where the legal chars are for example: a-z A-Z 0-9
So everything should be removed from the string but these
characters. “exam@p Le3|§” → “exampLe”

p “exam@p Le3|§”.gsub(/[^a-zA-Z0-9]/, “”)

gegroet,
Erik V. - http://www.erikveen.dds.nl/

Is there a simple way to remove all but the legal chars from a string. where the legal chars are for example: a-z A-Z 0-9
So everything should be removed from the string but these characters.
“exam@p Le3|§” --> “exampLe”

I don’t know very much about regular expressions, so I don’t know if it’s possible with sub or gsub. My first Idea was to loop over the string and check every character but I wondered if there is something more simple or better.

str="exam@p  L_e3|�"
puts str.gsub(/[^a-zA-Z0-9]/, "")
# yields:
#	exampLe3

a bit shorter would be

puts str.gsub(/\W/, "")
# but "word"-characters (\w) and "non-word"-characters (\W) also
# contain the' _', so this would yield:
#	exampL_e3

Benedikt

ALLIANCE, n. In international politics, the union of two thieves who
have their hands so deeply inserted in each other’s pockets that
they cannot separately plunder a third.
(Ambrose Bierce, The Devil’s Dictionary)

On Wed, 2006-06-28 at 22:09 +0900, thomas coopman wrote:

Is there a simple way to remove all but the legal chars from a string.
where the legal chars are for example: a-z A-Z 0-9
So everything should be removed from the string but these characters.
“exam@p Le3|§” --> “exampLe”

I don’t know very much about regular expressions, so I don’t know if
it’s possible with sub or gsub. My first Idea was to loop over the
string and check every character but I wondered if there is something
more simple or better.

An except from my upcoming book, Ruby Phrasebook:

“”"
new_password = gets
if new_password.count ‘^A-Za-z._’ != 0 then
puts “Bad Password”
else
#do something
end

This works by using a special syntax that’s shared by .count, .tr,
delete, and squeeze. A parameter beginning with ^ negates the list; the
list consists of any valid characters in the active character set and
may contain ranges formed with -. If more than one parameter list is
given to these functions, the lists of characters are intersected using
set logic[md]that is, only characters in both lists are used for
filtering.

You might also want to simply replace all “evil” characters with _ (such
as perhaps from a CGI form post):

evil_input = ‘cat /etc/passwd

evil_input.tr(’./`’, ‘_’)

#=> “_cat etc_passwd
“”"

In your specific question, you will want to use .delete:

‘exam@p Le3|§’.delete ‘^A-Za-z’
#=> “exampLe”

On Wednesday 28 June 2006 14:09, thomas coopman wrote:

Is there a simple way to remove all but the legal chars from a string.
where the legal chars are for example: a-z A-Z 0-9 So everything should be
removed from the string but these characters. “exam@p Le3|§” → “exampLe”

“exam@p Le3|§”.gsub(/\W/, ‘’)

will return

“exampLe3”

I strongly suggest you learn about regular expressions. You can start
here,
Regular expression - Wikipedia , there are many links
to
many tutorials. Here, http://www.rubycentral.com/book/tut_stdtypes.html
, you
can find info on ruby regular expressions ; though it might be best to
get
yourself a Ruby book.

Anselm

On 6/28/06, thomas coopman [email protected] wrote:

You’ve hit the nail on the head. Use gsub on the string.

“exam@p Le3|§”.gsub(/\W/, ‘’) # → exampLe3

The \W in the regular expression matches every character that is not a
valid word character: i.e. [^a-zA-Z0-9_]

Blessings,
TwP

Thomas,

s = “exam@p Le3|§”
s.gsub(/[^a-zA-Z0-9]/, ‘’) # => “exampLe”

Thanks,

David

So everything should be removed from the string but these characters.
“exam@p Le3|§” --> “exampLe”

I don’t know very much about regular expressions, so I don’t
know if it’s possible with sub or gsub.

Character sets with ranges: [a-z]
Negated sets: [^a-z]

“exam@p Le3|§”.gsub(/[^a-zA-Z0-9]/,’’) => “exampLe3”

ben

thomas coopman wrote:

Hi,

Is there a simple way to remove all but the legal chars from a string. where the legal chars are for example: a-z A-Z 0-9
So everything should be removed from the string but these characters.
“exam@p Le3|§” → “exampLe”

The easiest way to find stuff is to search comp.lang.ruby through
Google groups:

http://groups.google.com/group/comp.lang.ruby/browse_frm/thread/c9b63420fe8f66a9?q=remove+non-ASCII&

ruby-talk-google was set up in late April and doesn’t have much
searchable history