Gsub and backslashes

Dobai-Pataky_BSSSSl · November 20, 2010, 11:16pm

Consider the string
\1\2\3
that is
“\1\2\3”

I feel really stupid … but this simple substitution pattern does not
do what I expect.

“\1\2\3”.gsub(/\/,"\\")

What I want is to change single backslashes to double backslashes. The
result of the above substitution is “no change”

On the other hand
“\1\2\3”.gsub(/\/,"\\\\")
does do what I want … but I am clueless as to why.

ralphshnelvar · November 20, 2010, 11:34pm

On Sun, Nov 21, 2010 at 12:13 AM, Ralph S. [email protected]
wrote:

On the other hand
“\1\2\3”.gsub(/\/,“\\\\”)
does do what I want … but I am clueless as to why.

Backslashes are tricky. What’s happening here is each escaped
backslash “\” yields one backslash, which affects (escapes) what
comes after it, in this case another escaped backslash that in turn
yields one back slash. In other words, four backslashes yield two
backslashes, which is an escaped backslash (i.e one backslash).

HTH,
Ammar

ralphshnelvar · November 20, 2010, 11:41pm

On Sun, Nov 21, 2010 at 12:34 AM, Ammar A. [email protected]
wrote:

What I want is to change single backslashes to double backslashes. The result
of the above substitution is “no change”

I should have added that you can get the same result with 3
backslashes. So 6 of them will give you two.

“\1\2\3”.gsub(/\/,“\\\”).scan /./
=> [“\”, “\”, “1”, “\”, “\”, “2”, “\”, “\”, “3”]

Regards,
Ammar

ralphshnelvar · November 21, 2010, 2:56pm

On Sun, Nov 21, 2010 at 11:57 AM, botp [email protected] wrote:

#1

#4
“\1\2\3”.gsub(/(\)/){$1+$1}.scan /./
#=> [“\”, “\”, “1”, “\”, “\”, “2”, “\”, “\”, “3”]

#1 & #2 samples uses group backreferences, ruby may need second parsing pass
for this feature to work…

#3 & #4 uses code blocks. may not need second pass. backreferences can be
had using $n notation.

botp’s excellent suggestions reminded of another one:

“\1\2\3”.gsub(/\/, ‘&&’)
=> “\\1\\2\\3”

Regards,
Ammar

ralphshnelvar · November 21, 2010, 10:02pm

Ralph S. wrote in post #962847:

Consider the string
\1\2\3
that is
“\1\2\3”

I feel really stupid … but this simple substitution pattern does not
do what I expect.

“\1\2\3”.gsub(/\/,"\\")

Here you are replacing one backslash with one backslash.

The trouble is, in the replacement string, ‘\1’ has a special meaning
(insert the value of the first capture). Because of this, a literal
backslash is backslash-backslash.

So to replace with two backslashes you need
backslash-backslash-backslash-backslash. And inside a double or single
quoted string, a single backslash is represented as “\” or ‘\’

irb(main):001:0> “\1\2\3”.gsub(/\/,"\\\\")
=> “\\1\\2\\3”

The second level of backslashing isn’t used with the block form, since
if you want to use captured subexpressions you can use #{$1} instead of
\1. Hence as an alternative:

irb(main):002:0> “\1\2\3”.gsub(/\/) { “\\” }
=> “\\1\\2\\3”

ralphshnelvar · November 22, 2010, 12:27am

On Sun, Nov 21, 2010 at 11:02 PM, Brian C. [email protected]
wrote:

Ralph S. wrote in post #962847:

“\1\2\3”.gsub(/\/,“\\”)

Here you are replacing one backslash with one backslash.

The trouble is, in the replacement string, ‘\1’ has a special meaning
(insert the value of the first capture). Because of this, a literal
backslash is backslash-backslash.

That’s a keen observation, but the fact that they happen to be
back-references doesn’t seem to play a part in this situation.

“\a\b\c”.gsub(/\/,“\\”)
=> “\a\b\c”
“\a\b\c”.gsub(/\/,“\\\”)
=> “\\a\\b\\c”

Regards,
Ammar

ralphshnelvar · November 21, 2010, 10:58am

On Sun, Nov 21, 2010 at 6:13 AM, Ralph S. [email protected]
wrote:

What I want is to change single backslashes to double backslashes. The
result of the above substitution is “no change”

On the other hand
“\1\2\3”.gsub(/\/,“\\\\”)
does do what I want … but I am clueless as to why.

there are many ways,

#1
“\1\2\3”.gsub(/(\)/,“\1\1”).scan /./
#=> [“\”, “\”, “1”, “\”, “\”, “2”, “\”, “\”, “3”]

#2
“\1\2\3”.gsub(/(\)/,‘\1\1’).scan /./
#=> [“\”, “\”, “1”, “\”, “\”, “2”, “\”, “\”, “3”]

#3
“\1\2\3”.gsub(/\/){“\\”}.scan /./
#=> [“\”, “\”, “1”, “\”, “\”, “2”, “\”, “\”, “3”]

#4
“\1\2\3”.gsub(/(\)/){$1+$1}.scan /./
#=> [“\”, “\”, “1”, “\”, “\”, “2”, “\”, “\”, “3”]

#1 & #2 samples uses group backreferences, ruby may need second parsing
pass
for this feature to work…

#3 & #4 uses code blocks. may not need second pass. backreferences can
be
had using $n notation.

best regards -botp

ralphshnelvar · November 22, 2010, 1:29pm

On Mon, Nov 22, 2010 at 10:38 AM, Robert K.
[email protected] wrote:

backslash in a replacement string one needs to escape it and hence
have to backslashes. Coincidentally a backslash is also special in a
string (even in a single quoted string). So you need two levels of
escaping, makes 2 * 2 = 4 backslashes on the screen for one literal
replacement backslash.

Actually, 3 backslashes will yield one backslash. The first two result
in one (escaped), and the third one, escaped by the previous escaped
backslash ends up being one. My second example showed this, using 6
backslashes instead of 8. Using 4 backslashes works because the second
pair yields and escaped backslash, but it is not necessary.

Regards,
Ammar

ralphshnelvar · November 22, 2010, 9:39am

On Mon, Nov 22, 2010 at 12:27 AM, Ammar A. [email protected]
wrote:

That’s a keen observation, but the fact that they happen to be
back-references doesn’t seem to play a part in this situation.

“\a\b\c”.gsub(/\/,“\\”)
=> “\a\b\c”
“\a\b\c”.gsub(/\/,“\\\”)
=> “\\a\\b\\c”

The key point to understand IMHO is that a backslash is special in
replacement strings. So, whenever one wants to have a literal
backslash in a replacement string one needs to escape it and hence
have to backslashes. Coincidentally a backslash is also special in a
string (even in a single quoted string). So you need two levels of
escaping, makes 2 * 2 = 4 backslashes on the screen for one literal
replacement backslash.

Additionally people are often confused by the fact that IRB by default
uses #inspect for showing expression values which will display twice
as much backslashes as are present in the string.

Can we please make a big red sticker and put it on every Ruby
installer and source tar to inform people of this and the local
variable method ambiguity. These two seem to be the issues that pop
up most of the time.

Kind regards

robert

ralphshnelvar · November 22, 2010, 2:53pm

On Mon, Nov 22, 2010 at 1:28 PM, Ammar A. [email protected]
wrote:

(insert the value of the first capture). Because of this, a literal
The key point to understand IMHO is that a backslash is special in
backslashes instead of 8. Using 4 backslashes works because the second
pair yields and escaped backslash, but it is not necessary.

That does not work reliably under all circumstances though:

irb(main):006:0> “abc”.gsub /./, “\\n”
=> “\\n\\n\\n”
irb(main):007:0> puts(“abc”.gsub /./, “\\n”)

=> nil
irb(main):008:0> “abc”.gsub /./, “\\n”
=> “\n\n\n”
irb(main):009:0> puts(“abc”.gsub /./, “\\n”)
\n\n\n
=> nil

It is safer to use 4 backslashes. This is the only robust way to do
this even though sometimes you can simply use a single backslash (e.g.
\1 instead of \1) because string parsing is a bit tolerant under some
circumstances:

irb(main):014:0> ‘\1’
=> “\1”
irb(main):015:0> ‘\1’
=> “\1”

but

irb(main):019:0> “\n”
=> “\n”
irb(main):020:0> “\n”
=> “\n”
irb(main):021:0> “\1”
=> “\x01”
irb(main):022:0> “\1”
=> “\1”

Kind regards

robert

ralphshnelvar · November 22, 2010, 6:27pm

On Mon, Nov 22, 2010 at 3:53 PM, Robert K.
[email protected] wrote:

irb(main):006:0> “abc”.gsub /./, “\\n”
=> nil
I think these examples are somewhat misleading, because the escaped
newline (\n) normally includes a backslash. Taking that into account,
i.e. not counting the one that is part of newline character, the first
example is only using 2 backslashes, and the second example is using

The same goes for its friends, \a, \r, \f, etc.

It is safer to use 4 backslashes. This is the only robust way to do
this even though sometimes you can simply use a single backslash (e.g.
\1 instead of \1) because string parsing is a bit tolerant under some
circumstances:

I don’t think this is tolerance from the string parser, it is
recognition of the \1 as a valid octal value.

irb(main):014:0> ‘\1’
=> “\1”
irb(main):015:0> ‘\1’
=> “\1”

Here the single quotes are coming into play. Octal escapes are not
recognized within them. But it outputs the string in double quotes,
“forcing” the backslash to be escaped in the output. Backslashes need
to be escaped in single quoted string, just like they do in double
quoted ones, so in the second example (‘\1’), it’s just one
backslash, again.

but

irb(main):019:0> “\n”
=> “\n”
irb(main):020:0> “\n”
=> “\n”
irb(main):021:0> “\1”
=> “\x01”
irb(main):022:0> “\1”
=> “\1”

Here the double quotes are taking effect. The first correctly prints a
newline, the second an escaped one, the third gets recognized as an
octal escape, and the last escapes the meaning of the backslash that
would otherwise cause the 1 to be interpreted as an octal value.

Maybe using 4 backslashes is safer, overall, but I wouldn’t make it a
rule. At least not without explaining these special cases that include
a leading backslash in their normal representation.

Regards,
Ammar

ralphshnelvar · November 22, 2010, 8:25pm

On 22.11.2010 18:21, Ammar A. wrote:

That does not work reliably under all circumstances though:
irb(main):009:0> puts(“abc”.gsub /./, “\\n”)
\n\n\n
=> nil

I think these examples are somewhat misleading, because the escaped
newline (\n) normally includes a backslash. Taking that into account,
i.e. not counting the one that is part of newline character, the first
example is only using 2 backslashes, and the second example is using
3. The same goes for its friends, \a, \r, \f, etc.

That is the very point of my posting: you cannot always use three
slashes reliably because - ooops - all of a sudden the last one may be
part of something else. In other case, it happens to work

irb(main):002:0> “abc”.gsub /./, “\\y”
=> “\y\y\y”
irb(main):003:0> “abc”.gsub /./, “\\y”
=> “\y\y\y”

Now if someone changes “y” to “n” in the first case the (probably
unintended) effect is dramatic. Or consider a replacement string ‘foo
\1 bar’ which at some point in time is changed to “foo \1 bar \n”
unsuspectingly and which suddenly does not only yield a newline but some
weird octal character. This would have been avoided if the original
string did contain two backslashes already.

irb(main):015:0> ‘\1’
=> “\1”

Here the single quotes are coming into play. Octal escapes are not
recognized within them. But it outputs the string in double quotes,
“forcing” the backslash to be escaped in the output. Backslashes need
to be escaped in single quoted string, just like they do in double
quoted ones, so in the second example (’\1’), it’s just one
backslash, again.

Apparently I was not clear enough. The point is, that there is some
tolerance. Both sequences (line 14 and 15) produce the same output
although they differ in backslash usage. This does not work if you try
to write ‘’ to get a single backslash. For that you need ‘\’. If you
use two backslashes in both cases it’s clear what happens and there is
no room for errors.

Here the double quotes are taking effect. The first correctly prints a
newline, the second an escaped one,

This is not an “escaped newline” but merely a backslash followed by
character “n”. Whether that is considered “escaped” in some way depends
on the code that processes this string. If at all this is an escaped
“n”.

the third gets recognized as an
octal escape, and the last escapes the meaning of the backslash that
would otherwise cause the 1 to be interpreted as an octal value.

Correct.

Maybe using 4 backslashes is safer, overall, but I wouldn’t make it a
rule. At least not without explaining these special cases that include
a leading backslash in their normal representation.

My precise reason to make it a rule is that it is simple and beginners
do not have to remember all these special cases that you find so worthy
mentioning.

Actually I do not like those special cases and would rather suggest to
remove them since they make things unnecessary complicated. The
repeated occurrence of newbie confusion and the very discussion we are
having here proves that the logic creates more confusion than clarity.
The only reason I do not suggest to change this is the fact that this
might break a lot of code.

Kind regards

robert

ralphshnelvar · November 23, 2010, 10:18am

On Mon, Nov 22, 2010 at 10:06 PM, Ammar A. [email protected]
wrote:

----8<----
either their way, or their way in a way one did not expect.
But rules can be made to allow for some flexibility (just think
of method calls with or without brackets in Ruby).

This is not an “escaped newline” but merely a backslash followed by
character “n”. Whether that is considered “escaped” in some way depends on
the code that processes this string. If at all this is an escaped “n”.

You are correct sir. For someone who was nitpicking, I misspoke.

No problem. Apparently we both enjoy nitpicking. :-))

My precise reason to make it a rule is that it is simple and beginners do
not have to remember all these special cases that you find so worthy
mentioning.

This might be six of one, half a dozen of the other kind of situation.
People would start to ask if the backslash in the \n case would count
in the “just add 4” rule, or not? 4 backslashes in total or 5? It
seems to only shift the issue slightly, and temporarily, until one has
to actually understand what is really going on.

Hmm… Maybe.

Actually I do not like those special cases and would rather suggest to
remove them since they make things unnecessary complicated. The repeated
occurrence of newbie confusion and the very discussion we are having here
proves that the logic creates more confusion than clarity. The only reason I
do not suggest to change this is the fact that this might break a lot of
code.

I agree, but this long “heritage” that goes back to the 60s is
probably very hard to shake. Maybe a new language can break away from
it.

In Ruby’s case the heritage does not go back to the sixties but rather
to the nineties (1997) if I am not mistaken.

Out of curiosity, what could these beasts be replaced with? Constants?

I’d leave everything as is except drop special cases like ‘\1’ (this
would either be an octal escape as in a double quoted string or rather
just “1”). In single quoted strings only ’ would be special if
preceded by a backslash. In double quoted strings I would have those
characters which are special currently (", n, r, a, t and probably
others I’m not thinking of right now). I am undecided whether I would
make all others errors or tolerant (e.g. “\z” would either by a syntax
error or just “z”). I have a slight tendency to the more strict
variant though because otherwise people might be left wondering what
\z means when it is just “z”; also, this would help detect typing
errors (maybe someone wanted to type “\t” which is just a key away in
my German keyboard).

Kind regards

robert

ralphshnelvar · November 22, 2010, 10:07pm

On Mon, Nov 22, 2010 at 9:25 PM, Robert K.
[email protected] wrote:

On 22.11.2010 18:21, Ammar A. wrote:

I don’t think this is tolerance from the string parser, it is
recognition of the \1 as a valid octal value.

irb(main):014:0> ‘\1’
=> “\1”
irb(main):015:0> ‘\1’
=> “\1”
----8<----

Apparently I was not clear enough. The point is, that there is some
tolerance. Both sequences (line 14 and 15) produce the same output
although they differ in backslash usage. This does not work if you try to
write '' to get a single backslash. For that you need ‘\’. If you use
two backslashes in both cases it’s clear what happens and there is no room
for errors.

I guess I took issue with the word tolerance. I don’t think of lexers
and parsers as tolerant. They are quite ruthless and dictatorial. It’s
either their way, or their way in a way one did not expect.

This is not an “escaped newline” but merely a backslash followed by
character “n”. Whether that is considered “escaped” in some way depends on
the code that processes this string. If at all this is an escaped “n”.

You are correct sir. For someone who was nitpicking, I misspoke.

My precise reason to make it a rule is that it is simple and beginners do
not have to remember all these special cases that you find so worthy
mentioning.

This might be six of one, half a dozen of the other kind of situation.
People would start to ask if the backslash in the \n case would count
in the “just add 4” rule, or not? 4 backslashes in total or 5? It
seems to only shift the issue slightly, and temporarily, until one has
to actually understand what is really going on.

Actually I do not like those special cases and would rather suggest to
remove them since they make things unnecessary complicated. The repeated
occurrence of newbie confusion and the very discussion we are having here
proves that the logic creates more confusion than clarity. The only reason I
do not suggest to change this is the fact that this might break a lot of
code.

I agree, but this long “heritage” that goes back to the 60s is
probably very hard to shake. Maybe a new language can break away from
it.

Out of curiosity, what could these beasts be replaced with? Constants?

Cheers,
Ammar

ralphshnelvar · November 23, 2010, 12:41pm

On Tue, Nov 23, 2010 at 11:17 AM, Robert K.
[email protected]wrote:

On Mon, Nov 22, 2010 at 10:06 PM, Ammar A. [email protected] wrote:

I guess I took issue with the word tolerance. I don’t think of lexers
and parsers as tolerant. They are quite ruthless and dictatorial. It’s
either their way, or their way in a way one did not expect.

But rules can be made to allow for some flexibility (just think
of method calls with or without brackets in Ruby).

That’s a good example, and I know understand what you meant by
tolerance.

This is not an “escaped newline” but merely a backslash followed by
character “n”. Whether that is considered “escaped” in some way depends
on
the code that processes this string. If at all this is an escaped “n”.

You are correct sir. For someone who was nitpicking, I misspoke.

No problem. Apparently we both enjoy nitpicking. :-))

I agree, but this long “heritage” that goes back to the 60s is

probably very hard to shake. Maybe a new language can break away from
it.

In Ruby’s case the heritage does not go back to the sixties but rather
to the nineties (1997) if I am not mistaken.

I was thinking of C, which I believe introduced these escapes, but I’m
not
sure.

variant though because otherwise people might be left wondering what
\z means when it is just “z”; also, this would help detect typing
errors (maybe someone wanted to type “\t” which is just a key away in
my German keyboard).

I like the idea of treating unnecessary escapes as syntax errors, or at
least warnings. I see this a lot in regular expressions, especially in
character sets. Characters that don’t need to be escaped (like ? and *)
are
preceded with a backslash, just to be safe I guess, making for a harder
to
code, as you noted.

Regards,
Ammar

ralphshnelvar · November 23, 2010, 4:48pm

On Tue, Nov 23, 2010 at 12:39 PM, Ammar A. [email protected]
wrote:

On Tue, Nov 23, 2010 at 11:17 AM, Robert K.
[email protected]wrote:

On Mon, Nov 22, 2010 at 10:06 PM, Ammar A. [email protected] wrote:

I agree, but this long “heritage” that goes back to the 60s is

probably very hard to shake. Maybe a new language can break away from
it.

In Ruby’s case the heritage does not go back to the sixties but rather
to the nineties (1997) if I am not mistaken.

I was thinking of C, which I believe introduced these escapes, but I’m not
sure.

Yeah, but I don’t want to change \n, \t etc. in double quoted strings.
I mostly want to get rid of ‘\1’ which is something completely
specific to Ruby.

variant though because otherwise people might be left wondering what
\z means when it is just “z”; also, this would help detect typing
errors (maybe someone wanted to type “\t” which is just a key away in
my German keyboard).

I like the idea of treating unnecessary escapes as syntax errors, or at
least warnings. I see this a lot in regular expressions, especially in
character sets. Characters that don’t need to be escaped (like ? and *) are
preceded with a backslash, just to be safe I guess, making for a harder to
code, as you noted.

Exactly. I would not want to get rid of optional brackets for example
because lack of brackets can make code much more readable (apart from
foo.bar=(123) looking weird). It’s always a question of balance. I
have to say that Matz did a remarkable job at this in Ruby in general.
This is just one of very few things that could be better (class
variables is another one I can think of right now).

Kind regards

robert