Does Ruby need a "line separator" class?

I’ve run into a problem where Ruby can’t handle newlines on Windows
because the regexp is explicitly looking for \n and not \r\n.

In the Java world, there is a system property to represent line
separator so that you can write code that is cross-platform with respect
to line separation on Unix/Windows/Mac. Is there an equivalent
abstraction of the newline character in Ruby? If not, where does it
belong?

For some reason, I thought I read somewhere that sometimes the “\n”
character is overloaded in this way (to represent a “newline” regardless
of platform), but not sure if I’m misremembering.

Thanks,
Wes

On 7/31/06, Xavier N. [email protected] wrote:

print "\n"s.

That’s because there’s an intermediate IO layer that transforms CRLF
into LF in CRLF platforms on reading, and LF back to CRLF on writing.

This has come up in the JRuby project fairly frequently since Java wants
to
normalize line-terminators internally to the underlying platform, rather
than normalizing to \n and handling conversion on read-write. Xavier,
are
you saying that Ruby has in its IO layer code to convert from CRLF to LF
on
input/output, and this is the primary means of normalizing newlines? We
have
had in our bug tracker a patch that resolves JRuby’s newline issues in a
similar way, but had not committed it pending research into whether this
would be appropriate and sufficient.

On Jul 31, 2006, at 5:40 PM, Wes G. wrote:

I’ve run into a problem where Ruby can’t handle newlines on Windows
because the regexp is explicitly looking for \n and not \r\n.

It shouldn’t look for CRLFs. The rules of the game in languages that
inherit the newline normalization approach from C (those include C++,
and Perl, for instance, but not Java) are that if you work in text
mode and the text file follows runtime conventions, you only read and
print "\n"s.

That’s because there’s an intermediate IO layer that transforms CRLF
into LF in CRLF platforms on reading, and LF back to CRLF on writing.

In Java this is handled in a different way, “\n” is not portable in
Java. Portable code in Java uses method calls like println. But in
Ruby a portable regexp that assumes text mode and data with the
runtime platform conventions for newlines have to use “\n”, no CR
ever gets into the string.

– fxn

Xavier,

That’s interesting.

In a pure Ruby (Rails) app, I’ve had to modify regexps to handle the
\r\n sequence so that my regexps will work in a Windows environment.
I’m guessing that this is related to the “file follows runtime
conventions” in your post. Meaning that the file that I’m processing
(which is actually sourced externally) did not conform to C runtime
conventions when it was written.

In general, this seems simple enough to handle, you just allow for
optional \r \n combinations in your regexp (assuming setting the
multiline flag for the regexp), like so:

[^\r\n]*
[\r\n]*
(\r*\n*)

Wes

FWIW, I’m pursuing this question because of the JRuby issue.

On 7/31/06, Wes G. [email protected] wrote:

FWIW, I’m pursuing this question because of the JRuby issue.

I figured as much :slight_smile: From what Xavier says, we may be closer (with Ola’s
patch) than previously thought…

On Jul 31, 2006, at 6:15 PM, Charles O Nutter wrote:

had in our bug tracker a patch that resolves JRuby’s newline issues
in a
similar way, but had not committed it pending research into whether
this
would be appropriate and sufficient.

If I am not mistaken, in Ruby that is delegated to stdio. After a
quick code inspection I think the exact point where that is done is
in the call to write():

r = write(fileno(f), RSTRING(str)->ptr+offset, l);

That’s in the function io_fwrite(), line 455 of io.c in Ruby 1.8.4.

In Perl that was delegated to stdio as well until 5.8.0, where the I/
O layer was substituted with PerlIO who is now the responsible for
that filtering in CRLF platforms.

– fxn

On Jul 31, 2006, at 6:23 PM, Wes G. wrote:

In a pure Ruby (Rails) app, I’ve had to modify regexps to handle the
\r\n sequence so that my regexps will work in a Windows environment.
I’m guessing that this is related to the “file follows runtime
conventions” in your post. Meaning that the file that I’m processing
(which is actually sourced externally) did not conform to C runtime
conventions when it was written.

Yes, that is an important point.

When we talk about portability as far as newlines is concerned we are
assuming the newline conventions of the platform and the data match.
A portable line-oriented script might fail if it is running on Linux
processing text files from a FAT32 partition that were generated by
some Windows program. There a lot of common situations when
conventions may not match. A portable line-oriented script is not
supposed to handle those situation, a robust line-oriented script
should do something sensible with foreign conventions.

Web programming is one of them, because you cannot assume anything in
the input that comes from a text area or an uploaded text file for
instance. In that case you better normalize first (written on the way):

normalized_text_area = text_area.gsub(/\015\012/, “\n”).gsub(/
\015/, “\n”)

Now text_area has been normalized and all standard line-oriented

idioms will work.

In Ruby we are done because “\n” is “\012” everywhere, in Perl that
gets slightly more complicated because “\n” is eq “\015” on MacOS pre-
X. But you see the idea and why you do that.

– fxn (<-- whose article about newlines for O’Reilly is about to
appear)

On Jul 31, 2006, at 7:27 PM, Charles O Nutter wrote:

(i.e. we
can’t normalize \r\n to \n again because they’re already \n on disk).

If those files are only handled by that application there is no
problem because \ns are precisely what the script should see.

For instance, if you pass a Unix text file to a line-oriented script
running on Windows the script will work as long as it only reads.
That’s because LFs not following a CR are left untouched by the I/O
layer, and by a happy coincidence LFs is what readline expects. So
everything works, by chance, but works.

Problem is the application generates text files that do not follow
the conventions of the platform, and other programs may assume they do.

– fxn

On 7/31/06, Xavier N. [email protected] wrote:

Yes, that is an important point.

When we talk about portability as far as newlines is concerned we are
assuming the newline conventions of the platform and the data match.
A portable line-oriented script might fail if it is running on Linux
processing text files from a FAT32 partition that were generated by
some Windows program. There a lot of common situations when
conventions may not match. A portable line-oriented script is not
supposed to handle those situation, a robust line-oriented script
should do something sensible with foreign conventions.

A large part of our problem is that we currently tend to normalize
everything to \n…all the time. That has the effect of also writing
out \n
to the filesystem for newlines, which as you describe above causes
problems
when trying to re-read. So for the case in question, we run Rails…it
generates files with newlines…we normalize those newlines to \n and
write
such to disk…and then future use of those files (in this case, ERB
templates) fails because the newlines aren’t handled correctly (i.e. we
can’t normalize \r\n to \n again because they’re already \n on disk).

So it seems the IO approach may do well for us, where newlines are read
from
platform-specific and written to platform-specific.

On Jul 31, 2006, at 6:54 PM, Xavier N. wrote:

normalized_text_area = text_area.gsub(/\015\012/, “\n”).gsub(/
\015/, “\n”)

Just for the archives, this normalizes in Ruby with only one pass

normalized_text_area = text_area.gsub(/\015\012?/, “\n”)

though it is less explicit. Let me add now that we are on it that if
the text is Unicode it may come with a few more codes for newlines.
All in all this is a PITA like character encodings, but is what we’ve
got for historical reasons.

– fxn

I was thinking about this a little more.

Why wouldn’t JRuby just take advantage of the Java runtime’s
normalization facility in this case, using the JVM’s notion of “newline”
on the particular platform to handle I/O?

Is the JRuby issue that only some of the code that is doing I/O is
pure Java and some other set of the code is Ruby so that trying to
always use the JVM “line separator” concept won’t work?

Wes

In this particular case, could
java.lang.System.getProperty(“line.separator”) be used to handle
platform-specific reading/writing? That way, you get to piggyback on
the multiplatform support built into Java. If the low-level I/O code is
centralized, it seems like this would be the way to go.

Are there performance implications for this approach? Seems like you
could just grab all of the system specific newline properties from the
System object upon the initialization of the JRuby interpreter and just
refer to them later.

Wes

On 7/31/06, Wes G. [email protected] wrote:

The issues get complicated, but the biggest underlying issue is that we
can’t easily look like unix on unix and windows on windows because Java
looks basically the same everywhere…that is except for crap specific
to
unix and windows. If we pretend to be one or the other all the time,
then
the other platform breaks. If we try to emulate both, we run into things
where we simply can’t do it…we can act like both unix and windows for
some
things but not others. Ultimately we try to normalize things to some
amorphous “java” platform, but then Ruby has no idea what we’re talking
about and falls back on either windows or unix behavior.

We’ve mostly been able to trick Ruby into doing the right things on
different platforms, and this will probably work the same way. It’s just
a
matter of figuring out where newlines get normalized, normalizing them
ourselves to something appropriate internally for Java on platform X,
and
then handling the conversion of that normalized format back out to the
platform again. Figuring out exactly what happens to \r\n everywhere
it’s
encountered within Windows-based C Ruby will help us figure out where
the
in/out has to happen.

It’s a bit more complicated than that…bring this up on the JRuby dev
list
and others can chime in there as well.

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs