Text Parsing Help

Dobai-Pataky_BSSSSl · December 2, 2010, 6:27pm

Greetings,

I am new to Ruby and programming and am trying to parse a text file, but
encountered some difficulties.

Basically, the text file contains lines in the following format (where
\n is not really a newline but the text “\n”):

TextString [SYMBOL]\nDefinition

I need to replace the text \n with a tab, as I am attempting to separate
all the “tokens” by tabs. The issue here is that \n happens to be a
newline character. I tried searching the forums and tried the following
code:

lineItem = line.gsub("\\n", “\t”)

but it doesn’t seem to work and the \n is not being replaced. How can I
convert the \n text into a tab?

Any help is greatly appreciated!

jestermania · December 2, 2010, 6:34pm

On 12/2/2010 11:27 AM, Jester M. wrote:

I need to replace the text \n with a tab, as I am attempting to separate
all the “tokens” by tabs. The issue here is that \n happens to be a
newline character. I tried searching the forums and tried the following
code:

lineItem = line.gsub("\\n", “\t”)

but it doesn’t seem to work and the \n is not being replaced. How can I
convert the \n text into a tab?

In Ruby, the literal “\n” is a string consisting of only a newline
character. If you want the string to literally be backslash n (\n),
then you would use “\n”. The backslash is a special character within
string literals, so if you want it to appear literally in your string,
you have to escape it with another backslash. Your example in the gsub
call above is actually creating a search string of backslash backslash n
(\n) because you have 4 backslashes preceding the n, but that text does
not appear in your input.

-Jeremy

jestermania · December 2, 2010, 6:41pm

On Thu, Dec 2, 2010 at 6:27 PM, Jester M. [email protected]
wrote:

lineItem = line.gsub(“\\n”, “\t”)
This may be useful:

$ irb

s = ‘car\nplane\ntrain \n boat’ # because of ’ the \n is not interpreted as
newline
=> “car\nplane\ntrain \n boat”
s.gsub(/\n/, “\t”) # here the \n is really ‘\n’ , but “\t” is really
=> “car\tplane\ttrain \t boat”

the result has s now

Peter

jestermania · December 2, 2010, 7:04pm

On Thu, Dec 2, 2010 at 7:01 PM, Jester M. [email protected]
wrote:

end

How would I use the ’ ’ with the line variable?

You don’t need it, because what you read from the file are already the
character '' and the character ‘n’. Peter needed it because he was
typing Ruby string literals.

Jesus.

jestermania · December 2, 2010, 7:00pm

Thanks for the help! I have a question though regarding Peter’s reply:

s = ‘car\nplane\ntrain \n boat’ # because of ’ the \n is not interpreted as
newline

Currently, my code is:

IO.readlines(“input.txt”).each do |line|
lineItem = line.gsub(/\n/, “\t”)
end

How would I use the ’ ’ with the line variable?

jestermania · December 2, 2010, 8:07pm

Yes, but I tried the code and it is still not working. I used a puts
statement to output the results to see whether the “\n” text was truly
being replaced by a tab.

#!/usr/bin/ruby -w

IO.readlines(“input.txt”).each do |line|
lineItem = line.gsub(/\n/, “\t”)
puts lineItem.split("\t")
end

However, the results were that the output still had \n text.

jestermania · December 2, 2010, 11:38pm

On Thu, Dec 2, 2010 at 8:10 PM, Jester M. [email protected]
wrote:

However, the results were that the output still had \n text.

I hope my example below can explain what happens

$ ruby -v
ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]

I used this input.txt file for testing

car\nplane\ntrain \n boat

second line, first token \n second token

irb(main):013:0> IO.readlines(“input.txt”).each do |line|
irb(main):014:1* lineItem = line.gsub(/\n/, “\t”)
irb(main):015:1> puts lineItem.split(“\t”).inspect
irb(main):016:1> end
[“car”, “plane”, “train “, " boat\n”] # the first line is parsed
and split correctly into this array
[”\n"] # the second line only has a newline
["second line, first token “, " second token\n”] # correct too
=> [“car\nplane\ntrain \n boat\n”, “\n”, “second line, first token
\n second token\n”]

this last line is the result IO.readlines(“input.txt”) because the

“each” method
eventually returns self after having iterated over all entities

irb(main):017:0> IO.readlines(“input.txt”).each do |line|
irb(main):018:1* lineItem = line.gsub(/\n/, “\t”)
irb(main):019:1> puts lineItem.split(“\t”)
irb(main):020:1> end
car
plane
train
boat

second line, first token
second token
=> [“car\nplane\ntrain \n boat\n”, “\n”, “second line, first token
\n second token\n”]

So, one trick is to use .inspect and .class in many cases to better
understand what is
the object you are looking at and what the content really is.

Also, you could use chomp to get rid of the newline at the end of the
last entry in your array of tokens.
So, a shorter piece of code that may be useful is:

irb(main):025:0> IO.readlines(“input.txt”).map do |line|
irb(main):026:1* line.chomp.gsub(/\n/, “\t”)
irb(main):027:1> end
=> [“car\tplane\ttrain \t boat”, “”, “second line, first token \t second
token”]

Now there are the delimiters that you wanted between the tokens
in the resulting output.

HTH,

Peter

jestermania · December 3, 2010, 12:51am

On Thu, Dec 2, 2010 at 1:10 PM, Jester M. [email protected]
wrote:

However, the results were that the output still had \n text.

“\n” is a newline
“\n” is a backslash, letter n
‘\n’ is the same as “\n” but you can ignore that if it is confusing,
because it only counts when you enter it as a literal.

You say you want to see whether “\n” is being replaced by a tab, but you
are
replacing /\n/ (btw, you could use a string here). You say the output
has
\n in the text. By that, I assume you mean it has a newline, but are
misinterpreting it as “\n” which you replaced. If this is accurate, you
should decide whether you wish to replace “\n” or “\n”. As peter said,
using inspect (ie: puts line.inspect) is a good way to see your String
data.

Also, if you don’t already have tabs that you also wish to split on,
then
you don’t need the gsub step, you can just split on the “\n”. Here are
a
couple of examples to hopefully make it a little easier to see.
“a\nb\nc”.split(“\n”) # => [“a\nb”, “c”]
“a\nb\nc”.split(“\n”) # => [“a”, “b\nc”]
“a\nb\nc\td”.gsub(“\n”,“\t”).split(“\t”) # => [“a\nb”, “c”, “d”]
“a\nb\nc\td”.gsub(“\n”,“\t”).split(“\t”) # => [“a”, “b\nc”, “d”]

jestermania · December 4, 2010, 5:28am

Ah hah! I figured it out, the txt file had the wrong encoding. I
encoded it with UTF-8 in Notepad++ and everything works as expected. I
thank everyone for writing these meaningful replies.

jestermania · December 4, 2010, 3:29am

Peter/Josh,

Thanks once again for the helpful posts. I am learning quite a bit
which is good. However, I just tried to replicate Peter’s example and
when I attempted to use the .inspect method, the output was not what I
expected:

INPUT FILE <input.txt>

car\nplane\ntrain \n boat

second line, first token \n second token

OUTPUT

["\377\376c\000a\000r\000\\000n\000p\000l\000a\000n\000e\000\\000n\000t\000r\0
00a\000i\000n\000 \000\\000n\000 \000b\000o\000a\000t\000\r\000\n"]
["\000\r\000\n"]
["\000s\000e\000c\000o\000n\000d\000 \000l\000i\000n\000e\000,\000
\000f\000i\00
0r\000s\000t\000 \000t\000o\000k\000e\000n\000 \000\\000n\000
\000s\000e\000c\0
00o\000n\000d\000 \000t\000o\000k\000e\000n\000"]

Do you know why are they so many numbers? like \377 and \000?