Parse csv similar file

RebhanS_Gilbert · February 6, 2007, 3:33pm

Hi,

i have a txtfile with a format like that =

AP850KP;INCLIB;E023889;AP013;240107;0730
AP850SD$;INCLIB;E052337;AP013;240107;0730
AP850SDA;INCLIB;E050441;AP013;240107;0730
AP850SDI;INCLIB;E023889;AP013;240107;0730
AP850SDO;INCLIB;E052337;AP013;240107;0730
AP850SDS;INCLIB;E050441;AP013;240107;0730
…

i want to get a collection for every E followed by digits,
so with the example above, i want to get =

collections:
E023889
E052337
E050441
…

each collection should contain datasets with the rest of the line, so
f.e.
E023889 would have =

[AP850KP;INCLIB;AP013;240107;0730,AP850SDI;AP013;240107;0730]

questions=
what kind of collection is the best ? is an array sufficient ?

right now i have =

efas=Array.new
File.open(“mycsvfile”, “r”).each do |line|
if line =~ /(\w+.?);(\w+);(\w+);(\w+);(\w+);(\w+)/

     efas<<$3.to_s<<',' unless efas.include?($3.to_s)

    end
 end
 puts efas.to_s.chop

So i have all Ed+, but how to get further ?

Are there better ways as regular expressions ?
Any ideas ?

Regards, Gilbert

RebhanS_Gilbert · February 6, 2007, 3:38pm

On Tue, Feb 06, 2007 at 11:32:27PM +0900, Rebhan, Gilbert wrote:

questions=
what kind of collection is the best ? is an array sufficient ?

Depends what you want to do with it. If you want to be able to find an
entry
E123456 quickly, then you’d use a hash. If you want to keep only the
first/last entry for a particular key (as it seems you do), using a hash
speeds things up here too.

 puts efas.to_s.chop

Try:

efas = Hash.new
…
efas[$3] = [$1,$2,$4,$5,$6] unless efas.has_key?($3)
…
puts efas.inspect

Are there better ways as regular expressions ?

You could look at String#split instead

HTH,

Brian.

RebhanS_Gilbert · February 6, 2007, 3:55pm

Hi,

RebhanS_Gilbert · February 6, 2007, 4:30pm

On Feb 6, 7:32 am, “Rebhan, Gilbert” [email protected]
wrote:

i want to get a collection for every E followed by digits,
so with the example above, i want to get =

lines = DATA.readlines.map{ |line|
line.chomp.split( ‘;’ )
}
lookup = {}
lines.each{ |data|
key = data.find{ |value| /^E/ =~ value }
lookup[ key ] = data
}
p lookup[ “E050441” ]
#=> [“AP850SDS”, “INCLIB”, “E050441”, “AP013”, “240107”, “0730”]
END
AP850KP;INCLIB;E023889;AP013;240107;0730
AP850SD$;INCLIB;E052337;AP013;240107;0730
AP850SDA;INCLIB;E050441;AP013;240107;0730
AP850SDI;INCLIB;E023889;AP013;240107;0730
AP850SDO;INCLIB;E052337;AP013;240107;0730
AP850SDS;INCLIB;E050441;AP013;240107;0730

RebhanS_Gilbert · February 6, 2007, 4:36pm

Gavin K. wrote:

On Feb 6, 7:32 am, “Rebhan, Gilbert” [email protected]
wrote:

i want to get a collection for every E followed by digits,
so with the example above, i want to get =

lines = DATA.readlines.map{ |line|
line.chomp.split( ‘;’ )
}
lookup = {}
lines.each{ |data|
key = data.find{ |value| /^E/ =~ value }
lookup[ key ] = data
}
p lookup[ “E050441” ]
#=> [“AP850SDS”, “INCLIB”, “E050441”, “AP013”, “240107”, “0730”]
END
AP850KP;INCLIB;E023889;AP013;240107;0730
AP850SD$;INCLIB;E052337;AP013;240107;0730
AP850SDA;INCLIB;E050441;AP013;240107;0730
AP850SDI;INCLIB;E023889;AP013;240107;0730
AP850SDO;INCLIB;E052337;AP013;240107;0730
AP850SDS;INCLIB;E050441;AP013;240107;0730

I think he wants to append this array with information each time he sees
the same key, so modify your code like so:

lines = DATA.readlines.map{ |line|
line.chomp.split( ‘;’ )
}
lookup = {}
lines.each{ |data|
key = data.find{ |value| /^E/ =~ value }
lookup[ key ] ||= []
lookup[ key ] << data
}

RebhanS_Gilbert · February 6, 2007, 4:53pm

On 2/6/07, Rebhan, Gilbert [email protected] wrote:

AP850SDI;INCLIB;E023889;AP013;240107;0730
E050441
…

each collection should contain datasets with the rest of the line, so
f.e.
E023889 would have =

[AP850KP;INCLIB;AP013;240107;0730,AP850SDI;AP013;240107;0730]

questions=
what kind of collection is the best ? is an array sufficient ?

Just for fun, here’s a Ruport example:

require “rubygems”
require “ruport”
DATA = <<-EOS
AP850KP;INCLIB;E023889;AP013;240107;0730
AP850SD$;INCLIB;E052337;AP013;240107;0730
AP850SDA;INCLIB;E050441;AP013;240107;0730
AP850SDI;INCLIB;E023889;AP013;240107;0730
AP850SDO;INCLIB;E052337;AP013;240107;0730
AP850SDS;INCLIB;E050441;AP013;240107;0730
EOS

table = Ruport::Data::Table.parse(DATA, :has_names => false,
:csv_options=>{:col_sep=>“;”})

table.column_names = %w[c1 c2 c3 c4 c5 c6] # BUG! you shouldn’t need
colnames

e = table.column(2).uniq
e.each { |x| table.create_group(x) { |r| r[2].eql?(x) } }

groups = table.groups

groups.attributes
[“E023889”, “E052337”, “E050441”]

groups[“E023889”].map { |r| r[0] }
[“AP850KP”, “AP850SDI”]

groups.each { |t| p t[0].c1 }
“AP850KP”
“AP850SD$”
“AP850SDA”

===============

note that in making this example, I found a small bug in Ruport’s
grouping support which I will fix

RebhanS_Gilbert · February 6, 2007, 5:55pm

On Feb 6, 8:36 am, Drew O. [email protected] wrote:

lookup[ key ] << data

}

Curses, I didn’t read carefully enough. Right you are. (And, though
it’s not clear from his example, he might not even need to split the
original line into arrays of pieces, but just keep the lines.)

RebhanS_Gilbert · February 6, 2007, 6:00pm

On Feb 6, 8:36 am, Drew O. [email protected] wrote:

I think he wants to append this array with information each time he sees
the same key […]

So here’s another version:

lookup = Hash.new{ |h,k| h[k]=[] }

DATA.each_line{ |line|
line.chomp!
warn “No key in ‘#{line}’” unless key = line[ /\bE\w+/ ]
lookup[ key ] << line
}

p lookup[ “E050441” ]
#=> [“AP850SDA;INCLIB;E050441;AP013;240107;0730”,
“AP850SDS;INCLIB;E050441;AP013;240107;0730”]

require ‘pp’
pp lookup
#=> {“E050441”=>
#=> [“AP850SDA;INCLIB;E050441;AP013;240107;0730”,
#=> “AP850SDS;INCLIB;E050441;AP013;240107;0730”],
#=> “E052337”=>
#=> [“AP850SD$;INCLIB;E052337;AP013;240107;0730”,
#=> “AP850SDO;INCLIB;E052337;AP013;240107;0730”],
#=> “E023889”=>
#=> [“AP850KP;INCLIB;E023889;AP013;240107;0730”,
#=> “AP850SDI;INCLIB;E023889;AP013;240107;0730”]}

END
AP850KP;INCLIB;E023889;AP013;240107;0730
AP850SD$;INCLIB;E052337;AP013;240107;0730
AP850SDA;INCLIB;E050441;AP013;240107;0730
AP850SDI;INCLIB;E023889;AP013;240107;0730
AP850SDO;INCLIB;E052337;AP013;240107;0730
AP850SDS;INCLIB;E050441;AP013;240107;0730

RebhanS_Gilbert · February 7, 2007, 9:48am

Hi,

RebhanS_Gilbert · February 6, 2007, 4:14pm

On Tue, Feb 06, 2007 at 11:54:59PM +0900, Rebhan, Gilbert wrote:

that belong to the different E…
…
puts efas.inspect
*/

that gives me only one dataset in the hash, but there are more
entries that have E123456 in it.

I was just following your original example, which only kept the first
line
for a particular E key.

If you want to keep them all, then I’d use a hash with each element
being an
array.

 efas[$3] ||= []               # create empty array if necessary
 efas[$3] << [$1,$2,$4,$5,$6]  # add a new line

So, given the following input

aaa,bbb,E123,ddd,eee,fff
ggg,hhh,E123,iii,jjj,kkk

you should get

efas = {
“E123” => [
[“aaa”,“bbb”,“ddd”,“eee”,“fff”],
[“ggg”,“hhh”,“iii”,“jjj”,“kkk”],
],
}

puts efas[“E123”].size # 2
puts efas[“E123”][0][3] # “eee”
puts efas[“E123”][1][3] # “jjj”

In practice, to make it easier to manipulate this data, you’d probably
want
to create a class to represent each object, rather than using a
5-element
array.

You would give each attribute a sensible name. I don’t know what these
values mean, so I’ve just called them a to e here.

class Myclass
attr_accessor :a, :b, :c, :d, :e
def initialize(a, b, c, d, e)
@a = a
@b = b
@c = c
@d = d
@e = e
end
end

…
efas[$3] ||= []
efas[$3] << Myclass.new($1,$2,$4,$5,$6)

HTH,

Brian.

RebhanS_Gilbert · February 7, 2007, 11:29am

Hi,

RebhanS_Gilbert · February 7, 2007, 10:43am

On Wed, Feb 07, 2007 at 05:47:26PM +0900, Rebhan, Gilbert wrote:

AP540RBP;INCLIB;E052337;AP013;240107;0730

in the subfolder which is field 2
the format might look like
File.open(“mycsvfile”, “r”).each do |line|
if line =~ /(\w+.?);(\w+);(\w+);(\w+);(\w+);(\w+)/
     efas<<$3.to_s<<',' unless efas.include?($3.to_s)
i get an array with all ticketnr
then i create a folderstructure for every index in that array
and put the files in it, but i don’t get it.

Any ideas ?

I’d do all the work on-the-fly. Untested code:

require ‘fileutils’
SRCDIR="/path_to_src"
DSTDIR="/path_to_dst"

def copy_ticket(filename, folder, ticket, user, date, time)
srcdir = SRCDIR + File::SEPARATOR + folder
dstdir = DSTDIR + File::SEPARATOR + ticket + File::SEPARATOR + folder
FileUtils.mkdir_p(dstdir)
FileUtils.cp(srcdir + File::SEPARATOR + filename,
dstdir + File::SEPARATOR + filename)

write out status file

statusfile = dstdir + File::SEPARATOR + “status.txt”
unless FileTest.exists?(statusfile)
File.open(statusfile, “w”) do |sf|
sf.puts “user=#{user}”
sf.puts “date=#{date}”
sf.puts “time=#{time}”
end
end
end

def process_meta(f)
f.each_line do |line|
next unless line =~ /^(\w+);(\w+);(\w+);(\w+);(\w+);(\w+)$/
copy_ticket($1,$2,$3,$4,$5,$6)
end
end

Main program

File.open(“mycsvfile”) do |f|
process_meta(f)
end

If you want to build up a hash of ticket IDs seen, you can do that in
process_meta as well. I’d pass in an empty hash, and update it in the
each_line loop.

HTH,

Brian.

RebhanS_Gilbert · February 7, 2007, 3:31pm

On Wed, Feb 07, 2007 at 07:28:03PM +0900, Rebhan, Gilbert wrote:

srcdir = SRCDIR + File::SEPARATOR + folder
dstdir = DSTDIR + File::SEPARATOR + ticket + File::SEPARATOR + folder
filename=filename<<EXT
…

is there a better way ?

That’s OK, just beware that the way you’ve done it you’ve modified the
string which was passed in. e.g.

a=“foobar”
copy_ticket(a, “/tmp”, “E123”, “x”, “y”, “z”)
puts a

will print “foobar.txt”

To avoid that:

filename = filename + EXT

(which creates a new String object, and then updates the local variable
‘filename’ to point to this new object)

This is an interesting “small” file-chomping task. I wonder what the
equivalent Java program would look like

B.

RebhanS_Gilbert · February 7, 2007, 3:51pm

Hi,

filename=filename<<EXT
…

is there a better way ?

/*
That’s OK, just beware that the way you’ve done it you’ve modified the
string which was passed in. e.g.
…
*/

yup, i know, but somewhere i read that
string concatenation via << would be better/quicker as +
because no new String object gets created.

Regards, Gilbert

RebhanS_Gilbert · February 7, 2007, 12:33pm

Hi,

RebhanS_Gilbert · February 7, 2007, 5:16pm

Just an idea…

gegroet,
Erik V. - http://www.erikveen.dds.nl/

hash =
File.open(“input.txt”) do |f|
f.readlines.collect do |line|
k = line.scan(/;(E\d+);/).flatten.shift
v = line.scan(/;E\d+;(.*)/).flatten.shift

 [k, v]

end.select do |k, v|
k and v
end.inject({}) do |h, (k, v)|
(h[k] ||= []) << v ; h
end.inject({}) do |h, (k, v)|
h[k] = v.join(“,”) ; h
end
end

p hash

RebhanS_Gilbert · February 7, 2007, 7:27pm

Nice abstraction… ;]

(By heart: This group_by is part of one of the Rails packages.)

gegroet,
Erik V. - http://www.erikveen.dds.nl/

module Enumerable
def hash_by(&block)
inject({}){|h, o| (h[block[o]] ||= []) << o ; h}
end

def group_by(&block)
#hash_by(&block).values
hash_by(&block).sort.transpose.pop
end
end

p hash