Part of a script I can't figure out

Detlef_R · April 10, 2014, 12:45am

Hello,

This script reads a set of input files and extracts some data from the
file name.

The file names look like,

./f0/51.54_E900_50.02_E1100_19_ri_OA_f0_S3A_v1_36.35.1_ON_0.1lr.out.txt
./f1/50.26_E2500_56.10_E100_39_ri_OA_f1_S3A_v1_36.35.1_ON_0.1lr.out.txt
etc

One column is selected from each input file to be included in the
output.

Each input file has a series of columns with headers like E100, E200,
E300, etc. The script looks in the file name for the column to put in
the output. In the examples above, it would find f0=E900, f1=E2500. The
header name for the column to be output is the second field if the file
names are tokenized using _ as the delimiter. That’s how it would work
in awk anyway. In the output, the f0 and the column header are
concatenated into new headers that look like, f0_E900.

The way that the script is set up now, it finds the column to include in
the output from the second field. I need to change this so that it will
find the column specified in the 4th field. In the example above, that
would be f0=E1100, f1=E100.

This is the script, (apparently this forum doesn’t used [code] tags)

#!/usr/bin/ruby

(prefixes,data)=Dir[‘f[0-9]/*.out.txt’].sort.map { |file|
(colindex,*records)=File.readlines(file).drop(8).map { |l|
l.chomp.split(/\t/) }

colindex=Hash[*colindex.enum_with_index.to_a.flatten]

colindex=Hash[*colindex.each_with_index.to_a.flatten]
records=records.sort_by { |r| r[colindex[‘CVorder’]].to_i }.transpose
file=~/\A(f.)/\d+.\d+(E\d+)/||raise
[%w[CVorder Name RIexp],[$1,[$2,"#{$1}_#{$2}"]]].map { |g|

[%w[CVorder Name RIexp],[$1,[$2,"#{$1}_#{$4}"]]].map { |g|

g.map { |c|
  (k,n)=c.is_a?(Array) ? c : [c,c]
  [n,*records[colindex[k]]]
}

}
}.transpose
prefixes.all? { |x| x==prefixes[0] }||raise
print(data.transpose.inject(prefixes[0]) { |a,b| a+b }.transpose.map {
|r| r.join("\t")+"\n" }.join)

What I need at the moment is to find the column in the 4th field instead
of the column in the 2nd field.

I think that this can be done by changing a line in the script from,

[%w[CVorder Name RIexp],[$1,[$2,"#{$1}_#{$2}"]]].map { |g|

to,

[%w[CVorder Name RIexp],[$1,[$2,"#{$1}_#{$4}"]]].map { |g|

(I am unsure if that is really outputting a different column)

and that looks like it is getting the correct column. The problem is
that the output no longer contains the header values (E1100, E100). The
output is supposed to have columns that are labeled f0_E1100, f1_E100,
but
if I make the change to find the column at #{$4}, I get output, but the
headers are just f0_, f1_ and the E*** value is missing.

I think that the E*** value is set up on the line,

file=~/\A(f.)/\d+.\d+(E\d+)/||raise

but that is even more sanskrit than most sed and I can’t make it out.
This is the only place in the script that I see _ and E, so I guess this
must be it (he says with great conviction).

Can anyone translate this for me and let me know what changes I need to
make. I will need to use the script both ways, meaning that sometimes I
will need to find #{$2} and sometimes #{$4}. I will need to pass in an
argument to specify what I am trying to find (or have two separate
scripts, which is lame).

I am not actually sure that I am getting the correct columns with the
change from #{$2} to #{$4}. Overall, it is very unclear how this script
works and that is making it difficult to adjust for what I need now.

I have attached the script and a set of input files. The file
get_cols_and_sort_works_wrong-column.ruby is the original version that
does not output the columns I want. The file
get_cols_and_sort_works_no-headers.ruby has the change I mentioned
above.

The script is run as,
./get_cols_and_sort_works_wrong-column.ruby > out_wrong-column.txt
or
./get_cols_and_sort_works_no-headers.ruby > out_no-headers.txt

It would be a big help if someone can shed some light on this.

LMHmedchem

lmhmedchem · April 16, 2014, 12:05am

This Regular Expression:

/\A(f.)/\d+.\d+(E\d+)/

Only has 2 capture groups, so you will only find $1 and $2.

This is demonstrated here: Rubular: ^(f.)\/\d+\.\d+_(E\d+)_

I changed “\A” to “^” to demonstrate against more than 1 filename.

If you want to find the 4th column, you need to expand that Regexp to
include the 4th column as a capture group.

In the above example I set it up as the 3rd capture group, so you’ll
find its value under $3

Now you modify the input to accept an argument:

offset = ARGV[0] =~ /offset/i

This will look for the word “offset”, and will return nil if it is not
there. This allows for a simple boolean check like this:

(offset ? $3:$2)

Using this you can call your script with or without the argument,
rendering different results:

test.rb > test.txt
test.rb offset > offset.txt

I’ve attached the updated file (you might want to rename the extension
back to “.ruby”), let me know if it works for you. It appears to have
worked for me.

I suggest that you comment code and write it out in full unless disk
space is a pressing requirement. Other people’s code golfing is often
the bane of my existence, and adding comments as you learn what each
line does will help you later, or the next poor guy who has to edit it.

For more information on Regexp, rubular.com is a great website to play
around with realtime pattern matching.