Select "columns" from multidimensional array?

virtuoso · February 12, 2013, 6:45pm

On Tue, Feb 12, 2013 at 2:06 PM, Joel P. [email protected]
wrote:

Robert K. wrote in post #1096452:

anybody can override header values or complete
rows / columns violating your class’s idea of internal state.

This class only gets added into scripts, and I’m the only one who knows
Ruby where I work, so I don’t really see this being a problem yet.

I think you misunderstand OO. The point of OO is to build
abstractions of real world phenomena which behave in certain ways.
That does not have to do with how many people use the code; proper OO
abstractions help you even if you are the only one working with the
code because they encapsulate specific behavior and you as a user of
classes do not need to worry any longer about internals. By thinking
in proper abstractions you actually make your life easier since the
system is easier to understand.

I can rip the headers off data and also read my code back later and see
what it’s doing.

I would actually rather have Row and Colum as specific items which can
be asked for their header and iterate through all their values.
Example: Sample matrix with abstraction of rows and columns. This can certainly be improved, it's just to convey an idea. · GitHub

def filter( header, regex )
idx = self[0].index header
skip_headers { |xl| xl.select { |ar| ar[idx] =~ regex } }
end
That combines too much logic in one method IMHO. I’d rather select a
row based on header and then I would use #select on that.

In this case, I’m filtering the data like Excel does. This means I’m
keeping all of the columns, but only specific rows based on the values
in a single column.

Right, and as I said I’d rather make that two separate steps. That is
much more modular and hence reusable.

So in short, my question is how can I return my class type after using
Array’s methods on my child-class?
Do you mean as return value from #map and the like? Well, you can’t
without overriding all methods with this approach, I’m afraid. That’s
one of the reasons why this approach does not work well.

You’re probably thinking in terms of classes which are constantly active
as objects accessible to multiple users, whereas I’m just using this
class to make scripts easier to write.

See my initial statement: I think you are vastly underestimating the
value of OOP.

Cheers

robert

virtuoso · February 13, 2013, 6:50pm

On Tue, Feb 12, 2013 at 10:49 PM, Robert K.
[email protected] wrote:

On Tue, Feb 12, 2013 at 9:26 PM, Joel P. [email protected] wrote:

Thanks for the advice and examples, I’ll see whether I can understand
how the classes and methods work with each other there and set about
experimenting with them.

I didn’t put commenting in the gist. If there’s anything unclear feel
free to ask.

Maybe one explanatory sentence: I chose to use two values for
addressing cells in the Matrix class I put on github. That is a
design decision. You may want to choose something else (e.g. “R3”)
but the main point of the example was this: it does make sense to turn
things you talk and reason about into classes (i.e. Matrix, Column,
Row). One can do that and still maintain connection, i.e. if you
write through a Row or Column instance the Matrix gets updated. That
way you can easily maintain consistency in the Matrix and yet present
the user abstractions which are more appropriate for his particular
use case (e.g. if you want to sum all values in a column you obtain
the column and then iterate through all values and sum them up).

Kind regards

robert

virtuoso · February 12, 2013, 10:52pm

On Tue, Feb 12, 2013 at 9:26 PM, Joel P. [email protected]
wrote:

Thanks for the advice and examples, I’ll see whether I can understand
how the classes and methods work with each other there and set about
experimenting with them.

I didn’t put commenting in the gist. If there’s anything unclear feel
free to ask.

Once thing which put me off generating a custom class “from scratch” is
that Array appears to be equal to its content (I assume this is a
language shortcut), but it seems “custom” objects’ values have to be
accessed via their accessors.
I was hoping for some more succinct syntax than this sort of thing:
puts [] #Array is so easy to create
puts CustomObject.new([]).value #This looks clunky next to that

You can get quite close, for example you can do

def M(*a)
YourCustomMatrix.new(a)
end

use

M(1,2,3,4)

or

M = Object.new
def M.
YourCustomMatrix.new(a)
end

use

M[1,2,3,4]

I’d love to get accustomed proper OO thinking, but I’ll inevitably make
all the rookie mistakes in the process.

Yes, it will take time. Mistakes are what you will learn from. Given
that, I should probably shut up and let you make your personal
mistakes.

It’s a lot to get used to all at
once given that I’ve been using Ruby for less than a year, and I have no
training other than helpful hints and googling. Thanks again for your
patience.

You’re welcome!

Kind regards

robert

virtuoso · February 14, 2013, 11:48pm

Interesting Matrix build. It’s giving me a bit of a headache just trying
to figure out the links involved.

So MatrixPart defines the methods and the “parent” matrix (held as an
instance variable); and row and column both use these methods and both
access the variable which points to the matrix they’re part of.

The rows and columns can be selected based on given headers, and each
will reference the other… and this is where my head explodes:

def index( row, col )
@row_headers.index( row ) * @col_headers.size + @col_headers.index(
col )
end

It takes a bit of getting used to, but thanks to Ruby’s flexible array
class adding nil values automatically when you specify an index higher
than the upper boundary, that works.

I guess with a bit more poking and prodding I could figure out how to
append, insert, and delete rows and columns. After all, it’s only a math
problem in the end. All the interconnected references (especially the
layered yields) still make my head spin though

virtuoso · February 13, 2013, 8:47pm

I haven’t had a chance to look into your example yet; I’ve been reading
up on OOP.
I intend to take the ideas I’ve been coming up with for ease-of-use
within the Array class and use those, your Matrix example, and whatever
else occurs to me to form a new set of classes which can handle my data
and the operations I regularly need to perform. Then it’s time to play
with scenarios and see what happens.

virtuoso · February 15, 2013, 3:41pm

On Thu, Feb 14, 2013 at 11:50 PM, Joel P. [email protected]
wrote:

Interesting Matrix build. It’s giving me a bit of a headache just trying
to figure out the links involved.

So MatrixPart defines the methods and the “parent” matrix (held as an
instance variable); and row and column both use these methods and both
access the variable which points to the matrix they’re part of.

Yes, Row and Column are a facade to the “real” data and provide a
different interface to it which presents a different abstraction:
while the Matrix has two dimensions a Row and a Column only have one.

than the upper boundary, that works.
Since you obviously understood the method now I am not sure why you
say your head explodes over this piece of code.

Btw, with a small change you can change storage of data from an Array
to a Hash making the Matrix class better suited for sparse matrices.
And here comes an important aspect of that implementation: only the
Matrix class had to change, there was absolutely no change necessary
for the other three classes! This shows how Matrix’s API isolated
client code from inner workings of this class. This is what OO is
about.

gist.github.com

https://gist.github.com/rklemme/4771651/revisions

matrix.rb

#!/usr/bin/ruby

class Matrix

  def initialize(row_headers, col_headers)
    @row_headers = row_headers
    @col_headers = col_headers
  end  

  def [](row, col)

This file has been truncated. show original

I guess with a bit more poking and prodding I could figure out how to
append, insert, and delete rows and columns. After all, it’s only a math
problem in the end. All the interconnected references (especially the
layered yields) still make my head spin though

You’ll get used to that - and with a bit of oil the squeaking goes
away as well.

Kind regards

robert

virtuoso · February 15, 2013, 3:55pm

Hah, I wrote that head exploding comment first and then managed to work
out what it did afterwards. Still took a few minutes of smashing my head
into the desk to make room for the new thought though ;¬)

Using a Hash sounds like a good idea. I already tried rewriting the
selector into something a bit more excel-like (although I won’t bore you
with all the little changes):

def
col, row = addr.upcase.scan( /([A-Z]+)(\d+)/ ).flatten
data[ index( row, col ) ]
end

def []=( addr, val )
col, row = addr.upcase.scan( /([A-Z]+)(\d+)/ ).flatten
data[ index( row, col ) ] = val
end

m = Matrix.new(%w{A B C}, %w{1 2 3 4})
m[“A1”] = 123
m[“B4”] = 123

I haven’t gotten around to changing all the “row, col” to “col, row”
references, so it looks a bit weird, but I’m just experimenting with
options at the moment. I’ll have a go at Hashing it up as well.

Naturally I have many questions floating around in my head, but I’ll try
to work them out through the scientific method of repeated failed
attempts

virtuoso · February 17, 2013, 9:14pm

I’ve attached my attempt at converting your code to suit mine (hope you
don’t mind the plagarism )
I have a list of some of my plans to add functionality at the top, and
I’ve rewritten your test at the bottom to suit the new options.

I’d be interested to know whether there are any things I’m doing
drastically wrong… I think the rows? and columns? might be able to be
done more succinctly, for example.

virtuoso · February 15, 2013, 6:21pm

On Fri, Feb 15, 2013 at 3:55 PM, Joel P. [email protected]
wrote:

Hah, I wrote that head exploding comment first and then managed to work
out what it did afterwards. Still took a few minutes of smashing my head
into the desk to make room for the new thought though

LOL

Using a Hash sounds like a good idea. I already tried rewriting the
selector into something a bit more excel-like (although I won’t bore you
with all the little changes):

Good!

m = Matrix.new(%w{A B C}, %w{1 2 3 4})
I would probably not initialize the matrix then. Excel is dynamic as
well. So I’d start with a blank slate and only remember a max value
for row and column.

m[“A1”] = 123
m[“B4”] = 123

I haven’t gotten around to changing all the “row, col” to “col, row”
references, so it looks a bit weird, but I’m just experimenting with
options at the moment. I’ll have a go at Hashing it up as well.

Good. With Ruby, these types of experiments are so much fun.

Naturally I have many questions floating around in my head, but I’ll try
to work them out through the scientific method of repeated failed
attempts

Actually that’s probably the best method to learn: we learn much more
through our failures than from our successes.

Kind regards

robert

virtuoso · February 17, 2013, 9:30pm

Oopsie, this:
data.keys.map { |k| k[/\d+/] }.max.to_i
should be this:
data.keys.map { |k| k[/\d+/].to_i }.max

Seeing mistakes already

virtuoso · February 17, 2013, 9:19pm

Robert K. wrote in post #1097127:

On Fri, Feb 15, 2013 at 3:55 PM, Joel P. [email protected]
wrote:

Below is a valuable statement.

Actually that’s probably the best method to learn: we learn much more
through our failures than from our successes.

Kind regards

robert

virtuoso · February 17, 2013, 10:29pm

On Sun, Feb 17, 2013 at 9:14 PM, Joel P. [email protected]
wrote:

I’ve attached my attempt at converting your code to suit mine (hope you
don’t mind the plagarism )

No, it’s not a doctoral thesis. (allusion to German politics)

I have a list of some of my plans to add functionality at the top, and
I’ve rewritten your test at the bottom to suit the new options.

I’d be interested to know whether there are any things I’m doing
drastically wrong… I think the rows? and columns? might be able to be
done more succinctly, for example.

First of all their names seem weird. Methods with a question mark at
the end are usually intended to be used in a boolean context. But
rows? and columns? return the row and column counts. Plus, they do
not gracefully deal with an empty Matrix. I guess for everyday use it
will be more efficient to store max values in two members because you
will need them often for iteration.

Method #array could use Columns’s #map method. If you rewrite
iteration methods like this

def each_row
return to_enum(:each_row) unless block_given?

(1..rows?).each do |idx|
  yield row( idx )
end
self

end

Then you can even do

def array
each_row.map do |row|
row.to_a
end
end

And I would consider changing []= to actually remove the entry if the
value is nil. (Excel does it differently, I believe. It will not
reduce the area it considers “used” when emptying the last cell in the
last column or row.)

Ah, and I would add a validity check for addresses - otherwise one can
get any String into the Hash as key - even things which are not valid
addresses.

But generally I think you get the hang of it.

Kind regards

robert

virtuoso · February 18, 2013, 1:05pm

Nice tips! Thanks for the help again.

I had no idea how to use to_enum, I’ll have to read up on that. I’ve
done all the Ruby courses I could find at Codecademy which filled in a
few gaps I had in my knowledge. I’m still reading the Book of Ruby as
well.

Hopefully this one is more stable:

I’ve decided to leave the “Matrix” class name alone in case I need to
use it within the same scope later. I’ve renamed this “RubyExcel” for
want of a better term.

I fixed all the things you mentioned (I think).

I’ve added the ability to upload a multidimensional array into the data.
It carries the option to overwrite or append as a switch.

I set the reference list of column references to a Constant.

I’ve removed “array” added “to_a” and “to_s”

I’ve added “find” to return a “cell address” when given a value

I still have a long list of things I want to add, and I’m sure I’ll
think of more. I’m surprised I haven’t found anything equivalent out
there, to be honest. Maybe all the real pros are using databases to
parse their output

virtuoso · February 18, 2013, 2:42pm

Nice link. I agree with the sentiment there, and I’ll think more
carefully about using boolean switches in future.
I’ve split that method into “load” and “append”, each passing arguments
to private “import_data”.
I added the rescue when I realised the method was returning the number
of rows and I wanted it to return success or failure as a boolean, I
forgot it was catching my exceptions as well. Now it’s true or
exception.

I do use switches occasionally, here’s one example where I think it’s
justified (from my older Excel_Sheet<Array class):

def filter( header, regex, switch=true )
fail ArgumentError, “#{regex} is not valid Regexp” unless regex.class
== Regexp
idx = self[0].index header
fail ArgumentError, “#{header} is not a valid header” if idx.nil?
operator = ( switch ? :=~ : :!~ )
Excel_Sheet.new skip_headers { |xl| xl.select { |ar| ar[idx].send(
operator, regex ) } }
end

Mostly I just did that because I was learning how to use symbols, but it
makes the Regex more flexible with the minimum amount of repetition or
long-winded “if” statements.

virtuoso · February 18, 2013, 3:54pm

On Mon, Feb 18, 2013 at 2:43 PM, Joel P. [email protected]
wrote:

operator, regex ) } }
end

Mostly I just did that because I was learning how to use symbols, but it
makes the Regex more flexible with the minimum amount of repetition or
long-winded “if” statements.

I think there are better solutions - in order of increasing quality:

Rename the parameter from “switch” to “negate”.
Pass the operator symbol directly.
Add a method Regexp#negate which will return an instance which has
matching logic reversed and remove the parameter.
yield ar[idx] (i.e. delegate the decision to the block passed in)

I am still not fond of the #skip_headers approach. As far as I can
see you want to copy the sheet into a new one and include only cells
whose content matches a regular expression (or maybe other filter
criteria). I don’t see the rest of your current version of the
implementation but the line with self[0] somehow looks strange.

Kind regards

robert

virtuoso · February 18, 2013, 1:56pm

On Mon, Feb 18, 2013 at 1:05 PM, Joel P. [email protected]
wrote:

Nice tips! Thanks for the help again.

You’re welcome.

I had no idea how to use to_enum, I’ll have to read up on that. I’ve
done all the Ruby courses I could find at Codecademy which filled in a
few gaps I had in my knowledge. I’m still reading the Book of Ruby as
well.

Basically that method stores self and method name along with arguments
given in a new object which implements #each as delegation to the
method stored and also includes Enumerable. You can cook your own for
learning purposes

E = Struct.new :obj, :method, :arguments do
include Enumerable

def each(&b)
obj.send(method, *arguments, &b)
self
end
end

irb(main):017:0> s = “foo bar”
=> “foo bar”
irb(main):018:0> e = E.new s, :each_char
=> #<struct E obj=“foo bar”, method=:each_char, arguments=nil>
irb(main):019:0> e.map {|x| “<#{x.inspect}>”}
=> [“<"f">”, “<"o">”, “<"o">”, “<" ">”, “<"b">”, “<"a">”,
“<"r">”]

Note that #map is defined in Enumerable

I’ve decided to leave the “Matrix” class name alone in case I need to
use it within the same scope later. I’ve renamed this “RubyExcel” for
want of a better term.

+1

I’ve added the ability to upload a multidimensional array into the data.
It carries the option to overwrite or append as a switch.

Two remarks:

don’t catch the exception inside the method
better create two methods - one for overwrite and one for append
(even if they internally delegate common work to another method)

This will give you a clearer API. See here for more reasoning:

I’ve added “find” to return a “cell address” when given a value

You may also want to add #select etc. Or you include Enumerable and
implement #each as delegation to the Hash’s #each and get #select and
all others for free.

I still have a long list of things I want to add, and I’m sure I’ll
think of more. I’m surprised I haven’t found anything equivalent out
there, to be honest. Maybe all the real pros are using databases to
parse their output

Kind regards

robert

virtuoso · February 18, 2013, 5:17pm

On Mon, Feb 18, 2013 at 4:50 PM, Joel P. [email protected]
wrote:

I went with “filter” with an optional true/false regex switch because it
seemed like the simplest way to use it, and closest to my own experience
in using Excel’s filters.
Passing the symbol feels less intuitive, and yielding to a block means
writing more code, particularly when I’m writing a quick method chain.
The notation I set up feels natural to me when chaining criteria. For
example I can just do this:
data.filter( ‘Account’, /^P/ ).filter( ‘Type’, /^Large/, false )

What do you mean by “writing more code”? It is as short as

data.filter( ‘Account’, /^P/ ).filter( ‘Type’ ) {|x| /^Large/ =~ x}

You could even implement Regexp#to_proc like this

class Regexp
def to_proc
lambda {|s| self =~ s}
end
end

and then do

data.filter( ‘Account’, /^P/ ).filter( ‘Type’, &/^Large/)

If I only want to keep Parts of “Type1” and “Type3” then I could use
“select” and some Regex, but I might pick up the Header as well if I’m
not careful.
Using a method like “skip_headers” allows me to select or reject
elements of the data without losing the identifiers in the first row,
which I’m almost always going to need at the end when I output the data
into human-readable format.

But wouldn’t you want to make the decision what is a header and what
not more flexible? Possible criteria that come to mind are

first n lines / columns
first lines / columns where all values match a particular regexp
any line or column where all values match a particular regexp

I’m also dealing with entire rows rather than individual cells, and
since the source data can change its content and order, using the
headers to identify the data source for a given operation is essential.
Using skip_headers both allows me to preserve them while sorting through
data, and also puts them back on again for the next time I need to
reference them.

So basically you want a view on the data which omits a few rows and
columns. Given that there are so many potential criteria I’d probably
pass in one object implementing === as a row header detector and one
as a column header detector. Since Proc implements === as call you
can also easily provide a lambda there. The argument to === would be
the Column respective Row instance so the position as well as the cell
contents can be evaluated to decide whether something constitutes a
header row / column. Once you have that in place you could create
convenience methods using one of the criteria mentioned above. Just a
few thoughts.

Kind regards

robert

virtuoso · February 18, 2013, 4:50pm

I went with “filter” with an optional true/false regex switch because it
seemed like the simplest way to use it, and closest to my own experience
in using Excel’s filters.
Passing the symbol feels less intuitive, and yielding to a block means
writing more code, particularly when I’m writing a quick method chain.
The notation I set up feels natural to me when chaining criteria. For
example I can just do this:
data.filter( ‘Account’, /^P/ ).filter( ‘Type’, /^Large/, false )

Regarding the usage of skip_headers
Say I have this data:

Type Flag Unique_ID
Type1 1 A001
Type2 0 A002
Type1 0 A003
Type3 1 A004
Type1 1 A005

If I only want to keep Parts of “Type1” and “Type3” then I could use
“select” and some Regex, but I might pick up the Header as well if I’m
not careful.
Using a method like “skip_headers” allows me to select or reject
elements of the data without losing the identifiers in the first row,
which I’m almost always going to need at the end when I output the data
into human-readable format.
I’m also dealing with entire rows rather than individual cells, and
since the source data can change its content and order, using the
headers to identify the data source for a given operation is essential.
Using skip_headers both allows me to preserve them while sorting through
data, and also puts them back on again for the next time I need to
reference them.

virtuoso · February 18, 2013, 8:41pm

On Mon, Feb 18, 2013 at 6:23 PM, Joel P. [email protected]
wrote:

That Regexp to proc idea looks good. I could use proc form for a
positive match and a normal block for the negative. I’ll see if I can
get something like this working when I write filter method for
RubyExcel.

Using the new class I can implement something like skip_headers by
passing a starting value to “rows” or “columns”. This makes it more
flexible as well. I’ve rewritten those iterators using optional start
and end points:

Actually the even more general concept is filter anything: one might
not only want to skip headers but general rows or columns.

def rows( start_row = 1, end_row = maxrow )
fail TypeError, ‘Data is empty’ if maxrow == 0
fail ArgumentError, ‘The starting row must be less than the maximum
row’ if maxrow < start_row
return to_enum(:rows) unless block_given?

You need to pass arguments start_row and end_row here as well!

return to_enum(:rows, start_row, end_row) unless block_given?

same as VBA syntax.

I vaguely understand the idea of passing something in to compare to a
header type. I’m not sure how I’d implement it though, since the only
headers I ever deal with are row 1, and they tend to look pretty similar
to the data itself.

Well, then we should probably add a method #index to Row and Column
which returns the numeric index. That makes checking whether it’s the
n’th row / column easy. Basically for one of the two the method is
just an alias (not if there is a base 0 or based 1 difference though).

Kind regards

robert

virtuoso · February 18, 2013, 6:23pm

That Regexp to proc idea looks good. I could use proc form for a
positive match and a normal block for the negative. I’ll see if I can
get something like this working when I write filter method for
RubyExcel.

Using the new class I can implement something like skip_headers by
passing a starting value to “rows” or “columns”. This makes it more
flexible as well. I’ve rewritten those iterators using optional start
and end points:

def rows( start_row = 1, end_row = maxrow )
fail TypeError, ‘Data is empty’ if maxrow == 0
fail ArgumentError, ‘The starting row must be less than the maximum
row’ if maxrow < start_row
return to_enum(:rows) unless block_given?
( start_row…end_row ).each do |idx|
yield row( idx )
end
self
end

Now I can use rows(2) to skip the headers if necessary. It might be a
bit confusing when rows(1) actually returns from 1 to the end, but I’ve
already got row(1) for that purpose and it makes it shorter to iterate
through all of them. Plus it means I can do “rows.count”, which is the
same as VBA syntax.

I vaguely understand the idea of passing something in to compare to a
header type. I’m not sure how I’d implement it though, since the only
headers I ever deal with are row 1, and they tend to look pretty similar
to the data itself.