There's probably a simpler answer to this than the ways I've come up
with.
What's the best way to select columns from a two-dimensional array?
I build arrays to match excel-style formatting, like this but larger:
__________________
a = [
[ 'A1', 'A2', 'A3' ],
[ 'B1', 'B2', 'B3' ],
[ 'C1', 'C2', 'C3' ]
]
def get_cols (multi_array, headers )
indices = []
headers.each { |val| indices << multi_array[0].index(val) }
indices.compact!
multi_array.map do |ar|
indices.map { |idx| ar[idx] }
end
end
get_cols a, %w(A1 A3)
=> [["A1", "A3"], ["B1", "B3"], ["C1", "C3"]]
__________________
I haven't been able to work out a way to do this without writing
long-winded code. Is there a simple solution?
Thanks.
on 2013-01-31 10:45
on 2013-01-31 11:26
On Thu, Jan 31, 2013 at 10:46 AM, Joel Pearson <lists@ruby-forum.com> wrote: > [ 'C1', 'C2', 'C3' ] > end > > end > > get_cols a, %w(A1 A3) > > => [["A1", "A3"], ["B1", "B3"], ["C1", "C3"]] Is this the desired output? For getting the columns filtering the ones that have a header: def get_cols(multi_array, headers) multi_array.transpose.select {|(header,_)| headers.include? header} end 1.9.2p290 :001 > a = [ 1.9.2p290 :002 > [ 'A1', 'A2', 'A3' ], 1.9.2p290 :003 > [ 'B1', 'B2', 'B3' ], 1.9.2p290 :004 > [ 'C1', 'C2', 'C3' ] 1.9.2p290 :005?> ] 1.9.2p290 :008 > def get_cols(multi_array, headers) 1.9.2p290 :009?> multi_array.transpose.select {|(header,_)| headers.include? header} 1.9.2p290 :010?> end => nil 1.9.2p290 :011 > get_cols a, %w(A1 A3) => [["A1", "B1", "C1"], ["A3", "B3", "C3"]] Jesus.
on 2013-01-31 11:34
Praise Jesus! My initial output was the desired one, but the "transpose" method is what I was missing. Thanks :) I didn't know about that "|header,_|" trick either, very nice.
on 2013-01-31 13:08
Documenting the end result.
I added a sort to keep the input column order, and reordered the inputs
to make the array itself optional for ease-of-use within the parent
class.
def get_cols( headers, multi_array=nil)
multi_array = @data if multi_array.nil?
multi_array.transpose.select { |header,_| headers.include?(header)
}.sort_by { |header,_| headers.index(header) || headers.length
}.transpose
end
irb(main):017:0> get_cols %w(A3 A1), a
=> [["A3", "A1"], ["B3", "B1"], ["C3", "C1"]]
on 2013-01-31 13:12
On Thu, Jan 31, 2013 at 11:34 AM, Joel Pearson <lists@ruby-forum.com> wrote: > Praise Jesus! > > My initial output was the desired one, but the "transpose" method is > what I was missing. Thanks :) I'd really start by creating a class for this - or use Matrix from the standard library. irb(main):008:0> m = Matrix[[1,2,3],[4,5,6]] => Matrix[[1, 2, 3], [4, 5, 6]] irb(main):009:0> m.row 1 => Vector[4, 5, 6] irb(main):010:0> m.column 1 => Vector[2, 5] Kind regards robert
on 2013-01-31 15:14
Thanks robert My current approach is HTML Table -> Nokogiri Nodeset -> Multidimensional Array -> Excel / TSV A Matrix looks like a useful way of grabbing the values I need when I have to alter specifics in the data.
on 2013-01-31 15:57
On Thu, Jan 31, 2013 at 3:14 PM, Joel Pearson <lists@ruby-forum.com> wrote: > Thanks robert You're welcome! > My current approach is > HTML Table -> Nokogiri Nodeset -> Multidimensional Array -> Excel / TSV > > A Matrix looks like a useful way of grabbing the values I need when I > have to alter specifics in the data. Whatever you do - reuse class Matrix, write your own - it is the most reasonable thing to have a specific class for handling this instead of writing functions which work with a nested Array structure. It will make your life much easier because then you Matrix class can enforce proper internal state which you cannot as easily when using a set of functions to manipulate an Array structure. Kind regards robert
on 2013-01-31 15:58
Hi Joel, Worth mentioning that transpose requires each "row" to contain the same number of "columns" or you will get an index error. Here's a Gist on how to alter Array#transpose to allow a block for populating missing elements. https://gist.github.com/3876888 Best randym
on 2013-01-31 16:45
Looks like good advice, both. My concern with matrices is being able to modify elements in the same way as I could in in a multidimensional array, but I assume that's the reason for creating a child class. There's a whole thread full of people waxing philosophical about the subject! I've never written a class based on someone else's before, sounds like fun. I'll see what happens when I play with it a bit.
on 2013-02-02 22:03
Robert Klemme wrote in post #1094583: > write your own - it is the most > reasonable thing to have a specific class for handling this instead of > writing functions which work with a nested Array structure. It's only just dawned on me just how useful this kind of thing could be. I could have a class tailor-made to my expected outputs, yet versatile enough to adapt to new challenges... If I learn how to use blocks effectively as well, then it could do pretty much anything I need. All hail Ruby!
on 2013-02-03 18:04
On Sat, Feb 2, 2013 at 10:04 PM, Joel Pearson <lists@ruby-forum.com> wrote: > Robert Klemme wrote in post #1094583: >> write your own - it is the most >> reasonable thing to have a specific class for handling this instead of >> writing functions which work with a nested Array structure. > > It's only just dawned on me just how useful this kind of thing could be. > I could have a class tailor-made to my expected outputs, yet versatile > enough to adapt to new challenges... That's the whole point of OO - or rather software engineering in general: create proper abstractions. Advantage of OO is that one can first reason about the interface and hide all the details behind that - including the state necessary to make the interface of the class work as needed. > If I learn how to use blocks > effectively as well, then it could do pretty much anything I need. > All hail Ruby! :-) Cheers robert
on 2013-02-03 23:43
I decided to try and build on the Array class as I don't really
understand Matrices yet. I've added a few handy methods. The hidden Bang
stuff is justified, I think, as this class is intended to mimic Excel's
layout.
I'll add more useful bits as I come up with them, this is just an
experiment at the moment.
class Excel_Sheet<Array
def initialize( val=[] )
fail ArgumentError, 'Must be multidimensional array' unless
val[0].class == Array || val.empty?
super( val )
end
def columns
ensure_shape
self[0].length
end
def rows
self.length
end
def ensure_shape
max_size = self.max_by(&:length).length
self.map! { |ar| ar.length == max_size ? ar : ar + Array.new(
max_size - ar.length, nil) }
end
def get_cols( headers )
ensure_shape
self.transpose.select { |header,_| headers.include?(header)
}.sort_by { |header,_| headers.index(header) || headers.length
}.transpose
end
def get_cols!( headers )
self.replace get_cols
end
def to_s
self.map { |ar| ar.map { |el| "#{el}".strip.gsub( /\s/, ' ' ) }.join
"\t" }.join "\n"
end
end
on 2013-02-04 00:05
And my first attempt at using a block with it: def skip_headers yield self[1..-1] end test.skip_headers do |row| end p row end It works!
on 2013-02-04 14:12
On Sun, Feb 3, 2013 at 11:43 PM, Joel Pearson <lists@ruby-forum.com> wrote: > I decided to try and build on the Array class as I don't really > understand Matrices yet. I've added a few handy methods. The hidden Bang > stuff is justified, I think, as this class is intended to mimic Excel's > layout. > class Excel_Sheet<Array I wouldn't do that. With the basic types it is usually much better to use delegation (i.e. have a member of that type) than exposing the full API via inheritance. The whole point of OO is to control internal state which is usually quite difficult when exposing a complete API of Array because anybody can insert and remove elements. Btw. the "self." in your code are superfluous. Kind regards robert
on 2013-02-04 14:32
>> class Excel_Sheet<Array > > I wouldn't do that. With the basic types it is usually much better to > use delegation (i.e. have a member of that type) than exposing the > full API via inheritance. The whole point of OO is to control > internal state which is usually quite difficult when exposing a > complete API of Array because anybody can insert and remove elements. > > Btw. the "self." in your code are superfluous. > > Kind regards > > robert Ok, I wasn't sure how to create an object which behaved like an array. I've tried this instead: class Excel_Sheet def initialize( val=[] ) val = %w(A1 A2 A3 B1 B2 B3 C1 C2 C3).each_slice(3).to_a if val == 'test' fail ArgumentError, 'Must be multidimensional array' unless val[0].class == Array || val.empty? replace val end end It seems to work ok. I didn't know how to use self at first, I think I understand it a bit better now. Thanks for the advice :)
on 2013-02-12 13:19
I've decided to inherit from array after all, since all I want to do
with this is extend support for multidimensional arrays, but without
overwriting any of Array's methods.
Anyway, the obstacle I've hit is one I can avoid, but I was wondering
whether I'm doing something wrong, or whether there's a nice Rubyish way
around this. Here's a simplified version to demonstrate the issue:
___________________
class Excel_Sheet<Array
def initialize( val=[] )
val = %w(A1 B1 C1 A2 B2 C2 A3 B3 C3).each_slice(3).to_a if val ==
'test'
super ( val )
end
def skip_headers
block_given? ? ( [ self[0] ] + yield( self[1..-1] ) ) : (
self[1..-1] )
end
def filter( header, regex )
idx = self[0].index header
skip_headers { |xl| xl.select { |ar| ar[idx] =~ regex } }
end
end
___________________
When I do this sort of thing:
result = object.filter('Header', /value1|value2/)
I get the return as an Array, so I can't use my extra methods on it
anymore.
Here's my current workaround. It's the only way I could think of doing
this but it doesn't look right.
___________________
def filter( header, regex )
idx = self[0].index header
Excel_Sheet.new skip_headers { |xl| xl.select { |ar| ar[idx] =~ regex
} }
end
___________________
So in short, my question is how can I return my class type after using
Array's methods on my child-class?
on 2013-02-12 13:30
On Tue, Feb 12, 2013 at 1:19 PM, Joel Pearson <lists@ruby-forum.com> wrote: > I've decided to inherit from array after all, since all I want to do > with this is extend support for multidimensional arrays, but without > overwriting any of Array's methods. I usually do not engage in predictions since I don't have a crystal ball but in this case I'll say: you won't get happy with that approach. For example, anybody can override header values or complete rows / columns violating your class's idea of internal state. > super ( val ) > end > > def skip_headers > block_given? ? ( [ self[0] ] + yield( self[1..-1] ) ) : ( > self[1..-1] ) > end What is this supposed to do? Ah, I think I see. I'd probably name it differently, i.e. each_data_cell or something. > def filter( header, regex ) > idx = self[0].index header > skip_headers { |xl| xl.select { |ar| ar[idx] =~ regex } } > end > > end That combines too much logic in one method IMHO. I'd rather select a row based on header and then I would use #select on that. > So in short, my question is how can I return my class type after using > Array's methods on my child-class? Do you mean as return value from #map and the like? Well, you can't without overriding all methods with this approach, I'm afraid. That's one of the reasons why this approach does not work well. :-) Kind regards robert
on 2013-02-12 14:06
Robert Klemme wrote in post #1096452: > anybody can override header values or complete > rows / columns violating your class's idea of internal state. This class only gets added into scripts, and I'm the only one who knows Ruby where I work, so I don't really see this being a problem yet. >> def skip_headers >> block_given? ? ( [ self[0] ] + yield( self[1..-1] ) ) : ( >> self[1..-1] ) >> end > What is this supposed to do? Ah, I think I see. I'd probably name it > differently, i.e. each_data_cell or something. This basically allows me to make summaries or edits within the data itself without having to worry about the headers, then it tacks the headers back on again once I've made the changes. The switch is just so I can rip the headers off data and also read my code back later and see what it's doing. >> def filter( header, regex ) >> idx = self[0].index header >> skip_headers { |xl| xl.select { |ar| ar[idx] =~ regex } } >> end > That combines too much logic in one method IMHO. I'd rather select a > row based on header and then I would use #select on that. In this case, I'm filtering the data like Excel does. This means I'm keeping all of the columns, but only specific rows based on the values in a single column. >> So in short, my question is how can I return my class type after using >> Array's methods on my child-class? > Do you mean as return value from #map and the like? Well, you can't > without overriding all methods with this approach, I'm afraid. That's > one of the reasons why this approach does not work well. :-) You're probably thinking in terms of classes which are constantly active as objects accessible to multiple users, whereas I'm just using this class to make scripts easier to write. My usage in this case is: 1) Run a report which returns a table of data 2) Filter the data on a set of criteria 3) Output the results to Excel 4) Exit Using the class I've built I can easily pare down the data to the important sections, I've added other handy methods suitable to a multidimensional array such as "find" which returns the header and row, "to_s" which converts to TSV format, and several more. It saves me rewriting the same stuff over and over. It's great when I want to build excel-style tabs out of the data, like this little snippet: ______________ @data[:EXT] = @WIP_data.filter 'Job Status', /External Repair/ @data[:Awaiting_Parts] = @WIP_data.filter 'Job Status', /Awaiting Parts/ @data.each { |name, data| @data[:Summary].push [ name.to_str, ( data.length -1 ) ] } reporter.output_to_excel_tabs @data, filename ______________ Ok, in that case I'm happy with an array return, but there are plenty of sequential events where I'd like to have my own class type returned. Is passing it through the initializer again really the best way to do that?
on 2013-02-12 17:56
On 12.02.2013 13:06, Joel Pearson wrote: > active > Using the class I've built I can easily pare down the data to the > @data[:EXT] = @WIP_data.filter 'Job Status', /External Repair/ > of > sequential events where I'd like to have my own class type returned. > Is > passing it through the initializer again really the best way to do > that? As Robert says I'm not sure there is any (clean) way to do what you want, which is why delegation as opposed to inheritance is often used in this scenario. I'm sure this isn't a complete solution, but does this rewritten snippet (untested!) help at all? class ExcelSheet def initialize( val=[] ) val = %w(A1 B1 C1 A2 B2 C2 A3 B3 C3).each_slice(3).to_a if val == 'test' @arr = val end def skip_headers block_given? ? ( [ @arr[0] ] + yield( @arr[1..-1] ) ) : ( @arr[1..-1] ) end def filter( header, regex ) idx = @arr[0].index header @arr = skip_headers { |xl| xl.select { |ar| ar[idx] =~ regex } } end def method_missing(meth,*args,&block) ret = @arr.send meth, *args, &block ret.instance_of?(Array) ? ExcelSheet.new(@ret) : ret end end
on 2013-02-12 18:45
On Tue, Feb 12, 2013 at 2:06 PM, Joel Pearson <lists@ruby-forum.com> wrote: > Robert Klemme wrote in post #1096452: >> anybody can override header values or complete >> rows / columns violating your class's idea of internal state. > > This class only gets added into scripts, and I'm the only one who knows > Ruby where I work, so I don't really see this being a problem yet. I think you misunderstand OO. The point of OO is to build abstractions of real world phenomena which behave in certain ways. That does not have to do with how many people use the code; proper OO abstractions help you even if you are the only one working with the code because they encapsulate specific behavior and you as a user of classes do not need to worry any longer about internals. By thinking in proper abstractions you actually make your life easier since the system is easier to understand. > I can rip the headers off data and also read my code back later and see > what it's doing. I would actually rather have Row and Colum as specific items which can be asked for their header and iterate through all their values. Example: https://gist.github.com/rklemme/4771651 >>> def filter( header, regex ) >>> idx = self[0].index header >>> skip_headers { |xl| xl.select { |ar| ar[idx] =~ regex } } >>> end >> That combines too much logic in one method IMHO. I'd rather select a >> row based on header and then I would use #select on that. > > In this case, I'm filtering the data like Excel does. This means I'm > keeping all of the columns, but only specific rows based on the values > in a single column. Right, and as I said I'd rather make that two separate steps. That is much more modular and hence reusable. >>> So in short, my question is how can I return my class type after using >>> Array's methods on my child-class? >> Do you mean as return value from #map and the like? Well, you can't >> without overriding all methods with this approach, I'm afraid. That's >> one of the reasons why this approach does not work well. :-) > > You're probably thinking in terms of classes which are constantly active > as objects accessible to multiple users, whereas I'm just using this > class to make scripts easier to write. See my initial statement: I think you are vastly underestimating the value of OOP. Cheers robert
on 2013-02-12 21:25
Thanks for the advice and examples, I'll see whether I can understand how the classes and methods work with each other there and set about experimenting with them. Once thing which put me off generating a custom class "from scratch" is that Array appears to be equal to its content (I assume this is a language shortcut), but it seems "custom" objects' values have to be accessed via their accessors. I was hoping for some more succinct syntax than this sort of thing: puts [] #Array is so easy to create puts CustomObject.new([]).value #This looks clunky next to that I'd love to get accustomed proper OO thinking, but I'll inevitably make all the rookie mistakes in the process. It's a lot to get used to all at once given that I've been using Ruby for less than a year, and I have no training other than helpful hints and googling. Thanks again for your patience.
on 2013-02-12 22:52
On Tue, Feb 12, 2013 at 9:26 PM, Joel Pearson <lists@ruby-forum.com> wrote: > Thanks for the advice and examples, I'll see whether I can understand > how the classes and methods work with each other there and set about > experimenting with them. I didn't put commenting in the gist. If there's anything unclear feel free to ask. > Once thing which put me off generating a custom class "from scratch" is > that Array appears to be equal to its content (I assume this is a > language shortcut), but it seems "custom" objects' values have to be > accessed via their accessors. > I was hoping for some more succinct syntax than this sort of thing: > puts [] #Array is so easy to create > puts CustomObject.new([]).value #This looks clunky next to that You can get quite close, for example you can do def M(*a) YourCustomMatrix.new(a) end # use M(1,2,3,4) or M = Object.new def M.[](*a) YourCustomMatrix.new(a) end # use M[1,2,3,4] > I'd love to get accustomed proper OO thinking, but I'll inevitably make > all the rookie mistakes in the process. Yes, it will take time. Mistakes are what you will learn from. Given that, I should probably shut up and let you make your personal mistakes. :-) > It's a lot to get used to all at > once given that I've been using Ruby for less than a year, and I have no > training other than helpful hints and googling. Thanks again for your > patience. You're welcome! Kind regards robert
on 2013-02-13 18:50
On Tue, Feb 12, 2013 at 10:49 PM, Robert Klemme <shortcutter@googlemail.com> wrote: > On Tue, Feb 12, 2013 at 9:26 PM, Joel Pearson <lists@ruby-forum.com> wrote: >> Thanks for the advice and examples, I'll see whether I can understand >> how the classes and methods work with each other there and set about >> experimenting with them. > > I didn't put commenting in the gist. If there's anything unclear feel > free to ask. Maybe one explanatory sentence: I chose to use two values for addressing cells in the Matrix class I put on github. That is a design decision. You may want to choose something else (e.g. "R3") but the main point of the example was this: it does make sense to turn things you talk and reason about into classes (i.e. Matrix, Column, Row). One can do that and still maintain connection, i.e. if you write through a Row or Column instance the Matrix gets updated. That way you can easily maintain consistency in the Matrix and yet present the user abstractions which are more appropriate for his particular use case (e.g. if you want to sum all values in a column you obtain the column and then iterate through all values and sum them up). Kind regards robert
on 2013-02-13 20:47
I haven't had a chance to look into your example yet; I've been reading up on OOP. I intend to take the ideas I've been coming up with for ease-of-use within the Array class and use those, your Matrix example, and whatever else occurs to me to form a new set of classes which can handle my data and the operations I regularly need to perform. Then it's time to play with scenarios and see what happens.
on 2013-02-14 23:48
Interesting Matrix build. It's giving me a bit of a headache just trying to figure out the links involved. So MatrixPart defines the methods and the "parent" matrix (held as an instance variable); and row and column both use these methods and both access the variable which points to the matrix they're part of. The rows and columns can be selected based on given headers, and each will reference the other... and this is where my head explodes: def index( row, col ) @row_headers.index( row ) * @col_headers.size + @col_headers.index( col ) end It takes a bit of getting used to, but thanks to Ruby's flexible array class adding nil values automatically when you specify an index higher than the upper boundary, that works. I guess with a bit more poking and prodding I could figure out how to append, insert, and delete rows and columns. After all, it's only a math problem in the end. All the interconnected references (especially the layered yields) still make my head spin though :)
on 2013-02-15 15:41
On Thu, Feb 14, 2013 at 11:50 PM, Joel Pearson <lists@ruby-forum.com> wrote: > Interesting Matrix build. It's giving me a bit of a headache just trying > to figure out the links involved. > > So MatrixPart defines the methods and the "parent" matrix (held as an > instance variable); and row and column both use these methods and both > access the variable which points to the matrix they're part of. Yes, Row and Column are a facade to the "real" data and provide a different interface to it which presents a different abstraction: while the Matrix has two dimensions a Row and a Column only have one. > than the upper boundary, that works. Since you obviously understood the method now I am not sure why you say your head explodes over this piece of code. Btw, with a small change you can change storage of data from an Array to a Hash making the Matrix class better suited for sparse matrices. And here comes an important aspect of that implementation: only the Matrix class had to change, there was absolutely no change necessary for the other three classes! This shows how Matrix's API isolated client code from inner workings of this class. This is what OO is about. https://gist.github.com/rklemme/4771651/revisions > I guess with a bit more poking and prodding I could figure out how to > append, insert, and delete rows and columns. After all, it's only a math > problem in the end. All the interconnected references (especially the > layered yields) still make my head spin though :) You'll get used to that - and with a bit of oil the squeaking goes away as well. ;-) Kind regards robert
on 2013-02-15 15:55
Hah, I wrote that head exploding comment first and then managed to work
out what it did afterwards. Still took a few minutes of smashing my head
into the desk to make room for the new thought though ;¬)
Using a Hash sounds like a good idea. I already tried rewriting the
selector into something a bit more excel-like (although I won't bore you
with all the little changes):
def []( addr )
col, row = addr.upcase.scan( /([A-Z]+)(\d+)/ ).flatten
data[ index( row, col ) ]
end
def []=( addr, val )
col, row = addr.upcase.scan( /([A-Z]+)(\d+)/ ).flatten
data[ index( row, col ) ] = val
end
m = Matrix.new(%w{A B C}, %w{1 2 3 4})
m["A1"] = 123
m["B4"] = 123
I haven't gotten around to changing all the "row, col" to "col, row"
references, so it looks a bit weird, but I'm just experimenting with
options at the moment. I'll have a go at Hashing it up as well.
Naturally I have many questions floating around in my head, but I'll try
to work them out through the scientific method of repeated failed
attempts :)
on 2013-02-15 18:21
On Fri, Feb 15, 2013 at 3:55 PM, Joel Pearson <lists@ruby-forum.com> wrote: > Hah, I wrote that head exploding comment first and then managed to work > out what it did afterwards. Still took a few minutes of smashing my head > into the desk to make room for the new thought though ;) LOL > Using a Hash sounds like a good idea. I already tried rewriting the > selector into something a bit more excel-like (although I won't bore you > with all the little changes): Good! > m = Matrix.new(%w{A B C}, %w{1 2 3 4}) I would probably not initialize the matrix then. Excel is dynamic as well. So I'd start with a blank slate and only remember a max value for row and column. > m["A1"] = 123 > m["B4"] = 123 > > I haven't gotten around to changing all the "row, col" to "col, row" > references, so it looks a bit weird, but I'm just experimenting with > options at the moment. I'll have a go at Hashing it up as well. Good. With Ruby, these types of experiments are so much fun. :-) > Naturally I have many questions floating around in my head, but I'll try > to work them out through the scientific method of repeated failed > attempts :) Actually that's probably the best method to learn: we learn much more through our failures than from our successes. Kind regards robert
on 2013-02-17 21:14
I've attached my attempt at converting your code to suit mine (hope you don't mind the plagarism :p ) I have a list of some of my plans to add functionality at the top, and I've rewritten your test at the bottom to suit the new options. I'd be interested to know whether there are any things I'm doing drastically wrong... I think the rows? and columns? might be able to be done more succinctly, for example.
on 2013-02-17 21:19
Robert Klemme wrote in post #1097127: > On Fri, Feb 15, 2013 at 3:55 PM, Joel Pearson <lists@ruby-forum.com> > wrote: Below is a valuable statement. :) > Actually that's probably the best method to learn: we learn much more > through our failures than from our successes. > > Kind regards > > robert
on 2013-02-17 21:30
Oopsie, this:
data.keys.map { |k| k[/\d+/] }.max.to_i
should be this:
data.keys.map { |k| k[/\d+/].to_i }.max
Seeing mistakes already :p
on 2013-02-17 22:29
On Sun, Feb 17, 2013 at 9:14 PM, Joel Pearson <lists@ruby-forum.com> wrote: > I've attached my attempt at converting your code to suit mine (hope you > don't mind the plagarism :p ) No, it's not a doctoral thesis. :-) (allusion to German politics) > I have a list of some of my plans to add functionality at the top, and > I've rewritten your test at the bottom to suit the new options. > > I'd be interested to know whether there are any things I'm doing > drastically wrong... I think the rows? and columns? might be able to be > done more succinctly, for example. First of all their names seem weird. Methods with a question mark at the end are usually intended to be used in a boolean context. But rows? and columns? return the row and column counts. Plus, they do not gracefully deal with an empty Matrix. I guess for everyday use it will be more efficient to store max values in two members because you will need them often for iteration. Method #array could use Columns's #map method. If you rewrite iteration methods like this def each_row return to_enum(:each_row) unless block_given? (1..rows?).each do |idx| yield row( idx ) end self end Then you can even do def array each_row.map do |row| row.to_a end end And I would consider changing []= to actually remove the entry if the value is nil. (Excel does it differently, I believe. It will not reduce the area it considers "used" when emptying the last cell in the last column or row.) Ah, and I would add a validity check for addresses - otherwise one can get any String into the Hash as key - even things which are not valid addresses. But generally I think you get the hang of it. Kind regards robert
on 2013-02-18 13:05
Nice tips! Thanks for the help again. I had no idea how to use to_enum, I'll have to read up on that. I've done all the Ruby courses I could find at Codecademy which filled in a few gaps I had in my knowledge. I'm still reading the Book of Ruby as well. Hopefully this one is more stable: I've decided to leave the "Matrix" class name alone in case I need to use it within the same scope later. I've renamed this "RubyExcel" for want of a better term. I fixed all the things you mentioned (I think). I've added the ability to upload a multidimensional array into the data. It carries the option to overwrite or append as a switch. I set the reference list of column references to a Constant. I've removed "array" added "to_a" and "to_s" I've added "find" to return a "cell address" when given a value I still have a long list of things I want to add, and I'm sure I'll think of more. I'm surprised I haven't found anything equivalent out there, to be honest. Maybe all the real pros are using databases to parse their output :p
on 2013-02-18 13:56
On Mon, Feb 18, 2013 at 1:05 PM, Joel Pearson <lists@ruby-forum.com> wrote: > Nice tips! Thanks for the help again. You're welcome. > I had no idea how to use to_enum, I'll have to read up on that. I've > done all the Ruby courses I could find at Codecademy which filled in a > few gaps I had in my knowledge. I'm still reading the Book of Ruby as > well. Basically that method stores self and method name along with arguments given in a new object which implements #each as delegation to the method stored and also includes Enumerable. You can cook your own for learning purposes E = Struct.new :obj, :method, :arguments do include Enumerable def each(&b) obj.send(method, *arguments, &b) self end end irb(main):017:0> s = "foo bar" => "foo bar" irb(main):018:0> e = E.new s, :each_char => #<struct E obj="foo bar", method=:each_char, arguments=nil> irb(main):019:0> e.map {|x| "<#{x.inspect}>"} => ["<\"f\">", "<\"o\">", "<\"o\">", "<\" \">", "<\"b\">", "<\"a\">", "<\"r\">"] Note that #map is defined in Enumerable > I've decided to leave the "Matrix" class name alone in case I need to > use it within the same scope later. I've renamed this "RubyExcel" for > want of a better term. +1 > I've added the ability to upload a multidimensional array into the data. > It carries the option to overwrite or append as a switch. Two remarks: - don't catch the exception inside the method - better create two methods - one for overwrite and one for append (even if they internally delegate common work to another method) This will give you a clearer API. See here for more reasoning: http://stackoverflow.com/questions/1331630/use-met... > I've added "find" to return a "cell address" when given a value You may also want to add #select etc. Or you include Enumerable and implement #each as delegation to the Hash's #each and get #select and all others for free. > I still have a long list of things I want to add, and I'm sure I'll > think of more. I'm surprised I haven't found anything equivalent out > there, to be honest. Maybe all the real pros are using databases to > parse their output :p :-) Kind regards robert
on 2013-02-18 14:42
Nice link. I agree with the sentiment there, and I'll think more
carefully about using boolean switches in future.
I've split that method into "load" and "append", each passing arguments
to private "import_data".
I added the rescue when I realised the method was returning the number
of rows and I wanted it to return success or failure as a boolean, I
forgot it was catching my exceptions as well. Now it's true or
exception.
I do use switches occasionally, here's one example where I think it's
justified (from my older Excel_Sheet<Array class):
def filter( header, regex, switch=true )
fail ArgumentError, "#{regex} is not valid Regexp" unless regex.class
== Regexp
idx = self[0].index header
fail ArgumentError, "#{header} is not a valid header" if idx.nil?
operator = ( switch ? :=~ : :!~ )
Excel_Sheet.new skip_headers { |xl| xl.select { |ar| ar[idx].send(
operator, regex ) } }
end
Mostly I just did that because I was learning how to use symbols, but it
makes the Regex more flexible with the minimum amount of repetition or
long-winded "if" statements.
on 2013-02-18 15:54
On Mon, Feb 18, 2013 at 2:43 PM, Joel Pearson <lists@ruby-forum.com> wrote: > operator, regex ) } } > end > > Mostly I just did that because I was learning how to use symbols, but it > makes the Regex more flexible with the minimum amount of repetition or > long-winded "if" statements. I think there are better solutions - in order of increasing quality: 1. Rename the parameter from "switch" to "negate". 2. Pass the operator symbol directly. 3. Add a method Regexp#negate which will return an instance which has matching logic reversed and remove the parameter. 4. yield ar[idx] (i.e. delegate the decision to the block passed in) I am still not fond of the #skip_headers approach. As far as I can see you want to copy the sheet into a new one and include only cells whose content matches a regular expression (or maybe other filter criteria). I don't see the rest of your current version of the implementation but the line with self[0] somehow looks strange. Kind regards robert
on 2013-02-18 16:50
I went with "filter" with an optional true/false regex switch because it seemed like the simplest way to use it, and closest to my own experience in using Excel's filters. Passing the symbol feels less intuitive, and yielding to a block means writing more code, particularly when I'm writing a quick method chain. The notation I set up feels natural to me when chaining criteria. For example I can just do this: data.filter( 'Account', /^P/ ).filter( 'Type', /^Large/, false ) Regarding the usage of skip_headers Say I have this data: Type Flag Unique_ID Type1 1 A001 Type2 0 A002 Type1 0 A003 Type3 1 A004 Type1 1 A005 If I only want to keep Parts of "Type1" and "Type3" then I could use "select" and some Regex, but I might pick up the Header as well if I'm not careful. Using a method like "skip_headers" allows me to select or reject elements of the data without losing the identifiers in the first row, which I'm almost always going to need at the end when I output the data into human-readable format. I'm also dealing with entire rows rather than individual cells, and since the source data can change its content and order, using the headers to identify the data source for a given operation is essential. Using skip_headers both allows me to preserve them while sorting through data, and also puts them back on again for the next time I need to reference them.
on 2013-02-18 17:17
On Mon, Feb 18, 2013 at 4:50 PM, Joel Pearson <lists@ruby-forum.com> wrote: > I went with "filter" with an optional true/false regex switch because it > seemed like the simplest way to use it, and closest to my own experience > in using Excel's filters. > Passing the symbol feels less intuitive, and yielding to a block means > writing more code, particularly when I'm writing a quick method chain. > The notation I set up feels natural to me when chaining criteria. For > example I can just do this: > data.filter( 'Account', /^P/ ).filter( 'Type', /^Large/, false ) What do you mean by "writing more code"? It is as short as data.filter( 'Account', /^P/ ).filter( 'Type' ) {|x| /^Large/ =~ x} You could even implement Regexp#to_proc like this class Regexp def to_proc lambda {|s| self =~ s} end end and then do data.filter( 'Account', /^P/ ).filter( 'Type', &/^Large/) > If I only want to keep Parts of "Type1" and "Type3" then I could use > "select" and some Regex, but I might pick up the Header as well if I'm > not careful. > Using a method like "skip_headers" allows me to select or reject > elements of the data without losing the identifiers in the first row, > which I'm almost always going to need at the end when I output the data > into human-readable format. But wouldn't you want to make the decision what is a header and what not more flexible? Possible criteria that come to mind are - first n lines / columns - first lines / columns where all values match a particular regexp - any line or column where all values match a particular regexp > I'm also dealing with entire rows rather than individual cells, and > since the source data can change its content and order, using the > headers to identify the data source for a given operation is essential. > Using skip_headers both allows me to preserve them while sorting through > data, and also puts them back on again for the next time I need to > reference them. So basically you want a view on the data which omits a few rows and columns. Given that there are so many potential criteria I'd probably pass in one object implementing === as a row header detector and one as a column header detector. Since Proc implements === as call you can also easily provide a lambda there. The argument to === would be the Column respective Row instance so the position as well as the cell contents can be evaluated to decide whether something constitutes a header row / column. Once you have that in place you could create convenience methods using one of the criteria mentioned above. Just a few thoughts. Kind regards robert
on 2013-02-18 18:23
That Regexp to proc idea looks good. I could use proc form for a
positive match and a normal block for the negative. I'll see if I can
get something like this working when I write filter method for
RubyExcel.
Using the new class I can implement something like skip_headers by
passing a starting value to "rows" or "columns". This makes it more
flexible as well. I've rewritten those iterators using optional start
and end points:
def rows( start_row = 1, end_row = maxrow )
fail TypeError, 'Data is empty' if maxrow == 0
fail ArgumentError, 'The starting row must be less than the maximum
row' if maxrow < start_row
return to_enum(:rows) unless block_given?
( start_row..end_row ).each do |idx|
yield row( idx )
end
self
end
Now I can use rows(2) to skip the headers if necessary. It might be a
bit confusing when rows(1) actually returns from 1 to the end, but I've
already got row(1) for that purpose and it makes it shorter to iterate
through all of them. Plus it means I can do "rows.count", which is the
same as VBA syntax.
I vaguely understand the idea of passing something in to compare to a
header type. I'm not sure how I'd implement it though, since the only
headers I ever deal with are row 1, and they tend to look pretty similar
to the data itself.
on 2013-02-18 20:41
On Mon, Feb 18, 2013 at 6:23 PM, Joel Pearson <lists@ruby-forum.com> wrote: > That Regexp to proc idea looks good. I could use proc form for a > positive match and a normal block for the negative. I'll see if I can > get something like this working when I write filter method for > RubyExcel. > > Using the new class I can implement something like skip_headers by > passing a starting value to "rows" or "columns". This makes it more > flexible as well. I've rewritten those iterators using optional start > and end points: Actually the even more general concept is filter anything: one might not only want to skip headers but general rows or columns. > def rows( start_row = 1, end_row = maxrow ) > fail TypeError, 'Data is empty' if maxrow == 0 > fail ArgumentError, 'The starting row must be less than the maximum > row' if maxrow < start_row > return to_enum(:rows) unless block_given? You need to pass arguments start_row and end_row here as well! return to_enum(:rows, start_row, end_row) unless block_given? > same as VBA syntax. > > I vaguely understand the idea of passing something in to compare to a > header type. I'm not sure how I'd implement it though, since the only > headers I ever deal with are row 1, and they tend to look pretty similar > to the data itself. Well, then we should probably add a method #index to Row and Column which returns the numeric index. That makes checking whether it's the n'th row / column easy. Basically for one of the two the method is just an alias (not if there is a base 0 or based 1 difference though). Kind regards robert
on 2013-02-19 00:47
Nice catch on the arguments, I completely missed that. After more face-to-keyboard action, I came up with a working filter system. it modifies self at the moment rather than returning a copy, which is something I'll have to look into since I'm not sure I want that to be the default behaviour. I've added a index option for row and column, and also added row and column methods to String. Since those methods didn't exist before, and you gave me the idea of modifying an existing class (Regexp), I thought this would be quite a useful way to get the index values straight from the hash keys. In order to get the filter working properly I've created some compact methods which will reconfigure the hash keys and values. That could probably be refactored but it took me so long to get it working properly I dare not touch it again yet! I added empty? to the columns and rows as a helper for the compact method. I didn't like the inspect output so I tidied it up a bit as well, and redefined "to_s" for each type. I've added each_with_address as an option for the columns and rows since they don't access the data hash directly. There might be a neater way to implement this, but I couldn't figure it out. I'm too tired for rational thought now so I'd better call it a day before I find myself thinking that adding ASCII art comments in the shape of ponies and rainbows would improve the code...
on 2013-02-19 10:36
On Tue, Feb 19, 2013 at 12:47 AM, Joel Pearson <lists@ruby-forum.com> wrote: > After more face-to-keyboard action, I came up with a working filter > system. it modifies self at the moment rather than returning a copy, > which is something I'll have to look intosince I'm not sure I want that > to be the default behaviour. You cannot change self. Or do you mean you change matrix contents? IMHO that's a bad idea since you basically only want a filtered /view/ on the data. > I've added a index option for row and column, and also added row and > column methods to String. Since those methods didn't exist before, and > you gave me the idea of modifying an existing class (Regexp), I thought > this would be quite a useful way to get the index values straight from > the hash keys. IMHO that is not a good idea. The method I suggested for Regex is of general use because matching is what Regex instances are for. But the extraction of a row or column identifier is closely bound to your class's logic. So if at all I'd put it as a class method in class RubyExcel. > ... > I'm too tired for rational thought now so I'd better call it a day > before I find myself thinking that adding ASCII art comments in the > shape of ponies and rainbows would improve the code... :-) Frankly, I don't have the time right now to take a closer look. I hope I can followup at a later point in time. Kind regards robert
on 2013-02-19 10:51
Using the filter method on my previous attempt would return a copy of the object containing only the required data. I was hoping for something similar with this so I can split into multiple sections using filter logic but still keep the original if required, but I'm not sure how to do this. This is the kind of thing I'd probably want to use: m = #Main matrix n = m.filter args #Creates Subset without altering "m" m.filter! other_args #Alters its own data A class method sounds like a good idea. I'll have to play with that and see what I can come up with...
on 2013-02-19 11:03
On Tue, Feb 19, 2013 at 10:51 AM, Joel Pearson <lists@ruby-forum.com> wrote: > Using the filter method on my previous attempt would return a copy of > the object containing only the required data. I was hoping for something > similar with this so I can split into multiple sections using filter > logic but still keep the original if required, but I'm not sure how to > do this. This is the kind of thing I'd probably want to use: > > m = #Main matrix > n = m.filter args #Creates Subset without altering "m" > m.filter! other_args #Alters its own data That's another reasonable way to do it. Similar to how String#gsub and String#gsub! go about this. It entirely depends on your use case. If you would want to modify the original data through a view then of course you need a similar approach as with Row and Column. > A class method sounds like a good idea. I'll have to play with that and > see what I can come up with... Cheers robert
on 2013-02-19 13:52
Ok... new changes: Used "Me" to reference the class so I don't have to keep typing the classname when referencing class methods. Removed String alterations in favour of the class methods "self.row" and "self.column" which take a string as an argument. Combined various checks again multidimensional array input into a method "multi_array?" Added "insert_columns" and "insert_rows", which after long hours of threatening my laptop with physical violence I've finally gotten into some sort of working order. Added "delete_column" and "delete_row" Modified the "compact" methods to work with the above. Modified "calc_columns" and "calc_rows" to accept a multidimensional array as an optional argument, otherwise they reference "data". Added "cells" so the API gives the option to list all populated cell addresses. Added strip! and upcase! Modified the example case at the end to work with the major changes. It's still far from complete, but it's taking form well enough that I might be able to start replacing my old class with it in future projects. If I can get a clean and tidy version eventually I might make it into a gem. Sadly I did succumb to the urge to add ASCII art comments. I'll probably remove it when I drift back closer to sanity.
on 2013-02-20 00:14
I've now had time to do a bit more testing and debugging, and I've managed to come up with a (relatively) stable and usable framework. Even more surprising, I now have a working version of pretty much everything on my wish list. I suppose I can just call .dup or .clone for when I want to take different sections of the data down different paths. I've tried to refactor some of the more esoteric code, and improve the "encapsulation" by hiding the underlying hash wherever I spotted it poking out. I've actually managed to use recursion (in the "unique" method) without permanently looping! I'll just gloss over the amount of permanent loops I went through to get that right... I've had to make some hefty use of reverse_each to sort out deleting entire sections and compacting the rest afterwards. As usual, I've probably made many silly mistakes and done things the hard way. Any advice would be appreciated (although not demanded ;-) ). I think the next thing to do is put this class through some real-world scenarios, so I'll be using it as a replacement for ye olde Excel_Sheet class.
on 2013-02-20 01:18
hi Joel i am timo i follow this thread you are doing great starting to look like it will be a worthwhile effort lately i have hoped for a few pointers in this same topic may i ask have you compared any current ruby forge programs? a few mainly older Ruby programs seem to exist for MS Excel in Ruby http://rubyforge.org/projects/spreadsheet (2011) beta and on ruby-lang http://raa.ruby-lang.org/project/hyogen/ ( 2007) http://raa.ruby-lang.org/project/ruby-statsample/ (2010) http://raa.ruby-lang.org/project/parseexcel/ (2006)
on 2013-02-20 09:25
Thanks Timo. I have used the spreadsheet gem and parseexcel, but I switched to using win32ole for more advanced options and for full compatibility with office 2010 formats. I haven't been keeping up to date with their progress since, it's always worth looking into them. I haven't seen hyogen or statsample before, I'll have a look. There are a few reasons I've set out to build my own class(es) for this: 1) To create an API which feels natural for me. My previous scripting experience is mostly in VBA, so I'm accustomed to using Excel's API and my own custom functions. I wanted to get as close to this feel as possible in order to get comfortable learning Ruby. Of course, in the process I'm picking up on the concepts of OOP, so my approach is starting to change as well. 2) To work alongside my existing code. This is a side-project to my main one, which is a sort of web-scraper specifically designed to work with my company's online database. I have a few other classes which perform tasks such as automated reporting with advanced criteria, filling in forms quickly with minimal user input, and just general helper applications to improve productivity. It seemed only natural to create a class capable of handling all the data analysis tasks which I had previously written using standard Array methods. 3) The actual interaction I have with Excel tends to be minimal. I have a few data dump methods which will take an array or a hash of arrays, and output a TSV or Excel file automatically. The focus of the gems you mentioned seems to be more on read-write operations like this, and I was after something which would let me analyse and restructure large amounts of data into an ordered summary. The reason I tend to order my operations as if it were manual Excel interactions is this is how customers give me their current methodology, and my API makes it easier for me to translate their approach into code. 4) To learn how to do it myself rather than relying on existing programs :) I'd be happy to share my newfound knowledge with you, although I have to give the credit mostly to Robert. I'd never have made it this far without his guidance. Incidentally I wrote a Matrix movie "Digital Rain" effect (which works via win32ole) while I was trying to work out how to interact with Excel efficiently. I've attached it in case you're interested.
on 2013-02-20 13:00
From what I've seen of statsample so far it looks more numerically oriented, whereas I tend to be dealing with lots of strings, some numbers, and a few dates. Hyogen also looks more interested in the read-write than the analysis in the interim. My input tends to come from Nokogiri and I can already output to excel with some advanced formatting options using my own code. I have attached one of the functions I wrote to output the data to Excel (part of a larger class). As you can see, it's pretty simple to use; and I always have the option of just handing over the win3ole object if I need more advanced options (which happens occasionally). It could use some refactoring, but it does the job. As I already have this kind of functionality available, it's the ability to sort through a table of data efficiently in the memory which interests me. I know a database would be ideal for that kind of operation, but I have portability in mind as a given report program may be on a variety of machines, and the only thing I know they have for sure is Office 2007 or higher.
on 2013-02-20 14:08
hey thanks for some concise notes on this important topic yes at length you guys built a nifty way to filter data in the memory even present live updates of cell vectors in the sheets your main data is strings my excel woes lately is the veteran car club parts catalogs i scan pages to tiff then image OCR to Word correct it and paste into Excel with aims like matching compatible pistons to various makes and models
on 2013-02-20 14:34
Sounds messy. Is the data on the page graphical rather than text based? Why do you need Word?
on 2013-02-21 13:40
I thought I could avoid the issue of splitting the data into multiple
outputs by creating copies of the class, but apparently I was wrong:
irb(main):002:0> m = RubyExcel.new.load [['a','b']]
=> columns: 2, rows: 1, values: 2
irb(main):003:0> a = m.clone
irb(main):004:0> a['A1'] = nil
irb(main):005:0> puts a
b
irb(main):006:0> puts m
b
What am I doing wrong? Is @data inside m not being duplicated along with
the class instance?
on 2013-02-21 14:14
On Thu, Feb 21, 2013 at 1:40 PM, Joel Pearson <lists@ruby-forum.com> wrote: > irb(main):005:0> puts a > b > > irb(main):006:0> puts m > b > > What am I doing wrong? Is @data inside m not being duplicated along with > the class instance? #clone and #dup are _always_ shallow copies. If you want to change that you need to override them, e.g. class Excel def dup super.tap do |copy| %w{@table foo}.each {|iv| copy.instance_variable_set(iv, instance_variable_get(iv).dup)} end end end Kind regards robert
on 2013-02-21 14:50
Thanks! Before I start rewriting everything to try and use this, though... Should I really be doing this sort of thing just to be able to differentiate between bang and non-bang methods? Would my time be better spent trying to find a way of returning a seperate instance of the class without modifying the current one? I did consider doing something like modifying a local variable, passing it to the "load" method of a new instance, then returning that new instance; but it seemed a rather long-winded method, and I think I'd have to rewrite all the methods around it.
on 2013-02-21 15:42
Oh, that was simpler than I thought. I need to do some further testing
but this seems to work:
______________________________________________
def dup
Me.new.load @data.clone
end
def load ( multi_array )
if multi_array.is_a? Hash
@data = multi_array
calc_dimensions
self
else
@data = {}
import_data multi_array
end
end
______________________________________________
irb(main):006:0> m = RubyExcel.new.load [['a','b']]
=> columns: 2, rows: 1, values: 2
a b
irb(main):007:0> a = m.dup
=> columns: 2, rows: 1, values: 2
a b
irb(main):010:0> a['A1'] = nil
columns: 2, rows: 1, values: 1
b
irb(main):013:0> p m
columns: 2, rows: 1, values: 2
a b
on 2013-02-22 02:37
thanks yes you're right the task is tedious and messy first i place the old book in my scanner and convert image to tiff tiff is the format accepted in ms-office document imaging OCR tool any text found in the tiff image is output to Word all in a single column i clear all text formatting and correct many typos by examining it visually i check the OCR-output text in word against the input tiff side by side then copy and paste text ie part numbers and names into excel columns i put each catalog section items into an xls sheet to make the data accessible i hope these could be used to create an index of a database of tech drawings
on 2013-02-22 10:34
There are various ways to create a database from excel files, so that should be fine. I can't really think of a more effective way to pull the data from books, apart from trying out different OCR programs. The necessity of scanning and manual checking pretty much renders moot any attempt to accelerate that process. Depending on the amount of data, you might actually get better performance out of a skilled typist doing transcription. It might be worth timing the two against each other. I assume there's no chance of getting the original data these books were printed from? That would be the ideal solution.
on 2013-02-26 14:26
Maybe I should start a new thread for this... Anyway, I've worked out some of the kinks (probably in an unecessarily convoluted fashion) and added some documentation code after learning the basics of "yard". I've managed to add plenty of little helpers for modifying code, and I'm about to start real-life testing. As always, any constructive criticism is welcome!
on 2013-02-27 16:21
Hah, I'm such an idiot. I've replaced the hideously inefficient
compact_rows & columns with these:
def compact_rows
load to_a.reject { |ar| ar.all? { |el| el.nil? } }
end
def compact_columns
load to_a.transpose.reject { |ar| ar.all? { |el| el.nil? } }
end
Much better :)
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.