Forum: Ruby Select "columns" from multidimensional array?

Posted by Joel Pearson (virtuoso)
on 2013-01-31 10:45
There's probably a simpler answer to this than the ways I've come up
with.
What's the best way to select columns from a two-dimensional array?

I build arrays to match excel-style formatting, like this but larger:
__________________

a = [
  [ 'A1', 'A2', 'A3' ],
  [ 'B1', 'B2', 'B3' ],
  [ 'C1', 'C2', 'C3' ]
]

def get_cols (multi_array, headers )

  indices = []
  headers.each { |val| indices << multi_array[0].index(val) }
  indices.compact!

  multi_array.map do |ar|
    indices.map { |idx| ar[idx] }
  end

end

get_cols a, %w(A1 A3)

=> [["A1", "A3"], ["B1", "B3"], ["C1", "C3"]]
__________________

I haven't been able to work out a way to do this without writing
long-winded code. Is there a simple solution?

Thanks.
Posted by "Jesús Gabriel y Galán" <jgabrielygalan@gmail.com> (Guest)
on 2013-01-31 11:26
(Received via mailing list)
On Thu, Jan 31, 2013 at 10:46 AM, Joel Pearson <lists@ruby-forum.com> 
wrote:
>   [ 'C1', 'C2', 'C3' ]
>   end
>
> end
>
> get_cols a, %w(A1 A3)
>
> => [["A1", "A3"], ["B1", "B3"], ["C1", "C3"]]

Is this the desired output?

For getting the columns filtering the ones that have a header:

def get_cols(multi_array, headers)
  multi_array.transpose.select {|(header,_)| headers.include? header}
end

1.9.2p290 :001 > a = [
1.9.2p290 :002 >       [ 'A1', 'A2', 'A3' ],
1.9.2p290 :003 >       [ 'B1', 'B2', 'B3' ],
1.9.2p290 :004 >       [ 'C1', 'C2', 'C3' ]
1.9.2p290 :005?>   ]
1.9.2p290 :008 > def get_cols(multi_array, headers)
1.9.2p290 :009?>     multi_array.transpose.select {|(header,_)|
headers.include? header}
1.9.2p290 :010?>   end
 => nil
1.9.2p290 :011 > get_cols a, %w(A1 A3)
 => [["A1", "B1", "C1"], ["A3", "B3", "C3"]]


Jesus.
Posted by Joel Pearson (virtuoso)
on 2013-01-31 11:34
Praise Jesus!

My initial output was the desired one, but the "transpose" method is
what I was missing. Thanks :)

I didn't know about that "|header,_|" trick either, very nice.
Posted by Joel Pearson (virtuoso)
on 2013-01-31 13:08
Documenting the end result.
I added a sort to keep the input column order, and reordered the inputs 
to make the array itself optional for ease-of-use within the parent 
class.

def get_cols( headers, multi_array=nil)
  multi_array = @data if multi_array.nil?
  multi_array.transpose.select { |header,_| headers.include?(header) 
}.sort_by { |header,_| headers.index(header) || headers.length 
}.transpose
end

irb(main):017:0> get_cols %w(A3 A1), a
=> [["A3", "A1"], ["B3", "B1"], ["C3", "C1"]]
Posted by Robert Klemme (robert_k78)
on 2013-01-31 13:12
(Received via mailing list)
On Thu, Jan 31, 2013 at 11:34 AM, Joel Pearson <lists@ruby-forum.com> 
wrote:
> Praise Jesus!
>
> My initial output was the desired one, but the "transpose" method is
> what I was missing. Thanks :)

I'd really start by creating a class for this - or use Matrix from the
standard library.

irb(main):008:0> m = Matrix[[1,2,3],[4,5,6]]
=> Matrix[[1, 2, 3], [4, 5, 6]]
irb(main):009:0> m.row 1
=> Vector[4, 5, 6]
irb(main):010:0> m.column 1
=> Vector[2, 5]

Kind regards

robert
Posted by Joel Pearson (virtuoso)
on 2013-01-31 15:14
Thanks robert

My current approach is
HTML Table -> Nokogiri Nodeset -> Multidimensional Array -> Excel / TSV

A Matrix looks like a useful way of grabbing the values I need when I 
have to alter specifics in the data.
Posted by Robert Klemme (robert_k78)
on 2013-01-31 15:57
(Received via mailing list)
On Thu, Jan 31, 2013 at 3:14 PM, Joel Pearson <lists@ruby-forum.com> 
wrote:
> Thanks robert

You're welcome!

> My current approach is
> HTML Table -> Nokogiri Nodeset -> Multidimensional Array -> Excel / TSV
>
> A Matrix looks like a useful way of grabbing the values I need when I
> have to alter specifics in the data.

Whatever you do - reuse class Matrix, write your own - it is the most
reasonable thing to have a specific class for handling this instead of
writing functions which work with a nested Array structure.  It will
make your life much easier because then you Matrix class can enforce
proper internal state which you cannot as easily when using a set of
functions to manipulate an Array structure.

Kind regards

robert
Posted by Randy Morgan (randym)
on 2013-01-31 15:58
(Received via mailing list)
Hi Joel,

Worth mentioning that transpose requires each "row" to contain the same
number of "columns" or you will get an index error.

Here's a Gist on how to alter Array#transpose to allow a block for
populating missing elements.

https://gist.github.com/3876888

Best

randym
Posted by Joel Pearson (virtuoso)
on 2013-01-31 16:45
Looks like good advice, both. My concern with matrices is being able to 
modify elements in the same way as I could in in a multidimensional 
array, but I assume that's the reason for creating a child class.
There's a whole thread full of people waxing philosophical about the 
subject!

I've never written a class based on someone else's before, sounds like 
fun. I'll see what happens when I play with it a bit.
Posted by Joel Pearson (virtuoso)
on 2013-02-02 22:03
Robert Klemme wrote in post #1094583:
> write your own - it is the most
> reasonable thing to have a specific class for handling this instead of
> writing functions which work with a nested Array structure.

It's only just dawned on me just how useful this kind of thing could be. 
I could have a class tailor-made to my expected outputs, yet versatile 
enough to adapt to new challenges... If I learn how to use blocks 
effectively as well, then it could do pretty much anything I need.
All hail Ruby!
Posted by Robert Klemme (robert_k78)
on 2013-02-03 18:04
(Received via mailing list)
On Sat, Feb 2, 2013 at 10:04 PM, Joel Pearson <lists@ruby-forum.com> 
wrote:
> Robert Klemme wrote in post #1094583:
>> write your own - it is the most
>> reasonable thing to have a specific class for handling this instead of
>> writing functions which work with a nested Array structure.
>
> It's only just dawned on me just how useful this kind of thing could be.
> I could have a class tailor-made to my expected outputs, yet versatile
> enough to adapt to new challenges...

That's the whole point of OO - or rather software engineering in
general: create proper abstractions.  Advantage of OO is that one can
first reason about the interface and hide all the details behind that
- including the state necessary to make the interface of the class
work as needed.

> If I learn how to use blocks
> effectively as well, then it could do pretty much anything I need.
> All hail Ruby!

:-)

Cheers

robert
Posted by Joel Pearson (virtuoso)
on 2013-02-03 23:43
I decided to try and build on the Array class as I don't really 
understand Matrices yet. I've added a few handy methods. The hidden Bang 
stuff is justified, I think, as this class is intended to mimic Excel's 
layout.

I'll add more useful bits as I come up with them, this is just an 
experiment at the moment.

class Excel_Sheet<Array

  def initialize( val=[] )
    fail ArgumentError, 'Must be multidimensional array' unless 
val[0].class == Array || val.empty?
    super( val )
  end

  def columns
    ensure_shape
    self[0].length
  end

  def rows
    self.length
  end

  def ensure_shape
    max_size = self.max_by(&:length).length
    self.map! { |ar| ar.length == max_size ? ar : ar + Array.new( 
max_size - ar.length, nil) }
  end

  def get_cols( headers )
    ensure_shape
    self.transpose.select { |header,_| headers.include?(header) 
}.sort_by { |header,_| headers.index(header) || headers.length 
}.transpose
  end

  def get_cols!( headers )
    self.replace get_cols
  end

  def to_s
    self.map { |ar| ar.map { |el| "#{el}".strip.gsub( /\s/, ' ' ) }.join 
"\t" }.join "\n"
  end

end
Posted by Joel Pearson (virtuoso)
on 2013-02-04 00:05
And my first attempt at using a block with it:

def skip_headers
  yield self[1..-1]
end

test.skip_headers do |row| end
  p row
end

It works!
Posted by Robert Klemme (robert_k78)
on 2013-02-04 14:12
(Received via mailing list)
On Sun, Feb 3, 2013 at 11:43 PM, Joel Pearson <lists@ruby-forum.com> 
wrote:
> I decided to try and build on the Array class as I don't really
> understand Matrices yet. I've added a few handy methods. The hidden Bang
> stuff is justified, I think, as this class is intended to mimic Excel's
> layout.

> class Excel_Sheet<Array

I wouldn't do that.  With the basic types it is usually much better to
use delegation (i.e. have a member of that type) than exposing the
full API via inheritance.  The whole point of OO is to control
internal state which is usually quite difficult when exposing a
complete API of Array because anybody can insert and remove elements.

Btw. the "self." in your code are superfluous.

Kind regards

robert
Posted by Joel Pearson (virtuoso)
on 2013-02-04 14:32
>> class Excel_Sheet<Array
>
> I wouldn't do that.  With the basic types it is usually much better to
> use delegation (i.e. have a member of that type) than exposing the
> full API via inheritance.  The whole point of OO is to control
> internal state which is usually quite difficult when exposing a
> complete API of Array because anybody can insert and remove elements.
>
> Btw. the "self." in your code are superfluous.
>
> Kind regards
>
> robert

Ok, I wasn't sure how to create an object which behaved like an array.
I've tried this instead:

class Excel_Sheet

  def initialize( val=[] )
    val = %w(A1 A2 A3 B1 B2 B3 C1 C2 C3).each_slice(3).to_a if val == 
'test'
    fail ArgumentError, 'Must be multidimensional array' unless 
val[0].class == Array || val.empty?
    replace val
  end

end

It seems to work ok.

I didn't know how to use self at first, I think I understand it a bit 
better now.

Thanks for the advice :)
Posted by Joel Pearson (virtuoso)
on 2013-02-12 13:19
I've decided to inherit from array after all, since all I want to do 
with this is extend support for multidimensional arrays, but without 
overwriting any of Array's methods.

Anyway, the obstacle I've hit is one I can avoid, but I was wondering 
whether I'm doing something wrong, or whether there's a nice Rubyish way 
around this. Here's a simplified version to demonstrate the issue:

___________________
class Excel_Sheet<Array

  def initialize( val=[] )
    val = %w(A1 B1 C1 A2 B2 C2 A3 B3 C3).each_slice(3).to_a if val == 
'test'
    super ( val )
  end

  def skip_headers
    block_given? ? ( [ self[0] ] + yield( self[1..-1] ) ) : ( 
self[1..-1] )
  end

  def filter( header, regex )
    idx = self[0].index header
    skip_headers { |xl| xl.select { |ar| ar[idx] =~ regex } }
  end

end
___________________

When I do this sort of thing:
result = object.filter('Header', /value1|value2/)

I get the return as an Array, so I can't use my extra methods on it 
anymore.

Here's my current workaround. It's the only way I could think of doing 
this but it doesn't look right.
___________________
def filter( header, regex )
  idx = self[0].index header
  Excel_Sheet.new skip_headers { |xl| xl.select { |ar| ar[idx] =~ regex 
} }
end
___________________

So in short, my question is how can I return my class type after using 
Array's methods on my child-class?
Posted by Robert Klemme (robert_k78)
on 2013-02-12 13:30
(Received via mailing list)
On Tue, Feb 12, 2013 at 1:19 PM, Joel Pearson <lists@ruby-forum.com> 
wrote:
> I've decided to inherit from array after all, since all I want to do
> with this is extend support for multidimensional arrays, but without
> overwriting any of Array's methods.

I usually do not engage in predictions since I don't have a crystal
ball but in this case I'll say: you won't get happy with that
approach.  For example, anybody can override header values or complete
rows / columns violating your class's idea of internal state.

>     super ( val )
>   end
>
>   def skip_headers
>     block_given? ? ( [ self[0] ] + yield( self[1..-1] ) ) : (
> self[1..-1] )
>   end

What is this supposed to do?  Ah, I think I see.  I'd probably name it
differently, i.e. each_data_cell or something.

>   def filter( header, regex )
>     idx = self[0].index header
>     skip_headers { |xl| xl.select { |ar| ar[idx] =~ regex } }
>   end
>
> end

That combines too much logic in one method IMHO.  I'd rather select a
row based on header and then I would use #select on that.

> So in short, my question is how can I return my class type after using
> Array's methods on my child-class?

Do you mean as return value from #map and the like?  Well, you can't
without overriding all methods with this approach, I'm afraid.  That's
one of the reasons why this approach does not work well. :-)

Kind regards

robert
Posted by Joel Pearson (virtuoso)
on 2013-02-12 14:06
Robert Klemme wrote in post #1096452:
> anybody can override header values or complete
> rows / columns violating your class's idea of internal state.

This class only gets added into scripts, and I'm the only one who knows 
Ruby where I work, so I don't really see this being a problem yet.

>>   def skip_headers
>>     block_given? ? ( [ self[0] ] + yield( self[1..-1] ) ) : (
>> self[1..-1] )
>>   end
> What is this supposed to do?  Ah, I think I see.  I'd probably name it
> differently, i.e. each_data_cell or something.

This basically allows me to make summaries or edits within the data 
itself without having to worry about the headers, then it tacks the 
headers back on again once I've made the changes. The switch is just so 
I can rip the headers off data and also read my code back later and see 
what it's doing.

>>   def filter( header, regex )
>>     idx = self[0].index header
>>     skip_headers { |xl| xl.select { |ar| ar[idx] =~ regex } }
>>   end
> That combines too much logic in one method IMHO.  I'd rather select a
> row based on header and then I would use #select on that.

In this case, I'm filtering the data like Excel does. This means I'm 
keeping all of the columns, but only specific rows based on the values 
in a single column.

>> So in short, my question is how can I return my class type after using
>> Array's methods on my child-class?
> Do you mean as return value from #map and the like?  Well, you can't
> without overriding all methods with this approach, I'm afraid.  That's
> one of the reasons why this approach does not work well. :-)

You're probably thinking in terms of classes which are constantly active 
as objects accessible to multiple users, whereas I'm just using this 
class to make scripts easier to write.

My usage in this case is:

1) Run a report which returns a table of data
2) Filter the data on a set of criteria
3) Output the results to Excel
4) Exit

Using the class I've built I can easily pare down the data to the 
important sections, I've added other handy methods suitable to a 
multidimensional array such as "find" which returns the header and row, 
"to_s" which converts to TSV format, and several more. It saves me 
rewriting the same stuff over and over.

It's great when I want to build excel-style tabs out of the data, like 
this little snippet:
______________
@data[:EXT] = @WIP_data.filter 'Job Status', /External Repair/
@data[:Awaiting_Parts] = @WIP_data.filter 'Job Status', /Awaiting Parts/

@data.each { |name, data| @data[:Summary].push [ name.to_str, ( 
data.length -1 ) ] }

reporter.output_to_excel_tabs @data, filename
______________

Ok, in that case I'm happy with an array return, but there are plenty of 
sequential events where I'd like to have my own class type returned. Is 
passing it through the initializer again really the best way to do that?
Posted by Alex Gutteridge (Guest)
on 2013-02-12 17:56
(Received via mailing list)
On 12.02.2013 13:06, Joel Pearson wrote:
> active
> Using the class I've built I can easily pare down the data to the
> @data[:EXT] = @WIP_data.filter 'Job Status', /External Repair/
> of
> sequential events where I'd like to have my own class type returned.
> Is
> passing it through the initializer again really the best way to do
> that?

As Robert says I'm not sure there is any (clean) way to do what you
want, which is why delegation as opposed to inheritance is often used in
this scenario. I'm sure this isn't a complete solution, but does this
rewritten snippet (untested!) help at all?

class ExcelSheet

   def initialize( val=[] )
     val = %w(A1 B1 C1 A2 B2 C2 A3 B3 C3).each_slice(3).to_a if val ==
'test'
     @arr = val
   end

   def skip_headers
     block_given? ? ( [ @arr[0] ] + yield( @arr[1..-1] ) ) : (
@arr[1..-1] )
   end

   def filter( header, regex )
     idx = @arr[0].index header
     @arr = skip_headers { |xl| xl.select { |ar| ar[idx] =~ regex } }
   end

   def method_missing(meth,*args,&block)
     ret = @arr.send meth, *args, &block
     ret.instance_of?(Array) ? ExcelSheet.new(@ret) : ret
   end

end
Posted by Robert Klemme (robert_k78)
on 2013-02-12 18:45
(Received via mailing list)
On Tue, Feb 12, 2013 at 2:06 PM, Joel Pearson <lists@ruby-forum.com> 
wrote:
> Robert Klemme wrote in post #1096452:
>> anybody can override header values or complete
>> rows / columns violating your class's idea of internal state.
>
> This class only gets added into scripts, and I'm the only one who knows
> Ruby where I work, so I don't really see this being a problem yet.

I think you misunderstand OO.  The point of OO is to build
abstractions of real world phenomena which behave in certain ways.
That does not have to do with how many people use the code; proper OO
abstractions help you even if you are the only one working with the
code because they encapsulate specific behavior and you as a user of
classes do not need to worry any longer about internals.  By thinking
in proper abstractions you actually make your life easier since the
system is easier to understand.

> I can rip the headers off data and also read my code back later and see
> what it's doing.

I would actually rather have Row and Colum as specific items which can
be asked for their header and iterate through all their values.
Example: https://gist.github.com/rklemme/4771651

>>>   def filter( header, regex )
>>>     idx = self[0].index header
>>>     skip_headers { |xl| xl.select { |ar| ar[idx] =~ regex } }
>>>   end
>> That combines too much logic in one method IMHO.  I'd rather select a
>> row based on header and then I would use #select on that.
>
> In this case, I'm filtering the data like Excel does. This means I'm
> keeping all of the columns, but only specific rows based on the values
> in a single column.

Right, and as I said I'd rather make that two separate steps.  That is
much more modular and hence reusable.

>>> So in short, my question is how can I return my class type after using
>>> Array's methods on my child-class?
>> Do you mean as return value from #map and the like?  Well, you can't
>> without overriding all methods with this approach, I'm afraid.  That's
>> one of the reasons why this approach does not work well. :-)
>
> You're probably thinking in terms of classes which are constantly active
> as objects accessible to multiple users, whereas I'm just using this
> class to make scripts easier to write.

See my initial statement: I think you are vastly underestimating the
value of OOP.

Cheers

robert
Posted by Joel Pearson (virtuoso)
on 2013-02-12 21:25
Thanks for the advice and examples, I'll see whether I can understand 
how the classes and methods work with each other there and set about 
experimenting with them.

Once thing which put me off generating a custom class "from scratch" is 
that Array appears to be equal to its content (I assume this is a 
language shortcut), but it seems "custom" objects' values have to be 
accessed via their accessors.
I was hoping for some more succinct syntax than this sort of thing:
puts [] #Array is so easy to create
puts CustomObject.new([]).value #This looks clunky next to that

I'd love to get accustomed proper OO thinking, but I'll inevitably make 
all the rookie mistakes in the process. It's a lot to get used to all at 
once given that I've been using Ruby for less than a year, and I have no 
training other than helpful hints and googling. Thanks again for your 
patience.
Posted by Robert Klemme (robert_k78)
on 2013-02-12 22:52
(Received via mailing list)
On Tue, Feb 12, 2013 at 9:26 PM, Joel Pearson <lists@ruby-forum.com> 
wrote:
> Thanks for the advice and examples, I'll see whether I can understand
> how the classes and methods work with each other there and set about
> experimenting with them.

I didn't put commenting in the gist.  If there's anything unclear feel
free to ask.

> Once thing which put me off generating a custom class "from scratch" is
> that Array appears to be equal to its content (I assume this is a
> language shortcut), but it seems "custom" objects' values have to be
> accessed via their accessors.
> I was hoping for some more succinct syntax than this sort of thing:
> puts [] #Array is so easy to create
> puts CustomObject.new([]).value #This looks clunky next to that

You can get quite close, for example you can do

def M(*a)
  YourCustomMatrix.new(a)
end

# use
M(1,2,3,4)

or

M = Object.new
def M.[](*a)
  YourCustomMatrix.new(a)
end

# use
M[1,2,3,4]

> I'd love to get accustomed proper OO thinking, but I'll inevitably make
> all the rookie mistakes in the process.

Yes, it will take time.  Mistakes are what you will learn from.  Given
that, I should probably shut up and let you make your personal
mistakes. :-)

> It's a lot to get used to all at
> once given that I've been using Ruby for less than a year, and I have no
> training other than helpful hints and googling. Thanks again for your
> patience.

You're welcome!

Kind regards

robert
Posted by Robert Klemme (robert_k78)
on 2013-02-13 18:50
(Received via mailing list)
On Tue, Feb 12, 2013 at 10:49 PM, Robert Klemme
<shortcutter@googlemail.com> wrote:
> On Tue, Feb 12, 2013 at 9:26 PM, Joel Pearson <lists@ruby-forum.com> wrote:
>> Thanks for the advice and examples, I'll see whether I can understand
>> how the classes and methods work with each other there and set about
>> experimenting with them.
>
> I didn't put commenting in the gist.  If there's anything unclear feel
> free to ask.

Maybe one explanatory sentence: I chose to use two values for
addressing cells in the Matrix class I put on github.  That is a
design decision.  You may want to choose something else (e.g. "R3")
but the main point of the example was this: it does make sense to turn
things you talk and reason about into classes (i.e. Matrix, Column,
Row).  One can do that and still maintain connection, i.e. if you
write through a Row or Column instance the Matrix gets updated.  That
way you can easily maintain consistency in the Matrix and yet present
the user abstractions which are more appropriate for his particular
use case (e.g. if you want to sum all values in a column you obtain
the column and then iterate through all values and sum them up).

Kind regards

robert
Posted by Joel Pearson (virtuoso)
on 2013-02-13 20:47
I haven't had a chance to look into your example yet; I've been reading 
up on OOP.
I intend to take the ideas I've been coming up with for ease-of-use 
within the Array class and use those, your Matrix example, and whatever 
else occurs to me to form a new set of classes which can handle my data 
and the operations I regularly need to perform. Then it's time to play 
with scenarios and see what happens.
Posted by Joel Pearson (virtuoso)
on 2013-02-14 23:48
Interesting Matrix build. It's giving me a bit of a headache just trying 
to figure out the links involved.

So MatrixPart defines the methods and the "parent" matrix (held as an 
instance variable); and row and column both use these methods and both 
access the variable which points to the matrix they're part of.

The rows and columns can be selected based on given headers, and each 
will reference the other... and this is where my head explodes:

def index( row, col )
  @row_headers.index( row ) * @col_headers.size + @col_headers.index( 
col )
end

It takes a bit of getting used to, but thanks to Ruby's flexible array 
class adding nil values automatically when you specify an index higher 
than the upper boundary, that works.

I guess with a bit more poking and prodding I could figure out how to 
append, insert, and delete rows and columns. After all, it's only a math 
problem in the end. All the interconnected references (especially the 
layered yields) still make my head spin though :)
Posted by Robert Klemme (robert_k78)
on 2013-02-15 15:41
(Received via mailing list)
On Thu, Feb 14, 2013 at 11:50 PM, Joel Pearson <lists@ruby-forum.com> 
wrote:
> Interesting Matrix build. It's giving me a bit of a headache just trying
> to figure out the links involved.
>
> So MatrixPart defines the methods and the "parent" matrix (held as an
> instance variable); and row and column both use these methods and both
> access the variable which points to the matrix they're part of.

Yes, Row and Column are a facade to the "real" data and provide a
different interface to it which presents a different abstraction:
while the Matrix has two dimensions a Row and a Column only have one.

> than the upper boundary, that works.
Since you obviously understood the method now I am not sure why you
say your head explodes over this piece of code.

Btw, with a small change you can change storage of data from an Array
to a Hash making the Matrix class better suited for sparse matrices.
And here comes an important aspect of that implementation: only the
Matrix class had to change, there was absolutely no change necessary
for the other three classes!  This shows how Matrix's API isolated
client code from inner workings of this class.  This is what OO is
about.

https://gist.github.com/rklemme/4771651/revisions

> I guess with a bit more poking and prodding I could figure out how to
> append, insert, and delete rows and columns. After all, it's only a math
> problem in the end. All the interconnected references (especially the
> layered yields) still make my head spin though :)

You'll get used to that - and with a bit of oil the squeaking goes
away as well. ;-)

Kind regards

robert
Posted by Joel Pearson (virtuoso)
on 2013-02-15 15:55
Hah, I wrote that head exploding comment first and then managed to work 
out what it did afterwards. Still took a few minutes of smashing my head 
into the desk to make room for the new thought though ;¬)

Using a Hash sounds like a good idea. I already tried rewriting the 
selector into something a bit more excel-like (although I won't bore you 
with all the little changes):

  def []( addr )
    col, row = addr.upcase.scan( /([A-Z]+)(\d+)/ ).flatten
    data[ index( row, col ) ]
  end

  def []=( addr, val )
    col, row = addr.upcase.scan( /([A-Z]+)(\d+)/ ).flatten
    data[ index( row, col ) ] = val
  end

m = Matrix.new(%w{A B C}, %w{1 2 3 4})
m["A1"] = 123
m["B4"] = 123

I haven't gotten around to changing all the "row, col" to "col, row" 
references, so it looks a bit weird, but I'm just experimenting with 
options at the moment. I'll have a go at Hashing it up as well.

Naturally I have many questions floating around in my head, but I'll try 
to work them out through the scientific method of repeated failed 
attempts :)
Posted by Robert Klemme (robert_k78)
on 2013-02-15 18:21
(Received via mailing list)
On Fri, Feb 15, 2013 at 3:55 PM, Joel Pearson <lists@ruby-forum.com> 
wrote:
> Hah, I wrote that head exploding comment first and then managed to work
> out what it did afterwards. Still took a few minutes of smashing my head
> into the desk to make room for the new thought though ;)

LOL

> Using a Hash sounds like a good idea. I already tried rewriting the
> selector into something a bit more excel-like (although I won't bore you
> with all the little changes):

Good!

> m = Matrix.new(%w{A B C}, %w{1 2 3 4})
I would probably not initialize the matrix then. Excel is dynamic as
well.  So I'd start with a blank slate and only remember a max value
for row and column.

> m["A1"] = 123
> m["B4"] = 123
>
> I haven't gotten around to changing all the "row, col" to "col, row"
> references, so it looks a bit weird, but I'm just experimenting with
> options at the moment. I'll have a go at Hashing it up as well.

Good.  With Ruby, these types of experiments are so much fun. :-)

> Naturally I have many questions floating around in my head, but I'll try
> to work them out through the scientific method of repeated failed
> attempts :)

Actually that's probably the best method to learn: we learn much more
through our failures than from our successes.

Kind regards

robert
Posted by Joel Pearson (virtuoso)
on 2013-02-17 21:14
Attachment: XL.rb (2,82 KB)
I've attached my attempt at converting your code to suit mine (hope you 
don't mind the plagarism :p )
I have a list of some of my plans to add functionality at the top, and 
I've rewritten your test at the bottom to suit the new options.

I'd be interested to know whether there are any things I'm doing 
drastically wrong... I think the rows? and columns? might be able to be 
done more succinctly, for example.
Posted by Love U Ruby (my-ruby)
on 2013-02-17 21:19
Robert Klemme wrote in post #1097127:
> On Fri, Feb 15, 2013 at 3:55 PM, Joel Pearson <lists@ruby-forum.com>
> wrote:

Below is a valuable statement. :)

> Actually that's probably the best method to learn: we learn much more
> through our failures than from our successes.
>
> Kind regards
>
> robert
Posted by Joel Pearson (virtuoso)
on 2013-02-17 21:30
Oopsie, this:
data.keys.map { |k| k[/\d+/] }.max.to_i
should be this:
data.keys.map { |k| k[/\d+/].to_i }.max

Seeing mistakes already :p
Posted by Robert Klemme (robert_k78)
on 2013-02-17 22:29
(Received via mailing list)
On Sun, Feb 17, 2013 at 9:14 PM, Joel Pearson <lists@ruby-forum.com> 
wrote:
> I've attached my attempt at converting your code to suit mine (hope you
> don't mind the plagarism :p )

No, it's not a doctoral thesis. :-)  (allusion to German politics)

> I have a list of some of my plans to add functionality at the top, and
> I've rewritten your test at the bottom to suit the new options.
>
> I'd be interested to know whether there are any things I'm doing
> drastically wrong... I think the rows? and columns? might be able to be
> done more succinctly, for example.

First of all their names seem weird.  Methods with a question mark at
the end are usually intended to be used in a boolean context.  But
rows? and columns? return the row and column counts.  Plus, they do
not gracefully deal with an empty Matrix.  I guess for everyday use it
will be more efficient to store max values in two members because you
will need them often for iteration.

Method #array could use Columns's #map method. If you rewrite
iteration methods like this

  def each_row
    return to_enum(:each_row) unless block_given?

    (1..rows?).each do |idx|
      yield row( idx )
    end
    self
  end

Then you can even do

  def array
    each_row.map do |row|
      row.to_a
    end
  end

And I would consider changing []= to actually remove the entry if the
value is nil.  (Excel does it differently, I believe. It will not
reduce the area it considers "used" when emptying the last cell in the
last column or row.)

Ah, and I would add a validity check for addresses - otherwise one can
get any String into the Hash as key - even things which are not valid
addresses.

But generally I think you get the hang of it.

Kind regards

robert
Posted by Joel Pearson (virtuoso)
on 2013-02-18 13:05
Attachment: RubyExcel.rb (3,82 KB)
Nice tips! Thanks for the help again.

I had no idea how to use to_enum, I'll have to read up on that. I've 
done all the Ruby courses I could find at Codecademy which filled in a 
few gaps I had in my knowledge. I'm still reading the Book of Ruby as 
well.

Hopefully this one is more stable:

I've decided to leave the "Matrix" class name alone in case I need to 
use it within the same scope later. I've renamed this "RubyExcel" for 
want of a better term.

I fixed all the things you mentioned (I think).

I've added the ability to upload a multidimensional array into the data. 
It carries the option to overwrite or append as a switch.

I set the reference list of column references to a Constant.

I've removed "array" added "to_a" and "to_s"

I've added "find" to return a "cell address" when given a value

I still have a long list of things I want to add, and I'm sure I'll 
think of more. I'm surprised I haven't found anything equivalent out 
there, to be honest. Maybe all the real pros are using databases to 
parse their output :p
Posted by Robert Klemme (robert_k78)
on 2013-02-18 13:56
(Received via mailing list)
On Mon, Feb 18, 2013 at 1:05 PM, Joel Pearson <lists@ruby-forum.com> 
wrote:
> Nice tips! Thanks for the help again.

You're welcome.

> I had no idea how to use to_enum, I'll have to read up on that. I've
> done all the Ruby courses I could find at Codecademy which filled in a
> few gaps I had in my knowledge. I'm still reading the Book of Ruby as
> well.

Basically that method stores self and method name along with arguments
given in a new object which implements #each as delegation to the
method stored and also includes Enumerable.  You can cook your own for
learning purposes

E = Struct.new :obj, :method, :arguments do
  include Enumerable

  def each(&b)
    obj.send(method, *arguments, &b)
    self
  end
end

irb(main):017:0> s = "foo bar"
=> "foo bar"
irb(main):018:0> e = E.new s, :each_char
=> #<struct E obj="foo bar", method=:each_char, arguments=nil>
irb(main):019:0> e.map {|x| "<#{x.inspect}>"}
=> ["<\"f\">", "<\"o\">", "<\"o\">", "<\" \">", "<\"b\">", "<\"a\">", 
"<\"r\">"]

Note that #map is defined in Enumerable

> I've decided to leave the "Matrix" class name alone in case I need to
> use it within the same scope later. I've renamed this "RubyExcel" for
> want of a better term.

+1

> I've added the ability to upload a multidimensional array into the data.
> It carries the option to overwrite or append as a switch.

Two remarks:

 - don't catch the exception inside the method
 - better create two methods - one for overwrite and one for append
(even if they internally delegate common work to another method)

This will give you a clearer API.  See here for more reasoning:
http://stackoverflow.com/questions/1331630/use-met...

> I've added "find" to return a "cell address" when given a value

You may also want to add #select etc.  Or you include Enumerable and
implement #each as delegation to the Hash's #each and get #select and
all others for free.

> I still have a long list of things I want to add, and I'm sure I'll
> think of more. I'm surprised I haven't found anything equivalent out
> there, to be honest. Maybe all the real pros are using databases to
> parse their output :p

:-)

Kind regards

robert
Posted by Joel Pearson (virtuoso)
on 2013-02-18 14:42
Nice link. I agree with the sentiment there, and I'll think more 
carefully about using boolean switches in future.
I've split that method into "load" and "append", each passing arguments 
to private "import_data".
I added the rescue when I realised the method was returning the number 
of rows and I wanted it to return success or failure as a boolean, I 
forgot it was catching my  exceptions as well. Now it's true or 
exception.

I do use switches occasionally, here's one example where I think it's 
justified (from my older Excel_Sheet<Array class):

def filter( header, regex, switch=true )
  fail ArgumentError, "#{regex} is not valid Regexp" unless regex.class 
== Regexp
  idx = self[0].index header
  fail ArgumentError,  "#{header} is not a valid header" if idx.nil?
  operator = ( switch ? :=~ : :!~ )
  Excel_Sheet.new skip_headers { |xl| xl.select { |ar| ar[idx].send( 
operator, regex ) } }
end

Mostly I just did that because I was learning how to use symbols, but it 
makes the Regex more flexible with the minimum amount of repetition or 
long-winded "if" statements.
Posted by Robert Klemme (robert_k78)
on 2013-02-18 15:54
(Received via mailing list)
On Mon, Feb 18, 2013 at 2:43 PM, Joel Pearson <lists@ruby-forum.com> 
wrote:

> operator, regex ) } }
> end
>
> Mostly I just did that because I was learning how to use symbols, but it
> makes the Regex more flexible with the minimum amount of repetition or
> long-winded "if" statements.

I think there are better solutions - in order of increasing quality:

1. Rename the parameter from "switch" to "negate".
2. Pass the operator symbol directly.
3. Add a method Regexp#negate which will return an instance which has
matching logic reversed and remove the parameter.
4. yield ar[idx] (i.e. delegate the decision to the block passed in)

I am still not fond of the #skip_headers approach.  As far as I can
see you want to copy the sheet into a new one and include only cells
whose  content matches a regular expression (or maybe other filter
criteria).  I don't see the rest of your current version of the
implementation but the line with self[0] somehow looks strange.

Kind regards

robert
Posted by Joel Pearson (virtuoso)
on 2013-02-18 16:50
I went with "filter" with an optional true/false regex switch because it 
seemed like the simplest way to use it, and closest to my own experience 
in using Excel's filters.
Passing the symbol feels less intuitive, and yielding to a block means 
writing more code, particularly when I'm writing a quick method chain. 
The notation I set up feels natural to me when chaining criteria. For 
example I can just do this:
data.filter( 'Account', /^P/ ).filter( 'Type', /^Large/, false )

Regarding the usage of skip_headers
Say I have this data:

Type  Flag  Unique_ID
Type1  1  A001
Type2  0  A002
Type1  0  A003
Type3  1  A004
Type1  1  A005

If I only want to keep Parts of "Type1" and "Type3" then I could use 
"select" and some Regex, but I might pick up the Header as well if I'm 
not careful.
Using a method like "skip_headers" allows me to select or reject 
elements of the data without losing the identifiers in the first row, 
which I'm almost always going to need at the end when I output the data 
into human-readable format.
I'm also dealing with entire rows rather than individual cells, and 
since the source data can change its content and order, using the 
headers to identify the data source for a given operation is essential. 
Using skip_headers both allows me to preserve them while sorting through 
data, and also puts them back on again for the next time I need to 
reference them.
Posted by Robert Klemme (robert_k78)
on 2013-02-18 17:17
(Received via mailing list)
On Mon, Feb 18, 2013 at 4:50 PM, Joel Pearson <lists@ruby-forum.com> 
wrote:
> I went with "filter" with an optional true/false regex switch because it
> seemed like the simplest way to use it, and closest to my own experience
> in using Excel's filters.
> Passing the symbol feels less intuitive, and yielding to a block means
> writing more code, particularly when I'm writing a quick method chain.
> The notation I set up feels natural to me when chaining criteria. For
> example I can just do this:
> data.filter( 'Account', /^P/ ).filter( 'Type', /^Large/, false )

What do you mean by "writing more code"?  It is as short as

data.filter( 'Account', /^P/ ).filter( 'Type' ) {|x| /^Large/ =~ x}

You could even implement Regexp#to_proc like this

class Regexp
  def to_proc
    lambda {|s| self =~ s}
  end
end

and then do

data.filter( 'Account', /^P/ ).filter( 'Type', &/^Large/)

> If I only want to keep Parts of "Type1" and "Type3" then I could use
> "select" and some Regex, but I might pick up the Header as well if I'm
> not careful.
> Using a method like "skip_headers" allows me to select or reject
> elements of the data without losing the identifiers in the first row,
> which I'm almost always going to need at the end when I output the data
> into human-readable format.

But wouldn't you want to make the decision what is a header and what
not more flexible?  Possible criteria that come to mind are
 - first n lines / columns
 - first lines / columns where all values match a particular regexp
 - any line or column where all values match a particular regexp

> I'm also dealing with entire rows rather than individual cells, and
> since the source data can change its content and order, using the
> headers to identify the data source for a given operation is essential.
> Using skip_headers both allows me to preserve them while sorting through
> data, and also puts them back on again for the next time I need to
> reference them.

So basically you want a view on the data which omits a few rows and
columns.  Given that there are so many potential criteria I'd probably
pass in one object implementing === as a row header detector and one
as a column header detector.  Since Proc implements === as call you
can also easily provide a lambda there.  The argument to === would be
the Column respective Row instance so the position as well as the cell
contents can be evaluated to decide whether something constitutes a
header row / column.  Once you have that in place you could create
convenience methods using one of the criteria mentioned above.  Just a
few thoughts.

Kind regards

robert
Posted by Joel Pearson (virtuoso)
on 2013-02-18 18:23
That Regexp to proc idea looks good. I could use proc form for a 
positive match and a normal block for the negative. I'll see if I can 
get something like this working when I write filter method for 
RubyExcel.

Using the new class I can implement something like skip_headers by 
passing a starting value to "rows" or "columns". This makes it more 
flexible as well. I've rewritten those iterators using optional start 
and end points:

def rows( start_row = 1, end_row = maxrow )
  fail TypeError, 'Data is empty' if maxrow == 0
  fail ArgumentError, 'The starting row must be less than the maximum 
row' if maxrow < start_row
  return to_enum(:rows) unless block_given?
  ( start_row..end_row ).each do |idx|
    yield row( idx )
  end
  self
end

Now I can use rows(2) to skip the headers if necessary. It might be a 
bit confusing when rows(1) actually returns from 1 to the end, but I've 
already got row(1) for that purpose and it makes it shorter to iterate 
through all of them. Plus it means I can do "rows.count", which is the 
same as VBA syntax.

I vaguely understand the idea of passing something in to compare to a 
header type. I'm not sure how I'd implement it though, since the only 
headers I ever deal with are row 1, and they tend to look pretty similar 
to the data itself.
Posted by Robert Klemme (robert_k78)
on 2013-02-18 20:41
(Received via mailing list)
On Mon, Feb 18, 2013 at 6:23 PM, Joel Pearson <lists@ruby-forum.com> 
wrote:
> That Regexp to proc idea looks good. I could use proc form for a
> positive match and a normal block for the negative. I'll see if I can
> get something like this working when I write filter method for
> RubyExcel.
>
> Using the new class I can implement something like skip_headers by
> passing a starting value to "rows" or "columns". This makes it more
> flexible as well. I've rewritten those iterators using optional start
> and end points:

Actually the even more general concept is filter anything: one might
not only want to skip headers but general rows or columns.

> def rows( start_row = 1, end_row = maxrow )
>   fail TypeError, 'Data is empty' if maxrow == 0
>   fail ArgumentError, 'The starting row must be less than the maximum
> row' if maxrow < start_row
>   return to_enum(:rows) unless block_given?

You need to pass arguments start_row and end_row here as well!

return to_enum(:rows, start_row, end_row) unless block_given?

> same as VBA syntax.
>
> I vaguely understand the idea of passing something in to compare to a
> header type. I'm not sure how I'd implement it though, since the only
> headers I ever deal with are row 1, and they tend to look pretty similar
> to the data itself.

Well, then we should probably add a method #index to Row and Column
which returns the numeric index.  That makes checking whether it's the
n'th row / column easy.  Basically for one of the two the method is
just an alias (not if there is a base 0 or based 1 difference though).

Kind regards

robert
Posted by Joel Pearson (virtuoso)
on 2013-02-19 00:47
Attachment: RubyExcel.rb (6,73 KB)
Nice catch on the arguments, I completely missed that.

After more face-to-keyboard action, I came up with a working filter
system. it modifies self at the moment rather than returning a copy,
which is something I'll have to look into since I'm not sure I want that
to be the default behaviour.

I've added a index option for row and column, and also added row and
column methods to String. Since those methods didn't exist before, and
you gave me the idea of modifying an existing class (Regexp), I thought
this would be quite a useful way to get the index values straight from
the hash keys.

In order to get the filter working properly I've created some compact
methods which will reconfigure the hash keys and values. That could
probably be refactored but it took me so long to get it working properly
I dare not touch it again yet!

I added empty? to the columns and rows as a helper for the compact
method.

I didn't like the inspect output so I tidied it up a bit as well, and
redefined "to_s" for each type.

I've added each_with_address as an option for the columns and rows since
they don't access the data hash directly. There might be a neater way to
implement this, but I couldn't figure it out.

I'm too tired for rational thought now so I'd better call it a day
before I find myself thinking that adding ASCII art comments in the
shape of ponies and rainbows would improve the code...
Posted by Robert Klemme (robert_k78)
on 2013-02-19 10:36
(Received via mailing list)
On Tue, Feb 19, 2013 at 12:47 AM, Joel Pearson <lists@ruby-forum.com> 
wrote:

> After more face-to-keyboard action, I came up with a working filter
> system. it modifies self at the moment rather than returning a copy,
> which is something I'll have to look intosince I'm not sure I want that
> to be the default behaviour.

You cannot change self.  Or do you mean you change matrix contents?
IMHO that's a bad idea since you basically only want a filtered /view/
on the data.

> I've added a index option for row and column, and also added row and
> column methods to String. Since those methods didn't exist before, and
> you gave me the idea of modifying an existing class (Regexp), I thought
> this would be quite a useful way to get the index values straight from
> the hash keys.

IMHO that is not a good idea.  The method I suggested for Regex is of
general use because matching is what Regex instances are for.  But the
extraction of a row or column identifier is closely bound to your
class's logic.  So if at all I'd put it as a class method in class
RubyExcel.

> ...
> I'm too tired for rational thought now so I'd better call it a day
> before I find myself thinking that adding ASCII art comments in the
> shape of ponies and rainbows would improve the code...

:-)

Frankly, I don't have the time right now to take a closer look.  I
hope I can followup at a later point in time.

Kind regards

robert
Posted by Joel Pearson (virtuoso)
on 2013-02-19 10:51
Using the filter method on my previous attempt would return a copy of 
the object containing only the required data. I was hoping for something 
similar with this so I can split into multiple sections using filter 
logic but still keep the original if required, but I'm not sure how to 
do this. This is the kind of thing I'd probably want to use:

m = #Main matrix
n = m.filter args #Creates Subset without altering "m"
m.filter! other_args #Alters its own data

A class method sounds like a good idea. I'll have to play with that and 
see what I can come up with...
Posted by Robert Klemme (robert_k78)
on 2013-02-19 11:03
(Received via mailing list)
On Tue, Feb 19, 2013 at 10:51 AM, Joel Pearson <lists@ruby-forum.com> 
wrote:
> Using the filter method on my previous attempt would return a copy of
> the object containing only the required data. I was hoping for something
> similar with this so I can split into multiple sections using filter
> logic but still keep the original if required, but I'm not sure how to
> do this. This is the kind of thing I'd probably want to use:
>
> m = #Main matrix
> n = m.filter args #Creates Subset without altering "m"
> m.filter! other_args #Alters its own data

That's another reasonable way to do it.  Similar to how String#gsub
and String#gsub! go about this.  It entirely depends on your use case.
 If you would want to modify the original data through a view then of
course you need a similar approach as with Row and Column.

> A class method sounds like a good idea. I'll have to play with that and
> see what I can come up with...

Cheers

robert
Posted by Joel Pearson (virtuoso)
on 2013-02-19 13:52
Attachment: RubyExcel.rb (10,3 KB)
Ok... new changes:

Used "Me" to reference the class so I don't have to keep typing the
classname when referencing class methods.

Removed String alterations in favour of the class methods "self.row" and
"self.column" which take a string as an argument.

Combined various checks again multidimensional array input into a method
"multi_array?"

Added "insert_columns" and "insert_rows", which after long hours of
threatening my laptop with physical violence I've finally gotten into
some sort of working order.

Added "delete_column" and "delete_row"

Modified the "compact" methods to work with the above.

Modified "calc_columns" and "calc_rows" to accept a multidimensional
array as an optional argument, otherwise they reference "data".

Added "cells" so the API gives the option to list all populated cell
addresses.

Added strip! and upcase!

Modified the example case at the end to work with the major changes.

It's still far from complete, but it's taking form well enough that I
might be able to start replacing my old class with it in future
projects.
If I can get a clean and tidy version eventually I might make it into a
gem.

Sadly I did succumb to the urge to add ASCII art comments. I'll probably
remove it when I drift back closer to sanity.
Posted by Joel Pearson (virtuoso)
on 2013-02-20 00:14
Attachment: RubyExcel.rb (11,7 KB)
I've now had time to do a bit more testing and debugging, and I've 
managed to come up with a (relatively) stable and usable framework. Even 
more surprising, I now have a working version of pretty much everything 
on my wish list. I suppose I can just call .dup or .clone for when I 
want to take different sections of the data down different paths.

I've tried to refactor some of the more esoteric code, and improve the 
"encapsulation" by hiding the underlying hash wherever I spotted it 
poking out.

I've actually managed to use recursion (in the "unique" method) without 
permanently looping! I'll just gloss over the amount of permanent loops 
I went through to get that right...

I've had to make some hefty use of reverse_each to sort out deleting 
entire sections and compacting the rest afterwards.

As usual, I've probably made many silly mistakes and done things the 
hard way. Any advice would be appreciated (although not demanded ;-) ).

I think the next thing to do is put this class through some real-world 
scenarios, so I'll be using it as a replacement for ye olde Excel_Sheet 
class.
Posted by Timothy G. (timothy_g29)
on 2013-02-20 01:18
(Received via mailing list)
hi Joel
i am timo i follow this thread you are doing great
starting to look like it will be a worthwhile effort
lately i have hoped for a few pointers in this same topic
may i ask have you compared any current ruby forge programs?
a few mainly older Ruby programs seem to exist for MS Excel in Ruby
http://rubyforge.org/projects/spreadsheet   (2011) beta
and on ruby-lang
http://raa.ruby-lang.org/project/hyogen/   ( 2007)
http://raa.ruby-lang.org/project/ruby-statsample/  (2010)
http://raa.ruby-lang.org/project/parseexcel/      (2006)
Posted by Joel Pearson (virtuoso)
on 2013-02-20 09:25
Attachment: matrix.rbw (2,75 KB)
Thanks Timo.

I have used the spreadsheet gem and parseexcel, but I switched to using 
win32ole for more advanced options and for full compatibility with 
office 2010 formats. I haven't been keeping up to date with their 
progress since, it's always worth looking into them. I haven't seen 
hyogen or statsample before, I'll have a look.

There are a few reasons I've set out to build my own class(es) for this:

1) To create an API which feels natural for me. My previous scripting 
experience is mostly in VBA, so I'm accustomed to using Excel's API and 
my own custom functions. I wanted to get as close to this feel as 
possible in order to get comfortable learning Ruby. Of course, in the 
process I'm picking up on the concepts of OOP, so my approach is 
starting to change as well.

2) To work alongside my existing code. This is a side-project to my main 
one, which is a sort of web-scraper specifically designed to work with 
my company's online database. I have a few other classes which perform 
tasks such as automated reporting with advanced criteria, filling in 
forms quickly with minimal user input, and just general helper 
applications to improve productivity. It seemed only natural to create a 
class capable of handling all the data analysis tasks which I had 
previously written using standard Array methods.

3) The actual interaction I have with Excel tends to be minimal. I have 
a few data dump methods which will take an array or a hash of arrays, 
and output a TSV or Excel file automatically. The focus of the gems you 
mentioned seems to be more on read-write operations like this, and I was 
after something which would let me analyse and restructure large amounts 
of data into an ordered summary. The reason I tend to order my 
operations as if it were manual Excel interactions is this is how 
customers give me their current methodology, and my API makes it easier 
for me to translate their approach into code.

4) To learn how to do it myself rather than relying on existing programs 
:)

I'd be happy to share my newfound knowledge with you, although I have to 
give the credit mostly to Robert. I'd never have made it this far 
without his guidance.

Incidentally I wrote a Matrix movie "Digital Rain" effect (which works 
via win32ole) while I was trying to work out how to interact with Excel 
efficiently. I've attached it in case you're interested.
Posted by Joel Pearson (virtuoso)
on 2013-02-20 13:00
Attachment: Demo.rb (1,72 KB)
From what I've seen of statsample so far it looks more numerically 
oriented, whereas I tend to be dealing with lots of strings, some 
numbers, and a few dates.

Hyogen also looks more interested in the read-write than the analysis in 
the interim. My input tends to come from Nokogiri and I can already 
output to excel with some advanced formatting options using my own code.

I have attached one of the functions I wrote to output the data to Excel 
(part of a larger class).
As you can see, it's pretty simple to use; and I always have the option 
of just handing over the win3ole object if I need more advanced options 
(which happens occasionally). It could use some refactoring, but it does 
the job.

As I already have this kind of functionality available, it's the ability 
to sort through a table of data efficiently in the memory which 
interests me. I know a database would be ideal for that kind of 
operation, but I have portability in mind as a given report program may 
be on a variety of machines, and the only thing I know they have for 
sure is Office 2007 or higher.
Posted by Timothy G. (timothy_g29)
on 2013-02-20 14:08
(Received via mailing list)
hey thanks for some concise notes on this important topic
yes at length you guys built a nifty way to filter data in the memory
even present live updates of cell vectors in the sheets
your main data is strings
my excel woes lately is the veteran car club parts catalogs
i scan pages to tiff then image OCR to Word correct it and paste into 
Excel
with aims like matching compatible pistons to various makes and models
Posted by Joel Pearson (virtuoso)
on 2013-02-20 14:34
Sounds messy. Is the data on the page graphical rather than text based? 
Why do you need Word?
Posted by Joel Pearson (virtuoso)
on 2013-02-21 13:40
I thought I could avoid the issue of splitting the data into multiple 
outputs by creating copies of the class, but apparently I was wrong:

irb(main):002:0> m = RubyExcel.new.load [['a','b']]
=> columns: 2, rows: 1, values: 2

irb(main):003:0> a = m.clone

irb(main):004:0> a['A1'] = nil

irb(main):005:0> puts a
        b

irb(main):006:0> puts m
        b

What am I doing wrong? Is @data inside m not being duplicated along with 
the class instance?
Posted by Robert Klemme (robert_k78)
on 2013-02-21 14:14
(Received via mailing list)
On Thu, Feb 21, 2013 at 1:40 PM, Joel Pearson <lists@ruby-forum.com> 
wrote:
> irb(main):005:0> puts a
>         b
>
> irb(main):006:0> puts m
>         b
>
> What am I doing wrong? Is @data inside m not being duplicated along with
> the class instance?

#clone and #dup are _always_ shallow copies.  If you want to change
that you need to override them, e.g.

class Excel
  def dup
    super.tap do |copy|
      %w{@table foo}.each  {|iv| copy.instance_variable_set(iv,
instance_variable_get(iv).dup)}
    end
  end
end

Kind regards

robert
Posted by Joel Pearson (virtuoso)
on 2013-02-21 14:50
Thanks!
Before I start rewriting everything to try and use this, though...
Should I really be doing this sort of thing just to be able to 
differentiate between bang and non-bang methods? Would my time be better 
spent trying to find a way of returning a seperate instance of the class 
without modifying the current one?

I did consider doing something like modifying a local variable, passing 
it to the "load" method of a new instance, then returning that new 
instance; but it seemed a rather long-winded method, and I think I'd 
have to rewrite all the methods around it.
Posted by Joel Pearson (virtuoso)
on 2013-02-21 15:42
Oh, that was simpler than I thought. I need to do some further testing 
but this seems to work:
______________________________________________
def dup
  Me.new.load @data.clone
end

def load ( multi_array )
  if multi_array.is_a? Hash
    @data = multi_array
    calc_dimensions
    self
  else
    @data = {}
    import_data multi_array
  end
end
______________________________________________

irb(main):006:0> m = RubyExcel.new.load [['a','b']]
=> columns: 2, rows: 1, values: 2
a       b

irb(main):007:0> a = m.dup
=> columns: 2, rows: 1, values: 2
a       b

irb(main):010:0> a['A1'] = nil
columns: 2, rows: 1, values: 1
        b

irb(main):013:0> p m
columns: 2, rows: 1, values: 2
a       b
Posted by Timothy G. (timothy_g29)
on 2013-02-22 02:37
(Received via mailing list)
thanks yes you're right the task is tedious and messy
first i place the old book in my scanner and convert image to tiff
tiff is the format accepted in ms-office document imaging OCR tool
any text found in the tiff image is output to Word all in a single 
column
i clear all text formatting and correct many typos by examining it 
visually
i check the OCR-output text in word against the input tiff side by side
then copy and paste text ie part numbers and names into excel columns
i put each catalog section items into an xls sheet to make the data
accessible
i hope these could be used to create an index of a database of tech 
drawings
Posted by Joel Pearson (virtuoso)
on 2013-02-22 10:34
There are various ways to create a database from excel files, so that 
should be fine. I can't really think of a more effective way to pull the 
data from books, apart from trying out different OCR programs. The 
necessity of scanning and manual checking pretty much renders moot any 
attempt to accelerate that process.

Depending on the amount of data, you might actually get better 
performance
out of a skilled typist doing transcription. It might be worth timing 
the two against each other.

I assume there's no chance of getting the original data these books were 
printed from? That would be the ideal solution.
Posted by Joel Pearson (virtuoso)
on 2013-02-26 14:26
Attachment: RubyExcel.rb (16,7 KB)
Maybe I should start a new thread for this...

Anyway, I've worked out some of the kinks (probably in an unecessarily 
convoluted fashion) and added some documentation code after learning the 
basics of "yard". I've managed to add plenty of little helpers for 
modifying code, and I'm about to start real-life testing.

As always, any constructive criticism is welcome!
Posted by Joel Pearson (virtuoso)
on 2013-02-27 16:21
Hah, I'm such an idiot. I've replaced the hideously inefficient 
compact_rows & columns with these:

def compact_rows
  load to_a.reject { |ar| ar.all? { |el| el.nil? } }
end

def compact_columns
  load to_a.transpose.reject { |ar| ar.all? { |el| el.nil? } }
end

Much better :)
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.