Select "columns" from multidimensional array?

virtuoso · February 19, 2013, 12:47am

Nice catch on the arguments, I completely missed that.

After more face-to-keyboard action, I came up with a working filter
system. it modifies self at the moment rather than returning a copy,
which is something I’ll have to look into since I’m not sure I want that
to be the default behaviour.

I’ve added a index option for row and column, and also added row and
column methods to String. Since those methods didn’t exist before, and
you gave me the idea of modifying an existing class (Regexp), I thought
this would be quite a useful way to get the index values straight from
the hash keys.

In order to get the filter working properly I’ve created some compact
methods which will reconfigure the hash keys and values. That could
probably be refactored but it took me so long to get it working properly
I dare not touch it again yet!

I added empty? to the columns and rows as a helper for the compact
method.

I didn’t like the inspect output so I tidied it up a bit as well, and
redefined “to_s” for each type.

I’ve added each_with_address as an option for the columns and rows since
they don’t access the data hash directly. There might be a neater way to
implement this, but I couldn’t figure it out.

I’m too tired for rational thought now so I’d better call it a day
before I find myself thinking that adding ASCII art comments in the
shape of ponies and rainbows would improve the code…

virtuoso · February 19, 2013, 10:51am

Using the filter method on my previous attempt would return a copy of
the object containing only the required data. I was hoping for something
similar with this so I can split into multiple sections using filter
logic but still keep the original if required, but I’m not sure how to
do this. This is the kind of thing I’d probably want to use:

m = #Main matrix
n = m.filter args #Creates Subset without altering “m”
m.filter! other_args #Alters its own data

A class method sounds like a good idea. I’ll have to play with that and
see what I can come up with…

virtuoso · February 19, 2013, 10:36am

On Tue, Feb 19, 2013 at 12:47 AM, Joel P. [email protected]
wrote:

After more face-to-keyboard action, I came up with a working filter
system. it modifies self at the moment rather than returning a copy,
which is something I’ll have to look intosince I’m not sure I want that
to be the default behaviour.

You cannot change self. Or do you mean you change matrix contents?
IMHO that’s a bad idea since you basically only want a filtered /view/
on the data.

I’ve added a index option for row and column, and also added row and
column methods to String. Since those methods didn’t exist before, and
you gave me the idea of modifying an existing class (Regexp), I thought
this would be quite a useful way to get the index values straight from
the hash keys.

IMHO that is not a good idea. The method I suggested for Regex is of
general use because matching is what Regex instances are for. But the
extraction of a row or column identifier is closely bound to your
class’s logic. So if at all I’d put it as a class method in class
RubyExcel.

…
I’m too tired for rational thought now so I’d better call it a day
before I find myself thinking that adding ASCII art comments in the
shape of ponies and rainbows would improve the code…

Frankly, I don’t have the time right now to take a closer look. I
hope I can followup at a later point in time.

Kind regards

robert

virtuoso · February 19, 2013, 11:03am

On Tue, Feb 19, 2013 at 10:51 AM, Joel P. [email protected]
wrote:

Using the filter method on my previous attempt would return a copy of
the object containing only the required data. I was hoping for something
similar with this so I can split into multiple sections using filter
logic but still keep the original if required, but I’m not sure how to
do this. This is the kind of thing I’d probably want to use:

m = #Main matrix
n = m.filter args #Creates Subset without altering “m”
m.filter! other_args #Alters its own data

That’s another reasonable way to do it. Similar to how String#gsub
and String#gsub! go about this. It entirely depends on your use case.
If you would want to modify the original data through a view then of
course you need a similar approach as with Row and Column.

A class method sounds like a good idea. I’ll have to play with that and
see what I can come up with…

Cheers

robert

virtuoso · February 19, 2013, 1:52pm

Ok… new changes:

Used “Me” to reference the class so I don’t have to keep typing the
classname when referencing class methods.

Removed String alterations in favour of the class methods “self.row” and
“self.column” which take a string as an argument.

Combined various checks again multidimensional array input into a method
“multi_array?”

Added “insert_columns” and “insert_rows”, which after long hours of
threatening my laptop with physical violence I’ve finally gotten into
some sort of working order.

Added “delete_column” and “delete_row”

Modified the “compact” methods to work with the above.

Modified “calc_columns” and “calc_rows” to accept a multidimensional
array as an optional argument, otherwise they reference “data”.

Added “cells” so the API gives the option to list all populated cell
addresses.

Added strip! and upcase!

Modified the example case at the end to work with the major changes.

It’s still far from complete, but it’s taking form well enough that I
might be able to start replacing my old class with it in future
projects.
If I can get a clean and tidy version eventually I might make it into a
gem.

Sadly I did succumb to the urge to add ASCII art comments. I’ll probably
remove it when I drift back closer to sanity.

virtuoso · February 20, 2013, 1:18am

hi Joel
i am timo i follow this thread you are doing great
starting to look like it will be a worthwhile effort
lately i have hoped for a few pointers in this same topic
may i ask have you compared any current ruby forge programs?
a few mainly older Ruby programs seem to exist for MS Excel in Ruby
http://rubyforge.org/projects/spreadsheet (2011) beta
and on ruby-lang
http://raa.ruby-lang.org/project/hyogen/ ( 2007)
http://raa.ruby-lang.org/project/ruby-statsample/ (2010)
http://raa.ruby-lang.org/project/parseexcel/ (2006)

virtuoso · February 20, 2013, 9:25am

Thanks Timo.

I have used the spreadsheet gem and parseexcel, but I switched to using
win32ole for more advanced options and for full compatibility with
office 2010 formats. I haven’t been keeping up to date with their
progress since, it’s always worth looking into them. I haven’t seen
hyogen or statsample before, I’ll have a look.

There are a few reasons I’ve set out to build my own class(es) for this:

To create an API which feels natural for me. My previous scripting
experience is mostly in VBA, so I’m accustomed to using Excel’s API and
my own custom functions. I wanted to get as close to this feel as
possible in order to get comfortable learning Ruby. Of course, in the
process I’m picking up on the concepts of OOP, so my approach is
starting to change as well.
To work alongside my existing code. This is a side-project to my main
one, which is a sort of web-scraper specifically designed to work with
my company’s online database. I have a few other classes which perform
tasks such as automated reporting with advanced criteria, filling in
forms quickly with minimal user input, and just general helper
applications to improve productivity. It seemed only natural to create a
class capable of handling all the data analysis tasks which I had
previously written using standard Array methods.
The actual interaction I have with Excel tends to be minimal. I have
a few data dump methods which will take an array or a hash of arrays,
and output a TSV or Excel file automatically. The focus of the gems you
mentioned seems to be more on read-write operations like this, and I was
after something which would let me analyse and restructure large amounts
of data into an ordered summary. The reason I tend to order my
operations as if it were manual Excel interactions is this is how
customers give me their current methodology, and my API makes it easier
for me to translate their approach into code.
To learn how to do it myself rather than relying on existing programs

I’d be happy to share my newfound knowledge with you, although I have to
give the credit mostly to Robert. I’d never have made it this far
without his guidance.

Incidentally I wrote a Matrix movie “Digital Rain” effect (which works
via win32ole) while I was trying to work out how to interact with Excel
efficiently. I’ve attached it in case you’re interested.

virtuoso · February 20, 2013, 12:14am

I’ve now had time to do a bit more testing and debugging, and I’ve
managed to come up with a (relatively) stable and usable framework. Even
more surprising, I now have a working version of pretty much everything
on my wish list. I suppose I can just call .dup or .clone for when I
want to take different sections of the data down different paths.

I’ve tried to refactor some of the more esoteric code, and improve the
“encapsulation” by hiding the underlying hash wherever I spotted it
poking out.

I’ve actually managed to use recursion (in the “unique” method) without
permanently looping! I’ll just gloss over the amount of permanent loops
I went through to get that right…

I’ve had to make some hefty use of reverse_each to sort out deleting
entire sections and compacting the rest afterwards.

As usual, I’ve probably made many silly mistakes and done things the
hard way. Any advice would be appreciated (although not demanded ).

I think the next thing to do is put this class through some real-world
scenarios, so I’ll be using it as a replacement for ye olde Excel_Sheet
class.

virtuoso · February 20, 2013, 2:08pm

hey thanks for some concise notes on this important topic
yes at length you guys built a nifty way to filter data in the memory
even present live updates of cell vectors in the sheets
your main data is strings
my excel woes lately is the veteran car club parts catalogs
i scan pages to tiff then image OCR to Word correct it and paste into
Excel
with aims like matching compatible pistons to various makes and models

virtuoso · February 20, 2013, 1:00pm

From what I’ve seen of statsample so far it looks more numerically
oriented, whereas I tend to be dealing with lots of strings, some
numbers, and a few dates.

Hyogen also looks more interested in the read-write than the analysis in
the interim. My input tends to come from Nokogiri and I can already
output to excel with some advanced formatting options using my own code.

I have attached one of the functions I wrote to output the data to Excel
(part of a larger class).
As you can see, it’s pretty simple to use; and I always have the option
of just handing over the win3ole object if I need more advanced options
(which happens occasionally). It could use some refactoring, but it does
the job.

As I already have this kind of functionality available, it’s the ability
to sort through a table of data efficiently in the memory which
interests me. I know a database would be ideal for that kind of
operation, but I have portability in mind as a given report program may
be on a variety of machines, and the only thing I know they have for
sure is Office 2007 or higher.

virtuoso · February 21, 2013, 1:40pm

I thought I could avoid the issue of splitting the data into multiple
outputs by creating copies of the class, but apparently I was wrong:

irb(main):002:0> m = RubyExcel.new.load [[‘a’,‘b’]]
=> columns: 2, rows: 1, values: 2

irb(main):003:0> a = m.clone

irb(main):004:0> a[‘A1’] = nil

irb(main):005:0> puts a
b

irb(main):006:0> puts m
b

What am I doing wrong? Is @data inside m not being duplicated along with
the class instance?

virtuoso · February 21, 2013, 2:14pm

On Thu, Feb 21, 2013 at 1:40 PM, Joel P. [email protected]
wrote:

irb(main):005:0> puts a
b

irb(main):006:0> puts m
b

What am I doing wrong? Is @data inside m not being duplicated along with
the class instance?

#clone and #dup are always shallow copies. If you want to change
that you need to override them, e.g.

class Excel
def dup
super.tap do |copy|
%w{@table foo}.each {|iv| copy.instance_variable_set(iv,
instance_variable_get(iv).dup)}
end
end
end

Kind regards

robert

virtuoso · February 20, 2013, 2:34pm

Sounds messy. Is the data on the page graphical rather than text based?
Why do you need Word?

virtuoso · February 21, 2013, 2:50pm

Thanks!
Before I start rewriting everything to try and use this, though…
Should I really be doing this sort of thing just to be able to
differentiate between bang and non-bang methods? Would my time be better
spent trying to find a way of returning a seperate instance of the class
without modifying the current one?

I did consider doing something like modifying a local variable, passing
it to the “load” method of a new instance, then returning that new
instance; but it seemed a rather long-winded method, and I think I’d
have to rewrite all the methods around it.

virtuoso · February 21, 2013, 3:42pm

Oh, that was simpler than I thought. I need to do some further testing
but this seems to work:

def dup
Me.new.load @data.clone
end

def load ( multi_array )
if multi_array.is_a? Hash
@data = multi_array
calc_dimensions
self
else
@data = {}
import_data multi_array
end
end

irb(main):006:0> m = RubyExcel.new.load [[‘a’,‘b’]]
=> columns: 2, rows: 1, values: 2
a b

irb(main):007:0> a = m.dup
=> columns: 2, rows: 1, values: 2
a b

irb(main):010:0> a[‘A1’] = nil
columns: 2, rows: 1, values: 1
b

irb(main):013:0> p m
columns: 2, rows: 1, values: 2
a b

virtuoso · February 22, 2013, 2:37am

thanks yes you’re right the task is tedious and messy
first i place the old book in my scanner and convert image to tiff
tiff is the format accepted in ms-office document imaging OCR tool
any text found in the tiff image is output to Word all in a single
column
i clear all text formatting and correct many typos by examining it
visually
i check the OCR-output text in word against the input tiff side by side
then copy and paste text ie part numbers and names into excel columns
i put each catalog section items into an xls sheet to make the data
accessible
i hope these could be used to create an index of a database of tech
drawings

virtuoso · February 22, 2013, 10:34am

There are various ways to create a database from excel files, so that
should be fine. I can’t really think of a more effective way to pull the
data from books, apart from trying out different OCR programs. The
necessity of scanning and manual checking pretty much renders moot any
attempt to accelerate that process.

Depending on the amount of data, you might actually get better
performance
out of a skilled typist doing transcription. It might be worth timing
the two against each other.

I assume there’s no chance of getting the original data these books were
printed from? That would be the ideal solution.

virtuoso · February 26, 2013, 2:26pm

Maybe I should start a new thread for this…

Anyway, I’ve worked out some of the kinks (probably in an unecessarily
convoluted fashion) and added some documentation code after learning the
basics of “yard”. I’ve managed to add plenty of little helpers for
modifying code, and I’m about to start real-life testing.

As always, any constructive criticism is welcome!

virtuoso · February 27, 2013, 4:21pm

Hah, I’m such an idiot. I’ve replaced the hideously inefficient
compact_rows & columns with these:

def compact_rows
load to_a.reject { |ar| ar.all? { |el| el.nil? } }
end

def compact_columns
load to_a.transpose.reject { |ar| ar.all? { |el| el.nil? } }
end

Much better

virtuoso · November 30, 2013, 9:29pm

A related question… on values from nested arrays / json

https://www.ruby-forum.com/topic/4418971

Kindly advice