Statistical software and data transformation?

How many ruby-ists have to do statistical analysis or data cleaning
prior to analysis?

Is it not something that is done often by web developers?

What is the well known software out there for statistical software or
data transformation software? That is open source, or at least free of
charge? I mean besides R, I think I understand what R’s strengths and
limitations are.

There is a number of applications at
http://directory.fsf.org/math/stats/

but I do not know how mature they are (except for the one I submitted
(vilno)).

Is there currently a successful project for incredibly user-friendly
open source statistical software, usually using a GUI, to compete with
SAS (JMP) or SPSS? ( R is more for research statistics, with a tough
learning curve.).

Appreciate your feedback,

Robert

Robert wrote:

There is a number of applications at
Appreciate your feedback,

Robert

I do a lot of data cleaning/pre-processing. Most of it is numerical data
rather than more “traditional” business data mining like
name/address/zip code stuff. My main current modus operandi is

  1. Do the data extraction in Perl. I’d use Ruby, but
    a) I learned Perl years ago and just learned Ruby about a year ago
    b) There are no other Ruby programmers around for backup.

  2. Load the extracted data into a PostgreSQL database. I used to use
    Access, then migrated to SQL Server, and now I’m on PostgreSQL.

  3. Do SQL queries for the easy stuff and R (via RODBC) for the fancy
    stuff.

Mind you, I’ve been doing this with minor alterations in the tools for
something like 15 years, so I haven’t really dug into the way other
folks do it. But there are starting to be projects, both open source and
commercial, in the so-called ETL (Extract, Transfer, Load) arena, that
promise to revolutionize this type of work. One name that sticks in my
mind in open source is Pentaho, but I have not had a chance to check it
out. Most of the big ETL products are Java-based, IIRC.

As for the learning curve of R, there are a few GUI front-ends that
take some of the sting out of it, but the basic underlying philosophy
of R is that it is a language (and a damn good one!) for
scientific/statistical/graphical computing. The GUI builders expect you
to start with the GUI and learn the language, rather than continue using
the GUI like you would Excel, Minitab, or some of the other packages.
That said, the most complete and user-friendly is probably R Commander
(Rcmdr), which works on both Windows and Linux R.

This is something I’d like to see built in Rails – you’ve got the RDBMS
back ends, the AJAX and MVC GUI tools, the ORM, etc. There is an
interface to R from Ruby, but IIRC the bridge logic between the two
languages currently only works on Linux – there’s no way yet for a
Windows Ruby program to hook up with the R DLL. There are some R DCOM
interfaces, though – that might be the way to do it on a Windows
machine.

By the way, I think the Windows R UI is far superior to the one on
Linux. The Linux version hasn’t changed substantially from its origin –
it’s a simple xterm – X windows application.

On Sep 19, 2007, at 9:21 PM, M. Edward (Ed) Borasky wrote:

limitations are.
SAS (JMP) or SPSS? ( R is more for research statistics, with a tough

  1. Do the data extraction in Perl. I’d use Ruby, but
    a) I learned Perl years ago and just learned Ruby about a year ago
    b) There are no other Ruby programmers around for backup.

We’re all hurt Ed. You know how we enjoy those, “Help me extract
this data with a one-liner…” posts. :wink:

James Edward G. II

On 9/20/07, M. Edward (Ed) Borasky [email protected] wrote:

Mind you, I’ve been doing this with minor alterations in the tools for
something like 15 years, so I haven’t really dug into the way other
folks do it. But there are starting to be projects, both open source and
commercial, in the so-called ETL (Extract, Transfer, Load) arena, that
promise to revolutionize this type of work. One name that sticks in my
mind in open source is Pentaho, but I have not had a chance to check it
out. Most of the big ETL products are Java-based, IIRC.

On the Ruby front there may be ActiveWarehouse :

http://activewarehouse.rubyforge.org/etl/

I haven’t had a chance to play with it. It seems a bit Rails-focused.
Level of activity is high.

Best regards,

John M. wrote:

http://activewarehouse.rubyforge.org/etl/

I haven’t had a chance to play with it. It seems a bit Rails-focused.
Level of activity is high.

Best regards,

Yeah … I’ve seen that too. Then again, when it comes to databases and
Ruby, what isn’t Rails-focused?

Well … Nitro … Iowa … etc. … :slight_smile: I’m playing with Og at the
moment, but not with big datasets.

On 9/20/07, John M. [email protected] wrote:

On the Ruby front there may be ActiveWarehouse :

http://activewarehouse.rubyforge.org/etl/

I haven’t had a chance to play with it. It seems a bit Rails-focused.
Level of activity is high.

FWIW, ActiveWarehouse has a Rails plugin on one side but it also has
an ETL Gem called, not surprisingly, ActiveWarehouse ETL. The
documentation is available here:

http://activewarehouse.rubyforge.org/docs/activewarehouse-etl.html

We (the contributors) have worked hard to make something that is
pretty easy to extend and that attempts to be idiomatic Ruby as much
as possible. Take a look and feel free to join the ActiveWarehouse
discussion list.

V/r
Anthony

Robert wrote:

How many ruby-ists have to do statistical analysis or data cleaning
prior to analysis?

I use Ruby quite a lot at work for data cleaning, transformation and
also for generating SPSS syntax. For example, I used it to create a long
set of commands for linking together waves of the longitudinal British
Household Panel Study.

What is the well known software out there for statistical software or
data transformation software?

Possibly R Commander, which is a Tk interface onto R:
http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/

I haven’t ever used it myself, though; it seems to have a good feature
set but missing some things I use in SPSS eg Probit models.

Is there currently a successful project for incredibly user-friendly
open source statistical software, usually using a GUI, to compete with
SAS (JMP) or SPSS? ( R is more for research statistics, with a tough
learning curve.).

Not that I know of. I agree re R - on numerous attempts I’ve never
managed to get anywhere with it (I have 8 years programming experience
and a postgrad in Research Methods). It also seems much more geared to
the needs of natural rather than social science.

There are things I don’t like about SPSS too, apart from price - some
interface aspects, and its syntax. I’ve written GUI software in Ruby for
qualitative data analysis, but my inclination to create a competitor to
SPSS on the quant side (eg a GUI round ruby’s R bindings) is limited.
It’s partly a frank appreciation of the difficulty of the task, and
partly down to the fact that SPSS is provided “free” to UK academics by
nationwide licensing agreements with universities.

alex

Alex F. wrote:

I use Ruby quite a lot at work for data cleaning, transformation and
also for generating SPSS syntax. For example, I used it to create a long
set of commands for linking together waves of the longitudinal British
Household Panel Study.

Interesting … how does Ruby compare with other languages for this
purpose? We might be getting SPSS and if it’s as bizarre as I remember,
I’m going to need some way of preserving my sanity while using it.

Possibly R Commander, which is a Tk interface onto R:
http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/

I haven’t ever used it myself, though; it seems to have a good feature
set but missing some things I use in SPSS eg Probit models.

Yes, R Commander is pretty good for a beginner, but it’s a crutch IMHO.
R and its ancestor S were deliberately designed to be programming
languages and interactive environments.

Not that I know of. I agree re R - on numerous attempts I’ve never
managed to get anywhere with it (I have 8 years programming experience
and a postgrad in Research Methods). It also seems much more geared to
the needs of natural rather than social science.

Outside of “pure statistics”, the two most highly-developed application
areas for R are biology (http://www.bioconductor.org) and quantitative
finance aka “program trading”. Quantitative finance, however, tends to
jump on bandwagons and jump off onto the “next big thing” quickly as
well.

It used to be you’d walk into a quant shop and they’d all be coding in
APL. Then you’d walk into the place a year later and they’d have
something else. So the “golden days” of R among quants may have passed.
I think they’re into OCaml these days. Or is it Haskell? :slight_smile:

There are things I don’t like about SPSS too, apart from price - some
interface aspects, and its syntax.

I was talking to a colleague about this just yesterday. I left Minitab
for R for two reasons:

  1. It didn’t have a real programming language, and
  2. The system as distributed couldn’t do a non-linear regression out of
    the box.

SPSS has been around a long time. As far as I can remember, the only
thing older was the UCLA Bio-Med package from the early 1960s! Does it
still read like a hodge-podge of FORTRAN, macro assembler, JCL and such?

I’ve written GUI software in Ruby for
qualitative data analysis, but my inclination to create a competitor to
SPSS on the quant side (eg a GUI round ruby’s R bindings) is limited.
It’s partly a frank appreciation of the difficulty of the task, and
partly down to the fact that SPSS is provided “free” to UK academics by
nationwide licensing agreements with universities.

There are a couple of other GUI projects for R. There is an “R-gui”
mailing list where they all hang out. But it’s hard to argue with the
basic philosophy. R is supposed to be a programming language, not a
statistics package. For that matter, Ruby is supposed to be a
programming language, too. :slight_smile:

I’ve been a programmer for a long time and it didn’t take me long to
learn R. In a sense, S and R are dialects of Lisp, so if you’re used to
procedural languages as opposed to functional languages, you’ll have a
steeper learning curve. And if you’re used to object-oriented
programming as done in Smalltalk, Java or Ruby, you’ll find R’s
“objects” and “classes” totally different. They’re a bit like Common
Lisp’s CLOS in some senses, but not enough that you’d be able to
transfer any preconceived notions. I don’t tend to use them – I’m
perfectly happy with a “define-functions-from-the-bottom-up” programming
style I learned from Lisp 1.5.

M. Edward (Ed) Borasky wrote:

Alex F. wrote:

I use Ruby quite a lot at work for data cleaning, transformation and
also for generating SPSS syntax. For example, I used it to create a long
set of commands for linking together waves of the longitudinal British
Household Panel Study.

Interesting … how does Ruby compare with other languages for this
purpose? We might be getting SPSS and if it’s as bizarre as I remember,
I’m going to need some way of preserving my sanity while using it.

Ruby works nicely for generating SPSS syntax, mainly because of its
highly functional String/Hash/Array/Regexp classes. For preparing data,
Excel’s also useful, because it includes basic statistical functions (eg
normal distribution), and because you can copy-n-paste data from a
spreadsheet into SPSS’s Data Editor.

You might want to evaluate Stata as an alternative to SPSS. I haven’t
used it but several more quantitatively-oriented researchers I know
speak well of it.

Not that I know of. I agree re R - on numerous attempts I’ve never
managed to get anywhere with it (I have 8 years programming experience
and a postgrad in Research Methods). It also seems much more geared to
the needs of natural rather than social science.

Outside of “pure statistics”, the two most highly-developed application
areas for R are biology (http://www.bioconductor.org) and quantitative
finance aka “program trading”. Quantitative finance, however, tends to
jump on bandwagons and jump off onto the “next big thing” quickly as well.

I guess in the “softer” end of social science where I work the data sets
are relatively small and the analyses not computationally intensive. So
GUI ease-of-use for occasional users is a important distinguishing
feature.

SPSS has been around a long time. As far as I can remember, the only
thing older was the UCLA Bio-Med package from the early 1960s! Does it
still read like a hodge-podge of FORTRAN, macro assembler, JCL and such?

Don’t know its heritage, but it’s a ugly baby… here’s some code to
create a composite key of a four-digit year and a one-digit UK region
code:

STRING REGION_YEAR(A6).
COMPUTE REGION_YEAR = CONCAT(
STRING(Region, F1), ‘_’, STRING(Year, F4)
).
EXECUTE.

I’ve been a programmer for a long time and it didn’t take me long to
learn R. In a sense, S and R are dialects of Lisp, so if you’re used to
procedural languages as opposed to functional languages, you’ll have a
steeper learning curve. And if you’re used to object-oriented
programming as done in Smalltalk, Java or Ruby, you’ll find R’s
“objects” and “classes” totally different.

Interesting - I’ve been programming eight years, but my experience is
almost all in Ruby, Perl and Javascript, with a bit of C++. Probably why
I find the SPSS and R syntax so uncomfortable.

alex