R for the Enterprise(oracle.com) |
R for the Enterprise(oracle.com) |
Oracle has basically wrapped up R and made it embedded in the Database. One nice touch is that they have made overloaded versions of most of R's base and stat functions, and let them handle the "ore" dataframes; these overloaded versons are basically wrappers on the "big-data" functions inside the DB.
If this is really interesting to you, here are a set of PDFs describing Oracles approach: http://www.oracle.com/technetwork/database/options/advanced-...
Bad news is: you are limited to Oracle approaches to scale and growth; while there's some to love, there's also lots to hate.
I worry that some of this commercialization of R is going to cause trouble, as we start having non-compatible forks of R creating non-replicable analyses. If I build a great model that can only be replicated via some vendor's proprietary approach, then it may be great for my business, but it doesn't move the field forward.
Yes, because Redhat is doing so much better than Oracle with its OSS approach...
And it's not like other major companies support OSS with their products, like say IBM...
As a programmer, it seems like an awkward language, although no more awkward than SAS, SPSS, etc. And as I do more analysis the language makes more sense. It's a special-purpose language made to do a specific task.
The general workflow for doing data analysis is 1) import the data 2) clean it and format it properly as input for a pre-built package that does the actual analysis 3) feed it to the package and 4) interpret the results.
To that end, typically R programs are short and pretty declarative. R packages contain C or FORTRAN extensions that do all the heavy lifting. Substantial amount of imperative R code is going to be slow. For instance, looping over a vector is always worse than applying a vector transformation, and R provides a rich set of transformations for all its data types.
R has gotten popular because the proprietary guys dropped the ball at the universities. I recall reading a posting by one researcher who said he switched to R because his students could only use SAS at the school's stats lab, whereas they could run R on their computers at home.
Once researchers switched to R, they started publishing their work with code meant to be run with R. The cutting edge is important in stats, so people want a short lead time between when a new test or model is published and when it's available. SAS's "cathedral" can't really keep up. Combine that with SAS's licensing costs (both arms and a kidney too) as well as its overall "mainframey" feel, and you can see why R is winning.
EDIT: Another big win for R that I forgot to mention is its support for visualizations. A step that should perhaps come after importing the data above is investigating it with various diagnostic charts (scatter charts, box charts, etc) these are all just function calls in R. In addition, R has a powerful graphics engine and there are a huge number of packages available to create more sophisticated visualizations: http://addictedtor.free.fr/graphiques/
John Chambers created "S" at Bell Labs. S was a programming language designed for interactive statistical analysis. Much like gcc and icc are implementations of C compilers, R and S-PLUS are implementations of S. S-PLUS was/is the primary proprietary implementation of the S language, whereas R is the primary free one (also, sometimes called GNU S). (SAS and SPSS are completely different languages/systems as far as I know.) I think that statisticians at some point made a conscious effort to publish their work in R, rather than S-PLUS (or any other statistical system like SAS) because it was more widely available. That in turn led R to be a viable competitor to S-PLUS (and other systems) because it had vast amounts of recent statistical libraries, often implemented by the people who developed the techniques. That said, SAS and SPSS seem to pretty much still have social science students locked up --- the market for R is probably statisticians who are also excellent functional programmers.
This history is in really marked contrast to MATLAB and its corresponding free version Octave, where computer scientists pretty much refuse to use Octave, despite MATLAB's massive price tag to pretty much everyone involved (even with 90% discounts).
(That said, if anyone lived through the change over from S-PLUS to R, I'd love to hear if this history is wrong!)
Do you have any insights as to why Octave does not have higher adoption?
Computer science professors probably view a couple hundred dollars per MATLAB network license as a tiny expense on a $1m+ grant (whereas statistics grants are apparently often smaller), and they may be charged for it in departmental overhead anyway (removing the incentive to cut costs).
The type of people who could contribute either core code or toolbox type code to Octave often have an extremely rare quantitative skill set that is worth hundreds of dollars an hour, so there is a huge incentive to get paid to do similar work instead. There probably isn't much community recognition (to balance things out) for implementing a library in Octave. (Though, in the R world there are certain recognizable superstars like Hadley Wickham.)
Graduate students (who might work for cheap on these problems) are probably more focused on publications and networking.
As long as all of this is the case, Octave will always kind of just be a worse MATLAB that happens to be open source, so a new user choosing between them will probably just choose MATLAB by default.
It is true that we have a lot of trouble attracting new contributors. Most of our users keep demanding features that seem to us unimportant but to them are all the world: a GUI ("whatever for?", we think. "Use a real text editor!"), a JIT compiler ("here's a nickle, get better vectorised code, kid"), perfect Matlab compatibility (a never-ending chase, not very fun, in which we must always be behind).
Of these, we're finally slowly listening to our users. Two of our current three GSoC students are working on a GUI and a JIT compiler respectively. I have wild hope that this will attract more users and developers. I'm also currently hosting an Octave conference in a few days towards this goal:
http://www.octave.org/wiki/index.php?title=OctConf_2012
By the way, Octave is GNU (so is R, supposedly), so we're not really open source; we're free. ;-)I don't know why Octave hasn't been able to replicate R's success. I don't know if R's not really being GNU despite in name has something to it (R developers routinely try to find new ways to get around the GPL and link R to non-free code, and I don't doubt that this linking to Oracle's database is another example of that). I don't know if it's just that a lot of people with big money care more about statistics and R than they care about Octave (banks and brokers for R, electrical and civil engineers for Otave). Maybe our code sucks more than R's.
Do you have any suggestion how to make Octave the standard instead of Matlab? The recent gratis classes that emerged from Stanford gave Octave a lot of publicity. Do you have any suggestion of what else we might do?
I do not think Octave ever tried to replicate its workflow (which is not general programmer centric at all) and domain-specific documentation but merely focused on the underlying language compatibility, which is really the least important part of Matlab.
On top of that, I seem to recall, Matlab was one of the first of the specialist programming toolsets to offer a very competitive "Student Edition". This was a godsend for schools and universities before the Internet took off.
In short, Octave was too little too late, and Numpy/Scipy, while catching up fast, has supporting tools spread all over the place as well as being geared more to general programmers who want access to convenient numerics rather than numerical modellers/engineers wanting a RAD tool.
Numpy/Scipy etc. may well overtake Matlab eventually, but that will be purely a function of its infrastructure, not the something as mundane as even the nice language (which admittedly was its initial driving force). At least in this respect, it has done a lot better than Octave in much less time.
MATLAB has a nice GUI and IDE. It also generates good graphs with minimal effort.
Octave has a command-line REPL.
This is why we are doing Octave.