Are ML and Statistics Complementary? [pdf]

Are ML and Statistics Complementary? [pdf](ics.uci.edu)

68 points by snippyhollow 10 years ago | 31 comments

ktamura 10 years ago |

They definitely are as far as their roles at (most) startups are concerned.

Unless your startup's core strategy involves machine learning, statistics tends to come handier than machine learning in the early days. Most likely, what moves your company is not a data product built atop machine learning models but the ability to draw less wrong conclusions from your data, which is the very definition of statistics. Also, in the early days of a startup, you experience small/missing data problems: You have very few customers, very incomplete datasets with a lot of gotchas. Interpreting such bad data is no small feat, but it's definitely different from training your Random Forest model against millions of observations.

tristanz 10 years ago |

LeCun has a comment on this paper here: https://www.facebook.com/yann.lecun/posts/10153293764562143

jupiter90000 10 years ago | |

Thanks for sharing that, he's got some interesting stuff to say about this topic.

washedup 10 years ago |

Here is a link to the paper referenced in the beginning: http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataSci...

Great read for anyone interested in the debate.

nextos 10 years ago |

I think they will eventually converge.

Probabilistic programming is already a hint of this. The most general class of probability distributions is that of non-deterministic programs. ML is just a quick and dirty way to write these programs.

murbard2 10 years ago | |

It's not just a way to write them, it's a way to do inference. Probabilistic programming is extremely powerful in terms of representation but inference is, in general, intractable. Yes, you can express all those ML models as probabilistic programs, but the sampler isn't going to perform nearly as well as the original algorithm.

p4wnc6 10 years ago |

What is commonly understood as 'statistics' is just a specialized subset of machine learning. Machine learning generalizes statistics.

The correct complement to machine learning is cryptography -- trying to intentionally build things that are provably intractable to reverse engineer.

51109 10 years ago | |

Working with both statisticians and pure machine learners on the same task, I did notice some tendencies, presuppositions and modus operandi that were different (beyond being a specialized subset). Like said in this position paper, machine learners like to throw computation and parameters at the problem, where statisticians are more careful and sober. As an analogy, a statistician will approach a cliff very carefully, stomping the ground to make sure it is sturdy enough to carry a human. They'll approach the edge of the cliff 'till they have their p-measures and that is their model. Machine learners will jump head-first off the cliff and when you listen you can hear them yell: Cross-validatioooohhhh... as they plummet down.

I like the complement with cryptography. I would add another coding method: compression - Approximating the simplest model with explanatory power.

p4wnc6 10 years ago | | |

I have had the exact opposite experience with machine learning and statistics. In my experience, those who come from the 'statistics' side tend to use constructs, like null hypothesis significance testing, which are not consistent even from a theoretical point of view. And further, when they use them, they do awful things like p hacking, or using a direct comparison of t-stats as a model selection criterion, which are further rife with theoretical problems, not to mention lots of statistical biases and so forth.

I find the machine learning approach is far more humble. It starts out by saying that I, as a domain expert or a statistician, probably don't know any better than a lay person what is going to work for prediction or how to best attribute efficacy for explanation. Instead of coming at the problem from a position of hubris, that me and my stats background know what to do, I will instead try to arrive at an algorithmic solution that has provable inference properties, and then allow it to work and commit to it.

Either side can lead to failings if you just try to throw an off-the-shelf method at a problem without thinking, but there's a difference between criticizing the naivety with which a given practitioner uses the method versus criticizing the method itself.

When we look at the methods themselves I see much more care, humility, and carefulness to avoid statistical fallacies in the machine learning world. I see a lot of sloppy hacks and from-first-principles-invalid (like NHST) approaches in the 'statistics' side. And even when we consider how practioners use them, both sides are pretty much equally as guilty of trying to just throw methods at a problem like a black box. Machine learning is no more of a black box than a garbage-can regression from which t-stats will be used for model selection. However, all of the notorious misuses of p-values and conflation over policy questions (questions for which a conditional posterior is necessarily required, but for which likelihood functions are substituted as a proxy for the posterior) seem very uniquely problematic for only the 'statistics' side.

Three papers that I recommend for this sort of discussion are:

[1] "Bayesian estimation supersedes the t-test" by Kruschke, http://www.indiana.edu/~kruschke/BEST/BEST.pdf

[2] "Statistical Modeling: The Two Cultures" by Breiman, https://projecteuclid.org/euclid.ss/1009213726

[3] "Let's put the garbage-can regressions and garbage-can probits where they belong" by Achen, http://www.columbia.edu/~gjw10/achen04.pdf

Tarrosion 10 years ago | |

That's a strong and not-at-all obvious statement. Can you elaborate?

sjg007 10 years ago | |

Machine learning does not generalize statistics; mathematics does.

p4wnc6 10 years ago | | |

Machine learning is one subfield of mathematics that generalizes another, further subfield (statistics).

sjg007 10 years ago |

This is a great summary of the field.

_0w8t 10 years ago |

I think feasibility to get an explanation for the results of modern machine learning is wishful thinking. I personally cannot explain my gut feelings. So why should we expect an explanation when machine deals with the same class of problems?

Besides, it is easy to get wrong explanation and, as Vladimir Vapnik in his 3 metaphors for complex world observed, http://www.lancaster.ac.uk/users/esqn/windsor04/handouts/vap... , "actions based on your understanding of God’s thoughts can bring you to catastrophe".