AutoML toolkit for neural architecture search and hyper-parameter tuning

AutoML toolkit for neural architecture search and hyper-parameter tuning(github.com)

147 points by msalvaris 7 years ago | 59 comments

I manage a machine learning team for a large financial services company and AutoML tools, Microsoft’s NNI included, are on our radar.

I think the `future of work` for machine learning practitioners will quickly separate into two groups: a very small and elite group that performs research and a much larger groups that use AutoML but whose jobs also deal more with data preparation (which gets automated also) and ML devops, supporting models in production.

mlthoughts2018 7 years ago | |

This sounds like parody to me. There are so many problems in applied statistics, and neural networks are not helpful for most of them. Consider Bayesian analysis for very small data sets as an example (just the tip of the iceberg).

In financial services in particular, there are tons of time series and regression problems on small data such that a neural network (beyond perhaps some super small MLP) would be a ridiculous thing to try.

I think the breakdown of workload you described will only happen in business departments where there is a need for large scale embedding models, enhanced multi-modal search indices, computer vision and natural language applications, and maybe a handful of things that eventually productize reinforcement learning. I could also see this happening in businesses that can benefit from synthetically generated content, like stock photography, essays / news summaries / some fiction, website generators, probably more.

What I described above is a tiny drop in the ocean of applied statistics problems that business have to solve.

DebtDeflation 7 years ago | | |

It's another example of the FAANG + Bay Area Startups world versus the other 99% of Corporate America. In the latter world, most of the "machine learning" in production is traditional stuff like Random Forest, SVM, and more recently Gradient Boosting. Hell, Marketing departments across the country are still running old school decision tree (CART and CHAID) models and logistic regression models written in SAS 20+ years ago. DL/NN is a minuscule proportion of production ML in the enterprise space.

byebyetech 7 years ago | | |

Deep Learning also works on very small data sets by means of embeddings. A large model trained on large data sets can be used as feature extraction tool for training for small data sets.

human_scientist 7 years ago | | |

The parent did not specifically talk about NNs. As I understand it AutoML could apply to all statistical endeavours that involve estimation (classical or bayesian).

mjburgess 7 years ago | | |

The problem is "Applied Statistics" became "Machine Learning" which became "AI" which became "Deep Learning".

Throw away all the BS. and, yes, it's obvious.

bitL 7 years ago | |

Google, Facebook & MS already have even automated research, i.e. automated selection of a loss function, network architecture, individualized network topology etc. Amazon is not there yet. The rest of industry is still in "stone age", just "considering" using something like AutoML for basic hyperparameter tuning.

bitforger 7 years ago | | |

If you automate it, is it still research? Research implies some sort of hypothesis testing, yes?

I suppose OP means there will be two groups: people who use AutoML and people who try to make AutoML better.

noelsusman 7 years ago | |

Hasn't this always been the case? Actually fitting a model has always been a pretty small part of an applied statistician's job. The real work is everything before and after that point.

williamsmj 7 years ago |

I'd be interested in the creator's thoughts on this paper, "Random Search and Reproducibility for Neural Architecture Search", https://arxiv.org/abs/1902.07638, posted on the arxiv last week. Among other conclusions, they find:

"Our results show that random search with early-stopping is a competitive NAS baseline, e.g., it performs at least as well as ENAS, a leading NAS method, on both benchmarks"

ENAS, the specific algorithm that they find does no better than chance, is in this library. My understanding is that the results are pretty generic though, i.e. NAS is very far from a solved problem. (Hyperparameter tuning for "classical" models are another matter. That's commoditized and available as a service at this point, see tpot, DataRobot, etc., etc.)

wongarsu 7 years ago |

> We support Linux (Ubuntu 16.04 or higher), MacOS (10.14.1) in our current stage.

No Windows support in a Microsoft product. Curious.

This looks very useful for tuning hyper-parameters, and the fact that the tuned algorithm is treated as a black box makes this very flexible.

yeahhhhh 7 years ago | |

Actually, they will support in Windows later. Due to many developers usually train their deep learning model in Linux, so they support Linux and Max first.

perturbation 7 years ago |

Their example with LightGBM (https://nni.readthedocs.io/en/latest/gbdt_example.html) is very cool - I wanted to put together a custom script with mlflow + catboost + mlrMBO to do something similar, but this puts everything together in one package.

I think this does everything MLFlow does and more (besides maybe helping with deployment?)

yzh 7 years ago |

I'm working on auto hyper-parameter tuning and network optimization, I always think that people have put too much focus on NAS, which aims to create a whole new network from scratch, but not nearly enough on hyper-parameter tuning and local structural optimizations for an existing network, which I think is more demanding at least in the industry. Looks less cool than NAS though, maybe that's the reason.

sgt101 7 years ago |

I don't understand - isn't this model fishing? How is it different?

thanatropism 7 years ago | |

With training, test and validation sets.

In good old fashioned statistics there's the idea of the jackknife: for the i-th sample run a regression on all the data except i, and store statistics of interest (coefficients, predictions, etc). This gives you an ipso facto sampling distribution for the statistics of interest.

Similar and more common in econometrics is the bootstrap: run your model in like 1999 subsamples (with repetition) of the data and get sampling distributions.

With said sampling distributions, whether from the jackknife or the bootstrap, you're able to test whether your model is valid -- what's the probability that it'll have significant coefficients or an r2/mae/mape score indicating predictive capacity.

Cross-validation (and even scikit-learn is starting to default to five folds not three) is a "lazy" version of this. You don't get a sampling distribution but at least you're able to know that a given model appears good because it grips the data with all its might and doesn't work out-of-sample.

sklearn even offers the jackknife under some ML-y name like "one at a time scoring".

glial 7 years ago | |

Yes, but that's not necessarily bad. You want a model that effectively captures the structure present in your dataset. There are currently only rules-of-thumb in model architecture, and it makes sense to explore the model space to determine which architecture and hyper parameters are suitable to the needs at hand. Two things save this from being a statistical sin: one, the final evaluation set is typically different than the validation set, and evaluation is only performed at the end of the 'fishing expedition', thus providing a reliable measure of the model's ability to generalize. Second, we're doing engineering here, not science, and our goal is to capture the structure of observations and not make a scientific claim about values of latent parameters.

sandGorgon 7 years ago |

interesting - there's no scikit support, which for long has been the mainstay for data scientists everywhere.

Are people migrating from scikit to tensorflow in production for non-deep learning usecases ?

nurettin 7 years ago |

Do we need a hyper-parameter tuner tuner for this?

mlthoughts2018 7 years ago | |

Stuart Geman (one of the inventors of Gibbs Sampling) always used to say, “Parameters are the death of an algorithm.”

nurettin 7 years ago | | |

Environmental constraints (like width, height) are not bad. I would have argued Mr. Stuart.

angel_j 7 years ago |

Does it test against and prevent over-fitting?

hestefisk 7 years ago |

This is very cool.