MXNet – Deep Learning Framework of Choice at AWS(allthingsdistributed.com) |
MXNet – Deep Learning Framework of Choice at AWS(allthingsdistributed.com) |
As others have commented here, there is no evidence that MXNet is that much better (or worse) than the other frameworks.
Those properties combined make TensorFlow the most engineer/practitioner friendly choice in the market. If AWS hopes to compete with TensorFlow in all seriousness, they need to catch up with support on those seeming trivial but important details.
And I'm rooting for Amazon (and FaceBook, and Microsoft...). TensorFlow needs competition for the hearts and minds of developers.
1: https://www.usenix.org/system/files/conference/osdi16/osdi16...
[full disclosure, I work on the TensorFlow team]
It's a bit of a double edged sword. As developers this war gives us free access to well funded and heavily developed tools. The world has been fundamentally changed by their availability. But at the same time we need to understand that the primary reason they exist is to lock developers into a particular vendor. It's most transparent with Google's TensorFlow, where they were obvious about their intentions to offer TensorFlow services on their cloud platform.
This article more than most exemplifies their desperate attempts. For now it seems to remain mostly that, desperate attempts, with the tools remaining more-or-less platform agnostic. But I foresee a grim future where our best libraries and tools are tied inextricably to a commercial ecosystem.
Wonder what would happen to that scaling efficiency if those GPUs were P40s?
See also the absence of equivalent AlexNet numbers to further obscure attempts at comparing this to the other guys(tm).
Can't wait for Intel's response to this.
> a Deep Learning AMI, which comes pre-installed with the popular open source
> deep learning frameworks mentioned earlier; GPU-acceleration through CUDA
> drivers which are already installed, pre-configured, and ready to rock
You might want to clarify that the negative reviews [0] are from earlier versions which did not include the CUDA drivers. I recently considered this AMI and rejected it for a class [1] because of these reviews.[0] https://aws.amazon.com/marketplace/reviews/product-reviews?a...
[1] https://www.meetup.com/Cambridge-Artificial-Intelligence-Mee...
Without back by any benchmarks? This claim is lazy.
So perhaps I'm not well versed enough in deep learning, but does this mean that they solved the vanishing gradient problem? How are they managing to do this?
This is kind of related to solving the vanishing gradient issue in RNNs by using additive recurrent architectures like LSTMs and GRUs.
Alternatively it's possible to use concatenative skip connections as in DenseNets: https://arxiv.org/abs/1608.06993
Still using 1000 layers is useless in practice. State of the art image classification models are in the range 30-100 layers with residual connections and varying numbers of channels per layer depending on the depth so as to keep a tractable total number of trainable parameters. The 1000 layers nets are just interesting as a memory scalability benchmark for DL frameworks and to validate empirically the feasibility of the optimization problem but are of no practical use otherwise (as far as I know).
I learned about it last week, I don't seem to see too much benefit if the goal is good performance.
The computation graph is an in-memory datastructure that can be introspected by the program itself at runtime so as to do symbolic operations (e.g. compute the gradient of one node in the graph with respect to any ancestor input node).
theano implements this in pure Python and can generate C or CUDA code from string templates (in Python). tensorflow has to a Python API to assemble pre-built operators which are mainly written in C++ and use the Eigen linear algebra library.
I found this comment interesting. Is this really the summary of what machine learning is about?
Not trying to be sarcastic, I just can't think of any way other than the ML way.
Microsoft wants you to use CNTK on Azure. Amazon wants you to use Mxnet on AWS. Google wants you to use Tensorflow on GCP.
It's irrelevant whether these frameworks can be used outside their home platform by broke college students. That's a red herring. The cloud vendors are looking to sell enterprise contracts, and they need to check all of the boxes.
This strategy makes complete sense from a business perspective, and you really cannot fault them for doing it.
If they can achieve 109x speed up with 128 GPUs using synchronous data parallelism with a batch size tuned for optimal single GPU convergence time, then this is very impressive (but quite unlikely).
However I don't think that publishing training benchmarks on Inception v3 (vs say AlexNet) is a fraud. Inception v3 is close to the state of the art and very good at using few parameters & inference FLOPS for a good test accuracy.
Inception v3 has been publicly available for quite a long time in a variety of DL toolkits along with pre-trained weights.
I mean who cares about AlexNet any more? It's 2016 already. It trains in under 2h on a single machine. Distributing it doesn't make much sense
Amazon is at its best when it's customer obsessed and at its worst when it puts politics first.
All IMO of course.
A platform that runs AlexNet well has excellent computation performance for the convolution layers but it also has excellent algorithms/communication for parallelizing the model/data by whatever means.
Networks that attempt to minimize computation and/or communication are cool, but they should be considered in that light IMO.
It's also a great estimate of the low-end for strong scaling. There's a lot of bread and butter machine learning at this level in my experience.
These are orthogonal to memory management and neural net framework choices.
I would be extremely impressed if someone developed an algorithm that could accomplish this task without using any type of statistical/machine learning.
But this sounds exactly like expression template.
Once the graph is defined, it can be passed along with concrete values for the input nodes to the runtime framework to execute the section of the graph of interest (possibly with code generation + compilation).
I could believe you if tell you me that the validation loss and test accuracy of the large distributed model remains as good as the sequential, single GPU model after the same total number of epochs but this is not a given and if it's not the case I would find those benchmarks deceptive.
Both X and Y are related to the dataset and network complexity. A rough guess I often use is num_classes < X < 10num_classes and Y ~= 10X. To accelerate the convergence for batch size between X and Y, we can either increase the data augmentation or learning rate, or both. The basic idea is to add more noise to the SGD training to avoid falling into suboptimal points too easily.
The paper you mentioned studies the extremely case that batch size >> Y. They used CIFAR 10 (num_classes = 10) and batch size (20% num_examples = 12K). I also surprised that they also extended our earlier work to CNN and showed promising results (Sec 4.2)
But also as mentioned by the paper authors, there is little theory we can say about that. I expected that the research community will have fun about it for a while.
But back to the MXNet benchmark, we did successfully tuned the hyper-parameters with 128 GPUs and batch size = 32 * 128 to match the convergence compared to a single machine on the Imagenet 1K dataset. So we think our setting is reasonable. But the main point here is that we are more willing to show how fast the system can achieve, so that researchers can easier try more efficient distributed algorithms here.