Distributing a Fully Connected Neural Network Across a Cluster(iamtrask.github.io) |
Distributing a Fully Connected Neural Network Across a Cluster(iamtrask.github.io) |
For anyone actually interested in some interesting techniques for multi-GPU DNN training, http://arxiv.org/pdf/1404.5997v2.pdf and references therein are probably a good start.
From what I've understood, what you're suggesting is that for every node in a layer, you colocate the edge on the same machine?
For every node in every other layer, I colocate the edge on the same machine. In this way, when a group of, say, 10 nodes in layer 1 are each sending a weighted message to a single node in layer 2... they can pre-combine their messages (weighted sum) and send only that value over the network. This happens for every node in the second layer, reducing network i/o (this is the first optimization).