PDF slides: Scalable Machine Learning (based on slides by Li-Li Fei, Justin Johnson, Serena Yeung, Song Han)
Neural networks consists of layers of nodes, where the neighboring layers have connections between these nodes. The layers can be Fully Connected (FC) or partially connected, e.g. to preserve the spatial structure of an image (convolution layers). An input signal spreads forward over these connections, which have a weight, as the product of weight and input signal. The "score" of a network node is the sum of all incoming weighted signals. It is fed into an activation function; popular such functions are tanh, sigmoid or ReLU (the most popular). The final activation of a node is produced by a loss function that takes into account a regularization term. This regularization is intended to penalize complex models over simple ones. The models are trained by taking the derivative over the activation (the "gradient") and backpropagating the error (the difference between output and correct supervised training answer) from the output side, layer by layer. The weights are then adjusted, weighted by a learning rate.
It is perfectly possible to implement neural networks in statistics packages such as NumPy or R. But, this is not the easiest and also not the fastest way to work with neural networks. Deep learning frameworks such as Tensorflow and PyTorch allow one to specify a network topology, score, activation and loss functions; and then generate most machinery for you. Specifically, they generate the code to compute gradients, and handle all backpropagation.
TensorFlow is the deep learning framework developed by Google Brain. Its language allows to create a (static) neural network specification that is then compiled by TensorFlow into efficient training (e.g. doing operator fusion) and deployment programs (e.g. doing quantization). These programs run on CPUs, but also GPUs and TPUs (as discussed later). PyTorch is a similar framework, developed by Facebook. Rather than a fully new language, it embeds neural network specifications in plain Python. This creates more flexibility to experiment and to make neural processing flexible, but also means less optimization is possible and the deployed model will always depend on python. Keras is an example of an even higher level interface for deep learning. It comes with predefined optimizer strategy, and loss, regularization and activation functions, and works on top of TensorFlow.
Neural networks consist of weights and weights are numbers. When looking deep down at how this is in the end handled by computer hardware, typically, these weights are represented in one of these ways:
Regarding scalability in deep learning, the following hardware/scalability challenges appear:
To counter this we have discussed a number of optimizations that make models smaller or reduce the computational needs of model training and deployment.
CPUs have evolved along Moore's law but their speed has not improved much in the last 10 years, and the amount of cores per chip is also no longer increasing. More power now comes from so-called SIMD instructions (a Single Instruction performs an operation on Multiple Data items). SIMD has been in CPUs for 20 years (e.g. MMX, SSE, AVX instruction sets). This feature evolved from 64-bits SIMD to 128-bits in 15 years, but recently has grown through 256-bits to 512-bits SIMD quickly in new CPU designs. SIMD with 512-bits means: one "+" (plus), * (multiply), etc. instruction works on 64x8-bits numbers (or 32x16-bits, or 16x32-bits). Intel has a launched a chip for deep learning called Knights Landing (KNL) that provides 7 TFLOP (FP32) thanks to SIMD. With that, this special KNL CPU is the fastest CPU for machine learning by some margin.
Despite that, the deep learning community has already switched to training with GPU (graphics processors). GPUs arguably enabled the breakthrough of deep learning. A CPU chip has few powerful cores that can do very different things at the same time. To make that possible, the chip needs to contain a lot of control logic and cache memories. In a GPU chip there is almost no memory or control logic, it consist of very many simple cores. These GPU cores must all do the same thing at the same time on different data -- it is like a 1000x SIMD processor. GPUs are great for doing heavy computation in parallel. GPUs traditionally come on graphics cards that are placed in an extension slot of the computer. Graphics cards have their own memory, which is faster than normal memory, but also smaller. The higher bandwidth this provides is needed to feed the GPU with input data quickly enough. One has to be careful that the GPU does not become bandwidth-bound, the deep learning model must fit in this memory.
Programming GPUs is hard, one has to use low-level programming interfaces like CUDA (NVIDIA only), or OpenCL (slower). All major deep learning libraries (TensorFlow, PyTorch, MXNET, etc) support training and model evaluation on GPUs. Machine Learning training is at least 10x faster on GPUs than on CPUs.
NVidia is market leader in GPUs. Its current architecture is codenamed Pascal (1.5x faster than KNL for ML), and it has announced its new GPU architecture codenamed Volta. Volta is 1.5x faster than Pascal (10 TFLOP) as a GPU, and even 12x faster thanks to new so-called TensorCores (that provide 120 TFLOP for ML training). ML applications are thus influencing mainstream GPU designs.
To make deep learning scalable, one can try to parallelize it across multiple devices (e.g. GPUs), that may also be placed in different computers, connected by some network. Given so many layers, nodes and multiplications and additions that are independent, this should be scalable. However, parallel speedup is as always limited by communication overheads. Here are some approaches to do parallel training:
Can parallelism make deep learning training scalable? Most deep learning frameworks offer multi-GPU training, as long as these multiple GPUs are located inside the same computer. This tends to scale nicely, but the practical maximum number of GPUs in a machine is 8 or so. Please note that NVIDIA GPUs inside the same machine can be directly connected to each other with an NVlink cable, which has much higher bandwidth (up to 300GB/s in Volta), and is 100x faster than a fast computer network (10Gbits=1GByte/sec ethernet). Therefore, parallelism works best between multiple GPUs on the same machine.
Not many frameworks support distributing machine learning over multiple computers. It is best supported in Distributed TensorFlow, which offers various distribution methods and parameter synchronization algorithms, but scalability is not trivial to achieve with it.
For technical background material, there are the following papers,
Distributed TensorFlow (TensorFlow Dev Summit 2017)