Since TensorFlow was open-sourced by Google, I have been studying material found on its website in depth. The website has a number of links to Christopher Olah's blog, with excellent and thought provoking articles that I have tremendously enjoyed reading and pondering over. Starting with his article named Neural Networks, Manifolds, and Topology, they have all made me think really hard on how and why deep neural networks actually work, and I am so grateful, because in the process my understanding of these networks has appreciated by leaps and bounds. Here I would like to share my own perspective and current thoughts on how deep neural networks might be working. I assume the reader has some basic familiarity with deep neural networks.
How Deep Neural Networks Work
Let's focus on simple multi-layer neural networks for classification, in which neurons perform a weighted sum of their inputs, add a bias term, and then apply a reasonable non-linearity to the result to squash/crop it, and assume that the last layer of the network is a Softmax layer (which I won't really focus on here). How does such a simple multi-layer neural network work? What does it do with a multi-dimensional input that is fed to it?
Let’s start with an even simpler network where the input data is in R2 (i.e.; 2D space). Assume in the first layer of the network we have a bunch of neurons. Each of them has a weight vector that can be thought of as defining a hyperplane (in our case, a straight line) in our input (2D) space. Points on one side of this line (hyperplane) have a positive dot product with the weight vector, and the further away they are from the line, the more positive the dot product is. And points on the opposite side of the line produce a negative dot product, and the farther they are the more negative the dot product is. Thus, the weight vector of this neuron, through the dot product action, induces a gradient (imagine a gray scale gradient). You can represent this gradient on a third axis. That is, let’s define a R2:R function from our original 2D input space to R. If you try to visualize this function in 3D, it will look like a flat plane/surface in 3D space that passes through our very 2D line, and one side of this plane is above the 2D plane of our inputs (corresponding to points with positive dot products), and the other side is below (corresponding to points with negative dot products). This surface represents the gradient induced by our dot product. This plane looks like a slope over our otherwise horizontal 2D input plane. Next, imagine each neuron of the first layer has its own such surface/slope/function/map. Note that for this first layer, we used the concept of a hyperplane defined by a weight vector (the weight vector being the norm of the hyperplane); we also interpreted the dot product between the input vector and the weight vector--which is really the size of the projection of the input vector on the weight vector--as the shade of the input vector with respect to the hyperplane; and then we reinterpreted the shade as the height of the input vector in an induced 3D space.
Now, let’s move to the second layer. From the second layer and on, we won’t look at weight vectors of a neuron as inducing hyperplanes in a higher dimensional space any more (i.e., there is no deed to think about hyperplanes in n-dimensional space where n is the number of inputs of a layer). Instead we try to see what we can do with the results of the preceding layer, which are R2:R functions (3D surfaces) computed by neurons of the previous layer. We would like to characterize what it means for the output of our second layer neuron to have a specific value; or more specifically, what is the shape/locus of all the original 2D input points that will generate the same output at this neuron. This neuron still does a dot product, and weights each of its inputs differently and then adds them up. So, let’s just do that, taking its inputs to be the 3D surfaces (more accurately, the R2:R functions) generated by the previous layer. The neuron multiplies each of those surfaces (R2:R functions) by the corresponding weight of the neuron, magnifying them, and then adds them all up, resulting in a new 3D map/surface (R2:R function), which looks like a landscapes with hills and valleys, and which is more interesting and complex compared to the simple slopes that came from the previous layer. Now, think of any point in the original 2D input space that is now mapped to a point on this 3D surface, and notice the fact that the height of that point represents the output of the second layer neuron. Moreover, think about cross sections of this 3D surface at different heights. Each of these cross sections (say one at height H) map/represent all points from the original 2D space that generate the same output value, H, at this second layer neuron. This neuron may have a bias term. What that bias term does is lifting up or pushing down the whole 3D surface. This neuron may also be followed by a non-linearity. What that non-linearity does is that it modifies this 3D surface along its 3rd dimension (more mathematically, the non-linearity is composed with the R2:R function). For example, if the non-linearity is a sigmoid, it squashes the high peaks and deep valleys of the 3D surface. If the non-linearity is a Relu, it cuts off the bottom of the 3D surface thus creating plato areas in the valleys. Other neurons on the second layer use their weights in other ways to combine the slopes of the previous layer into different 3D surfaces (R2:R functions).
What happens in the next layer is very similar, except that the resulting mountainous range is going to be much more bumpy and the density of variations on that surfaces will likely be higher. This process continues on to the last layer. By then, the 3D surfaces associated with each of the last layer's neurons could be quite bumpy and rocky! At this point, the expectation is that for any given class of the inputs, a unique output neuron that is associated with that class would most likely produce the highest height on its 3D surface for members of that class, compared to the other neurons, and meanwhile that neuron is expected to produce smaller heights for inputs belonging to any other class. This means that points on the input 2D plane that belong to a given class must have been mapped to the highest peaks/regions of the 3D surface associated with their corresponding output neuron, and meanwhile they must also have been mapped to lower points on 3D surfaces of all other neurons (likely on their valleys). So that is how a neural network works with 2D inputs work. It creates a multitude of 3D maps in each layer that get combined with each other in different ways in the next layer, eventually generating complex 3D surfaces (R2:R functions) in the last layer that can differentiate between inputs of their class by assigning high peaks to them, and assigning lower heights (e.g., valleys) to any other input (compared to that of the neuron responsible for such other input).
The above analysis will unsurprisingly make us think of each neuron as composing a nonlinear function with the weighted sum of multiple R2:R functions, added a bias. And each layer can be though of as a mapping of a set of R2:R functions to another set of R2:R functions. So, what is going on in a neural network can also be thought of some sort of complex/convoluted, yet structured form of functional composition. Perhaps this type of composition has a name. If not, we can call it Neuronal Functional Composition!
When we train a neural network using back propagation and stochastic gradient descent, we are adjusting the ways the 3D surfaces that are input to the last layer’s neurons are combined by each output neuron (by adjusting the weights and biases), to improve the overall predication for the minibatch of inputs. And we try to modify the 3D surfaces at the input of the last layer as well, by back propagating an error to the previous layers, etc. So, what training does is shaping/carving the landscape of all these 3D surfaces/functions throughout the network.
Now, the above was a simple network with 2D inputs, and it was easy (on the hindsight) to visualize what the neurons where doing by utilizing a third dimension. It would be much harder to visualize in our minds this same process for neural networks with higher-dimensional inputs. However, it is still true that for a network with n-dimensional inputs, the neurons of that network are combining Rn:R functions into Rn:R functions.
I will add some diagrams as soon as I can, to help with the visualization of the 2D example.