Neural networks

In our previous examples, we have discussed mainly regressions in the form . We have touched on using polynomials to fit more complex equations such as . However, as we add more features to our model, when to use a transformation of the original feature becomes a case of trial and error. Using neural networks, we are able to fit a much more complex function, y = f(X), to our data, without the need to engineer or transform our existing features. 

Structure of neural networks

When we were learning the optimal value of , which minimized loss in our regressions, this is effectively the same as a one-layer neural network:

Figure 1.10 – One-layer neural network

Here, we take each of our features, , as an input, illustrated here by a node. We wish to learn the parameters, , which are represented as connections in this diagram. Our final sum of all the products between and gives us our final prediction, y:

A neural network simply builds upon this initial concept, adding extra layers to the calculation, thus increasing the complexity and the parameters learned, giving us something like this: 

Figure 1.11 – Fully connected network

Every input node is connected to every node in another layer. This is known as a fully connected layer. The output from the fully connected layer is then multiplied by its own additional weights in order to predict y. Therefore, our predictions are no longer just a function of but now include multiple learned weights against each parameter. Feature is no longer affected not just by . Now, it is also affected by the . parameters.

Since each node within the fully connected layer takes all values of X as input, the neural network is able to learn interaction features between the input features. Multiple fully connected layers can be chained together to learn even more complex features. In this book, we will see that all of the neural networks we build will use this concept; chaining together multiple layers of different varieties in order to construct even more complex models. However, there is one additional key element to cover before we can fully understand neural networks: activation functions.

Activation functions

While chaining various weights together allows us to learn more complex parameters, ultimately, our final prediction will still be a combination of the linear products of weights and features. If we wish our neural networks to learn a truly complex, non-linear function, then we must introduce an element of nonlinearity into our model. This is done through the use of activation functions:

Figure 1.12 – Activation functions in neural networks

We apply an activation function to each node within our fully connected layer. What this means is that each node in the fully connected layer takes a sum of features and weights as input, applies a nonlinear function to the resulting value, and outputs the transformed result. While there are many different activation functions, the most frequently used in recent times is ReLU, or the Rectified Linear Unit:

Figure 1.13 – Representation of ReLU output

ReLU is a very simple non-linear function that returns y = 0 when and y = X when X > 0. After introducing these activation functions to our model, our final learned function becomes nonlinear, meaning we can create more models than we would have been able to using a combination of conventional regression and feature engineering alone.

How do neural networks learn?

The act of learning from our data using neural networks is slightly more complicated than when we learned using basic regressions. While we still use gradient descent as before, the actual loss function we need to differentiate becomes significantly more complex. In a one-layered neural network with no activation functions, we can easily calculate the derivative of the loss function as it is easy to see how the loss function changes as we vary each parameter. However, in a multi-layered neural network with activation functions, this is more complex.

We must first perform a forward-pass, which is where, using the model's current state, we compute the predicted value of y and evaluate this against the true value of y in order to obtain a measure of loss. Using this loss, we move backward through the network, calculating the gradient at each parameter within the network. This allows us to know which direction to update our parameter in so that we can move closer toward the point where loss is minimized. This is known as backpropagation. We can calculate the derivative of the loss function with respect to each parameter using the chain rule:

Here, is the output at each given node within the network. So, to summarize, the four main steps we take when performing gradient descent on neural networks are as follows:

  1. Perform a forward pass using your data, calculating the total loss of the network.
  2. Using backpropagation, calculate the gradients of each parameter with respect to loss at each node in the network.
  3. Update the values of these parameters, moving toward the direction where loss is minimized.
  4. Repeat until convergence.

Overfitting in neural networks

We saw that, in the case of our regressions, it was possible to add so many features that it was possible to overfit the network. This gets to a point where the model fits the training data perfectly but does not generalize well to an unseen test set of data. This is a common problem in neural networks as the increased complexity of the models means that it is often possible to fit a function to the training set of data that doesn't necessarily generalize. The following is a plot of the total loss on the training and test sets of data after each forward and backward pass of the dataset (known as an epoch):

Figure 1.14 – Test and training epochs

Here, we can see that as we continue to train the network, the training loss gets smaller over time as we move closer to the point where the total loss is minimized. While this generalizes well to the test set of data up to a point, after a while, the total loss on the test set of data begins to increase as our function overfits to the data in the training set. One solution to this is early stopping. Because we want our model to make good predictions on data it hasn't seen before, we can stop training our model at the point where test loss is minimized. A fully trained NLP model may be able to easily classify sentences it has seen before, but the measure of a model that has truly learned something is its ability to make predictions on unseen data.