Most of today’s artificial intelligence is implemented using some form of neural network. In my last two articles, I introduced neural networks and showed you how to build a neural network in Java. The power of a neural network derives largely from its ability to deep learn, and that ability is based on the concept and execution of gradient descent backpropagation. I’ll conclude this short series of articles with a quick dive into backpropagation and gradient descent in Java.

## Backpropagation in machine learning

It has been said that the AI is not that smart, that it is largely just backward propagation. So what is this cornerstone of modern machine learning?

To understand backpropagation, you must first understand how a neural network works. Basically, a neural network is a directed graph of nodes called *neurons*. Neurons have a specific structure that takes inputs, multiplies them with weights, adds a bias value, and runs all of that through an activation function. Neurons feed their output to other neurons until the output neurons are reached. The output neurons produce the output of the network. (See Machine Learning Styles: Introduction to Neural Networks for a fuller introduction.)

I’ll assume from here that you understand how a network and its neurons are structured, including feedforward. The example and discussion will focus on gradient descent backpropagation. Our neural network will have a single output node, two “hidden” nodes, and two input nodes. Using a relatively simple example will make it easier to see the math involved with the algorithm. Figure 1 shows an example neural network diagram.

The idea of gradient descent backpropagation is to consider the entire network as a multivariate function that provides input to a loss function. The loss function calculates a number that represents how well the network is performing by comparing the output of the network to known good results. The input data set along with good results is known as the training set. The loss function is designed to increase the value of the number as the behavior of the network moves further from what is correct.

Gradient descent algorithms take the loss function and use partial derivatives to determine what each variable (weights and biases) in the network contributed to the loss value. It then moves back, visiting each variable and adjusting it to decrease the loss value.

## Gradient Descent Calculation

Understanding gradient descent involves some calculus concepts. The first is the notion of a *derivative*. MathsIsFun.com has a great introduction to derivatives. In short, a derivative gives you the slope (or rate of change) of a function at a single point. Put another way, the derivative of a function gives us the rate of change at the given input. (The beauty of the calculation is that it allows us to find the change without another reference point, or rather, it allows us to assume an infinitesimally small change in the input.)

The next important notion is the partial derivative. TO *partial derivative* Let’s take a multidimensional function (also known as a multivariate) and isolate just one of the variables to find the slope of the given dimension.

Derivatives answer the question: What is the rate of change (or slope) of a function at a specific point? Partial derivatives answer the question: given multiple input variables to the equation, what is the rate of change for just this variable?

Gradient descent uses these ideas to visit each variable in an equation and adjust it to minimize the output of the equation. That is exactly what we want when training our network. If we think of the loss function as plotted on the chart, we want to move in increments toward the minimum of a function. That is, we want *find the global minimum*.

Note that the size of an increment is known as the “learning rate” in machine learning.

### Gradient descent in code

We’ll stick to the code as we explore the mathematics of gradient descent backpropagation. When the math gets too abstract, looking at the code will help keep us grounded. Let’s start by looking at our `Neuron`

class, shown in Listing 1.

#### Listing 1. A Neuron class

```
class Neuron {
Random random = new Random();
private Double bias = random.nextGaussian();
private Double weight1 = random.nextGaussian();
private Double weight2 = random.nextGaussian();
public double compute(double input1, double input2){
return Util.sigmoid(this.getSum(input1, input2));
}
public Double getWeight1() { return this.weight1; }
public Double getWeight2() { return this.weight2; }
public Double getSum(double input1, double input2){ return (this.weight1 * input1) + (this.weight2 * input2) + this.bias; }
public Double getDerivedOutput(double input1, double input2){ return Util.sigmoidDeriv(this.getSum(input1, input2)); }
public void adjust(Double w1, Double w2, Double b){
this.weight1 -= w1; this.weight2 -= w2; this.bias -= b;
}
}
```

He `Neuron`

the class only has three `Double`

members: `weight1`

, `weight2`

and `bias`

. It also has some methods. The method used for feedforward is `compute()`

. It accepts two inputs and does the work of the neuron: multiply each by the appropriate weight, add the bias, and run it through a sigmoid function.

Before we continue, let’s review the concept of *sigmoid activation*, which I also discussed in my introduction to neural networks. Listing 2 shows a Java-based sigmoid activation function.

#### Listing 2. Util.sigmoid()

```
public static double sigmoid(double in){
return 1 / (1 + Math.exp(-in));
}
```

The sigmoid function takes the input and raises Euler’s number (Math.exp) to its negative, adding 1 and dividing by 1. The effect is to squeeze the output between 0 and 1, with larger and smaller numbers approaching the limits. asymptotically.

going back to the `Neuron`

class in Listing 1, beyond the `compute()`

method we have `getSum()`

and `getDerivedOutput()`

. `getSum()`

only does the *weights * inputs + bias* calculation. Realise `compute()`

you accept `getSum()`

and runs it through `sigmoid()`

. He `getDerivedOutput()`

method is executed `getSum()`

through a different function: the *derivative* of the sigmoid function.

## derivative in action

Now take a look at Listing 3, which shows a sigmoid derivative function in Java. We’ve talked about derivatives conceptually, here’s one in action.

#### Listing 3. Sigmoid derivative

```
public static double sigmoidDeriv(double in){
double sigmoid = Util.sigmoid(in);
return sigmoid * (1 - sigmoid);
}
```

Remembering that a derivative tells us what the change of a function is for a single point on its graph, we can get an idea of what this derivative says: *Tell me the rate of change of the sigmoid function for the given input*. You could say that it tells us what impact the pre-firing neuron in Listing 1 has on the final firing result.

### derived rules

You may be wondering how we know that the sigmoid derivative function in Listing 3 is correct. The answer is that we will know that the derived function is correct if it has been verified by others and if we know that correctly differentiated functions are accurate. *based on specific rules*. We don’t have to go back to first principles and rediscover these rules once we understand what they say and are confident that they are accurate, just as we accept and apply the rules for simplifying algebraic equations.

So, in practice, we find derivatives by following the derivative rules. If you look at the sigmoid function and its derivative, you’ll see that the latter can be arrived at by following these rules. For the purposes of gradient descent, we need to know the derived rules, trust that they work, and understand how they are applied. We will use them to find the role that each of the weights and biases play in the final loss result of the network.

### Notation

the notation *f prime f'(x)* is a way of saying “the derivative of f of x”. Another is:

The two are equivalent:

Another notation that you will see shortly is the partial derivative notation:

this says, *give me the derivative of f for the variable x*.

### the chain rule

The most curious of the derived rules is the *chain of rules*. It says that when a function is composite (a function within a function, *alias* a higher order function) you can expand it like this:

We will use the chain rule to unpack our network and obtain partial derivatives for each weight and bias.

## Be First to Comment