# Building A Program That Actually Learns

I decided to build a program that can learn.

Technically, it is a 2 layer neural network that is able to take a set of inputs and desired outputs, and learn how to predict those outputs.

I’m already 3 months into the Deep Learning Nanodegree at Udacity, so I’ve already built this a few times. But I wanted to take a step back and dive deeper into the mechanics behind it.

My goal of writing this blog post is two-fold:

1. Provide guidance for anyone interested in learning how to build a neural network
2. Solidify what I’ve learned by explaining it to others

I’m no expert in machine learning, so if you see ways I can improve my code or provide a better explanation, please let me know.

Before we jump in, please also check out Andrew Trask’s post: A Neural Network in 11 lines of Python (Part 1). My post is really just a fork of Andrew’s in which I try to simplify and explain concepts that confused me the first time around. So all credit goes to Andrew.

If you’d like to see the full code, you can view it on my GitHub: https://github.com/byanofsky/simple-2-layer-neural-network

Let’s begin!

## What Are We Trying To Accomplish?

The general idea behind Deep Learning is to feed a neural network a bunch of inputs and the desired outputs. The network then performs a series of calculations until it can accurately predict the outputs from the inputs. Then, if done correctly, it can take a new set of inputs and accurately predict their outputs.

For example, we can feed a bunch of images of animals and tell the neural network what each image is: cat, dog, cat, horse, dog, etc. Once it learns, we can then feed it a new image of a cat and it can say, “I’m 95% certain that’s a picture of a cat.”

Image classification requires a more complex neural network then what we’ll be building here. But it’s the same building blocks.

In this specific implementation, we’ll feed the network 3 inputs, each being a 1 or a 0. And the associated output will be a 1 or a 0.

Visually, here is what we are working with: $\begin{bmatrix}0&0&1\\0&1&1\\1&0&1\\1&1&1\end{bmatrix} \longrightarrow \begin{bmatrix}0\\0\\1\\1\end{bmatrix}$

Each row in the first matrix is a set of 3 inputs, leading to an output in the same row in the second matrix.

So we have 4 sets of data, each with 3 inputs and one expected output.

Now let’s think through this problem before diving into the code.

## Thinking Through The Problem

One thing Andrew taught us in Udacity’s Deep Learning course is to think through a problem before introducing it to a neural network.

In this case, looking through the inputs and outputs, you might notice that the output is directly correlated with the first input in each set. When the first input is a 0, output is 0. When it is a 1, the output is 1. The other two inputs are irrelevant.

This is what we want our neural network to figure out.

But we’ll do so without providing it any guidance.

Now it is time to start setting up the neural network.

## Visualizing The Network

Visually, here is how our neural network is going to look.

We’ll have 3 inputs, which is our Layer 0.

Connecting our first layer to the second layer are synapses, or weights. Each input value is multiplied by the associated weight value. These values are then summed together, and passed through a function (we’ll discuss this function soon). The resulting value is our value for Layer 1, and in this case, also our output value.

During training, our network will start with random weight values. The resulting output from these random weights will then be compared to the expected output. This difference is our error.

Going from input to output is our “forward propagation”.

Next, comes “back propagation“.

The error is used in a few calculations to figure out how much to adjust the weights by.

With these adjusted weights, the neural network will run forwards again, find the error, then backwards, over and over until it reaches the set number of iterations.

Each iteration, the network is attempting to reduce the error to come to a statistically certain output.

Now that we have a visual understanding of the network, let’s jump into the code.

If you want to skip to the code, you can view it here: https://github.com/byanofsky/simple-2-layer-neural-network

## Initializing The Variables

I’ll be using Python to code this neural network. I’ll be coding within a Jupyter notebook, and the only dependency I’ll have is Numpy (used in the matrix calculations).

First, we’ll need to import numpy:

import numpy as np


Now, let’s turn our inputs and outputs into numpy arrays.

For our input layer, we’ll have 3 inputs. You’ll also notice that we have 4 different sets of inputs.

So we’ll create a 4×3 matrix:

# Input dataset
X = np.array([ [0,0,1],
[0,1,1],
[1,0,1],
[1,1,1] ])


Our outputs are similar. Like the inputs, we’ll have 4 sets. But each set only has 1 output.

So our outputs will be a 4×1 matrix:

y = np.array([,
,
,
])


Now we want to initialize our weights (the synapses connecting layer 0 to layer 1).

As Andrew says, “There is quite a bit of theory that goes into weight initialization. For now, just take it as a best practice that it’s a good idea to have a mean of zero in weight initialization.”

So we’ll follow his practice and initialize the weights randomly with a mean of 0, and values between -1 and 1:

# Seed random numbers. Useful for testing so that each time we run,
# the random set is the same.
np.random.seed(1)

# Initialize weights randomly with mean 0
W0 = 2 * np.random.random((3, 1)) - 1


Our weight matrix has a shape of 3×1.

It took me a while to figure out exactly why, but it works out like this:

Each set has 3 inputs, and 1 output. Our weights need to connect each input to the output. So the weights should take in the 3 inputs and return 1 output.

I like to think of it like this:

Now it’s time for training our neural network.

## Training Our Network

It’s time to create our training loop. With each iteration, our network will perform a forward propagation and backward propagation, bringing our error closer and closer to 0.

In general the more times we train our network, the more accurate it becomes. But the more iterations, the longer it will take to train. On top of this, after a certain point, we’ll have diminishing returns. It can also lead to overfitting (which you’ll come across on more complex networks).

For now, let’s stick with what Andrew suggests and train our network over 10,000 iterations:

# Train the network
for i in range(10000):
# Code will go here
pass


Let’s build our our forward propagation first.

### Forward Propagation

This is our process which leads from inputs to an output. And here is the code:

# Train the network
for i in range(10000):
# Forward propagation
l0 = X
l1 = sigmoid(np.dot(l0, W0))


Our Layer 0 is just our inputs, so we set that to X. From iteration to iteration, our Layer 0 will remain constant because our inputs will not be changing.

To go from Layer 0 to Layer 1 requires some calculations I referenced earlier.

First, we’ll take every input from all 4 sets, and multiply them by their associated weight. These values are then summed together. Because Layer 0 and the weights are matrices, this calculation is called the dot product.

Visually, the calculation looks like this:

np.dot() is a numpy function for calculating the dot product of two matrices.

Now, what about the sigmoid function?

### Sigmoid Explained

The sigmoid function is another concept that took me a while to completely grasp.

In essence, a sigmoid function takes a value X and returns a value y that is between 0 and 1. In essence, it is a probability.

And it graphs like this:

As you can see, when X is 0, the output is .5.

As X gets larger in either direction, it gets closer to 0 (when X is negative) or 1 (when X is positive).

How does this integrate into our neural network?

Well, once we take our weighted inputs and sum them together, we pass them into the Sigmoid function and are returned a probability of the output being 1. The closer the result of sigmoid is to 1, the more likely the result is 1. The closer it is to 0, the more likely the result is 0.

So if our network wants a set of inputs to output 1, it better make X a large, positive number.

Let’s create a function for sigmoid:

def sigmoid(x):
return 1 / (1+np.exp(-x))


So, in Layer 1, we have our output, which is the result of multiplying each of our inputs by its associated weight, summing these values for each set, and passing that value through the sigmoid function, yielding a probability.

Now it’s time to see how far off we are from the correct value.

### Figuring Out The Error

Because we had 4 sets of data, our output currently stored in Layer 1 will have 4 “predictions”.

We can compare these predictions to the expected values stored in “y”, which will yield our error:

# Error
l1_error = y - l1


Our error value, like “l1” and “y” will also be a 4×1 matrix containing the error for each set of data.

Now that we know our error, we need to back propagate the error and update our weights accordingly.

### Back Propagation

So we get into the meaty stuff.

Having our neural network make predictions is cool, but those predictions aren’t accurate. Back propagation is where our neural network starts to learn and make its predictions more accurate.

Take a look at the sigmoid curve again: You’ll notice that, as the y values approach 0 or 1, the curve starts to straighten out, but when the y value is closer to 0.5, it’s steeper.

This is the “gradient”, or “slope” of the line, and can be represented mathematically by the derivative of sigmoid. Luckily, the derivative of sigmoid is rather easy to express: $sigmoid(x) * (1 - sigmoid(x))$

Remember that the output from Layer 1 is the sigmoid value. So if we want to figure out the derivative value, it can be expressed as:

x * (1-x)


Where “x” is the value of Layer 1.

So if the output from Layer 1 is close to 0.5, it means our neural network is pretty uncertain about the result. 0.5 is basically saying it’s a 50/50 chance of being a 1 or 0. So we have high uncertainty and need to have the network adjust.

But if Layer 1’s value is close to 1 or 0, we have a lower uncertainty.

But remember that we also have our error value.

So, we multiply our error and the value of the sigmoid derivative.

This value becomes our delta value. We can express it with code as such (note that the parameter True in the sigmoid function denotes the derivative):

# Backpropagation
l1_delta = l1_error * sigmoid(l1, True)


So we now know our delta. The last step is to update our weights according to how large our error and uncertainty were.

We take the 4 sets of inputs and multiply them by their associated weight delta and sum them. That value is then added to the weights to give us our updated weights.

# Update weights
W0 += np.dot(l0.T, l1_delta)


Note that we transpose that Layer 0 matrix. This is because we need to multiply all input 1 values across all 4 sets by their associated “l1_delta”, sum those together, and add it to the associated weights. Then do that with all input 2’s, and input 3’s.

Now we let our network train until it has trained over all iterations.

## Final Output

Once the network is complete, it produces this output:

You’ll see I’ve had it print results every 2000 iterations, showing the iteration number, L1 values, and error (I used mean square error here to emphasize the error).

Finally, it print the Output After Training.

You can see the first two output values are very close to 0, while the last two output values are very close to 1.

So it learned how each input is associated with the output and can technically predict the appropriate output!

Even cooler is comparing the L1 values from iteration 0 to the final output values.

You can actually see it learning!

More To Come