Reinforcement learning is very simple at its core. It can be used for a variety of use cases, from teaching agents to play games, drive cars, and fly planes. In this article, we will go over reinforcement learning concepts so that the next time you read an advanced paper, you’ll have some idea of what it’s talking about.
What is reinforcement learning
Reinforcement learning is just another technique used to train models to do certain tasks.
When to use reinforcement learning
Reinforcement learning is only better than supervised learning or its alternatives when there are no clear answers to a problem. For example, when playing a game of snake, you can have 4 moves, 2 of which go away from the apple but avoid running into yourself in the future, one that keeps going straight, and one that goes towards the apple but will end up forcing the snake to run into itself.
Reinforcement learning also has a much higher limit to what it can do compared to supervised learning. For example, if you were training a bot to play track mania, recorded each frame and your move, and then trained a model off of this, the model would only be as good as you, and would never be better. Reinforcement learning works a lot more like human learning by itself, while supervised learning works a bit like a person being taught directly how to do it. Supervised learning may be easier, but in the end, it can be limiting.
How Reinforcement Learning Works
Before we can fully understand how reinforcement learning works, we first need to understand how a neural network works. In a nutshell, a neural network is just a mathematical function. It takes in an input and gives an output. To understand it fully, let’s think of it as one input one output, and zero hidden layers. This could be represented with f(x) = mx + b
. Look familiar?
Then we would establish what we are going to train this model on, In this example we’ll say that we want f(x) to be 5 when x is 1. To do this we’d run the model f(1) = m(1) + b
and gather the output (m and b are randomly initialized). Then we would compare what the model says f(1) is against what we want it to be (5). Next, we run what is called “backpropagation” on the mode to tune it so that it is closer to 5. After a couple of iterations, the complete function could look something like f(x) = 3.3x + 1.7
. To make the like more precise (with just f(1) = 5 to go off of, our m and b values could be any numbers that add to five) we would just need to include more training examples.
Note that this is a very simplified explanation. Check out 3blue1brown’s series on the topic
Reinforcement learning works in a very similar way to this, except we just don’t know the desired output, only the reward associated with what the model gives us.
Say that you have a model that plays a snake. You want the snake to move towards the apple, so you might say +1 reward if the snake moves towards the apple and -1 if it moves away.
When you get your model's action for the first time, it spits out a random array of numbers, the highest number being the action that the model wishes to take.
For example, let’s say that you are making a snake game. For your inputs to the model, you could pass the angle to the apple and the distance to the apple. You could represent this as [90, 4]
, showing that the apple is 4 blocks up from the snake's head.
In our example, the output would look something like [0.075, 0.5, 0.4, 0.025]
. This tells you that the model is 50% sure that the action to take is the second one. You can map this to a direction very easily. You could say action 1 is up, action 2 is down, action 3 is left, and action 4 is right.
If in this hypothetical example going down was the right move, you would give the model a reward of +1.
To be able for your reward to have any effect on the training, you need to amplify the correct options and diminish the incorrect ones. We know action 2 was the correct one because it was associated with a reward of +1, which means all the others can be regarded as “incorrect”. To represent this, we take the reward (+1), multiply it by a variable called “gamma”, which represents the impact rewards have in training (bigger is not always better), then add the original value.
In our example, action 2 was chosen as the action the model wanted to take (to determine this we figure out which is the highest value in the outputs, in this case 0.5), and it received a reward of +1. If we set our gamma to 0.9, we get the following formula: reward * gamma + original
, and when filling in the values we get 1 * 0.9 + 0.5
, which equates to 1.4.
Diminishing the other values, or multiplying them by -1 * reward
I do not recommend, as if your reward is negative, you could accidentally end up amplifying other incorrect values. Then we set action 2 in the original array equal to the new value, and train the model on the input and the updated output (in this case, our updated output is [0.075, 1.4, 0.4, 0.025]
. The model will train itself on these new values and the next time it gets an input of [90, 4]
It will put much more emphasis on the second value than it originally would have.
In a nutshell
A model is trained by modifying weights and biases (the m & b variables in the example) to get them to output values closer and closer to the desired output
In supervised learning, those outputs are predetermined, in reinforcement learning, they are determined with the following formula:
output[chosenAction] = reward * game + output[chosenAction]