Exploration vs. Exploitation - Learning the Optimal Reinforcement Learning Policy

Exploration vs. Exploitation – Learning the Optimal Reinforcement Learning Policy

what's up guys welcome back to this series on reinforcement learning last time we left our discussion of cue learning with the question of how an agent chooses to either explore the environment or to exploit it in order to select its actions to answer this question will introduce a type of strategy called an epsilon greedy strategy so let's get to it we're going to pick up right where we left off last time by continuing our explanation of Q learning with the lizard game example remember in each episode the agent the lizard in our case starts out by choosing an action from the starting state based on the current Q value estimates in the Q table the lizard chooses its action based on the action with the highest Q value for the given state since we know that all of the Q values are 1st initialized to zero there's no way for the lizard to differentiate between them at the starting state of the first episode so the question remains what action does the agent start with furthermore for subsequent States is it really as straightforward as just selecting the action with the highest Q value for the given State additionally we know that we need a balance of exploration and exploitation to choose actions but how exactly this is achieved is with an Epsilon greedy strategy so let's explore that strategy now to get the balance between exploitation and exploration we use what's called an Epsilon greedy strategy with this strategy we define an exploration rate Epsilon that we initially set to 1 this exploration rate is the probability that our agent will explore the environment rather than exploit it with epsilon equal to 1 it's a hundred percent certain that the agent will start out by exploring the environment as the agent learns more about the environment at the start of each new episode epsilon will decay by some rate that we set so that the likelihood of exploration becomes less and less probable as the agent learns more and more about the environment the agent will become in a sense greedy in terms of exploiting the and once it's had the opportunity to explore and learn more about it to determine whether the agent will choose exploration or exploitation at each time step we generate a random number between zero and one if this number is greater than Epsilon then the agent will choose its next action via exploitation ie it will choose the action with the highest Q value for its current state from the Q table otherwise its next action will be chosen via exploration ie randomly choosing its action and exploring what happens in the environment so recall we first started talking about the exploration exploitation trade-off last time because we were discussing how the lizard should choose its very first action since all the actions have a Q value of zero at the start well now we should know that the action will be chosen randomly via exploration since the exploration rate is set to one initially meaning with a hundred percent probability the lizard will explore the environment during the first episode of the game rather than exploit it alright so after the lizard takes an action it observes the next state the reward gained from its action and then updates the Q value in the Q table for the action it took from the previous state let's suppose the lizard chose to move right as its action from the starting State we can see the forward the lizard gets in this new state is minus 1 since recall ante tiles have a reward of minus one point to update the Q value for the action of moving right taken from the previous state we use the bellman equation that we highlighted previously we want to make the Q value for the given state action pair as close as we can to the right-hand side of the bellman equation so that the Q value will eventually converge to the optimal Q value Q star this will happen over time by iteratively comparing the loss between the Q value and the optimal Q value for the given state action pair and then updating the Q value over and over again each time the agent encounters this same state action pair and the objective of this process is to reduce the loss between the Q value and the optimal Q value to actually see how we update the Q value we first need to introduce the idea of a learning rate the learning rate is a number between zero and one which can be thought of as how quickly that agent abandons the previous Q value in the Q table for the new Q value for a given state action pair so for example suppose we have a Q value in the Q table for some arbitrary state action pair that the agent has experienced in a previous time step well if the agent experience is that same state action pair at a later time step once it learned more about the environment the Q value will need to be updated to reflect the change in expectations the agent now has for the future returned we don't want to just overwrite the old Q value though but rather we use the learning rate as a tool to determine how much information we keep about the previously computed Q value for the state action pair versus the new Q value calculated for the same state action pair just at a later time step will denote the learning rate with the symbol alpha and we'll arbitrarily set alpha equal to 0.7 for our lizard game example the higher the learning rate the more quickly the agent will adopt the new Q value for example if the learning rate is 1 the estimate for the Q value for a given state action pair would be the straight-up newly calculated Q value and wouldn't consider previous Q values that had previously been calculated for the given state action pair in earlier time steps now let's see how exactly the new Q value is calculated using the learning rate specifically we'll see how the Q value is calculated for the example of the lizard taking the action of moving right from the starting State the formula for calculating the new Q value for a state action pair s comma a at time T is this so our new Q value is equal to a weighted sum of our old value and the learned value the old value in our case is zero since this is the first time the agent is experiencing this particular state action pair and we multiply this old value by 1 minus alpha our learned value is the reward we received from moving right from the starting state plus the discounted estimate of the optimal future Q value for the next state action pair s prime comma at time t plus one this entire learned value is then multiplied by our learning rate all of the math for this calculation of our concrete example state action pair is shown here so take a moment to pause and make sure you've got everything down all right so now we'll take this new Q value we just calculated and store it in our Q table for this particular state action pair we've now done everything needed for a single time step this same process will happen for each time step until termination in each episode oh and speaking of termination we can also specify a max number of steps that our agent can take before the episode auto terminates with the way the game set up right now termination will only occur if the lizard reaches the state with five crickets or the state with the bird we could define some condition though that states of the lizard hasn't reached termination by either one of these two states after 100 steps then terminate the game after the 100th step now finally once the q function converges to the optimal q function we can obtain our optimal policy all right now I know all of that was kind of a lot so I've provided really condensed summarized steps for everything we covered in the last video and this video for how Q learning works it's available on the corresponding blog for this video along with all the full details for everything we've covered so far so check that out and be sure you're taking advantage of that resource in the next video we're going to see how we can implement this Q learning algorithm step by step in code using Python to play a simple game we'll have lots more to talk about there let me know in the comments if you're stoked to actually start getting your hands dirty to implement some of this stuff and be sure to leave a thumbs up if you are thanks for contributing to collective intelligence and I'll see you in the next one artificial intelligence will change our world machines that can think like humans yet crunch through massive amounts of data in a very inhuman way may be exactly the tool we need to solve our biggest problems AI may also unlock the remaining mysteries of the human mind and this is what intrigues me the most I'm not a neuroscientist but as a computer scientist I write programs that can learn learn from their experience and change and adapt as for the environment we mainly use games at deepmind why is that well games are very very useful because we can readily evaluate how well the agent does by pitting it against a human player for example or comparing its score to other agents but more importantly we can think of games as being like a microcosm of human ability because they are so diverse and so ubiquitous across human culture so they're incredibly valuable if what we want is to both develop and demonstrate artificial intelligence

15 thoughts on “Exploration vs. Exploitation – Learning the Optimal Reinforcement Learning Policy

  1. Check out the corresponding blog and other resources for this video at:

  2. I am a bit confused on how we got the update Q-value equation. Specifically where we introduced the learning rate to the equation and the (1 – alpha)
    Never mind, amazing things happen when you write down an equation and simplify it 🙂 Thanks for the video

  3. Outstanding video series. Many thanks for going through all this effort to teach us this intriguing concept. I have one question on this video, which I would deeply appreciate if someone could clarify:

    I didn’t quite understand the logic behind updating the Q-value formula. Specifically I didn’t understand how placing a fixed learning value, in such a way so it’s always favouring one Q-value over another (i.e. between the old Q-value and the Learned Q-value). I can’t see the logic behind how this will optimise and converge to the optimal Q-value. Because it’s always going to favour one Q-value over another (i.e. give it a higher weight), even if the other Q-value might be yielding a better value over a number of iterations. The formula just seems to be acting as a bias rather than a learning optimisation formula. I hope this question makes sense, and many thanks in advance for any answers.

  4. Edit: I got it from your blog. The Bellman equation is actually used as the second part in the Q-learning update function.

    Original question:
    The explanation is awesome! But I have a doubt here. The Bellman equation was introduced in the previous episode, but it seems when we implement Q-learning, we just need that Q value function to update the Q values in the Q-table? So where is the Bellman equation used in Q-learning or the Bellman equation is just a conceptual idea? Thanks!

  5. your other video series are lot more intuitive…. this whole RL series is lot less intuitive…. you just cover math behind it. for someone who doesn't know well, the best thing to do is to start from going through the lizard game step by step, and then cover the math….. I was very much fan of your videos, but honestly this whole series is not very well explained.

  6. Hi, On your blog post, when you have put the values on your equation, where did u get the value of γ as 0.99. 

    I am a bit dumb.. would be great if could explain. Thanks for the videos..

    I have finished your fundamentals of deep learning. It was absolutely great. Can you please add RNN/LSTM video. Also it would be great to see some KDD99 , word2vec and auto encoders. thank you for your effort.

  7. Explanation is really good for me (university student in artificial intelligence)! Superbe video and audio quality, really understandable graphics!! This channel deserves more! Thank you!

  8. What I learned:
    1.Exploration and exploitation:key is choose the highest Q-value for given state or not.
    2.Balance:epsilon greedy strategy(EGS)
    3.EGS:we set a epsilon to decide. Which means probability to explore.At first set to 1 means 100% percent to explore(not to choose the best but ramdom)
    4.Greedy:means agent will be greedy if it learns the envirment.(No more explore,just exploitation)
    5.Before I thought update the Q-table is easy. Every time you just update the value you learned -1 for example.Now I get it. First you must to set learning rate.You can not forget the old value.Then the value is not only the reward this step . It is the return include this step and the step after(bellman equention).
    6.Maxsteps:We set the conditon to stop.

    1.When lizard know the bird will kill him.Do him need to explore it again?
    2.I don't get the updating the Q-value that chapter idea.In my view,without it.This article can make sence too.
    3.I think the example not good enough . Maxq(s',a') is zero.

  9. Very good explanation. It would be great if you had worked out the example completely. (preferably step by step). Thanks!

  10. The learning process seems to rely on both epsilon and alpha. The role of epsilon is clear, but alpha is more confusing. If I were making this stuff up, I would have picked alpha to depend on how many times that specific action was taken at the current state. But I'm not the inventor, so I am going to assume that we should choose alpha and epsilon in order to get our policy to converge to the optimum policy as quickly as possible. Is that right? Of course, this begs the question: How do we know that we will converge to the optimum policy?

  11. Great videos! Thanks! I'm learning a lot (I think–but maybe not). Here's something I'm puzzled about. In the definition of "loss" there is an expression "q*(s,a)" and an expression "q(s,a)" with no subscript. But I'm not sure what policies are being used for each. There are three policies in the back of my mind. There is the true optimum policy which is what we are trying to find, but we don't know it yet. There is the current policy that takes into account the exploration rate, "epsilon". And there is the current approximation to the optimum policy, by which I mean the current policy with epsilon = 0. Could you please make it clear to me which policy goes with each of these q's? Thanks!

  12. Two comments:
    1. I think this is very good and about as clear as it could be using the formal equations. A different approach which helped me a lot, perhaps usable in conjunction, is to walk through updating the Q table “by hand”. Here is a good treatment I found, which might make a great video. http://mnemstudio.org/path-finding-q-learning-tutorial.htm
    2. Your Keras playlist is terrific. If you could do Keras in this series and not just PyTorch, that would be very much appreciated.

  13. The chunking is fairly good so far. I like the progression. Great video and audio quality. The graphics are very good also. I'm hoping that the issue of calculating max value of future state/action pairs is well covered. I'm wondering what happens when you don't know what the actual future values will be. (obviously not like the 4×4 square problem)

Leave a Reply

Your email address will not be published. Required fields are marked *