 # Exploration vs. Exploitation – Learning the Optimal Reinforcement Learning Policy

## 15 thoughts on “Exploration vs. Exploitation – Learning the Optimal Reinforcement Learning Policy”

1. deeplizard says:

Check out the corresponding blog and other resources for this video at:
http://deeplizard.com/learn/video/mo96Nqlo1L8

2. Nabil Baalbaki says:

I am a bit confused on how we got the update Q-value equation. Specifically where we introduced the learning rate to the equation and the (1 – alpha)
Never mind, amazing things happen when you write down an equation and simplify it 🙂 Thanks for the video

3. hazzaldo N says:

Outstanding video series. Many thanks for going through all this effort to teach us this intriguing concept. I have one question on this video, which I would deeply appreciate if someone could clarify:

I didn’t quite understand the logic behind updating the Q-value formula. Specifically I didn’t understand how placing a fixed learning value, in such a way so it’s always favouring one Q-value over another (i.e. between the old Q-value and the Learned Q-value). I can’t see the logic behind how this will optimise and converge to the optimal Q-value. Because it’s always going to favour one Q-value over another (i.e. give it a higher weight), even if the other Q-value might be yielding a better value over a number of iterations. The formula just seems to be acting as a bias rather than a learning optimisation formula. I hope this question makes sense, and many thanks in advance for any answers.

4. Hang Chen says:

Edit: I got it from your blog. The Bellman equation is actually used as the second part in the Q-learning update function.

Original question:
The explanation is awesome! But I have a doubt here. The Bellman equation was introduced in the previous episode, but it seems when we implement Q-learning, we just need that Q value function to update the Q values in the Q-table? So where is the Bellman equation used in Q-learning or the Bellman equation is just a conceptual idea? Thanks!

5. arian vc says:

This series has been the best introductory course I've seen so far.

6. RatedRudy says:

your other video series are lot more intuitive…. this whole RL series is lot less intuitive…. you just cover math behind it. for someone who doesn't know well, the best thing to do is to start from going through the lizard game step by step, and then cover the math….. I was very much fan of your videos, but honestly this whole series is not very well explained.

7. Nepal Crypto says:

Hi, On your blog post, when you have put the values on your equation, where did u get the value of γ as 0.99.

I am a bit dumb.. would be great if could explain. Thanks for the videos..

I have finished your fundamentals of deep learning. It was absolutely great. Can you please add RNN/LSTM video. Also it would be great to see some KDD99 , word2vec and auto encoders. thank you for your effort.

8. Luigi Faticoso says:

Explanation is really good for me (university student in artificial intelligence)! Superbe video and audio quality, really understandable graphics!! This channel deserves more! Thank you!

9. 刘新新 says:

What I learned:
1.Exploration and exploitation:key is choose the highest Q-value for given state or not.
2.Balance:epsilon greedy strategy(EGS)
3.EGS:we set a epsilon to decide. Which means probability to explore.At first set to 1 means 100% percent to explore(not to choose the best but ramdom)
4.Greedy:means agent will be greedy if it learns the envirment.(No more explore,just exploitation)
5.Before I thought update the Q-table is easy. Every time you just update the value you learned -1 for example.Now I get it. First you must to set learning rate.You can not forget the old value.Then the value is not only the reward this step . It is the return include this step and the step after(bellman equention).
6.Maxsteps:We set the conditon to stop.

question:
1.When lizard know the bird will kill him.Do him need to explore it again?
2.I don't get the updating the Q-value that chapter idea.In my view,without it.This article can make sence too.
3.I think the example not good enough . Maxq(s',a') is zero.

10. Amey Naik says:

Very good explanation. It would be great if you had worked out the example completely. (preferably step by step). Thanks!

11. Chris Freiling says:

The learning process seems to rely on both epsilon and alpha. The role of epsilon is clear, but alpha is more confusing. If I were making this stuff up, I would have picked alpha to depend on how many times that specific action was taken at the current state. But I'm not the inventor, so I am going to assume that we should choose alpha and epsilon in order to get our policy to converge to the optimum policy as quickly as possible. Is that right? Of course, this begs the question: How do we know that we will converge to the optimum policy?

12. Chris Freiling says:

Great videos! Thanks! I'm learning a lot (I think–but maybe not). Here's something I'm puzzled about. In the definition of "loss" there is an expression "q*(s,a)" and an expression "q(s,a)" with no subscript. But I'm not sure what policies are being used for each. There are three policies in the back of my mind. There is the true optimum policy which is what we are trying to find, but we don't know it yet. There is the current policy that takes into account the exploration rate, "epsilon". And there is the current approximation to the optimum policy, by which I mean the current policy with epsilon = 0. Could you please make it clear to me which policy goes with each of these q's? Thanks!

13. garyblauer says:

14. Peter Heinrich says:
15. Subrat Swain says: