机器学习(六)
RL learning
强化学习(一) | chiblog (chierhy.github.io)
强化学习(二) | chiblog (chierhy.github.io)
RL概念
policy gradient
version 0: reward delay
version 1: cumulated reward
version 2: discount factor
version 3: Good or bad reward is “relative”, minus by a baseline b
- Each time you update the model parameters, you need to collect the whole training set again.
Off-policy → Proximal Policy Optimization (PPO) • The actor to train has to know its difference from the actor to interact.
Actor-Critic
how to estimate
version 3.5:
version 4: 平均-平均 Advantage Actor-Critic
Tip of Actor-Critic • The parameters of actor and critic can be shared.
DQN, Deep Q Network
Reward Shaping
多数时候reward都是0, 执行什么action都差不多,比如下围棋 机械手。- reward shaping curiosity——Obtaining extra reward when the agent sees something new (but meaningful).
No Reward: Learning from Demonstration
- Motivation • Even define reward can be challenging in some tasks. • Hand-crafted rewards can lead to uncontrolled behavior.
- Imitation Learning
- Inverse Reinforcement Learning
。。。An Overview of Deep Reinforcement Learning (ntu.edu.tw)
本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 chiblog!
评论