机器学习（六）

发表于2022-12-03|更新于2022-12-06|科研

|阅读量:

RL learning

强化学习（一） | chiblog (chierhy.github.io)

强化学习（二） | chiblog (chierhy.github.io)

An Overview of Deep Reinforcement Learning (ntu.edu.tw)

RL概念

policy gradient

version 0： reward delay
version 1: cumulated reward
version 2: discount factor
version 3: Good or bad reward is “relative”, minus by a baseline b
- Each time you update the model parameters, you need to collect the whole training set again.
Off-policy → Proximal Policy Optimization (PPO) • The actor to train has to know its difference from the actor to interact.

Actor-Critic

how to estimate

version 3.5:
version 4: 平均-平均 Advantage Actor-Critic
Tip of Actor-Critic • The parameters of actor and critic can be shared.
DQN, Deep Q Network

Reward Shaping

多数时候reward都是0，执行什么action都差不多，比如下围棋机械手。
reward shaping curiosity——Obtaining extra reward when the agent sees something new (but meaningful).

No Reward: Learning from Demonstration

Motivation • Even define reward can be challenging in some tasks. • Hand-crafted rewards can lead to uncontrolled behavior.
Imitation Learning
Inverse Reinforcement Learning
。。。An Overview of Deep Reinforcement Learning (ntu.edu.tw)

文章作者: chierhy

文章链接: https://chierhy.github.io/技术方法模型/13-机器学习（六）/a7a4c28b/

版权声明: 本博客所有文章除特别声明外，均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 chiblog！

笔记🎫AI👾

相关推荐

课程-模式识别与机器学习

Quantum Machine Learning

我读=虚拟显示：引领未来的人机交互革命

评论