RL learning

强化学习(一) | chiblog (chierhy.github.io)

强化学习(二) | chiblog (chierhy.github.io)

An Overview of Deep Reinforcement Learning (ntu.edu.tw)

RL概念

  • 图片

  • 图片

policy gradient

  • version 0: reward delay

  • version 1: cumulated reward

  • version 2: discount factor

  • version 3: Good or bad reward is “relative”, minus by a baseline b

  • 图片
    图片

    • Each time you update the model parameters, you need to collect the whole training set again.

     

  • 图片

  • Off-policy → Proximal Policy Optimization (PPO) • The actor to train has to know its difference from the actor to interact.

  • 图片

Actor-Critic

  • 图片

  • how to estimate 图片
    图片

图片

  • version 3.5: 图片

  • version 4: 平均-平均 Advantage Actor-Critic图片

  • Tip of Actor-Critic • The parameters of actor and critic can be shared.

  • DQN, Deep Q Network

Reward Shaping

  • 图片
    多数时候reward都是0, 执行什么action都差不多,比如下围棋 机械手。
  • reward shaping curiosity——Obtaining extra reward when the agent sees something new (but meaningful).

No Reward: Learning from Demonstration