强化学习的数学原理-notes

P1 Basic Concepts

State: The status of the agent with respect to the environment.

grid-world example: the location of the agent ( $s_1, s_2, s_3, \ldots, s_9$ ).

State space: the set of all states $S = \{s_i\}_{i=1}^9$ .

Action: For each state, there are lots of possible actions: $a_1, \ldots, a_n$ .

Action space of a state: the set of all possible actions of a state. $A(s_i) = \{a_i\}^5_{i=1}$ .

State transition: When taking an action, the agent may move from one state to another. It defines the interaction with the environment. $s_1 \xrightarrow{a_2} s_2$

Tabular representation: could only represent deterministic cases.

State transition probability: use probability to describe state transition. This could be deterministic or stochastic.

$\begin{aligned} p(s_2 | s_1, a_2) &= 0.5 \\ p(s_3 | s_1, a_2) &= 0.5 \\ p(s_i | s_1, a_2) &= 0 \quad \forall i \neq 2,3 \notag \end{aligned}$

Policy: tells the agent what actions to take at a state.

Intuitive representation:

Mathematical representation: using conditional probability

$\begin{aligned} \pi(a_1 | s_1) &= 0.5 \\ \pi(a_2 | s_1) &= 0.5 \\ \pi(a_3 | s_1) &= 0 \\ \pi(a_4 | s_1) &= 0 \\ \pi(a_5 | s_1) &= 0 \end{aligned}$

Tabular representation: could be deterministic or stochastic cases.

Reward: could be interpreted as a Human-machine interface. A real number we get after taking an action.

A positive reward represents encouragement to take such actions.
A negative reward represents punishment to take such actions.

Tabular representation:

Mathematical description: conditional probability

$\begin{aligned} p(r = -1 | s_1, a_1) &= 1 \\ p(r \neq -1 | s_1, a_1) &= 0 \end{aligned}$

Trajectory: a state-action-reward chain:

$s_1 \xrightarrow[r=0]{a_2} s_2 \xrightarrow[r=0]{a_3} s_5 \xrightarrow[r=0]{a_2} s_8 \xrightarrow[r=1]{a_2} s_9$

Return: the sum of all the rewards collected for specific trajectory

$\text{return} = 0 + 0 + 0 + 1 = 1$

Discounted return: introduce a discount rate $\tau \in \left[0, 1 \right)$ to solve the infinite return problem.

$\begin{aligned} \text{discounted return} &= 0 + \tau 0 + \tau^2 0 + \tau^3 0 + \tau^3 1 + \ldots \\ &= \tau^3 (1 + \tau + \ldots) \\ &= \tau^3 \frac{1}{1 - \tau} \end{aligned}$

If $\tau$ is close to 0, the value of discounted return is dominated by the rewards obtained in the near future.
If $\tau$ is close to 1, the value of discounted return is dominated by the rewards obtained in the far future.

Episode: the agent may stop at some terminal states. The resulting trajectory is called an episode (trial).

An Episode is usually assumed to be a finite trajectory. We treat episode as a continuing tasks which means that this task has no terminal state.

Markov decision process (MDP)

Sets:
- State: the set of states $S$
- Action: the set of actions $A(s)$ is associated for state $s \in S$
- Reward: the set of rewards $R(s, a)$
Probability distribution
- State transition probability: at state $s$ , taking action $a$, the probability to transit to state $s^\prime$ is $p(s^\prime | s, a)$
- Reward probability: at state $s$, taking action $a$, the probability to get reward $r$ is $p(r|s, a)$
Policy: at state $s$, the probability to choose action a is $\pi(a|s)$
Markov property: memoryless property

$\begin{aligned} p(s_{t + 1} | a_{t+1}, s_t, \ldots, a_1, s_0) &= p(s_{t + 1} | a_{t+1}, s_t) \\ p(r_{t + 1} | a_{t+1}, s_t, \ldots, a_1, s_0) &= p(r_{t + 1} | a_{t+1}, s_t) \end{aligned}$

P2 bellman Equation

Motivation

The return reflects the efficacy of the policy: policy 1 is the best and the policy 2 is the worst.

$\begin{aligned} return_1 &= 0 + \tau 1 + \tau^2 1+ \ldots \\ &= \tau (1 + \tau + \tau^2 + \ldots) \\ &= \frac{\tau}{1 - \tau} \\ return_2 &= -1 + \tau 1 + \tau^2 1+ \ldots \\ &= -1 + \frac{\tau}{1 - \tau} \\ return_3 &= 0.5 (\frac{\tau}{1 - \tau}) + 0.5 (-1 + \frac{\tau}{1 - \tau}) \\ return_1 &< return_2 < return_3 \notag \end{aligned}$

State Value

Consider the following single-step process:

$S_t \xrightarrow{A_t} R_{t+1},S_{t+1} \notag$

This step is governed by the following probability distributions:

$\begin{aligned} S_t & \rightarrow A_t \ \ \text{is governed by } \pi(A_t = a | S_t = s) \\ S_t, A_t & \rightarrow R_{t+1} \ \ \text{is governed by } P(R_{t+1} = r | S_t = s, A_t = a) \\ S_t, A_t & \rightarrow S_{t+1} \ \ \text{is governed by } P(S_{t+1} = s^{'}| S_t = s, A_t = a) \notag \end{aligned}$

Consider the following multi-step trajectory:

$S_t \xrightarrow{A_t} R_{t+1},S_{t+1} \xrightarrow{A_{t+1}} R_{t+2},S_{t+2} \xrightarrow{A_{t+2}} \ldots$

The discounted return: $G_t = R_{t + 1} + R_{t+2} + R_{t + 3} + \ldots$ , $\tau \in (0, 1)$ is discounted rate.

State value: the expectation of $G_t$ :

$\begin{aligned} v_\pi(s) = \mathbb{E}[G_t | S_t = s] \notag \end{aligned}$

It is a function of $s$ . It is a conditional expectation with the condition that the state starts from $s$ .
It is based on the policy $\pi$ . For a different policy, the state value may be different.
It represents the “value” of a state. If the state value is greater, the policy is better.

Bellman equation

Consider a random trajectory:

$S_t \xrightarrow{A_t} R_{t+1},S_{t+1} \xrightarrow{A_{t+1}} R_{t+2},S_{t+2} \xrightarrow{A_{t+2}} \ldots$

The return $G_t$ can be written as:

$\begin{aligned} G_t &= R_{t + 1} + \tau R_{t+2} + \tau^2 R_{t+3} + \ldots \\ &= R_{t + 1} + \tau (R_{t+2} + \tau R_{t+3} + \ldots) \\ &= R_{t + 1} + \tau (G_{t+1}) \end{aligned}$

Then:

$\begin{aligned} v_{\pi}(s) &= \mathbb{E}[G_t|S_t = s] \\ &= \mathbb{E}[R_t + \tau G_{t+1}|S_t = s] \\ &= \mathbb{E}[R_{t+1}|S_t=s] + \tau \mathbb{E}[G_{t+1}|S_t=s] \end{aligned}$

For the first term $\mathbb{E}[R_{t+1}|S_t=s]$ :

$\begin{aligned} \mathbb{E}[R_{t+1}|S_t=s] &= \sum_{a} \pi(a|s) \mathbb{E}[R_{t+1}|S_t=s, A_t=a] \\ &= \sum_a \pi(a|s) \sum_r p(r|s, a)r \end{aligned}$

For the second term $\mathbb{E}[G_{t+1}|S_t=s]$ :

$\begin{aligned} \mathbb{E}[G_{t+1}|S_t=s] &= \sum_{s^{\prime}} \mathbb{E}[G_{t+1}|S_t =s, S_{t+1}=s^{\prime}] p(s^\prime|s)\\ &= \sum_{s^{\prime}} \mathbb{E}[G_{t+1}|S_{t+1}=s^{\prime}] p(s^\prime|s) \\ &= \sum_{s^\prime} v_{\pi}(s^\prime) p(s^\prime|s) \\ &= \sum_{s^\prime} v_{\pi}(s^\prime) \sum_a p(s^\prime|s,a) \pi(a|s) \end{aligned}$

Therefore:

$\begin{aligned} v_{\pi}(s) &= \mathbb{E}[R_{t+1}|S_t=s] + \tau \mathbb{E}[G_{t+1}|S_t=s] \\ &= \underbrace{\sum_a \pi(a|s) \sum_r p(r|s, a)r}_{\text{mean of immediate rewards}} + \tau \underbrace{\sum_{s^\prime} v_{\pi}(s^\prime) \sum_a p(s^\prime|s,a) \pi(a|s)}_{\text{mean of future rewards}} \\ &= \end{aligned}$