P1 Basic Concepts


State: The status of the agent with respect to the environment.

  • grid-world example: the location of the agent ().

image-20260217224152449

State space: the set of all states .

Action: For each state, there are lots of possible actions: .

image-20260217224418063

Action space of a state: the set of all possible actions of a state. .

State transition: When taking an action, the agent may move from one state to another. It defines the interaction with the environment.

  1. Tabular representation: could only represent deterministic cases.

image-20260217225236436

  1. State transition probability: use probability to describe state transition. This could be deterministic or stochastic.

Policy: tells the agent what actions to take at a state.

  1. Intuitive representation:

image-20260217230350088

  1. Mathematical representation: using conditional probability
  1. Tabular representation: could be deterministic or stochastic cases.

image-20260217230836629

Reward: could be interpreted as a Human-machine interface. A real number we get after taking an action.

  • A positive reward represents encouragement to take such actions.
  • A negative reward represents punishment to take such actions.

image-20260217232118905

  1. Tabular representation:

image-20260217232220624

  1. Mathematical description: conditional probability

Trajectory: a state-action-reward chain:

image-20260217232832368

Return: the sum of all the rewards collected for specific trajectory

image-20260217235257446

Discounted return: introduce a discount rate $\tau \in \left[0, 1 \right)$ to solve the infinite return problem.

  • If is close to 0, the value of discounted return is dominated by the rewards obtained in the near future.
  • If is close to 1, the value of discounted return is dominated by the rewards obtained in the far future.

Episode: the agent may stop at some terminal states. The resulting trajectory is called an episode (trial).

An Episode is usually assumed to be a finite trajectory. We treat episode as a continuing tasks which means that this task has no terminal state.

Markov decision process (MDP)

  • Sets:

    • State: the set of states
    • Action: the set of actions is associated for state
    • Reward: the set of rewards
  • Probability distribution

    • State transition probability: at state $s$ , taking action $a$, the probability to transit to state $s^\prime$ is $p(s^\prime | s, a)$
    • Reward probability: at state $s$, taking action $a$, the probability to get reward $r$ is $p(r|s, a)$
  • Policy: at state $s$, the probability to choose action a is $\pi(a|s)$
  • Markov property: memoryless property

P2 bellman Equation


Motivation

The return reflects the efficacy of the policy: policy 1 is the best and the policy 2 is the worst.

image-20260222233209903

State Value

Consider the following single-step process:

This step is governed by the following probability distributions:

Consider the following multi-step trajectory:

The discounted return: , $\tau \in (0, 1)$ is discounted rate.

State value: the expectation of :

  • It is a function of . It is a conditional expectation with the condition that the state starts from .
  • It is based on the policy . For a different policy, the state value may be different.
  • It represents the “value” of a state. If the state value is greater, the policy is better.

Bellman equation

Consider a random trajectory:

The return can be written as:

Then:

For the first term :

For the second term :

Therefore: