In Reinforcement Learning paradigm we have agent who takes certain action in certain state and receives certain reward. Options in Reinforcement Learning are temporally extended actions, (Macro action). Option framework (Precup, 2000; Sutton, Precup & Singh, 1999) provides very natural way of incorporating such extended action in Reinforcement Learning system. Option framework shows that option can be interchangeably used with primitive action in planning and learning methods of RL.  In RL the problem is defined using an Markov Decision Process and set of option defined over this MDP constitutes an Semi-Markov Decision Process.

##### Markov Decision Process Framework:perticular

A Reinforcement Learning problem is modeled as a Markov Decision Process which comprises of a state-space S, and an Action space Aagent interacts with environment in discrete steps t= 0,1,2,3…. which involves taking action $a \in A$ in state $s \in S$ and receiving reward r. Over finite set S and A transition dynamics of the environment can be modeled as: $P^a_{ss'} = Pr[s_{t+1} | s_t = s, a_t = a ]$

and one-set expected reward is given by: $r_s^a = E[r_{t+1} | s_t = s, a_t = a ]$

Agents aim is to learn the Markov Policy which maximises the expected reward for each state. This can be mathematically understood my Bellman equation (details of which can be found in paper.)

##### Options:

Options: Extended actions, are described using three components:

1. Option Policy $\pi: S \times A \rightarrow [0,1]$ meaning probability of choosing an action in a state according to option policy being followed. (Mapping state action pair to a probability value between 0 and 1)
2. Termination Condition $\beta: S^+ \rightarrow [0,1]$ meaning probability of option termination in a particular state.
3. Initiation set $I \subset S$:  set of all the sets where option can begin.

Markov option executes as in following manner: Next action $a_t$ is selected according to option policy being followed, state is transitioned to $s_{t+1}$ where option terminates with probability $\beta(s_{t+1})$ or else continues to choose action action according to option policy. If the option terminates agent chooses new option in state $s_k$ where $s_k$ should be in initiation set of new option.

Important thing to notice here is that initiation and termination conditions restricts the range of applicability of option. Meaning now option policy has to be defined only over I and not complete state space S.

From the above description it can be seen that termination of option completely depends on termination condition $\beta$

MORE COMING SOON  (To be continued…) 