#### Introduction

For an intelligent autonomous agents, not only performing a task efficiently and accurately is important but also adapting to new situation and task is necessary. In order to achieve this agent should have proper representation of its knowledge which it acquired while performing previous tasks and it should be able to selectively transfer this knowledge when performing a new task. These are two aims of Life long learning.

Lifelong Learning is the continued learning of tasks, from one or more domains, over the course of a lifetime, by a lifelong learning system. A lifelong learning system efficiently and effectively (1) retains the knowledge it has learned; (2) selectively transfers knowledge to learn new tasks; and (3) ensures the effective and efficient interaction between (1) and (2)(Silver, Yang, and Li 2013)

One way of representing and transferring the knowledge is via abstract representation of task and planning over this representation.

Aim of this document is to list few of this abstract representation and planning methods on them and study its applicability to Robotics.

#### Broad Pipeline

The document describes block 2 and 3 of the pipeline.

#### Related Work

##### Temporally Extended Actions: Options

Learning a task with very long horizon can require many parameters and make learning difficult. It will also have high sample complexity, meaning it will require either very good exploration of state-space or many sample trajectories. Option framework (Precup et.al) reduces this requirement by breaking down one complex policy in many simple policies.

Formally, Option is defined by three tuple <>, where I is initiation set, where option can begin. It can also be represented as indicator function where = 1 if option can begin in state s, is option policy and is termination condition which gives probability of option terminating in a particular state.

In problems with high dimensions defining options manually is difficult. In such cases its important to automatically discover useful options. This can be done broadly in two ways, by exploring the state space or by imitation learning. Since exploration can be computationally expensive, second method is explored in details below.

Goal of agent in imitation learning scenario is to find a policy which maximizes the log-likelihood assigned to trajectories which comes from expert.

In case of options, the primitive actions will come from option policy and options will be picked from set of options by a meta-control policy . This is done is following manner. An option is picked in the initial state and action is chosen according to option policy and state is changed from to where option can terminate with probability , if the option terminates new option is chosen according to and so on.

This line of work is implemented using Expectation-gradient algorithm described in paper: Multi-level discovery of Deep Options. (Roy Fox et.al)

###### Understanding Expectation-gradient in details (paper link)

If is parameter used to approximate option policy, termination function of option and policy over options then we want to find that maximizes the log-likelihood assigned to given set of trajectories obtained by expert.

Consider a trajectory then the hierarchical generative model in the paper is given by,

which simplifies to

where is the current parameter taken as fixed outside the gradient.

If we expand by definition we find that log-likelihood gradient can be computed as sum of log probability gradients of various parametrized networks (option policy, termination condition, policy over options)

Intuitively, the algorithm attempts to jointly optimize three objectives (Directly from paper)

- In the E-step, to probabilistically infer the option boundaries where b “ 1 appears likely in v — this segments the trajectory into regimes where we expect h to change and employ different control law.
- In the E-step, to infer the option selection after a switch, given by w, and in the G-step to reduce the cross-entropy loss between that distribution, weighted by the probability of a switch, and the meta-control policy.
- In the G-step, to reduce the cross-entropy loss between the empirical action distribution, weighted by the probability for h, and the control policy for h.

###### Few Thoughts

- Since there are many different parameters to be inferred, sample complexity will be high as well as training using above method can be difficult.
- The above method is tested for discrete state and action space but the paper mentions that it can also be applied to continuous case by using hybrid networks (effectiveness to be verified)
- How many options should be considered for a given task is calculated empirically.

##### Slightly Modified form of Option: Universal Option Model

In the above described option model we saw that options are discovered according to single pre-specified reward function, which agent is trying to maximize. But there are many situations with multiple reward function which can be specified at any time. In such setting above option model will have have to start from scratch which is inefficient. *Universal Option Model* considers this problem, of learning models of options for real-time abstract planning in which reward function can be specified at any time.

###### Universal Function Approximators

To be written…

###### Few Thoughts (Directly from paper)

- It is very important to note the the distinction between task and goals. Task may have different MDP dynamics, whereas goal only changes the reward function but not the transition structure of an MDP. (Methods which generalize across tasks rather than goals will be discussed below.)
- Even though we know the description above, Its still unclear intuitively as well as mathematically proper distinction between goals, sub-goals, task! (To be investigated)

##### Options with Continuous State and Action: Skills

The frame work described above works with task having discrete state and action set. In robotics, most task have continuous state and action set. One way to use options in robotics is to discretized the state and action space, but this results is very high complexity and in some cases discretization is infeasible. Framework proposed by George Konidaris (et. al) deals with skill discovery and planning over skills in continuous state-action domain.

Formally skill is defined in similar way as options, but since skills work in continuous domain where one state may not appear twice, the ‘sets’ in above definition are replaced by target regions and skill discovery means finding the region in state space where the skill can initiate, terminate and skill policy over the area of interest.

In continuous domains computational complexity increases even more. There fore skill discovery via exploration can be computationally intensive. Therefore it can also be implemented using expert’s demonstration.

Intuitively speaking, useful skills will be those which are near the path to goal. Therefore it’ll be good idea first discover skill which definitely achieves the goal, meaning its termination condition lies in goal region. Then finding a skill whose termination condition lies in initiation region of previously discovered skill and so on. This process of discovery is called Skill-Chaining. In the process described above, finding initiation set is difficult task, as this is region from there target will be reached. This can be treated as a classification problem by collecting labeled example of states from which the target is reached. Also there can be multiple regions in state space from where the target region can be achieved, this is called Skill-trees

Important question to answer in such formulation is, how many skills should be formed and how this segmentation will be performed? Intuitively, a new skill is created when value function becomes too complex to be represented by single skill. This segmentation is performed using change-point detection method. A change point detection problem is where given a data set and candidate model set and assuming data is segmented such that, data from one segment comes from single model finding these segmentation points and how many such points exists. Change point is where the data changes the model.

This change point detection can be used on sampled expert trajectories to find skill segments. Even though we studied the skill discovery using skill chain, order in which is skills are discovered is not according to the chain and requires further planning to use them to achieve desired goal.

###### Understanding change-point detection in details

Since essence of skill is that it is specific some part of state space and it is task oriented it can be represented using task-relevant state space rather than complete state space. This representation is call *Abstraction*. An abstraction M is defined as pair of functions where is *state abstraction*, mapping over all state-space to smaller state space and is *motor abstraction*, mapping full action space to smaller action space. It is assumed that each abstraction has a basis function, , defined over which can be used to define the value function.

In change point detection, A set, Q, of models is given with prior p(q) for each . The whole process then can be seen as hidden Markov model with models as latent variables and data points as observed data. Data of segment length *l* is seen for , marginal probability of a segment length l is modeled with probability mass function g(l) and cumulative distribution function G(l) = . Transition between model (change point) can be understood by diagram below from G. Konidaris thesis.

Transitions occur when the model changes to a new model. The transition from model to occurs with probability g(j−i−1)p(qj), while the emission probability for observed data , …, y_{j−1} is P(i, j,)(1 − G(j − i − 1)).

In segmenting the demonstration trajectory using change-point, basis function associated with abstraction of skill work as models and sample return, from state at each time t as target variable. This effectively performs changepoint detection on the value function sample obtained from the trajectory.

###### Skill learning via exploration and deep networks

As mentioned earlier, learning skill from scratch by exploration of continuous state action space can be difficult and inefficient but there are few interesting methods that work towards mitigating this problem and learn skills via exploration.

###### 1. Skill discovery using Intrinsic Motivation

To be written…

###### 2. Skill discovery using Stochastic Neural Networks

To be written…

###### Few Thoughts

- Skill discovery using skill chaining is specific to a particular goal, meaning it will discover only those skill which will take agent to goal, generalizing this would be an interesting problem. One approach will be use things like Universal Value Function Approximators or Universal option model.

##### Slightly modified form of Skill: Parametrized Skills

To be written…

##### Abstract Markov Decision Process

The frameworks described above are bottom up construction of hierarchy. Meaning, the hierarchy is constructed from primitive action or state to abstract action or state (in above case extended action). Abstract Markov Decision process is top down hierarchy of MDPs. Having top down hierarchy comes with several advantages over previous methods like, its not necessary to plan for complete environment MDP but its possible to plan for only those MDP which help achieve the goal. If the goal of the task changes, its not necessary to change the reward function at each level of hierarchy. Also, the above described frameworks work in model-free RL setting while AMDPs can work in model-based RL setting.

Formally, AMDP is defined as 6 tuple where is state space, is action space consisting not only the primitive actions but also subgoals to be achieved, is transition model, is reward, are absorbing states or goal states after which transitions are not allowed and F is *state projection function* that maps states from environment MDP to abstract MDP states.

###### Top-down Planning with AMDPs

As mentioned earlier one key property of AMDPs is that planning can be done over AMDPs in top-down manner where first the policy for the root AMDP is computed then policies for subgoals chosen by root policy are computed recursively. Consequently, agent never have to find policy for subgoals which were not chosen by root policy. For such top-down approach transition model and reward function for each AMDP is required. Important paragraph from paper –> Important thing to remember in such approach is that, When a parent AMDP invokes a lower-level node, it expects that particular subgoal to be achievable; there is currently no mechanism for backtracking. This is potentially problematic if the lower-level AMDP’s failure conditions are not expressible in the higher-level state abstraction, since the parent would then have no means to reason about the failure. We currently avoid this problem by ensuring that termination sets include all failure conditions, and designing higher-level state spaces that are capable of representing such failures. However, if the abstractions are themselves imperfect (e.g., learned), then we would likely need to address the problem of failure recovery.

### References

- Richard S. Sutton, Doina Precup, Satinder Singh: Between MDPs and semi-MDPs:A framework for temporal abstraction in reinforcement learning.
- PhD Thesis, George Konidaris: AUTONOMOUS ROBOT SKILL ACQUISITION
- George Konidaris, Scott Kuindersma, Roderic Grupen, Andrew Barto: CST: Constructing Skill Trees by Demonstration.
- Fox, Roy; Krishnan, Sanjay; Stoica, Ion; Goldberg, Ken: Multi-Level Discovery of Deep Options
- Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J. Mankowitz, Shie Mannor: A Deep Hierarchical Approach to Lifelong Learning in Minecraft, arXiv:1604.07255
- Tom Schaul, Dan Horgan, Karol Gregor, David Silver: Universal Value Function Approximators.
- Carlos Florensa, Yan Duan, Pieter Abbeel: STOCHASTIC NEURAL NETWORKS FOR HIERARCHICAL REINFORCEMENT LEARNING
- Tejas D. Kulkarni, Karthik R. Narasimhan, Ardavan Saeedi, Joshua B. Tenenbaum: Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation, arXiv:1604.06057
- Nakul Gopalan, Marie desJardins, Michael L. Littman, James MacGlashan,Shawn Squire,Stefanie Tellex,John Winder,Lawson L.S. Wong: Planning with Abstract Markov Decision Processes
- Bruno Castro da Silva, George Konidaris, Andrew G. Barto: Learning Parameterized Skills
- Hengshuai Yao, Csaba Szepesvari, Rich Sutton, Joseph Modayil, Shalabh Bhatnagar: Universal Option Models.