Apprenticeship+Learning+via+Inverse+Reinforcement+Learning

Here we will look at inverse reinforcement learning approach of apprenticeship learning. The inverse method of reinforcement learning is found to be more effective in that a key problem in learning an MDP’s parameters is that of exploration. That is, how can we ensure that all relevant parts of the MDP are visited sufficiently enough. Pieter Abbeel and Andrew Y. Ng of Stanford University approached apprenticeship learning via inverse reinforcement learning.

Abbeel & Ng believed that the MDP formalism is useful for many problems because it is often easier to specify the reward function than to directly specify the value function, and/or the optimal policy. For example, we look back at Abbeel & Ng’s driving scenario; When driving, we typically trade off many different desired things, such as maintaining safe following distance, keeping away from the curb, staying far from any pedestrians, maintaining a reasonable speed, perhaps a slight preference for driving in the middle lane, not changing lane too often, and so on. To specify a reward function for the driving task, they would have to assign a set of weights stating exactly how they would like to trade off these different factors. For this Abbeel & Ng develops an algorithm using quadratic programming, which requires a QP solver, that uses an inverse reinforcement learning algorithm to compute the optimal policy for the MDP. Abbeel & Ng noted that although they called one stop of their quadratic algorithm an inverse reinforcement learning step, their algorithm does not necessarily recover the underlying reward function correctly.

Now for their experiment, Abbeel & Ng goes back to the car driving simulation, and applied apprenticeship learning to try to learn different driving styles. The car has five different actions, three of which cause the car to steer to one of the three lanes, and two of which cause the car to drive off, but parallel to, the road, on either the left or the right side. Here Abbeel & Ng are not worried if the car drives off the road. They wanted to demonstrate a variety of different driving styles, including some that correspond to highly unsafe driving, to see if the algorithm can mimic the same “style” in every instance. The result was that in every instance, the algorithm was qualitatively able to mimic the demonstrated driving style. The result showed, for each of the five driving styles tested, that the feature expectations of the expert and the featured expectations of the learned controller were pretty high which surprised Abbeel & Ng.

<-Back