Apprenticeship+Learning+Using+Linear+Programming

In apprenticeship learning, the goal is to learn a policy in a Markov decision process that is at least as good as a policy demonstrated by an expert. Apprenticeship learning was first introduced by Abbeel & Ng in 2004; the objective is to find a good policy for an autonomous agent, called the “apprentice”, in a constantly changing environment. First, we will look at using the linear programming approach for finding an optimal policy. Umar Syed of Princeton University, Michael Bowling of University of Alberta, and Robert E. Schapire of Princeton University have developed a technique that is similar to policy learning for Markov Decision Processes, or MDPs, where the objective is to find a good policy for an autonomous agent, called the apprentice, that is at least as good as a policy demonstrated by an expert. By using a linear programming solver, Bowling et al were able see a substantial improvement in running time over a few other methods for apprenticeship learning.

Here is an example of what apprenticeship learning is all about. In 2004, Abbeel & Ng first introduced apprenticeship learning by investigating driving a car. When a person drives a car, it is apparent that their behavior can be viewed as maximizing “some reward function,” and that this reward function depends on just a few key properties of each environment state: the speed of the car, the position of other cars, the terrain, etc. The second observation is that demonstrations of good policies by experts are often plentiful. This is certainly true in the car driving example, as it is in many other applications. Bowling et al developed two algorithms; MWAL algorithm and LPAL algorithm.

MWAL, Multiplicative Weights Apprenticeship Learning, algorithm computes an optimal policy for every iteration with respect to the current weight vector. The weights are updated so that it increases or decreases if there’s a bad or good policy. The second algorithm is the LPAL, Linear Programming Apprenticeship Learning, algorithm which uses the Bellman flow constraints to find a good apprentice policy in a much more direct fashion than the MWAL algorithm. This way they are able to define a feasible set containing all policies whose basis values are above a certain lower bound, and then maximize this bound. Now we look at an experiment Bowling et al did to test out their algorithms.

Their experiment was based on Abbeel & Ng’s driving a car example. The experiment consist of a busy virtual 3-lane highway in a simulator, and the available actions are to move left, move right, drive faster, or drive slower. The purpose is to test both MWAL and LPAL algorithms to see which has a better reaction time and meets the expert’s policy. In their conclusion and test result, Bowling et al found that the LPAL algorithm is much faster than any of the MWAL variations, thus they concluded that LPAL is most appropriate for problems with large stat spaces or many basis reward functions. Although LPAL may have an advantage over MWAL in computation time, MWAL has an advantage over the fact that it can compute well under the condition where the expert’s policy is far from ideal, i.e. crash into every car that it sees and fail the test. Thus, both algorithms are not equally effective under certain conditions since both have some drawbacks in the tests that were done.

<-Back