### Abstract

In robotics, the ultimate goal of reinforcement learning is to endow robots with the ability to learn, improve, adapt and reproduce tasks with dynamically changing constraints based on exploration and autonomous learning. We give a summary of the state-of-the-art of reinforcement learning in the context of robotics, in terms of both algorithms and policy representations. Numerous challenges faced by the policy representation in robotics are identified. Three recent examples for the application of reinforcement learning to real-world robots are described: a pancake flipping task, a bipedal walking energy minimization task and an archery-based aiming task. In all examples, a state-of-the-art expectation-maximization-based reinforcement learning is used, and different policy representations are proposed and evaluated for each task. The proposed policy representations offer viable solutions to six rarely-addressed challenges in policy representations: correlations, adaptability, multi-resolution, globality, multi-dimensionality and convergence. Both the successes and the practical difficulties encountered in these examples are discussed. Based on insights from these particular cases, conclusions are drawn about the state-of-the-art and the future perspective directions for reinforcement learning in~robotics.

### Bibtex reference

@article{Kormushev13ROB,
author="Kormushev, P. and Calinon, S. and Caldwell, D. G.",
title="Reinforcement Learning in Robotics: Applications and Real-World Challenges",
journal="Robotics",
year="2013",
Volume="2",
number="3",
pages="122--148",
}

### Video

After being instructed how to hold the bow and release the arrow, the robot learns by itself to aim and shoot arrows at the target. It learns to hit the center of the target in only 8 trials.

The learning algorithm, called ARCHER (Augmented Reward Chained Regression) algorithm, was developed and optimized specifically for problems like the archery training, which have a smooth solution space and prior knowledge about the goal to be achieved. In the case of archery, we know that hitting the center corresponds to the maximum reward we can get. Using this prior information about the task, we can view the position of the arrow's tip as an augmented reward. ARCHER uses a chained local regression process that iteratively estimates new policy parameters which have a greater probability of leading to the achievement of the goal of the task, based on the experience so far. An advantage of ARCHER over other learning algorithms is that it makes use of richer feedback information about the result of a rollout.

For the archery training, the ARCHER algorithm is used to modulate and coordinate the motion of the two hands, while an inverse kinematics controller is used for the motion of the arms. After every rollout, the image processing part recognizes automatically where the arrow hits the target which is then sent as feedback to the ARCHER algorithm. The image recognition is based on Gaussian Mixture Models for color-based detection of the target and the arrow's tip.

The experiments are performed on a 53-DOF humanoid robot iCub. The distance between the robot and the target is 3.5m, and the height of the robot is 104cm.

Authors of the video:
Petar Kormushev, Sylvain Calinon, Ryo Saegusa and Giorgio Metta
Italian Institute of Technology