Reinforcement Learning

Reinforcement Learning describes the situation of a machine learning system, where the only training signal provided by the environment is that of success or failure of the agent, after the system has acted over a sequence of decision cycles. This learning problem can be formulated as a Markov Decision Process (MDP) within the framework of dynamic programming.

Within the soccer simulation project Brainstormers 2D, one of our main motivations has been to investigate reinforcement learning (RL) methods in complex domains, such as the soccer domain, and to develop new variants and practical algprithmus. In that domain, we consider it particularly important to not only demonstrate the basic and principal feasibility of RL for a specific problem, but to actually do apply learned behavior in our competition team.

Use Case: Learning Competitive Behaviors for Simulated Soccer Agents

In a highly competitive domain like robotic soccer well-developed player skills (the term skills refers to a player's low-level capabilities like intercepting a ball, kicking with certain speed into a specific direction, and others) are of fundamental importance for the team's performance. Therefore, one aim of the Brainstormers 2D project has been to not only make the agents learn their skills autonomously - instead we want them to obtain optimal or at least near-optimal low-level capabilities.

Since most reinforcement learning algorithms try to learn state or state-action value functions, from which a policy for action choice can be induced, a crucial question is how to approximate that function. The video shows the application of an RL approach using a case-based value function approximator for learning a behavior for ball interception in simulated robotic soccer. The player is rewarded if it manages to intercept the ball, a small negative reward is assigned for each time step, it has not reached the ball, yet. The results of the learning process are shown after different stages of learning: The agent managed to acquire an average-quality intercept policy within very short time, i.e. with a small number of training episodes, and with limited case memory. Using a case-based function approximator good learning results can be obtained quickly, but more comprehensive function approximation mechanisms (like neural networks) bring about much better - and often optimal - results in the long run.

See the following publication for more details: