Approximating the Value Function in the Actor Critic Architecture using the Temporal Dynamics of Spiking Neural Networks

Abstract

The human ability to learn from sparse rewards has been modeled with the temporal difference learning mechanism, using an actor-critic architecture (Montague, Dayan, & Sejnowski, 1996). These models incorporate an "adaptive critic" which learns a "value function": a mapping from the learner's current situation to expected future reward. In complex environments, a "value function approximator" (VFA) must be implemented to allow generalization between similar situations. While some implementations of VFAs have been successful (Tesauro, 1992), this approach does not consistently converge to a solution (Boyan and Moore, 1995). With the goal of developing a general and reliable VFA mechanism, capturing human level learning performance, we have explored the use of spiking neural networks, including liquid state machines, as a technique for VFA learning in complex environments. We report on simulations demonstrating the benefits and pitfalls of using the temporal dynamics of neural spikes to encode the learner's state.


Back to Table of Contents