Content area
Full Text
Since its introduction, the reward prediction error theory of dopamine has explained a wealth of empirical phenomena, providing a unifying framework for understanding the representation of reward and value in the brain1-3. According to the now canonical theory, reward predictions are represented as a single scalar quantity, which supports learning about the expectation, or mean, of stochastic outcomes. Here we propose an account of dopamine-based reinforcement learning inspired by recent artificial intelligence research on distributional reinforcement learning4-6. We hypothesized that the brain represents possible future rewards not as a single mean, but instead as a probability distribution, effectively representing multiple future outcomes simultaneously and in parallel. This idea implies a set of empirical predictions, which we tested using single-unit recordings from mouse ventral tegmental area. Our findings provide strong evidence for a neural realization of distributional reinforcement learning.
The reward prediction error (RPE) theory of dopamine derives from work in the artificial intelligence (AI) field of reinforcement learning (RL)7. Since the link to neuroscience was first made, however, RL has made substantial advances8,9, revealing factors that greatly enhance the effectiveness of RL algorithms10. In some cases, the relevant mechanisms invite comparison with neural function, suggesting hypotheses concerning reward-based learning in the brain11-13. Here we examine a promising recent development in AI research and investigate its potential neural correlates. Specifically, we consider a computational framework referred to as distributional reinforcement learning4-6 (Fig. 1a, b).
Similar to the traditional form of temporal-difference RL-on which the dopamine theory was based-distributional RL assumes that reward-based learning is driven by a RPE, which signals the difference between received and anticipated reward. (For simplicity, we introduce the theory in terms of a single-step transition model, but the same principles hold for the general multi-step (discounted return) case; see Supplementary Information.) The key difference in distributional RL lies in how 'anticipated reward' is defined. In traditional RL, the reward prediction is represented as a single quantity: the average over all potential reward outcomes, weighted by their respective probabilities. By contrast, distributional RL uses a multiplicity of predictions. These predictions vary in their degree of optimism about upcoming reward. More optimistic predictions anticipate obtaining greater future rewards; less optimistic predictions anticipate more meager outcomes. Together, the entire range of predictions captures the...