Reward is enough
Abstract
In this article we hypothesise that intelligence, and its associated abilities, can be understood as subserving the maximisation of reward. Accordingly, reward is enough to drive behaviour that exhibits abilities studied in natural and artificial intelligence, including knowledge, learning, perception, social intelligence, language, generalisation and imitation. This is in contrast to the view that specialised problem formulations are needed for each ability, based on other signals or objectives. Furthermore, we suggest that agents that learn through trial and error experience to maximise reward could learn behaviour that exhibits most if not all of these abilities, and therefore that powerful reinforcement learning agents could constitute a solution to artificial general intelligence.
1. Introduction
Expressions of intelligence in animal and human behaviour are so bountiful and so varied that there is an ontology of associated abilities to name and study them, e.g. social intelligence, language, perception, knowledge representation, planning, imagination, memory, and motor control. What could drive agents (natural or artificial) to behave intelligently in such a diverse variety of ways?
One possible answer is that each ability arises from the pursuit of a goal that is designed specifically to elicit that ability. For example, the ability of social intelligence has often been framed as the Nash equilibrium of a multi-agent system; the ability of language by a combination of goals such as parsing, part-of-speech tagging, lexical analysis, and sentiment analysis; and the ability of perception by object segmentation and recognition. In this paper, we consider an alternative hypothesis: that the generic objective of maximising reward is enough to drive behaviour that exhibits most if not all abilities that are studied in natural and artificial intelligence.
This hypothesis may startle because the sheer diversity of abilities associated with intelligence seems to be at odds with any generic objective. However, the natural world faced by animals and humans, and presumably also the environments faced in the future by artificial agents, are inherently so complex that they require sophisticated abilities in order to succeed (for example, to survive) within those environments. Thus, success, as measured by maximising reward, demands a variety of abilities associated with intelligence. In such environments, any behaviour that maximises reward must necessarily exhibit those abilities. In this sense, the generic objective of reward maximisation contains within it many or possibly even all the goals of intelligence.
Reward thus provides two levels of explanation for the bountiful expressions of intelligence found in nature. First, different forms of intelligence may arise from the maximisation of different reward signals in different environments, resulting for example in abilities as distinct as echolocation in bats, communication by whale-song, or tool use in chimpanzees. Similarly, artificial agents may be required to maximise a variety of reward signals in future environments, resulting in new forms of intelligence with abilities as distinct as laser-based navigation, communication by email, or robotic manipulation.
Second, the intelligence of even a single animal or human is associated with a cornucopia of abilities. According to our hypothesis, all of these abilities subserve a singular goal of maximising that animal or agent's reward within its environment. In other words, the pursuit of one goal may generate complex behaviour that exhibits multiple abilities associated with intelligence. Indeed, such reward-maximising behaviour may often be consistent with specific behaviours derived from the pursuit of separate goals associated with each ability.
For example, a squirrel's brain may be understood as a decision-making system that receives sensations from, and sends motor commands to the squirrel's body. The behaviour of the squirrel may be understood as maximising a cumulative reward such as satiation (i.e. negative hunger). In order for a squirrel to minimise hunger, the squirrel-brain must presumably have abilities of perception (to identify good nuts), knowledge (to understand nuts), motor control (to collect nuts), planning (to choose where to cache nuts), memory (to recall locations of cached nuts) and social intelligence (to bluff about locations of cached nuts, to ensure they are not stolen). Each of these abilities associated with intelligence may therefore be understood as subserving a singular goal of hunger minimisation (see Fig. 1).
Fig. 1. The reward-is-enough hypothesis postulates that intelligence, and its associated abilities, can be understood as subserving the maximisation of reward by an agent acting in its environment. For example, a squirrel acts so as to maximise its consumption of food (top, reward depicted by acorn symbol), or a kitchen robot acts to maximise cleanliness (bottom, reward depicted by bubble symbol). To achieve these goals, complex behaviours are required that exhibit a wide variety of abilities associated with intelligence (depicted on the right as a projection from an agent's stream of experience onto a set of abilities expressed within that experience).
As a second example, a kitchen robot may be implemented as a decision-making system that receives sensations from, and sends actuator commands to, the robot's body. The singular goal of the kitchen robot is to maximise a reward signal measuring cleanliness.1 In order for a kitchen robot to maximise cleanliness, it must presumably have abilities of perception (to differentiate clean and dirty utensils), knowledge (to understand utensils), motor control (to manipulate utensils), memory (to recall locations of utensils), language (to predict future mess from dialogue), and social intelligence (to encourage young children to make less mess). A behaviour that maximises cleanliness must therefore yield all these abilities in service of that singular goal (see Fig. 1).
When abilities associated with intelligence arise as solutions to a singular goal of reward maximisation, this may in fact provide a deeper understanding since it explains why such an ability arises (e.g. that classification of crocodiles is important to avoid being eaten). In contrast, when each ability is understood as the solution to its own specialised goal, the why question is side-stepped in order to focus upon what that ability does (e.g. discriminating crocodiles from logs). Furthermore, a singular goal may also provide a broader understanding of each ability that may include characteristics that are otherwise hard to formalise, such as dealing with irrational agents in social intelligence (e.g. pacifying an angry aggressor), grounding language to perceptual experience (e.g. dialogue regarding the best way to peel a fruit), or understanding haptics in perception (e.g. picking a sharp object from a pocket). Finally, implementing abilities in service of a singular goal, rather than for their own specialised goals, also answers the question of how to integrate abilities, which otherwise remains an outstanding issue.
Having established that reward maximisation is a suitable objective for understanding the problem of intelligence, one may consider methods for solving the problem. One might then expect to find such methods in natural intelligence, or choose to implement them in artificial intelligence. Among possible methods for maximising reward, the most general and scalable approach is to learn to do so, by interacting with the environment by trial and error. We conjecture that an agent that can effectively learn to maximise reward in this manner would, when placed in a rich environment, give rise to sophisticated expressions of general intelligence.
A recent salutary example of both problem and solution based on reward maximisation comes from the game of Go. Research initially focused largely upon distinct abilities, such as openings, shape, tactics, and endgames, each formalised using distinct objectives such as sequence memorisation, pattern recognition, local search, and combinatorial game theory [32]. AlphaZero [49] focused instead on a singular goal: maximising a reward signal that is 0 until the final step, and then +1 for winning or −1 for losing. This ultimately resulted in a deeper understanding of each ability – for example, discovering new opening sequences [65], using surprising shapes within a global context [40], understanding global interactions between local battles [64], and playing safe when ahead [40]. It also yielded a broader set of abilities that had not previously been satisfactorily formalised – such as balancing influence and territory, thickness and lightness, and attack and defence. AlphaZero's abilities were innately integrated into a unified whole, whereas integration had proven highly problematic in prior work [32]. Thus, maximising wins proved to be enough, in a simple environment such as Go, to drive behaviour exhibiting a variety of specialised abilities. Furthermore, applying the same method to different environments such as chess or shogi [48] resulted in new abilities such as piece mobility and colour complexes [44]. We argue that maximising rewards in richer environments – more comparable in complexity to the natural world faced by animals and humans – could yield further, and perhaps ultimately all abilities associated with intelligence.
The rest of this paper is organised as follows. In Section 2 we formalise the objective of reward maximisation as the problem of reinforcement learning. In Section 3 we present our main hypothesis. We consider several important abilities associated with intelligence, and discuss how reward maximisation may yield those abilities. In Section 4 we turn to the use of reward maximisation as a solution strategy. We present related work in Section 5 and finally, in Section 6 we discuss possible weaknesses of the hypothesis and consider several alternatives.