Making sense of sensory input
Abstract
This paper attempts to answer a central question in unsupervised learning: what does it mean to “make sense” of a sensory sequence? In our formalization, making sense involves constructing a symbolic causal theory that both explains the sensory sequence and also satisfies a set of unity conditions. The unity conditions insist that the constituents of the causal theory – objects, properties, and laws – must be integrated into a coherent whole. On our account, making sense of sensory input is a type of program synthesis, but it is unsupervised program synthesis.
Our second contribution is a computer implementation, the Apperception Engine, that was designed to satisfy the above requirements. Our system is able to produce interpretable human-readable causal theories from very small amounts of data, because of the strong inductive bias provided by the unity conditions. A causal theory produced by our system is able to predict future sensor readings, as well as retrodict earlier readings, and impute (fill in the blanks of) missing sensory readings, in any combination. In fact, it is able to do all three tasks simultaneously.
We tested the engine in a diverse variety of domains, including cellular automata, rhythms and simple nursery tunes, multi-modal binding problems, occlusion tasks, and sequence induction intelligence tests. In each domain, we test our engine's ability to predict future sensor values, retrodict earlier sensor values, and impute missing sensory data. The Apperception Engine performs well in all these domains, significantly out-performing neural net baselines. We note in particular that in the sequence induction intelligence tests, our system achieved human-level performance. This is notable because our system is not a bespoke system designed specifically to solve intelligence tests, but a general-purpose system that was designed to make sense of any sensory sequence.
1. Introduction
Imagine a machine, equipped with various sensors, that receives a stream of sensory information. It must, somehow, make sense of this stream of sensory data. But what does it mean, exactly, to “make sense” of sensory data? We have an intuitive understanding of what is involved in making sense of the sensory stream – but can we specify precisely what is involved? Can this intuitive notion be formalized?
One approach is to treat the sensory sequence as the input to a supervised learning problem1: given a sequence
of sensory data from time steps 1 to t, maximize the probability of the next datum
. This family of approaches seeks to maximize
. More generally, as well as training the system to predict future sensor readings, we may also train it to retrodict past sensor readings (maximizing
), and to impute missing intermediate values (maximizing
).
We believe there is more to “making sense” than prediction, retrodiction, and imputation. Predicting the future state of one's photoreceptors may be part of what is involved in making sense – but it is not on its own sufficient. The ability to predict, retrodict, and impute is a sign, a surface manifestation, that one has made sense of the input. We want to define the underlying mental model that is constructed when one makes sense of the sensory input, and to show how constructing this mental model ipso facto enables one to predict, retrodict, and impute.
In this paper, we assume that making sense of sensory input involves constructing a symbolic theory that explains the sensory input [9], [10], [11], [12]. A number of authors, including Lake [13] and Marcus [14], have argued that constructing an explanatory theory is a key component of common sense. Following Spelke and others [15], [16], we assume the theory must posit objects that persist over time, with properties that change over time according to general laws. Further, we assume, following John McCarthy and others [17], [18], [19], that making sense of the surface sensory perturbations requires positing latent objects: some sensory sequences can only be made intelligible by hypothesizing an underlying reality, distinct from the surface features of our sensors, that makes the surface phenomena intelligible. The underlying reality consists of latent objects that causally interact with our sensors to product the sensory perturbations we are given as input. Once we have constructed such a theory, we can apply it to predict future sensor readings, to retrodict past readings, or to impute missing values.
Now constructing a symbolic theory that explains the sensory sequence is necessary for making sense of the sequence. But it is not, we claim, sufficient. There is one further additional ingredient that we add to our characterisation of “making sense”. This is the requirement that our theory exhibits a particular form of unity: the constituents of our theory – objects, properties, and atoms – must be integrated into a coherent whole. Specifically, our unity condition requires that the objects are interrelated via chains of binary relations, the properties are connected via exclusion relations, and the atoms are unified by jointly satisfying the theory's constraints. This extra unity condition is necessary, we argue, for the theory to achieve good accuracy at prediction, retrodiction, and imputation.
This paper makes two main contributions. The first is a formalization of what it means to “make sense” of the stream of sensory data. According to our definition, making sense of a sensory sequence involves positing a symbolic causal theory – a set of objects, a set of concepts, a set of initial conditions, a set of rules, and a set of constraints – that together satisfy two conditions. First, the theory must explain the sensory readings it is given. Second, the theory must satisfy a particular type of unity. Our definition of unity involves four conditions. (i) Spatial unity: all objects must be unified in space via a chain of binary relations. (ii) Conceptual unity: all concepts must be unified via constraints. (iii) Static unity: all propositions that are true at the same time must jointly satisfy the set of constraints. (iv) Temporal unity: all the states must be unified into a sequence by causal rules.
Our second contribution is a description of a particular computer system, the Apperception Engine,3 that was designed to satisfy the conditions described above.4 We introduce a causal language, Datalog
, that was designed for reasoning about infinite temporal sequences. Given a sensory sequence, our system synthesizes a Datalog
program that, when executed, generates a trace that both explains the sensory sequence and also satisfies the four conditions of unity. This can be seen as a form of unsupervised program synthesis [22]. In traditional supervised program synthesis, we are given input/output pairs, and search for a program that, when executed on the inputs, produces the desired outputs. Here, in unsupervised program synthesis, we are given a sensory sequence, and search for a causal theory that, when executed, generates a trajectory that both respects the sensory sequence and also satisfies the conditions of unity.
The Apperception Engine has a number of appealing features. (1) Because the causal theories it generates are symbolic, they are human-readable and hence verifiable. We can understand precisely how the system is making sense of its sensory data.5 (2) Because of the strong inductive bias (both in terms of the design of the causal language, Datalog
, but also in terms of the unity conditions that must be satisfied), the system is data-efficient, able to make sense of the shortest and scantiest of sensory sequences.6 (3) Our system generates a causal model that is able to accurately predict future sensory input. But that is not all it can do; it is also able to retrodict previous values and impute missing sensory values in the middle of the sensory stream. In fact, our system is able to predict, retrodict, and impute simultaneously.7 (4) The Apperception Engine has been tested in a diverse variety of domains, with encouraging results. The five domains we use are elementary cellular automata, rhythms and nursery tunes, “Seek Whence” and C-test sequence induction intelligence tests [24], multi-modal binding tasks, and occlusion problems. These tasks were chosen because they require cognition rather than mere classificatory perception, and because they are simple for humans but not for modern machine learning systems, e.g. neural networks.8 The Apperception Engine performs well in all these domains, significantly out-performing neural net baselines. These results are significant because neural systems typically struggle to solve the binding problem (where information from different modalities must somehow be combined into different aspects of one unified object) and fail to solve occlusion tasks (in which objects are sometimes visible and sometimes obscured from view).
We note in particular that in the sequence induction intelligence tests, our system achieved human-level performance. This is notable because the Apperception Engine was not designed to solve these induction tasks; it is not a bespoke hand-engineered solution to this particular domain. Rather, it is a general-purpose9 system that attempts to make sense of any sensory sequence. This is, we believe, a highly suggestive result [25].
In ablation tests, we tested what happened when each of the four unity conditions was turned off. Since the system's performance deteriorates noticeably when each unity condition is ablated, this indicates that the unity conditions are indeed doing vital work in our engine's attempts to make sense of the incoming barrage of sensory data.