Making sense of raw input
Abstract
How should a machine intelligence perform unsupervised structure discovery over streams of sensory input? One approach to this problem is to cast it as an apperception task [1]. Here, the task is to construct an explicit interpretable theory that both explains the sensory sequence and also satisfies a set of unity conditions, designed to ensure that the constituents of the theory are connected in a relational structure.
However, the original formulation of the apperception task had one fundamental limitation: it assumed the raw sensory input had already been parsed using a set of discrete categories, so that all the system had to do was receive this already-digested symbolic input, and make sense of it. But what if we don't have access to pre-parsed input? What if our sensory sequence is raw unprocessed information?
The central contribution of this paper is a neuro-symbolic framework for distilling interpretable theories out of streams of raw, unprocessed sensory experience. First, we extend the definition of the apperception task to include ambiguous (but still symbolic) input: sequences of sets of disjunctions. Next, we use a neural network to map raw sensory input to disjunctive input. Our binary neural network is encoded as a logic program, so the weights of the network and the rules of the theory can be solved jointly as a single SAT problem. This way, we are able to jointly learn how to perceive (mapping raw sensory information to concepts) and apperceive (combining concepts into declarative rules).
1. Introduction
There are, broadly speaking, two approaches to interpreting the results of machine learning systems [2], [3], [4]. In one approach, post-hoc interpretation, we take an existing machine learning system, that has already been trained, and try to understand its inner state. In the other approach, designing explicit already-interpretable machine learning systems, we constrain the design of the machine learning system to guarantee, in advance, that its results will be interpretable.
In this paper, we take the second approach to unsupervised learning. Our system takes as input a temporal sequence of raw unprocessed sensory information, and produces an interpretable theory capturing the regularities in that sequence. It combines an unsupervised program synthesis system for constructing explicit first-order theories, with a binary neural network that transforms raw unprocessed sensory information into symbolic information that can be accessed by the program synthesis system. Thus, the system jointly synthesizes an explanatory symbolic theory, connected to a learned, sub-symbolic perceptual front-end.
1.1. Unsupervised learning and apperception
Consider a machine, equipped with various sensors, receiving a stream of sensory information. Somehow, it must make sense of this sensory stream. But what, exactly, does “making sense” involve, and how, exactly, should it be performed?
Unsupervised learning occupies a curious position within the space of AI tasks in that, while it is acknowledged to be of central importance to the progress of the field,1 it is also frustratingly ill-defined. What, exactly, does it mean to “make sense” of unlabelled data? There is no consensus on what the problem is, let alone the solution.
Self-supervised learning has emerged as a well-defined sub-field within unsupervised learning [6], [7]. Here, the task is to use the unlabelled sensory sequence as a source of supervised learning problems: we try to predict future states given previous states. Now, the vague under-specified unsupervised learning problem has been replaced by the well-defined task of predicting certain data points conditioned on others.
But there is more, we submit, to making sense than just predicting (or retrodicting) held-out states. Predicting future held-out states is certainly part of what is involved in making sense of the sensory given—but it is not, on its own, sufficient.
Recently, we proposed an alternative approach to unsupervised learning [1]. The problem of “making sense” of sequences is formalised as an apperception task. Here, the task is to construct an explicit theory that both explains the sequence and also satisfies a set of unity conditions designed to ensure that the constituents of the theory—the objects, properties, and propositions—are combined together in a relational structure. We developed an implementation, the Apperception Engine, and showed, in a range of experiments, how this system is able to outperform recurrent networks and other baselines on a range of tasks, including Hofstadter's Seek Whence dataset [8].
But in our initial implementation, there was one fundamental limitation: we assumed the sensory input was provided in symbolic form. We assumed some other system had already parsed the raw sensory input into a set of discrete categories, so that all the Apperception Engine had to do was receive this already-digested symbolic input, and make sense of it. But what if we don't have access to pre-parsed input? What if our sensory sequence is raw unprocessed information—a sequence of noisy pixel arrays from a video camera, for example?
1.2. Overview
Our central contribution is an approach for unsupervised learning of interpretable symbolic theories from raw unprocessed sensory data. We achieve this through a major extension of the Apperception Engine so that it is able to work from this raw input. This involves two phases. First, we extend the Apperception Engine to receive ambiguous (but still symbolic) input: sequences of disjunctions. Second, we use a neural network to map raw sensory input to disjunctive input. Our binary neural network is encoded as a logic program, so the weights of the network and the rules of the theory can be found jointly by solving a single SAT problem. This way, we are able to simultaneously learn how to perceive (mapping raw sensory information to concepts) and apperceive (combining concepts into rules).
We tested our system in three domains. In the first domain, the Apperception Engine learned to solve sequence induction tasks, where the sequence was represented by noisy MNIST images [9]. In the second, it learned the dynamics of Sokoban from a sequence of noisy pixel arrays. In the third, it learned to make sense of sequences of noisy ambiguous data without knowledge of the underlying spatial structure of the generative model.
This system is, to the best of our knowledge, the first system that is able to learn explicit provably correct dynamics of non-trivial games from raw pixel input. We discover that generic inductive biases embedded in our system suffice to induce these game dynamics from very sparse data, i.e. less than two dozen game traces. We see this as a step toward machines that can flexibly adapt and even synthesize their own world models [10], [11], starting from raw sub-symbolic input, while organizing and representing those models in a format that humans can comprehend, debug, and verify.
In Section 3, we describe the Apperception Engine as it operates on discrete symbolic input. In Section 4, we extend the system to handle raw unprocessed input. Section 5 shows our experimental results.