Possible Mechanism of Internal Visual Perception: Context-dependent Processing by Predictive Coding and Reservoir Computing Network

Hiroto Tamura; Yuichi Katori; Kazuyuki Aihara

doi:10.2991/jrnal.k.190531.009

<Previous Article In Issue

Download article (PDF)

Next Article In Issue>

Volume 6, Issue 1, June 2019, Pages 42 - 47

Possible Mechanism of Internal Visual Perception: Context-dependent Processing by Predictive Coding and Reservoir Computing Network

Authors

Hiroto Tamura¹^{, *}, Yuichi Katori²^{, 3}, Kazuyuki Aihara³

¹Graduate School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan

²Future University Hakodate, 116-2 Kamedanakano-cho, Hakodate, Hokkaido 041-8655, Japan

³Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan

^*Corresponding author. Email: hiroto0324@sat.t.u-tokyo.ac.jp

Corresponding Author

Hiroto Tamura

Received 12 October 2018, Accepted 28 November 2018, Available Online 25 June 2019.

DOI: 10.2991/jrnal.k.190531.009 How to use a DOI?
Keywords: Visual system; perception; predictive coding; reservoir computing; context; nonlinear dynamics
Abstract: The predictive coding is a widely accepted hypothesis on how our internal visual perceptions are generated. Dynamical predictive coding with reservoir computing (PCRC) models have been proposed, but how they work remains to be clarified. Therefore, we first construct a simple PCRC network and analyze the nonlinear dynamics underlying it. Since the influence of contexts is another important factor on the visual perception, we also construct PCRC networks for the context-dependent task, and observe their attractor-landscapes on each context.
Copyright: © 2019 The Authors. Published by Atlantis Press SARL.
Open Access: This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

1. INTRODUCTION

It is widely known that what we see is not the visual sensory input as it is. Instead, our brains integrate the sensory inputs and reconstruct the internal image in the manner we can easily understand. For example, although the actual visual input is 2D and received by both eyes, what we see is the 3D vision as one image. However, exactly how the internal visual perceptions are generated in the visual cortex has provoked much debate.

The predictive coding is one of the most accepted hypotheses on the internal perception. In the predictive coding framework, a perceived image is not merely the integrated visual sensory input, but the result of the prediction made by the internal generative model. The predictive coding also assumes that the generative model is optimized to minimize the residual error between the prediction and the actual sensory input. In particular, the hierarchical predictive coding model [1] postulates that the top-down signals from the higher-order area carry the predictions of lower-level neural activities, whereas the bottom-up signals from the lower-order area carry the residual errors between the predictions and the actual lower-level activities, so that the ascending signals have much less redundancy.

However, this model is only suitable for static visual inputs and cannot deal with temporally changing visual images, or movies. Then Fukino et al. [2] proposed the Predictive Coding with Reservoir Computing (PCRC) model, which can predict the temporally changing auditory inputs, implementing the generative model by the dynamical reservoir. Furthermore, the hierarchical PCRC models for more complex auditory inputs were proposed by Ara and Katori [3,4].

Here, the reservoir computing [5] refers to a type of the Recurrent Neural Network (RNN) approach with a simple learning strategy. When the reservoir computing networks are trained, only the output connections are modified, and the recurrent and feedback connections are fixed with randomly given values.

However, precisely how these PCRC models [2–4] work largely remains to be clarified. Moreover, these conventional models cannot perceive unlearned inputs. In addition, they are not exactly driven by the prediction error but by the sum of the error and their own prediction, which is equal to the original sensory input.

In this study, therefore, we first modify them and construct a simple one-layer PCRC network exactly driven by the prediction errors, which can perceive even unlearned inputs. Then we analyze the nonlinear dynamics underlying the trained network, in order to clarify the mechanism of the behavior.

The influence of contexts, which refers to situations, goals, and relevant past experiences, is another important factor on the visual perception. For example, even identical sensory stimuli can result in very different perceptions depending on contexts. Indeed, RNN models for context-dependent tasks have been proposed [6].

Therefore, we also construct a PCRC network for a simple context-dependent perception task. We analyze the trained network again, in order to reveal how the network perceives the sensory stimuli on each context. We further construct a PCRC network which can perceive more high-dimensional visual inputs, in order to show that the proposed network can be a possible mechanism of the visual perception. We observe that the mismatch between the context and the type of sensory stimuli induces the perceptual error, which exhibits complex visual features.

2. SIMPLE PCRC

In this section, we construct a simple one-layer PCRC network exactly driven by the prediction errors. We also analyze the nonlinear dynamics underlying the trained network to elucidate how it works.

2.1. Network Architecture and Dynamics

We use a leaky-integrator RNN, defined by the equation:

(1)τx˙= −x+WRECr+WFBz +WIN(d−z),

where τ is the membrane time constant, N is the number of neurons, x(t) ≔ (x₁(t), ..., x_N(t))^T ∈ ℝ^N represents the membrane potentials or activities of the neurons at time t ∈ ℝ, and r ≔ (ϕ(x₁), ..., ϕ(x_N))^T ∈ ℝ^N represents the firing rates of the neurons with ϕ(x) ≔ tan h(x). W^REC ∈ ℝ^N×N is a random recurrent connectivity matrix, whose elements are sampled i.i.d. from 𝒩(0, g²/N) with the parameter g ∈ ℝ. M is the dimension of the input and output. The output of the network z ≔ W^OUT r ∈ ℝ^M represents the prediction and is fed back through weights W^FB ∈ ℝ^N×M, whose elements are independently and uniformly sampled from [−1, 1]. The residual error between the target (or sensory input) d(t) ∈ ℝ^M and the prediction z(t) is fed through weights W^IN ∈ ℝ^N×M, whose elements are independently and uniformly sampled from [−1, 1]. The output weights W^OUT ∈ ℝ^M×N are initially set to all zero, and modified during training. Note that the last term is unique to the PCRC network. This network architecture is illustrated in Figure 1, though contexts are ignored in this section.

In order to simulate this dynamics numerically, we introduce the discrete-time version of Equation (1), derived by Euler method:

(2)x(n+1)=x(n)+δτ[−x(n)+WRECr(n)+WFBz(n)+WIN(d(n)−z(n))],

where n ∈ ℤ is the discrete time step, δ is the small time interval, and other notations follow Equation (1).

Throughout this paper, we use N = 1000, g = 1.2, τ = 100 ms, and δ = 10. In this section, we use M = 2 for visibility of the dynamics.

2.2. Task

We present the constant vectors d¹,..., d^N_D ∈ ℝ^M in turn as the sensory inputs to the network, where N_D is the number of trials. The network is trained to keep outputting the target dⁱ until the next target dⁱ⁺¹ is presented, at each i^th trial. Each sensory input dⁱ is presented for 0.2 s, and its elements are independently and uniformly sampled from [1, 2].

Since the network actually receive the prediction error dⁱ – z(t) as the input, it is required to decode this error into the original sensory input dⁱ. This corresponds to the framework of the predictive coding.

2.3. Learning Rule

We train W^OUT by Fast Order Reduced and Controlled Error (FORCE) learning algorithm [7], which is based on the recursive least square filter. Its update rule is:

(3)e(t):=WOUT(t−Δt)r(t)−d(t),

(4)s(t):=P(t−Δt)r(t)(1+rT(t)P(t−Δt)r(t))−1,

(5)P(t)=P(t−Δt)−s(t)rT(t)P(t−Δt),

(6)WOUT(t)=WOUT(t−Δt)−e(t)sT(t),

where the initial value for P(t) ∈ ℝ^N×N is given by

(7)P(0)=1αI, (α∈ℝ).

In this algorithm, the inverse of P(t) is a running estimate of the autocorrelation matrix of the firing rates r(t) plus a regularization term:

(8)P−1(t)= ∫r(t′)rT(t′)dt′+αI.

Throughout this paper, we use α = 0.02 and Δt = δ.

2.4. Results and Analysis

We trained the network for 1000 trials. (i.e., N_D = 1000). At each trial in the test phase, the sensory input dⁱ is presented for 5.0 s. As shown in Figure 2, the training resulted in almost perfect performance. Figure 2 also shows that at the beginning of each i^th trial, the prediction error dⁱ – z(t) is fed to the network as a sharp pulse, but it immediately decays to zero and the network settles into a fixed point xi¯ where z(t) ≡ dⁱ.

In order to reveal the underling mechanism of this behavior, we analyze the nonlinear dynamics of the trained network. In what follows, we regard the term W^IN (d – z) as the external force and separate it from the network’s own dynamics, because of its pulse-like behavior, i.e., we here analyze the dynamics:

(9)τx˙= −x+WRECr+WFBz .

Following the approach of Sussillo and Barak [8], we define the scalar function q(x) ≔ |ẋ|²/2, which is near to zero if x is an approximate fixed point, or a slow point. Figure 3a shows that almost all the q values at the end of trials are very low, and the corresponding slow points are located on a 2D-manifold in the phase space. Figure 3a also shows that at the beginning of each i^th trial, the pulse-like prediction error dⁱ – z(t) drives the trajectory out of the 2D-manifold, but in the subsequent relaxation phase, the trajectory is attracted by the 2D-manifold, and the projection of the trajectory onto the manifold corresponds to the total movement dⁱ – d^i–1.

Furthermore, by analyzing the linearized system around each slow point on the 2D-manifold, we uncover the stability of this manifold. Linearizing Equation (9) around the slow point x¯ (q(x¯)≃0) , we obtain the dynamics about the perturbation δx:=x−x¯ :

(10)δ˙x=1τ[−I+(WREC+WFBWOUT)R′(x¯)]δx

where Rij'(x¯):=δijϕ′(xi¯).

As for almost all the slow points, the linearized systems around them have only eigenvalues with the negative real part, as shown in Figure 3b. This suggests that almost all the slow points are locally stable, and the 2D-manifold composed of them attracts any trajectories in the vicinity of it. Nevertheless, this manifold attractor is not fully continuous and there is a slow flow on it. Then the trajectory on the manifold is attracted by the specific slow point on the manifold where the output z(t) is near to but not equal to the target dⁱ, which leads to the little prediction error shown in Figure 2.

Up to this point, the trained network has exhibited the performance only on the discontinuously changing sensory inputs. Here we show that the same network can also perceive the continuously changing sensory inputs. For example, Figure 4a shows that the network output succeeded in following the sinusoidal input. In this case the network trajectory x(t) keeps travelling around the 2D-manifold in the phase space, as shown in Figure 4b. This behavior results from the balance between the attracting force from the 2D-manifold and the driving force by the prediction error d(t) – z(t). Even in the general case, the same mechanism enables the network to perceive the continuously changing input.

Throughout this section, we have shown the case of M = 2 for simplicity, but the same scenario holds for the case of general M.

3. CONTEXT-DEPENDENT PCRC

In this section, we construct a simple PCRC network for the context-dependent perception task. We also analyze the trained network to elucidate how it switches the processing depending on contexts.

3.1. Network Architecture and Dynamics

We add to Equation (1) the term of the context signal from external modules:

(11)τx˙= −x+WRECr+WFBz+WIN(d−z)+WCONc

where the context c(t) ∈ ℝ^L is fed through weights W^CON ∈ ℝ^N×L, whose elements are independently and uniformly sampled from [−1,1]. In this section we use M = 4 and L = 2 for simplicity, and the other settings follow Section 2.

3.2. Task and Learning Rule

We present the constant vectors d¹,…,d^N_D ∈ ℝ^M in turn as the sensory inputs to the network, and train the network to keep outputting each given constant vector until the next target is presented. We also present the context c¹ ≔ (0,1)^T during the 1^st to N_D/2^th trials, and the context c² ≔ (1, 0)^T during the N_D/2+1^th to N_D^th trials, respectively. Each sensory input dⁱ is presented for 0.2 s, and its elements are given as di=(d1i,1/d1i,d2i,1/d2i)T on the context c¹ and di=(d1i, d2i,d2i/2 , d1i/2)T on the context c², respectively. At each trial d1i and d2i are independently and uniformly sampled from [1, 2]. Note that the essential dimension of the input is M/2 on each context, but the trained network is required to switch the type of processing depending on contexts.

We train W^OUT by FORCE learning algorithm used in Section 2.

3.3. Results and Analysis

We trained the network for 1000 trials on each context c¹ and c². (i.e., N_D = 2000). At each trial in the test phase, the sensory input dⁱ is presented for 1.0 s. As shown in Figure 5a, the training resulted in almost perfect performance, and the pulse-like prediction error drives the network from one slow point to another, as with the context-free case in Section 2. Figure 5b shows that the two different 2D-manifold attractors are formed for the contexts c¹ and c², respectively. This suggests that the network switches its attractor-landscape depending on contexts, and the same mechanism as Section 2 enables the network to perceive the sensory inputs on each context.

We next evaluate the performance of the trained network when the type of the sensory input ( di=(d1i, 1/d1i, d2i, 1/d2i)T or (d1i,d2i,d2i/2 ,d1i/2)T ) does not match the context. In this case, we present each sensory input dⁱ for 5.0 s. Figure 6 shows that the context mismatch keeps the prediction errors apart from zero, so that the network fails to perceive the sensory inputs, but nevertheless the network trajectories sometimes settle into slow points.

4. CONTEXT-DEPENDENT PCRC FOR VISUAL DATA

In this section, we construct a context-dependent PCRC network which can perceive more high-dimensional visual inputs, in order to demonstrate that the proposed network can be a possible mechanism of the visual perception. We also observe the complex features of the perceptual error induced by the context mismatch.

4.1. Network Architecture, Task, and Learning Rules

The network architecture and settings follow those of Section 3, except the dimension of the input and output: M = 20.

We use the Mixed National Institute of Standards and Technology (MNIST) data set, which is widely used for handwritten numeral recognition tasks, as the high-dimensional visual sensory stimuli. As the preprocessing, we first compress the MNIST data whose labels are “0” or “1” into 20 dimension, using the Non-negative Matrix Factorization (NMF). We next randomly choose one of the compressed MNIST data as the sensory input dⁱ and present it to the network for 0.2 s at each i^th trial. At the same time, we present the context c¹ if dⁱ has “0” label, and the context c² if dⁱ has “1” label, respectively. (i.e., each context represents the category of the visual sensory input). We train the network to keep outputting the presented sensory input dⁱ during each trial. In the test phase, we present to the network unlearned compressed MNIST data as the sensory inputs. The trained network is required to form slow points that correspond to even unlearned MNIST data in its phase space.

We use the FORCE algorithm again during training.

4.2. Results and Analysis

We trained the network for 2000 trials on each context c¹ and c². (i.e., N_D = 4000). At each trial in the test phase, we present a randomly chosen unlearned MNIST data for 5.0 s as the sensory input dⁱ. Figure 7a shows that the training network almost succeeded in perceiving unlearned MNIST inputs, and the pulse-like prediction error drives the network from one slow point to another, as with the case above. Figure 7b shows that the two different manifold attractors are formed for the “0” label and “1” label MNIST inputs respectively, but in the 3D PCA space we cannot observe the actual shapes of these manifolds.

We next evaluate the performance of the trained network when the label of the sensory input does not match the context. As shown in Figure 8, the context mismatch keeps the prediction errors apart from zero, but nevertheless the network trajectories settle into slow points.

We further visualize these errored predictions z at slow points and compare them with the original inputs d, by inversely transforming the output z into the dimension of the original MNIST data, using the matrix generated in NMF. As a result, in the errored predictions, the original sensory inputs and the predictions for the wrong label MNIST image overlap each other, as illustrated in Figure 9.

5. DISCUSSION

We first proposed the simple one-layer PCRC network driven by the prediction error, which can perceive even unlearned inputs. We analyzed the nonlinear dynamics underling the trained network, and revealed that the network perceives the sensory stimuli using the low-dimensional manifold attractor in its phase space. Since low-dimensional manifold attractors have also been observed in the trained RNNs in previous studies [6,8,9], it can be a natural strategy for RNNs to use them for the information processing.

Next, we constructed the simple PCRC network for the context-dependent task, and observed that the different attractor-landscape is formed on each context. Throughout this study, we used the PCRC networks with only one layer and assumed the context signals to be fed from the external module, for simplicity. However, the hierarchy plays a key role in the predictive coding framework [1], and how the context signals are generated remains to be clarified. Therefore, it is our future work to build the hierarchical PCRC model composed of the one-layer networks which are analyzed in this study, and incorporate the modules which generate the context signals inside the model.

Finally, we constructed the context-dependent PCRC network for the compressed MNIST data task, and demonstrated that the proposed network can be a possible mechanism of the visual perception. The perceptual errors induced by the context mismatch exhibited complex features, and interestingly, they share some common features with the symptoms of the hallucination in dementia with Lewy bodies [10], in which the patients see other people who are not there on the background which actually exists there. It is also our future work to study the relation between these perceptual errors.

CONFLICTS OF INTEREST

There is no conflicts of interest.

ACKNOWLEDGMENTS

The authors are grateful to T. Kohno for useful discussions. This paper is partially supported by AMED under Grant Number JP18dm0307009, NEC Corporation, and JSPS KAKENHI Grant Number JP16K00246, and also based on results obtained from a project subsidized by the New Energy and Industrial Technology Development Organization (NEDO).

Authors Introduction

Mr. Hiroto Tamura

He received the B.E. degree of applied mathematics in 2017 from Waseda University, Japan, and the M.E. degree of electronic engineering in 2019 from the University of Tokyo, Japan. Currently, he is PhD Candidate of Graduate School of Engineering, the University of Tokyo. His research interests include computational neuroscience and nonlinear dynamics.

Dr. Yuichi Katori

He received the PhD degree of science in 2007 from the University of Tokyo, Japan. Currently, he is Associate Professor of Future University Hakodate. His research interests include mathematical modeling of complex systems, computational neuroscience, and brain-like artificial intelligence systems.

Dr. Kazuyuki Aihara

He received the B.E. degree of electrical engineering in 1977 and the PhD degree of electronic engineering 1982 from the University of Tokyo, Japan. Currently, he is Professor of Institute of Industrial Science, Graduate School of Information Science and Technology, and Graduate School of Engineering, the University of Tokyo. His research interests include mathematical modeling of complex systems, parallel distributed processing with spatio-temporal neurodynamics, and time series analysis of complex data.

REFERENCES

[1]RPN Rao and DH Ballard, Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects, Nat. Neurosci., Vol. 2, 1999, pp. 79-87.

[2]M Fukino, Y Katori, and K Aihara, A computational model for pitch pattern perception with the echo state network, in 2016 International Symposium on Nonlinear Theory and its Applications (Yugawara, Japan, 2016), pp. 271-274.

[3]K Ara and Y Katori, Hierarchical network model of auditory information processing using dynamical predictive coding and non-negative matrix factorization, in 23rd International Conference on Artificial Life and Robotics (Beppu, Japan, 2018), pp. 41-46.

[4]Y Katori, Network model for dynamics of perception with reservoir computing and predictive coding, Advances in Cognitive Neurodynamics (VI), Chapter 11, Springer Nature Singapore Pvt., Ltd., Singapore, 2018, pp. 89-95.

[5]M Lukoševičius and H Jaeger, Reservoir computing approaches to recurrent neural network training, Comput. Sci. Rev., Vol. 3, 2009, pp. 127-149.

[6]V Mante, D Sussillo, KV Shenoy, and WT Newsome, Context-dependent computation by recurrent dynamics in prefrontal cortex, Nature, Vol. 503, 2013, pp. 78-84.

[7]D Sussillo and LF Abbott, Generating coherent patterns of activity from chaotic neural networks, Neuron, Vol. 63, 2009, pp. 544-557.

[8]D Sussillo and O Barak, Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks, Neural Comput., Vol. 25, 2013, pp. 626-649.

[9]CL Beer and O Barak, Dynamics of dynamics: following the formation of a line attractor, 2018. arXiv:1805.09603

[10]D Collerton, JP Taylor, I Tsuda, H Fujii, S Nara, K Aihara, et al., How can we see things that are not there? Current insights into complex visual hallucinations, J. Consciousness Stud., Vol. 23, 2016, pp. 195-227.

<Previous Article In Issue

Download article (PDF)

Next Article In Issue>

Journal: Journal of Robotics, Networking and Artificial Life
Volume-Issue: 6 - 1
Pages: 42 - 47
Publication Date: 2019/06/25
ISSN (Online): 2352-6386
ISSN (Print): 2405-9021
DOI: 10.2991/jrnal.k.190531.009 How to use a DOI?
Open Access: This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

ris enw bib

TY  - JOUR
AU  - Hiroto Tamura
AU  - Yuichi Katori
AU  - Kazuyuki Aihara
PY  - 2019
DA  - 2019/06/25
TI  - Possible Mechanism of Internal Visual Perception: Context-dependent Processing by Predictive Coding and Reservoir Computing Network
JO  - Journal of Robotics, Networking and Artificial Life
SP  - 42
EP  - 47
VL  - 6
IS  - 1
SN  - 2352-6386
UR  - https://doi.org/10.2991/jrnal.k.190531.009
DO  - 10.2991/jrnal.k.190531.009
ID  - Tamura2019
ER  -

download .riscopy to clipboard