1    Neural Implementation of Causal Inference Master of Science Thesis in Biomedical Engineering   MASIH RAHMATI Department of Signals and Systems Division of Biomedical Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Göteborg, Sweden, 2011 Report No. EX096/2011 2    REPORT NO. EX096/2011 Neural Implementation of Causal Inference A neural model of the Causal Inference task using Probabilistic Population Codes(PPC) Masih Rahmati Supervisor: Dr. Wei Ji Ma Department of Neuroscience Baylor College of Medicine Houston, United States, 2011 Examiner: Dr. Yngve Hamnerius Department of Signals and Systems Chalmers University of Technology Gothenburg, Sweden 2011 3    Acknowledgment I owe my deepest gratitude to Dr. Wei Ji Ma1 as my supervisor. This thesis would not have been possible without his kind guidance, patience and support from the initial to the final level. I also should show my appreciation to Dr. Jeff Beck2 for his great suggestions during this work. Finally I should thank Dr. Yngve Hamnerius3 at Chalmers University of Technology as my ex- aminer and advisor at Chalmers.                                                              1 Department of Neuroscience, Baylor College of Medicine, Houston, Texas, USA 2 Gatsby Computational Neuroscience Unit, University College London, London, UK 3 Department of Signals and Systems, Chalmers University of Technology, Gothenburg, Sweden 4    Contents Introduction ................................................................................................................................................... 5  Chapter 1: Background ................................................................................................................................. 7  1.1.  Behavioral models of cue combination ......................................................................................... 7  1.2.  Causal inference .......................................................................................................................... 11  1.3.  Neural Models ............................................................................................................................. 16  Probabilistic Population Coding ......................................................................................................... 16  Stimulus, preferred stimulus, tuning curve ......................................................................................... 17  Variability ........................................................................................................................................... 18  Encoding ............................................................................................................................................. 18  Decoding ............................................................................................................................................. 20  1.4.  Cue Combination using PPC ....................................................................................................... 21  Chapter 2: Problem Statement .................................................................................................................... 26  2.1.  Neural implementation of Causal Inference ............................................................................... 26  Chapter 3: Methods ..................................................................................................................................... 33  3.1.  Stochastic Gradient Descent ....................................................................................................... 33  3.2.  VBLR .......................................................................................................................................... 36  3.3.  Test criteria ................................................................................................................................. 36  Chapter 4: Simulation Results .................................................................................................................... 38  4.1.  Training ....................................................................................................................................... 38  4.2.  Test .............................................................................................................................................. 39  4.3.  Results ......................................................................................................................................... 39  4.3.1.  Linear and LDN .................................................................................................................. 41  4.3.2.  Quadratic ............................................................................................................................. 42  4.3.3.  QDN .................................................................................................................................... 49  4.4.  Discussion and conclusion .......................................................................................................... 55  References ................................................................................................................................................... 56  Appendix A ................................................................................................................................................. 57  5    Introduction Imagine that you are walking in the jungle. Suddenly you see a movement in the bushes and si- multaneously hear a sound. At that moment you should take different possibilities into account. You may think that the sound you heard and the movement had different sources (for example the sound came from your stepping on a piece of wood and the wind just moved the bushes). On the other hand, you may conclude from the two clues that there is one source for the sound and the movement (for example an animal is behind the bushes and about to attack). There are many different tasks in which the brain has to infer whether a set of stimuli originate from a common source or from different sources. The task of inferring the number of sources of a set of stimuli is a form of causal inference. In this project we are trying to address how the brain performs the causal inference task using neurons. More precisely we want to investigate how the brain combines two neural population activities; in our case we consider visual and audio neural populations, for example but not nec- essarily located in primary visual and auditory cortices (V1 and A1). Our approach is based on the theory that neural populations encode probability distributions over stimuli. Specifically, we use probabilistic population codes [1] to implement the causal inference model proposed by [2]. In [1], Ma et al. proposed a method in which the brain combines different pieces of information encoded as probability distributions using a Bayesian approach; this is called Probabilistic Popu- lation Coding (PPC). In [2] Kording et al. suggest a Bayesian behavioral model for causal infe- rence and stimulus estimation. We use the PPC framework to design and implement a neural structure capable of doing causal inference according to the model presented by Kording et al. Briefly, we suppose that there is a parameter C in the outside world which determines if there is one source, C = 1, or two different sources, C = 2, for the visual and auditory stimuli. Then we expect that our neural structure is capable of calculating the probability of each of these cases given visual and auditory neural population activities, P(C=1| rA , rV ) and P(C=2| rA , rV ), where rA and rV are the auditory and visual neural population activities, respectively. There are several challenges we have to deal with in implementing this model. One is that the network should be able to do the task regardless of the reliabilities of the two inputs. Going back to our jungle example, different environmental situations can result in different input reliabilities. Visual reliability could be high or low depending on lighting conditions, and auditory reliability could be high or low depending on the level of background noise. 6    To measure the performance of our network, we compare its results with the results of the beha- vioral model proposed by Kording et al, when fed by the same test input data. Chapter 1 con- tains a detailed statement of the problem. In chapter 2, the previous work on this problem is dis- cussed. Chapter 3 contains details of the methods we used, and the simulation results are pre- sented in Chapter 4. 7    Chapter 1: Background In many ecologically important situations, the brain combines multiple pieces of information (cues) from one or multiple sensory modalities. For example, when trying to understand what someone is saying, one combines auditory information with visual information about the move- ment of the lips. To judge the depth of a surface, one might combine visual and tactile informa- tion. In such cue combination, the way in which the brain combines these pieces of information is of interest. It is not always a good idea to combine, since the cues might have been produced by different sources. Evaluating the probability of two cues having the same source and using this information to judge the underlying stimulus or stimuli is known as causal inference. The goal of this thesis is to investigate the neural basis of an optimal model of cue combination in this general setting. We will study cue combination at a behavioral and at a neural level. At the behavioral level, we abstract the internal representation of a stimulus without referring to neurons. Yet, it is possible to write down a model that describes how a probabilistically optimal observer would solve the task. The predictions of this model can then be compared with behavioral experiments in hu- mans, as has been done in previous studies. At the neural level, the goal is to implement the computations of the optimal behavioral model using a biologically plausible model of neural populations. In this chapter we will explain previous studies in cue combination at both the be- havioral and neural levels. 1.1. Behavioral models of cue combination  During the last decades, many researchers have focused on probabilistic interpretation of cue combination. For example in one study, the integration of visual and proprioceptive position- information is investigated[3].In this study they concluded that hand position is better predicted by a model which integrates visual and proprioceptive information than by a model that models depends only on one cue. They also showed that the weight of each cue in the integration de- pends on the precision of the corresponding unimodal information. In another study, the combi- nation of texture and motion in depth perception is considered [4]. In this study, two optimal cue integration models are presented to analyze subject’s behavioral data from a task in which sub- jects were asked to adjust the height of an ellipse until it matched the depth of a simulated cy- 8    linder defined by texture and motion cues. The generative model of cue integration task in gen- eral is shown in figure 1.1, where ix is the internal presentation of i th cue in the brain. Figure 1.1 Generative model of cue integration for multiple cues from one source. A generative model is a useful tool to visualize the structure of perceptual tasks. Using genera- tive models, we can show the links between different variables of the model and the probability distribution over each parameter. It also helps us to calculate joint and conditional probability of different variables of the network. In our generative model there are two kinds of parameters. First, a “state-of-the-world variable”, which can be location, orientation, etc. In cue combination, it is the variable in the outside world that generates the cues - s in figure 1.1. Second, an “observation” is a noisy representation of the state-of-the-world variable. In cue combination, an observation is the internal presentation of a cue inside the brain - ix in figure 1.1. Then the generative model shows how the observations arise stochastically from the state-of-the-world variable. In figure 1.1, each node represents a random variable and each arrow represents a conditional distribution. When an observer knows the functional forms of 1( | )P x s and 2( | )p x s , he can use Bayes’ rule to calculate 1 2( | , )p s x x , this can be written mathematically as: 1 2 1 2 1 2 ( , | ) ( ) ( | , ) ( , ) P x x s P s P s x x P x x  (1.1) Here, 1 2( , )P x x is the normalization factor which ensures that 1 2( | , )P s x x is a probability distri- bution and integrates to one. Therefore: 1 2 1 2( | , ) ( , | ) ( )P s x x P x x s P s (1.2) 1x 2x s 1( | )p x s 2( | )p x s Generative model Inference 9    Where 1 2( | , )P s x x is the posterior. ( )P s is the prior which is the probability the observer assigns to s without any information about current trial, 1 2( , | )P x x s is the likelihood function over s, which modifies the prior using information from current trial . Intuitively, Bayes’ rule involves moving “against the direction of the arrows” in the generative model. In cue integration, we consider a uniform prior over s. As 1x and 2x are internal representations of two cues originating from a common stimulus, they are not independent. However, for a given s they are conditionally independent, which means: 1 2 1 2( , | ) ( | ) ( | )P x x s P x s P x s (1.3) Therefore we can write equation.(1.2) as: 1 2 1 2( | , ) ( | ) ( | )P s x x P x s P x s (1.4) We assume that ( | )iP x s has a Gaussian form which is 2( | ) ( , )iP x s   N , a normal distribu- tion with mean and variance 2 . For a given s: 2 2 ( ) 21 ( | ) 2 i i i x i i P x s e       (1.5) From equation (1.4) we can write: 2 2 1 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2 ( ) ( ) 2 2 1 2 1 2 ( ) ( ) 2 2 1 2 ( ) 2 1 1 ( | , ) 2 2 1 1 2 2 1 2 x s x s s x s x s P s x x e e e e e                         (1.6) with: 2 2 2 2 21 2 2 1 1 2 2 2 2 2 1 2 1 2 , x x              (1.7) Therefore, 1 2( | , )P s x x is a Gaussian with mean and variance calculated in equation(1.7). So there is a specific s, which equals , that maximizes 1 2( | , )P s x x . As a Gaussian form function, there is an s - in equation(1.7)- which maximizes 1 2( , | )P x x s . Then proportionality of posterior and likelihood in Equation(1.1) implies that this s maximizes the posterior too. 10    From equation(1.7) we can calculate the value of s which maximizes the probability of s given the data on a single trial - 1x and 2x . It is called the maximum-a-posteriori (MAP) estimate of s. We can write the best estimate as: 1 1 2 2s w x w x    (1.8) Where: 2 2 1 2 1 2 2 2 2 2 1 2 1 2 1 1 , 1 1 1 1 w w           (1.9) In summary, we supposed that there is one stimulus s which have two internal representations 1x and 2x in the brain. In each trial for given 1x and 2x maximum likelihood calculator calculates the best estimate for the stimulus - s  in equation(1.8). This is the idea which many researchers [5-7] have used to study cue integration task. In [4] Ernst and Banks have studied an experiment to investigate integration of visual and haptic in- formation. In this study they measured the variance of visual and haptic estimations of height and used these variances to use in Maximum Likelihood Estimation(MLE). In .such experiments, the experimenter uses two stimuli 1s and 2s - as visual and haptic stimulus respectively - which are slightly different, but in a way that the observer still believes that there is a single underlying stimulus. If the experimenter presents two stimuli 1s and 2s repeating over many trials, the MLE will fol- low a distribution 1 2( | , )P s s s  Two random variables 1x and 2x are the internal representations of 1s and 2s . Therefore the means of 1x and 2x are 1s and 2s . Then we can calculate the mean and variance of distribution of the s  from equation (1.8) : 1 1 2 2 2 2 1 2 2 2 1 2 var( ) s w s w s s           (1.10) Then if there is just one stimulus s, 1 2s s s  , equation (1.10) reduces to s s   . This is differ- ent from equation(1.8) since that equation calculates the best estimation on one trial while equa- tion (1.10) calculates the mean of best guess over many trials. 11    Equation(1.10) says that the weight of each stimulus is inversely proportional to the variance of the likelihood of each cues in the In other words more reliable the cue is, larger weight the brain gives to it for cue integration. In their study Ernst and Banks showed that this optimal model behaves very similarly to humans in a visual-haptic task. In another study [5] Alais and Burr investigated how the brain integrates visual and auditory cues when watching a ventriloquist playing a puppet. To do this, they have designed an experiment in which subjects were asked to localize short light blobs or sound clicks. They showed that in this case the brain combines auditory and visual stimuli using weighted some of the stimuli where weight of each stimulus is proportional to it’s reliability. Their model also follows equations (1.1) to (1.10) with visual and auditory cues as 1x and 2x . 1.2. Causal inference In cue combination studies we discussed above, it is assumed that there is a common source for two perceived stimuli. But in real world this is not always the case. In many situations there are two different sources for two stimuli and the brain should first infer if there is one source or dif- ferent sources and then integrate cues if they have a common cause. For example in the case of ventriloquist if the performer is not professional enough, the brain can infer that the source of sound and movement are not the same, so it doesn’t integrate visual and auditory stimuli. Vice versa if the ventriloquist performs well enough, the brain infers one cause for visual and auditory stimuli and integrates them. In psychophysics experiments, it has been found that there is a con- siderable interaction between perceptual unification, localization bias and uncertainty of subjects in localization estimation [8]. Wallace et al. showed in their study [8] that when the disparity between auditory and visual sti- muli increases, then the localization bias increases. Their results also show that increasing spatial disparity results in lower percentage of unifying reports. In all trials which subjects reported spa- tially unified stimuli, variability was significantly lower. This means the subject was more cer- tain than cases where subjects reported disjointed stimuli. In unified stimuli cases, higher dispari- ty results in higher variability while in non-unity judgments, higher disparity results in lower variability. Therefore, a complete model of cue combination / multisensory perception should be able to deal with situations in which it is not certain whether two cues have a single source or different sources. In [2] Kording et al. proposed a causal inference model in multisensory perception. In this paper they have suggested a model of an optimal Bayesian observer for causal inference as well as 12    source localization. Their model is based on this fact that the brain has access to a noisy presen- tation of the source locations and should infer the best estimate of those locations while it is con- siderably uncertain about the presence of a common cause. The optimal observer model defines how two cues should be combined depending on the observer’s inference about the presence of a common cause. If the observer believes strongly that the cues have a common cause, they are fused. If the observer believes strongly that there are two different causes, they will be segre- gated. In practice, the optimal observer always has uncertainty about the number of causes and adjusts its cue combination depending on his degree of belief about the causal structure. Causal inference models tries to predict conditions in which the brain perceives a common cause or dif- ferent causes as well as the way in which two cues should be combined. In their study Koridng et al. tried to model cases where subjects are presented by both visual and auditory stimuli at the same time. Their task is to estimate the location of the light and sound sources. Where there is a common cause for two stimuli, one stimulus can be used to improve the estimation of the other stimulus source. One point to be considered is that in some cases there are two sources for two stimuli but there are so close that the observer reports one cause for both of them. Figure 1.2 shows the task. Figure 1.2 Causal inference task. To perform causal inference task and estimate the position of causes, this model uses Bayesian statistics for two pieces of information. The first piece is the likelihood, which is provided by the observed visual and auditory stimuli. Because of noise in our sensory system, each single stimu- lus does not provide exact information about its cause but it presents a probability distribution of where the stimulus could be. The second piece of information is priors about the unity of causes- for causal inference- and the positions of sources- for position estimation. Combining these two pieces of information, causal inference model infers if there is a common cause or two different causes and estimates their positions. 13    The task consists of calculating the probability of a common cause, given Ax and Vx which are the noisy internal representations of auditory and visual stimuli As and Vs in the outside world.. Considering inferring number of causes as a binary decision – common cause or different causes – the brain can use a threshold. If probability of a common cause is higher than determined thre- shold then the brain reports a common cause. If probability of a common cause is smaller than the threshold, the brain infers two different causes for two perceived stimuli. Figure 1.3 shows the generative model of the task. Figure 1.3 Generative model of causal inference. As can be seen in the generative model, the binary variable C determines if there is one cause or two different causes. If C=1 it means that auditory and visual stimuli have a common cause and if C=2 it means that there are two different causes for As and Vs . On each trial, C is drawn from a binomial distribution with ( 1) commonP C P  and ( 2) 1 commonP C P   . If there is a common source (C=1) a stimulus - here we can say the position of the source - is drawn from a normal prior distribution (0, )pN  which means a normal distribution with mean 0 and standard devia- tion p . This means we expect that in most trials the stimulus is close to the center. If C=1 we have As s and Vs s . Vice versa if there are two different causes (C=2), As and Vs are drawn independently from a normal prior distribution (0, )pN  . The noise corrupting auditory and visu- al signals is modeled as unbiased Gaussian noise with standard deviations A and V . Thus we can say that the internal representation of auditory and visual stimuli are drawn from ( , )A AN s  and ( , )V VN s  respectively. Now that we have our generative model the goal is to calculate ( | , )A VP C x x . C As Vs Ax Vx ( | )A Ap x s ( | )V Vp x s ( , | )A Vp s s C 14    According to Bayes’ rule: ( , | ) ( ) ( | , ) ( , ) A V A V A V P x x C P C P C x x P x x  (1.11) Also we know that ( 1| , ) ( 2 | , ) 1A V A VP C x x P C x x    . We consider a prior probability commonP for ( 1)P C  and 1 commonP for ( 2)P C  , so we can rewrite equation(1.11) as: ( , | 1) ( 1| , ) ( , | 1) ( , | 2)(1 ) A V common A V A V common A V common P x x C P P C x x P x x C P P x x C P        (1.12) In view of the conditionally independence of Ax and Vx we can write from the generative model in figure 1.3: ( , | 1) ( , | , , 1) ( , | 1) ( | ) ( | ) ( ) ( ) ( | ) ( | ) ( ) A V A V A V A V A V A A V V A A V A A V A A V A A A P x x C P x x s s C P s s C ds ds P x s P x s s P s s s ds ds P x s P x s P s ds             (1.13) All three factors inside the integral have Gaussian distribution - for C=1 are ( , )AN s  , ( , )VN s  and ( , )p pN   respectively. We can calculate ( , | 1)A VP x x C  analytically from equation(1.14). 222 222 ( )( )( ) 222 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 1 1 1 ( , | 1) e e e 2 2 2 ( ) ( ) ( )1 1 exp 22 A pV AA A pVA sx sx s A V A A V p V A p V p A A p V A V V p A pA V V p A p P x x C ds x x x x                                           (1.14) For C=2 we can write: 15       ( , | 2) ( , | , , 2) ( , | 2) ( | ) ( | ) ( ) ( ) ( | ) ( ) ( | ) ( ) A V A V A V A V A V A A V V A V A V A A A A V V V V P x x C P x x s s C P s s C ds ds P x s P x s P s P s ds ds P x s p s ds P x s p s ds           (1.15) Like for C=1, here all distributions in these integrals are Gaussian. As a result, we can find an analytical answer for ( , | 2)A VP x x C  which is: 2 2 2 2 2 22 2 2 2 ( ) ( )1 1 ( , | 2) exp 22 ( )( ) A p V p A V A p V pA p V p x x P x x C                            (1.16) After calculating the probabilities of C=1 and C=2, the observer has to make a decision. We use MAP estimation to decide if C equals 1 or 2. We know that ( 1| , ) ( 2 | , ) 1A V A VP C x x P C x x    . Thus, the observer uses 0.5 as a threshold for his decision i.e. if ( 1| , ) 0.5A VP C x x  the observ- er reports C=1 and if ( 1| , ) 0.5A VP C x x  , the observer reports C=2. Figure 1.4 shows the per- formance of the causal inference model. a b -60 -40 -20 0 20 40 6 Li ke lih o o d Stimulus ( | )A Ap x s ( | )V Vp x s -60 -40 -20 0 20 40 6 P o st e ri o r Stimulus ( | , , 2)VA Ap s x x C  ( | , , 1)VA Ap s x x C  Vs  As  ( | , )VA Ap s x x 16    c d Figure 1.4 (a) Shows the likelihood of SA and SV as a function of stimulus. (b) Shows the best guesses for sti- mulus knowing Ax and Vx respectively. In (c) the Bias - which shows the influence of vision on the perceived auditory stimulus - is plotted as a function of difference between two stimuli. Trials for which reported C equals 1 are plotted in blue and trials for which reported C equals 2 are plotted in green (d) shows the relative number of trials on which the subject reports C = 1. The causal inference model gives us an analytical method to calculate probability of common or different causes for given internal representation of two stimuli. But as we know the brain per- forms any function via its neural populations and networks. So to check the neurally plausibility of the causal inference model, we need a neural structure capable of implementing this model. 1.3. Neural Models Probabilistic Population Coding In most of our brain activities the brain receives one or multiple stimuli from the outside world in form of physical signals like light, sound, tactile, etc. These stimuli contain different amounts of information. Depending on their nature- visual, auditory, etc- each stimulus goes to its’ corres- ponding region of the brain to be processed. Inside the brain these stimuli are presented as neur- al activities. In fact the information carried by the stimuli is encoded in neural activities of the corresponding part of the brain. Then the brain decodes and processes this information to be pre- pared for different operations such as decision making, object recognition, etc. The general schematic of this procedure is shown in figure 1.5 -20 -15 -10 -5 0 5 10 15 2 -300 -200 -100 0 100 Spatial disparity B ia s( % ) C=1 C=2 -15 -10 -5 0 5 10 15 0 20 40 60 80 100 Spatial disparity R e p o rt  o f  u n it y 17    Figure 1.5 Schematic of what the brain does with stimulus. Before describing how neural populations encode information, some basic concepts need to be described. Stimulus, preferred stimulus, tuning curve Stimulus can be defined as any physical feature of the world. It can be distance between two ob- jects, number of an event in a specific period of time, color of an object, etc. Each stimulus is processed in a specific part of the brain consists of a population of neurons. For simplicity we can suppose that there are a line of neurons in each part. When the corresponding part receives the stimulus its’ neurons start firing neural spikes. In fact the activity of a neural population in response to a received stimulus encodes information content of that stimulus. Each neuron in a neural population is most sensitive to a specific value of the stimulus. In other words, different neurons in a neural population are tuned to different values of stimulus. The value to which each neuron is tuned is called preferred stimulus of that neuron. Figure 1.6 shows the preferred stimulus for a single neuron. This curve is called tuning curve of a neuron which shows the mean response of a single neuron in neural population to all possible stimuli. The neu- ron fires most intensely in response to the stimulus to which it is tuned. Tuning curve and pre- ferred stimulus are shown in figure 1.6. a B Figure 1.6 (a) Tuning curve and preferred stimulus of one neuron; (b) Tuning curve of a neural population stimulus posterior decision neural representation encoding decodingcomputation Stimulus N e u ra l a ct iv it y (S p ik e  c o u n t/ se co n d ) Preferred stim Stimulus N e u ra l  a c ti v it y (S p ik e  c o u n t/ se co n d ) 18    Variability When a neuron is stimulated with a stimulus it fires neural spikes in a rate proportional to the value of the stimulus. But it has been shown that if we stimulate a neuron with a specific stimu- lus on different trials, the firing rate of the neuron differs trial to trial. In neural population this variability in neural activities is proportional to the mean of activities i.e. larger mean of neural activity results in larger variance in distribution. Figure 1.7 shows population activities for a sin- gle trial and for multi trials case. Figure 1.7 Population activity and neural variability Encoding When there is variability it means that on each trial one cannot specifically say that what the val- ue of the stimulus is. This leads us to this conclusion that neural populations encode probability distribution over stimuli not their exact value [9, 10]. One of the most important reasons for en- coding probability distributions instead of deterministic values of stimuli is the role of uncertain- ty in neural computations. When there is variability it means that on each trial one cannot specif- ically say that what the value of the stimulus is. This leads us to this conclusion that neural popu- lations encode probability distribution over stimuli not their exact value. One of the most impor- tant reasons for encoding probability distributions instead of deterministic values of stimuli is the role of uncertainty in neural computations. The generative model in figure 1.8 shows the relation between stimulus and population activity. N eu ra l A ct iv it y( m an y  tr ia ls ) 1 2{ , , ..., }Nr r rr Preferred stimulus 1 2{ , ,..., }Nr r rr N e u ra l A ct iv it y( o n e  tr ia l) Preferred stimulus Population of neurons 19    Figure 1.8 Relationship between stimulus and population activity. One of the distributions that satisfy this mean-variance relation in neural activities is Poisson distribution[11]. This distribution expresses the probability of a number of events in a specific period of time, physical distance, etc with a known average rate and independent of the begin- ning point. For a Poisson distribution, if the expected number of occurrences is , then the prob- ability of exactly k occurrences during the given interval is: ( ; ) ! ke f k k    (1.17) Now if we show the response distribution of the i-th neuron to a given stimulus as ( | )iP r s then the variance of the spike count is proportional to the mean of it. Then substituting ( )if s for  and ir for k in equation (1.17) we can write: ( ) ( ) ( | ) ! i if s r i i i e f s p r s r   (1.18) Where ir is an integer number which represents the firing rate of the i-th neuron in the neural population and ( )if s is its’ tuning curve. Then for the neural population 1 2{ , ,..., }Nr r rr we have: ( ) 1 1 ( ) ( | ) ( | ) ! i if s rN N i i i i i e f s p s P r s r    r (1.19) Here we assume that the firing rate of each of N neurons in the population is conditionally inde- pendent of other neurons. According to Bayes rule ( | ) ( | ) ( )p s p s p sr r . In this case, x - the internal representation of s - is the s which maximizes this likelihood: s r ( | )p r s ( | )p s r 20    arg max ( ) arg max ( | ) s s x L s p s r r (1.20) Figure 1.9 shows ( )L sr and x. Figure 1.9 The interpretation of x. Decoding Information encoded in population activities needs to be decoded for being used by other parts of the brain for different tasks such as judgment and decision making. This means knowing the population activity, the brain should estimate the value of perceived stimulus – inference in ge- nerative model. To do this the observer can calculate the probability distribution over stimulus - ( | )p s r - and then read out the best value of s from it’s distribution. Using Bayes rule, posterior about the stimulus can be written as: ( | ) ( ) ( | ) ( ) p s p s p s p  r r r (1.21) Where ( | )p s r is the likelihood, ( )p s is the prior probability of the stimulus and ( )p r is the nor- malization factor which is: ( ) ( | ) ( )i ip p s p sr r (1.22) Equation (1.21) means that we can calculate the probability of a stimulus given the population activity -r - and the prior probability over the stimulus. This is shown figure 1.10: Stimulus (s) p ro b ab ili ty x 2 ( )rL s 21    Figure 1.10 Posterior distribution given population activity. From equations (1.19) and (1.21) we can write: ( ) ( ) ( | ) ( ) ! i if s r i i i e f s p s p s r  r (1.23) Using a flat distribution for ( )p s , it can be shown from equation (1.23) that the posterior distri- bution ( | )p s r converges to a Gaussian distribution with mean  and variance 2 . Using maxi- mum aposteriori (MAP) to read out the best guess, the observer chooses the mean of the post- erior as the stimulus. Then the activity of each neuron in the population is proportional to the distance between its preferred stimulus and the input stimulus. So the peak of the posterior - mean of the distribution - happens very close to the preferred stimulus of the neuron with highest activity in the population. It is shown in [12-14] that the variance of the posterior is proportional to the hill of the activity i.e. 2 1 g   . One of the most important reasons for decoding probability distributions instead of deterministic values of stimuli is the role of uncertainty in neural computations. In fact to use sensory informa- tion for both judgment and motor control, the brain needs to take into account the uncertainty in information it receives or how reliable are the received information. 1.4. Cue Combination using PPC  To show the usefulness of encoding the probability distributions, here we describe an example of multisensory integration using PPC. In cue combination, the brain integrates two cues such as auditory and visual cues, each of which contains information about the stimulus with some un- certainty. Figure11 shows this cue integration task. 1 2{ , ,..., }Nr r rr N e u ra l A ct iv it y  (o n e tr ia l) Preferred stimulus Bayes rule p ri o r p ro b a b ili ty ( | )p s r stimulus 22    Figure 1.11 An example of cue integration. The relation between cues the mean and variance of the cues and the inferred stimuli follows equation(1.7), which describes the Bayes-optimal solution. . As mentioned before, the human brain performs very close to equation(1.7). Now the question is how neurons can achieve this relationship. Figure 1.11 shows that two cues - 1c and 2c - are encoded in two population activi- ties r1 and r2 with gains g1 and g2 respectively. This figure also shows the relation between r3 with r1 and r2 . p ro b a b ili ty ( | )A Ap s r stimulus P ro b a b ili ty ( | )V Vp s r stimulus V A 23    Figure 1.12 The relationship between population activity of two cues and a third population which can encode the distribution of the stimuli. In fact these PPCs - r1 and r2 – encode likelihood functions 1( | )p sr and 2( | )p sr . Using the same number of neurons in both populations and same tuning curves for corresponding neurons in two populations, it can be shown that optimal Bayesian inference is equivalent to summing up the activity of two populations r1 and r2 on a neuron by neuron basis. In other words, we can con- struct a new population r3 for which we have: 3 1 2 r r r (1.24) If r1 and r2 have a Poisson distribution r3 will have too. As another result we can see that r3 en- codes the likelihood function with variance 2 3 which is inversely proportional to the gain of r3. Figure 1.12 shows the relation between r1, r2 and r3. As can be seen in this figure, the gain of the final population, 3g , equals the sum of the first two populations - 3 1 2g g g  . Since the gains are inversely proportional to the variance of the cor- responding populations with the same proportionality, we conclude 2 2 2 3 1 2 1 1 1      which is exactly equation (1.7). As a consequence, the variance of the distribution encoded by r3 is the variance of the posterior probability 1 2( | , )p s r r . To prove that summing up population activities results in Bayes-optimal inference we assume that neural variability follows the so-called Poisson-like distribution, defined as: lik e lih o o d ( | )A Ap sr stimulus N e u ra lA ct iv it y Preferred stimulus 1g N e u ra l A ct iv it y Preferred stimulus 2g N e u ra l A ct iv it y Preferred stimulus 3g 3 1 2g g g  lik e lih o o d ( | )V Vp sr stimulus lik e lih o o d ( | )A Vp sr r stimulus 2 2 2 3 1 2 1 1 1      1 2 1 1 g  2 2 2 1 g   3 2 3 1 g   2 2 1 2 1 1    24    ( | , ) ( , ) exp( ( ) )T k k k k k kp s g g s r r h r (1.25) Then we want to prove that 3( | )p sr - with 3 1 2r r r  - is a Poisson-like distribution. To do this we write: 3 11 1 1 3 1 2 1 2 3 1 2 1 2 ( ) ( )( ) 1 1 3 1 2 1 ( ) 1 1 3 1 2 1 ( ) 3 3 ( | ) ( | ) ( | ) ( ) ( , ) ( , ) ( , ) ( , ) ( , ) TT T T ss s s p s p s p s d d g e g e d e g g d g e                     h . r rh .r h .r h .r r r r r r r r r r r r r r r r r r r r (1.26) With 3 3 1 1 3 1 2 1( , ) ( , ) ( , )g g g d    r r r r r . So we can decode the information content of the population generated from summing up two former populations. The information encoded by the new population is a combination of the information from the two input populations. In equation(1.25), k can be 1, 2 or 3 and h(s) is the kernel defined as:  1'( ) , '( , )k k ks s g s g h f (1.27) Where 1 k  is the covariance matrix of kr , and ' kf is the derivative of the tuning curves. This is the general form. Then if the noise is independent Poisson and tuning curves for different gains and in two populations are identically shaped, then we have: ( ) log ( )s sh f (1.28) exp( log ) ( , ) exp( ) ! ki k k k k k i ki r g r g cg r     (1.29) Where c is a constant. So far, we have argued that probabilistic population coding can be a very convenient way for neural populations to encode probability distributions over outside world parameters. In this chapter we reviewed some literatures on causal inference and cue combination both in behavioral and neural level. We saw that using causal inference model, the observer can infer if there is a common cause for two cues as well as he can estimate the value of the stimulus. In next 25    chapter we will discuss neural implementation of causal inference model of inferring the number of cause using probabilistic population codes (PPC). 26    Chapter 2: Problem Statement In the causal inference model presented in the previous chapter, the observer uses the noisy in- ternal representations of stimuli to calculate the probability that those stimuli have a common cause. So far, we have chosen the internal representations to be rather abstract representations of the stimuli. In reality, a stimulus is internally represented through a pattern of firing rates in a population of neurons. The central problem in this work is to find a neural network that takes these patterns of activity as input and uses them to output the probability that the originating sti- muli have a common cause. 2.1. Neural implementation of Causal Inference In Section 1.2, we presented an analytical calculation of the probability of a common cause for a given pair of internal representations. Now we reformulate that calculation using neural activities instead of internal representations of stimuli. For concreteness, we will refer to auditory and vis- ual stimuli, but the framework can be applied to any set of stimuli. Information about auditory and visual stimuli, As and Vs , is encoded in auditory and visual neural populations Ar and Vr . As mentioned in Section 1.3, this information is encoded as likelihoods ( | )Ap sAr and ( | )Vp sVr respectively. In the previous chapter, we calculated the probability of C=1 and C=2 given internal representations of As and Vs - Ax and Vx . However, we need a me- thod to calculate which is the same as ( | , )A VP C r r except that here we know Ar and Vr instead of Ax and Vx . We can rewrite(1.12), (1.13) and (1.15) for Ar and Vr instead of Ax and Vx so we have: ( , | 1) ( 1| , ) ( , ) A V common A V A V P C p P C P    r r r r r r (2.1) where 27    ( , | 1) ( , | ) ( ) ( | ) ( | ) ( ) A V A V A V P C P s P s ds P s P s P s ds      r r r r r r (2.2)    ( , | 2) ( , | , ) ( , ) ( | ) ( ) ( | ) ( ) A V A V A V A V A V A A A A V V V V P C P s s P s s ds ds P s p s ds P s p s ds       r r r r r r (2.3) Unlike the distribution of Ax and Vx over s, which is Gaussian, the distributions of Ar and Vr over s are Poisson-like, as given by (1.25). Then substituting in (1.13) gives: ( )( )( , | 1) ( , ) ( , ) ( ) TT VA ss A V A A V VP C g g e e p s ds     h rh rr r r r (2.4) and (1.15) turns into: ( )( )( , | 2) ( , ) ( , ) ( ) ( ) TT VA ss A V A A V VP C g g e p s ds e p s ds      h rh rr r r r (2.5) with ( )p s a Gaussian distribution as before. In the case of independent Poisson variability and Gaussian tuning curves, we can write from (2.4) and (2.5): :               log ( ) log ( ) log ( ) log ( ) log ( ) log ( ) log ( ) lo ( , ) ( , ) ( )( 1| , ) log log ( 2 | , ) ( , ) ( , ) ( ) ( ) ( ) log ( ) A V A V A V A f s f s A A V VA V f s f s A V A A V V f s f s f s A A g g e e p s dsP C d P C g g e p s ds e p s ds e e p s ds e p s ds e                             r r r r r r r r rr r r r r r  g ( ) ( )Vf s V Vp s ds         r (2.6) Equation (2.6) gives us the log-posterior ratio which we can use as a threshold for binary varia- ble C to decide if C=1 or C=2. Now we are looking for a population of neurons – called z - which encodes the posterior proba- bility of the number of causes - ( | , )Ap C Vr r . Having such neural population allows us to use the same encoding and decoding techniques for information about C. As mentioned before, we sup- pose that neural populations have Poisson-like variability. The same assumption is needed for z to be able to follow the same logic in inferring C. 28         | T Cp C e   h zz z (2.7) This allows for recursive computation and facilitates decoding. Assuming z encodes probabili- ty of a common cause, we can define a variable d as below: ( 1| ) ( | 1) log log ( 2 | ) ( | 2) p C p C d p C p C          z z z z (2.8) Having ( | )p C z , we need to readout the value of C from its distribution. If 0d  them ( 1| ) ( 2 | )p C p C  z z and if 0d  ,then ( 2 | ) ( 1| )p C p C  z z This is using MAP to readout the value of s from its distribution. From equation (2.8) we can write:     ( 1) ( ( 1) ( 2)) ( 2) ( ) log log log ( ) C C C C e d e e e                   h z h h z Δh z h z z Δh z z (2.9) . . 1 ( 1| ) 1 1 ( 2 | ) 1 ( 1| ) 1 p C e p C p C e                h z h z z z z (2.10) So here it is shown how we can calculate probability of common cause using the activity of one neural population. To find a neural network that can perform near-optimal causal inference, we now need to deter- mine the relation between the input population activities, Ar and Vr , and the population which encodes the probability of a common cause, z. In other words, we need to find the neural opera- tions – R which can be used in the network structure to implement the causal inference model. R is in fact a hidden layer in the network that determines which neurons from each population and in what form are combined to improve networks discrimination ability. Figure 2.1 shows such a structure. 29    Figure 2.1 Neural network structure for implementing the causal inference model. We can construct the neurons in the hypothetical R layer to perform products of activities of neurons in the input populations Ar and Vr ; this effectively computes a kind of correlation be- tween neurons. For example, if there is no correlation between the neurons in the two input populations, then the layer R consists of neurons of two populations in a row beside each other. In another case, if we suppose that there are some correlations between each neuron with its cor- responding neuron in the other population, we can make layer R consists of nodes which are the product of each two corresponding neuron from two populations. Figure 2.2 shows these two examples. a b Figure 2.2 Two examples of possible structures of R. Ar Vr z R ~ W R Ar Vr Ar Ar Vr R i jA Vr r 30    The last step in completing the network is to find a set if weights which allow the network cal- culating the same probability as equation(2.6). In other words, we need a set of weights such that ~  z W R (2.11) One important point here is that, as can be seen in equation (2.11), ~ W is a matrix which maps R to z . Substituting this z into equation (2.10) we have: ~ ~ 1 ( 1| ) 1 1 ( 2 | ) 1 ( 1| ) 1 p C e p C p C e                     h (W R) h (W R) z z z (2.12) As will be shown in next chapter, because we do not know Δh , what we try to find is ~  W Δh W so the final result is: 1 ( 1| ) 1 1 ( 2 | ) 1 ( 1| ) 1 p C e p C p C e                 W R W R z z z (2.13) To calculate the probability of a common cause from equation(2.13), we need to find the Wus- ing which gives us a result as close as possible to the probability calculated from equation 15. But before that, we need to determine the elements of R layer. The simplest case is to use [ ] A VR r r which contains no correlation between different neurons. We can consider the causal inference task over many trials as a classification process. Data points are patterns of neural activity ( Ar , Vr ) and they have to be classified as originating from same source (C=1) or different sources (C=2). Figure 2.3 shows the scatter plot of the activity of two corresponding neurons in the auditory and visual populations for both C=1 and C=2 cases. In reality, the data points are not one-dimensional, but each data point is has N dimensions, where N is the number of neuron in the population. So each point is a population activity in an n dimensional space. 31    Figure 2.3 Demonstration of discrimination boundary. In the simplest case we can use [ ]Lin  A VR r r which corresponds to a linear boundary. But as can be seen in figure 2.3, a linear boundary cannot discriminate the data points well. One of the simplest boundaries that might do a better job is a quadratic boundary, shown as a green curve in figure 2.3. To have such a boundary we need quadratic elements in R. So we set R as [ ]Quad  A V A A V V A VR r r r r r r r r . This R contains quadratic elements, i jA Ar r , i jV Vr r and i jA Vr r which actually represent correlation between not only different neurons in the same population, but also neurons in different populations. Calculating d in equation (2.9), it can be shown that in analytical formula of d , R contains some non-linear function of Ar and Vr as well as terms containing Ar and Vr in the denominator. This denominator term containing Vr and Vr is called divisive normalization. Then R will be [ ]Lin LDN Lin R R R a.r as linear with divisive normalization and Quad QDN Quad        R R R a.r as qua- dratic with divisive normalization. a is a constant vector which should be learned through an iterative algorithm. Considering a in our calculations, formula (2.13) can be rewritten as: 1 ( 1| ) 1 1 ( 2 | ) 1 ( 1| ) 1 p C e p C p C e                    W R a r W R a r z z z (2.14) Therefore we tested four networks, characterized by the following operations: 4 8 12 16 4 12 16 8 C = 2 C = 1 Air Vir 32    [1 ]Lin  A VR r r Linear (Lin) [ ]Lin LDN Lin R R R a.r Linear with divisive normalization (LDN) [1 ]Quad  A V A A V V A VR r r r r r r r r Quadratic (Quad) [ ]Quad QDN Quad R R R a.r Quadratic with divisive normaliza- tion (QDN) We also added element 1 to R in all networks as the bias. As can be seen in Eq. (2.13), this bias lets network to shift the value of W R independent of the input. This in fact, helps network to learn any bias in the input data. So far we have suggested a neural network structure that can implement causal inference. For linear and quadratic networks, R is calculated from Ar and Vr . Then we need a set of weights – W – to calculate ( | )p C z . For the LDN and QDN networks, we also need a vector a to construct R. S. Therefore, for LDN and QDN networks, there are two set of weights – W and a - to be learnt. In the next chapter, we will discuss learning algorithms for the network and present simu- lation results of the networks we trained. 33    Chapter 3: Methods In the previous chapter, we proposed a neural network structure for implementing the causal in- ference model. The goal of the network is to calculate ( | )p C z . In this chapter, we explain two learning algorithms we used to train the network. 3.1. Stochastic Gradient Descent In chapter 2 we presented two ways to calculate the probability distribution over C , ( | )p C z . One from the analytical optimal method, equations (2.1) to (2.5), and one using neural network, equation (2.13). The second method uses a set of weights – W – for linear(Lin) and quadratic (Quad) networks and a set of weights plus divisive normalization factor – W and a respectively – for linear with divisive normalization (LDN) and quadratic with divisive normalization (QDN) networks. Then the goal is to find W and a in a way that two probability distributions calculated from analytical and network methods, have the smallest possible distance from each other. For this purpose we need a method to calculate the distance between two probability distributions. One of the most common measures of the distance between two probability distributions is the Kullback-Leibler divergence[15]. If ( )p x and ( )q x are two distributions over discrete random variable X , then the KL divergence ( KLD ) is defined as: ( ) ( || ) ( ) log( ) ( ) i KL i i i p x D p q p x q x  (3.1) In the case of a continuous variable, the sum turns into an integral. In our problem, we can substitute the distribution calculated from the analytical method for p and distribution calculated from neural network for q . Therefore ( || )KLD p q shows the distance between these two distributions which is: , ( | ) ( || ) ( | ) log ( | )KL C p C D p q p C q C         R R R R (3.2) With 34    1 ( | ) 1 C q C e  W.R R (3.3) From equation(3.2), it can be seen that ( || )KLD p q is a function of W (and a for networks with divisive normalization). Then we have to find W – for networks without divisive normalization – and W plus a – for networks with divisive normalization which minimize ( || )KLD p q . To find the minimum value of ( || )KLD p q as a function of W, we use stochastic gradient descent algorithm (SGDS). This algorithm finds the minimum of a function iteratively after choosing an initial point. In each trial the algorithm finds the gradient of ( || )KLD p q at that point and moves in the opposite direction of the gradient until gets to its minimum. This can be formulated as: 1n n    W W W (3.4) Where nW and 1nW are the weight vectors in nth and n+1th trial respectively.  is defined as the learning rate and Was below: KLD     W W (3.5) Then from equation (3.2) and (3.3), we can write: , ( || ) ( | )[ (1 ( | ))]KL C D p q p C C q C       R R R R, W W (3.6) Equation (3.6) calculates W for one trial in each iteration. We choose R from the generative model in figure 1.3 stochastically – and that’s why it’s called stochastic gradient descent method. Therefore to find a W and finally W independent of a specific R, we use the average of ( || )KLD p q W over many trials. So equations (3.6) can be rewritten as: , , ( || ) ( | ) ( )[ (1 ( | ))] ( , )[ (1 ( | ))] KL C C D p q p C p C q C p C C q C           R R R R R R, W W R R R, W (3.7) 35    Equation (3.7) calculates the weighted sum of the function [ (1 ( | ))]C q CR R, W as its average over C and R . Instead of averaging in this way, we can use sampling. In this case we sample iC and iR from ( , )p C R according to the generative model. Therefore we can write the following equation instead of equation (3.7): ( || ) [ (1 ( | ))]KL i i i i i D p q C q C       R R , W W (3.8) From equations (3.4), (3.5) and (3.8) we can write: 1 [ (1 ( | ))]n n i i i i i C q C   W W R R , W (3.9) Equation (3.8) calculates W for networks both with and without divisive normalization. In Networks with divisive normalization we also need to find the best a which minimizes ( || )KLD p q . Therefore we again use stochastic gradient descent algorithm to minimize ( || )KLD p q regarding a. Then we have: 2 2 ( || ) (1 ( | ))( ) ( ) combKLD p q C q C        rW r r, W,a a a r (3.10) W , 1W , 2W ,r and combr are defined as below: [1 ] A Vr r r comb r r for LDN [1 ]comb  A V A A V V A Vr r r r r r r r r for QDN 1( , ) 2W W W where 1W corresponds to elements of R without divisive normalization and 2W corresponds to elements of R with divisive normalization. One important factor is the time when we stop updating Wand a . We continue updating Wand a until a convergence criterion is satisfied. This criterion can be a threshold forW , a threshold for information loss rate – which is defined in testing criteria- or any measure for convergence. In our simulations, we used information loss rate as convergence measure. 36    3.2. VBLR The second learning algorithm we used is Variational Bayesian Logistic Regression (VBLR). VBLR tries to find a simpler approximation for the sigmoid function (3.3). To have the best ap- proximation, it should find the W which maximizes a lower bound for the sigmoid function. In this algorithm some latent parameters are defined and then the approximation is optimized re- garding those latent parameters as well as W. In our simulations we used VBLR code written by Jan Drugowitsch [16] One important point about using both SGDS and VBLR algorithm is that in LDN and QDN networks we have two set of weights to be learnt, W and a. To do this we in each iteration, we first update W and then use this W to do the calculations for updating a. This idea comes from the Expected Maximization (EM) algorithm in which in E step the expectation of the log- likelihood of the model parameters is calculated and in M step we find the values of some hidden parameters which maximize the calculated expectation. 3.3. Test criteria After finding optimized W and a, we calculate the information loss rate to measure the perfor- mance of the network. The information loss rate is defined as the normalized KL divergence to the mutual information between R and C:                | , | , , log | log | , log | log i i i i KL i i i i i i i i i C i i i C D p C q CI I I C p C q C p C p C        R R R R W R R R W R (3.11) In (3.11) to calculate the information loss rate, we approximate ( || )KLD p q and ( , )iI C R by av- eraging over different trials. It can be shown that the larger the number of trials we use, the bet- ter the approximation is. 37    In real world, different stimuli have different reliabilities or gains. So the network should be able to discriminate C=1 and C=2 trials for different gains. Therefore in both training and testing we used different gains for different trials. To check the performance of the network we also create the scatter plot of ( 1| )p C  R and ( 1| , )q C  R W over many trials. 38    Chapter 4: Simulation Results As mentioned in chapter 2, we can categorize our networks based on their neural operations. This gives us four categories – Lin, Quad, LDN and QDN. Here we explain different training and testing parameters. We also present the performance of each of the networks with different training and testing parameters. 4.1. Training There are three important training parameters. The first parameter is the learning rate - in (3.4). We can use the same learning rate for all elements of W and a or different rates for different elements. For example for LDN and QDN, we can use two learning rates for 1W and 2W in 1 2[ ]W W W where 1W corresponds to elements of R without divisive normalization and 2W corresponds to elements of R with divisive normalization. Suitable values of learning rates are obtained by trial and error. The second training parameter is the initial value of W- and in addition, of a for the LDN and QDN networks. We set the initial value of W as the zero vector - 0 W 0 . This makes sense be- cause we expect that without any input information the network results in equal probabilities for C=1 and C=2 -    0 0 1 1| , 2 | , 2 p C p C   R W R W . This can be achieved by using W 0 in (3.3). The third parameter is the number of trials in each iteration. As mentioned in chapter 3 both learning algorithms use stochastic data generated from the generative model of the task. There- fore, the number of trials should be large enough to achieve a reasonable approximation. In the learning procedure, we also need a criterion for network convergence. We can say the network is trained when the square root of 2( )i W is smaller than a threshold. This means that we are as close as we want as to the minimum of ( || )KLD p q . This is a somewhat arbitrary crite- rion. The more principled convergence criterion is the rate of change in information loss defined by (3.11). We used information loss as a convergence criterion because it is already one of the performance criteria of the network. Regarding gain or reliability, we first trained and tested networks with fixed-equal visual and auditory gains. This means that all trials in all iterations have the same auditory and visual gains. Then we trained networks for multiple gains or different reliabilities. Using multiple-unequal 39    visual and auditory gains is much closer to real circumstances. However, it is easier for the net- work to learn the weights for fixed-equal gains because in this case, there is less variance in training data which is equivalent to the data being more classifiable. Therefore, we can find the proper order of the parameters we need for the more complicated case. 4.2. Test We have two main criteria to measure the network’s performance. The first one is the informa- tion loss defined by(3.11). The less the information loss is, the better the network performance. The second one is the scatter plot of ( 1| )p C  R , the probability calculated using analytical me- thod, and ( 1| , )q C  R W , the probability calculated using network weights, over many trials. In a scatter plot, the performance of the network is better when the points lie closer to the diagonal. A sample of the scatter plot with error bars, which show the standard deviation of data, is shown in figure 4.1. Figure 4.1 Scatter plot of ( 1| )p C  R and ( 1| , )p C  R W . 4.3. Results First of all we present the information loss of all four networks for both fixed and multiple gains. Then we present the scatter plot and information loss diagram of all four networks for both sixed and multiple gains. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Optimal posterior N et w or k po st er io r 40    Figure 4.2 The information loss of all networks for both fixed and multiple gains. As can be seen in figure 4.2, Lin and LDN networks have a large information loss. This was pre- dicted in view of the discrimination boundary in figure.2.3. 0 0.2 0.4 0.6 0.8 1 1.2 Fixed Gain Multiple Gains Lin LDN Quad QDN 41    4.3.1. Linear and LDN In figures 4.3-a and 4.3-b it can be seen that linear network performs completely at chance for both fixed gain and multiple gains. The probability of C=1 estimated by the linear network is close to 0.5 for any input pattern of activity. This shows that the network performance is very poor. This is the result of using a too limited set of operations for this network. Figures 4.3-c and 4.3-d show that the LDN network does not do well in discriminating C=1 from C=2. The scatter plot is still completely off-diagonal and like LIN network very close to 0.5. a b c d Figure 13 scatter plot of network and analytical results for P(C=1) (a) linear network with fixed-equal gains, (b) linear network with multiple-unequal gains (c) LDN network with fixed-equal gains (d) LDN network multiple-unequal gains. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Optimal posterior N et w o rk  p o st e ri o r 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Optimal posterior N et w o rk  p o st er io r 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Optimal posterior N et w o rk  p o st er io r 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Optimal posterior N et w o rk  p o st er io r 42    4.3.2. Quadratic Quadratic with fixed gain Figure 4.4 Information loss diagram and scatter plot of Quadratic SGD (Stochastic Gradient Descent) net- work and analytical results for P(C=1) with fixed train gain and fixed test gain. Figure 4.5 Information loss diagram and scatter plot of Quadratic VBLR network and analytical results for P(C=1) with fixed train gain and fixed test gain. Figures 4.4 and 4.5 show the P(C=1) calculated with a Quad network. The network in figure 4.4 has been trained using the Stochastic Gradient Descent (SGD) algorithm and the network in fig- ure 4.5 has been trained with the Variational Bayesian Logistic Regression (VBLR) algorithm. All training trials had fixed equal visual and auditory gains. This is not compatible with real con- 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 In fo rm a ti o n  L o ss Iteration ×103 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Optimal posterior N e tw o rk  p o st e ri o r 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Optimal posterior N et w o rk  p o st er io r 0 10 20 30 40 50 60 70 80 0 0.2 0.4 0.6 0.8 1 Iteration  In fo rm at io n  L o ss 43    ditions where the input gains can be different and unequal. Equal fixed auditory and visual gains result in less variance in training data, so the network can learn the weights better and faster. As can be seen in these figures, the scatter plot is almost diagonal but the error bars are large in comparison with the mean. The results of VBLR network are almost identical to those of the SGD network. This shows that the large error bars are not related to the learning algorithm. The information loss is under 10%. But to have an acceptable network, we need information loss smaller than 2%. This is why the quadratic network is not suitable for this task. Figure 4.6 Information loss diagram and scatter plot of Quadratic SGD network and analytical results for P(C=1) with fixed training gain and multiple test gains. Figure 4.7 Information loss diagram and scatter plot of Quadratic VBLR network and analytical results for P(C=1) with fixed train gain and multiple test gains. In addition to testing the Quad networks trained with equal and fixed gain with same gain, we tested these networks-SGD and VBLR- with multiple and unequal auditory and visual gains. As 0 20 40 60 80 0 0.2 0.4 0.6 0.8 1 100 ×103 In fo rm at io n  Lo ss Iteration 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Optimal posterior N et w o rk  p o st er io r 0 10 20 30 40 50 60 70 80 0 0.2 0.4 0.6 0.8 1 Iteration In fo rm at io n  lo ss 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Optimal posterior N et w o rk  p o st er io r 44    can be seen in figures 4.6 and 4.7 the network trained with data with fixed and equal gains per- form also well for test data with multiple and unequal gains. A test over all possible combination of auditory and visual gains is done. Figures 4.8 to 4.11 show the scatter plot and information loss of these combinations for SGD and VBLR networks respectively. Figure 4.8 Scatter plot of different combinations of auditory and visual gains for a Quad network trained with SGD algorithm using fixed and equal auditory and visual gains. It can be seen in figure 4.8 and 4.9 for SGD and VBLR networks that when the gains are equal, the network performance is much better. As well as the difference between gains gets larger, both deviation and error bars increase. We couldn’t find an intuitive explanation for the part where the line start folding down when the difference between gains increase. This shows that this network can’t discriminate C=1 and C=2 properly, when auditory and visual stimuli have different gains. It is also shown that the scatter plot has larger deviation for cases with smaller gain. 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,6) 0 0.5 1 0 0.5 1 (Gain A , GainV) = (3,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,8) 45    Figure 4.9 Scatter plot of different combinations of auditory and visual gains for a Quad network trained with VBLR algorithm using fixed and equal auditory and visual gains. The information loss of these combinations is shown in figures 4.10 for SGD and VBLR net- works respectively. Here also we can see that there is less information loss for combinations with smaller difference between auditory and visual gains. 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,6) 0 0.5 1 0 0.5 1 (Gain A , GainV) = (3,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,8) 46    a b Figure 4.10 Information loss for different combinations auditory and visual gains; a) VBLR network trained with a fixed gain b) SGD network trained with a fixed gain. Quadratic with multiple gains Figure 4.11 Information loss diagram and scatter plot of Quadratic SGD network and analytical results for P(C=1) with multi train gains and multiple test gains. 2 3 4 5 6 7 8 2 3 4 5 6 7 8 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 ga in  A gain V 2 3 4 5 6 7 8 2 3 4 5 6 7 8 0.5 1 1.5 2 2.5 3 3.5 4 4.5 ga in  A gain V 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Optimal posterior N et w o rk  p o st er io r 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 in fo rm at io n  lo ss ×103Iteration  47    Figure 4.12 Information loss diagram and scatter plot of Quadratic VBLR network and analytical results for P(C=1) with multi train gains and multiple test gains. Figures 4.11 and 4.12 shows the results of Quad networks trained with multiple and unequal gains using SGD and VBLR respectively. As we predicted multiple and unequal gains result in more complex training data which makes it harder for the network to learn proper weights. This can be seen as more deviation from diagonal line in the scatter plot plus larger error bars as well as higher information loss. Figure 4.13 Information loss diagram and scatter plot of Quadratic SGD network and analytical results for P(C=1) with multi train gains and fixed test gain. 0 10 20 30 40 50 60 70 80 0 0.2 0.4 0.6 0.8 1 Iteration In fo rm a ti o n  L o ss 0 0.2 0.4 0.6 0.8 1 N e tw o rk  p o st e ri o r 0 0.2 0.4 0.6 0.8 1 Optimal posterior 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 ×103Iteration In fo rm at io n  Lo ss 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Optimal posterior N et w or k  po st er io r 48    Figure 4.14 Information loss diagram and scatter plot of Quadratic VBLR network and analytical results for P(C=1) with multi train gains and fixed test gain. We also used data with fixed and equal gains to test the networks trained with multiple and un- equal gains. It can be seen that in scatter plot, the error bars are smaller. The information loss is also smaller in this case. In fact figures 4.13 and 4.14 are the best results among different pairs of equal auditory and visual gains. So we can conclude that the network performs differently for different combination of weights. This is shown more clearly in figures 3 and 4 in appendix. A as scatter plots of different gain combinations and figures 4.15 as information loss of different gain combinations for networks trained with SGD and VBLR algorithms respectively. 0 10 20 30 40 50 60 70 80 0 0.2 0.4 0.6 0.8 1 Iteration In fo rm at io n  Lo ss 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Optimal posterior N et w o rk  p o st er io r 49    a b Figure 4.15 Information loss for different combinations auditory and visual gains; a) SGD network trained with multiple gains b) network trained with multiple gains. 4.3.3. QDN QDN with fixed gain Figure 4.16 Information loss diagram and scatter plot of QDN SGD network and analytical results for P(C=1) with fixed train gain and fixed test gain. 2 3 4 5 6 7 8 2 3 4 5 6 7 8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 g a in  A gain V 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 g a in  A gain V 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Optimal Posterior N et w o rk  P o st er io r 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 in fo rm at io n  lo ss Iteration  ×103 50    Figure 4.17 Information loss diagram and scatter plot of QDN VBLR network and analytical results for P(C=1) with fixed train gain and fixed test gain. Figure 4.16 and 4.17 show the P(C=1) calculated with a QDN network. The network in figure 4.18 has been trained using the (SGDS) algorithm and the network in figure 4.19 has been trained with the Variational Bayesian Logistic Regression (VBLR) algorithm. All training trials had fixed and equal visual and auditory gains. This is not compatible with real conditions where the input gains can be different and unequal. Fixed equal auditory and visual gains result in less variance in training data, so the network can learn the weights better and faster. As can be seen in these figures, the scatter plot is almost diagonal but with large error bars in comparison with the mean. The results of the VBLR network are almost identical to those of the SGD network. This shows that the large error bars are not related to the learning algorithm. The information loss is around 8%. This is far from what we expected from a QDN network. Depending on the two classes’ data boundaries and the divisive normalization effect, our expectation for the QDN net- work structure was much higher performance to less than 3% information loss. But the results show that the combination of quadratic terms and first order normalization does not discriminate two classes well enough. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Optimal Posterior N et w o rk  P o st er io r 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 in fo rm at io n  lo ss Iteration  51    Figure 4.18 Information loss diagram and scatter plot of QDN SGD network and analytical results for P(C=1) with fixed train gain and multiple test gains. Figure 4.19 Information loss diagram and scatter plot of QDN VBLR network and analytical results for P(C=1) with fixed train gain and multiple test gains. In addition to the QDN networks trained with equal and fixed gain with same gain, we tested these networks-SGD and VBLR- with multiple and unequal auditory and visual gains. As can be seen in figures 4.18 and 4.19 the network trained with data with fixed and equal gains performs poorly on test data with multiple and unequal gains. The reason for this can be the bigger role of the gains in training. In the QDN network there is an additional set of weights, divisive normali- zation factors, to be learnt. This increases the effect of variance in the data and as a result makes it more difficult for the network to learn the structure of training data. A test over all possible combination of auditory and visual gains is done. Here again for both SGD and VBLR trainings when the gains are equal, the network performance is much better. As well as the difference be- 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Optimal posterior N e tw o rk  p o st e ri o r 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 in fo rm at io n  lo ss Iteration ×103 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Optimal posterior N et w o rk  p o st er io r 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 in fo rm at io n  lo ss Iteration 52    tween gains gets larger, both deviation and error bars increase. The scatter plots are shown in figures 5 and 6 of appendix. A. The information loss of these combinations is shown in figure 4.20 for SGD and VBLR networks respectively. Here also we can see that there is less information loss for combinations with smaller difference between auditory and visual gains. a b Figure 4.20 Information loss for different combinations auditory and visual gains for QDN; a) VBLR net- work trained with a fixed gain b) SGD network trained with a fixed gain. QDN with multiple gains 1 2 3 4 5 6 7 1 2 3 4 5 6 7 ‐1.5 ‐1 ‐0.5 0 0.5 G a in  V Gain A 1 2 3 4 5 6 7 1 2 3 4 5 6 7 ‐1.5 ‐1 ‐0.5 0 0.5 G a in  V Gain A 53    Figure 4.21 Information loss diagram and scatter plot of QDN SGD network and analytical results for P(C=1) with multi train gains and multiple test gains. Figure 4.22 Information loss diagram and scatter plot of QDN VBLR network and analytical results for P(C=1) with multi train gains and multiple test gains. Figures 4.21 and 4.22 show the results of QDN networks trained with multiple and unequal gains using SGD and VBLR respectively. As we predicted multiple and unequal gains result in more complex training data which makes it harder for the network to learn proper weights. This can be seen as more deviation from diagonal line in the scatter plot plus larger error bars as well as higher information loss. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Optimal posterior N et w o rk  p o st er io r 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 in fo rm at io n  lo ss ×103Iteration  Optimal posterior 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 in fo rm a ti o n  l o ss Iteration  0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 N e tw o rk  p o st e ri o r 54    Figure 4.23 Information loss diagram and scatter plot of QDN SGD network and analytical results for P(C=1) with multi train gains and fixed test gain. Figure 4.24 Information loss diagram and scatter plot of QDN VBLR network and analytical results for P(C=1) with multi train gains and fixed test gain. We also used data with fixed and equal gains to test the networks trained with multiple and un- equal gains. It can be seen that in scatter plot, the error bars are smaller. The information loss is also smaller in this case. In fact figures 4.23 and 4.24 are the best results among different pairs of equal auditory and visual gains. So we can conclude that the network performs differently for different combination of weights. This is shown more clearly in figures 7 and 8 of appendix. A. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Optimal posterior N e tw o rk  p o st e ri o r 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 in fo rm a ti o n  l o ss ×103Iteration  0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Optimal posterior N e tw o rk  p o st e ri o r 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 in fo rm a ti o n  l o ss Iteration  55    Figures 4.25 as information loss of different gain combinations for networks trained with SGD and VBLR algorithms respectively. a b Figure 4.25 Information loss for different combinations auditory and visual gains for QDN; a)SGD network trained with multiple gains b) VBLR network trained with multiple gains. 4.4. Discussion and conclusion Our simulation results show that to have a neural structure which is capable of doing causal infe- rence task a linear discriminator performs very poorly. Although the quadratic basis function performs better than the linear one, it results in large information loss. This large information loss could be explained the need to divisive normalization for implementing causal inference using probabilistic population codes. However, we found that even a network that uses a first- order divisive normalization as in (2.14) does not decrease the information loss to an acceptable amount. We used two different learning algorithms, SGD and VBLR. These two methods train the network from two different approaches. Almost identical results of these two learning me- thods show that it is unlikely that the poor performance of the networks is due to the learning method. One solution to improve the network performance is to construct a more complicated basis func- tion, which could be a higher order polynomial and/or higher-order divisive normalization. Then the question is to what order we should increase the basis function and divisive normalization factors. One possible way is to manipulate formula (2.6) in a way that we can extract higher- order terms. 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0.4 0.5 0.6 0.7 0.8 0.9 Gain A G ai n  V 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0.4 0.5 0.6 0.7 0.8 0.9 Gain A G ai n  V 56    References 1.  Ma, W.J.,  et  al.,  Bayesian  inference with  probabilistic  population  codes. Nat Neurosci,  2006.  9(11): p. 1432‐8.  2.  Kording, K.P., et al., Causal Inference in Multisensory Perception. Plos One, 2007. 2(9): p. ‐.  3.  van  Beers,  R.J.,  A.C.  Sittig,  and  J.J.  Gon,  Integration  of  proprioceptive  and  visual  position‐ information: An experimentally supported model. J Neurophysiol, 1999. 81(3): p. 1355‐64.  4.  Jacobs, R.A., Optimal integration of texture and motion cues to depth. Vision Res, 1999. 39(21):  p. 3621‐9.  5.  Yuille,  A.L.  and  H.H.  Bülthoff,  Bayesian  decision  theory  and  psychophysics,  in  Perception  as  Bayesian  inference, D.C. Knill and W. Richards, Editors. 1996, Cambridge University Press   New  York, NY, USA p. 123 ‐ 161    6.  Ernst, M.O. and M.S. Banks, Humans  integrate  visual and haptic  information  in a  statistically  optimal fashion. Nature, 2002. 415(6870): p. 429‐33.  7.  Alais, D. and D. Burr, The ventriloquist effect results from near‐optimal bimodal integration. Curr  Biol, 2004. 14(3): p. 257‐62.  8.  Wallace, M.T., et al., Unifying multisensory signals across time and space. Exp Brain Res, 2004.  158(2): p. 252‐8.  9.  Sanger, T.D., Probability density estimation  for  the  interpretation of neural population codes.  J  Neurophysiol, 1996. 76(4): p. 2790‐3.  10.  Zemel, R.S.,  P. Dayan,  and A.  Pouget,  Probabilistic  interpretation of population  codes. Neural  Comput, 1998. 10(2): p. 403‐30.  11.  Dayan, P. and L.F. Abbott, Theoretical neuroscience : computational and mathematical modeling  of  neural  systems.  Computational  neuroscience.  2001,  Cambridge,  Mass.:  Massachusetts  Institute of Technology Press. xv, 460 p.  12.  Seung, H.S. and H. Sompolinsky, Simple models for reading neuronal population codes. Proc Natl  Acad Sci U S A, 1993. 90(22): p. 10749‐53.  13.  Hinton,  G.E.  Products  of  Experts.  in  International  Confernce  on  Artificial  Neural  Networks(ICANN). 1999. London, England: IEEE.  14.  Snippe, H.P., Parameter extraction from population codes: a critical assessment. Neural Comput,  1996. 8(3): p. 511‐29.  15.  Bishop, C.M., Pattern  recognition and machine  learning,  in  Information  science and  statistics.  2006, Springer: New York. p. p. 498‐505 .    16. http://www.lnc.ens.fr/~jdrugowi/code_vb.html 57    Appendix A The following figures show the scatter plot of the posterior probability of C=1 calculating from the optimal method and using the trained neural network for both SGD and VBLR training me- thods. It can be that for all the basis function we used, when the gains are equal, the network per- formance is much better. As well as the difference between gains gets larger, both deviation and error bars increase. We couldn’t find an intuitive explanation for the part where the line start folding down when the difference between gains increase. This shows that this network can’t discriminate C=1 and C=2 properly, when auditory and visual stimuli have different gains. It is also shown that the scatter plot has larger deviation for cases with smaller gain. Figure 1 Scatterplot of different combinations of auditory and visual gains for a Quad network trained with SGD algorithm using fixed and equal auditory and visual gains. 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,6) 0 0.5 1 0 0.5 1 (Gain A , GainV) = (3,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,8) 58    Figure 2 Scatterplot of different combinations of auditory and visual gains for a Quad network trained with VBLR algorithm using fixed and equal auditory and visual gains. 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,6) 0 0.5 1 0 0.5 1 (Gain A , GainV) = (3,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,8) 59    Figure 3 Scatterplot of different combinations of auditory and visual gains for a Quad network trained with SGD algorithm using multiple and unequal auditory and visual gains. 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,6) 0 0.5 1 0 0.5 1 (Gain A , GainV) = (3,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,8) 60    Figure 4 Scatterplot of different combinations of auditory and visual gains for a Quad network trained with VBLR algorithm using multiple and unequal auditory and visual gains. 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,2) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,3) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,4) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,5) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,6) 0 0.5 1 0 0.5 1 (Gain A , GainV) = (3,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,6) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,7) 0 0.5 1 0 0.5 1 (GainA , GainV) = (2,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (3,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (4,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (5,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (6,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (7,8) 0 0.5 1 0 0.5 1 (GainA , GainV) = (8,8) 61    Figure 5 Scatterplot of different combinations of auditory and visual gains for a QDN network trained with SGD algorithm using fixed and equal auditory and visual gains. 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (2,2) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (3,2) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (4,2) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (5,2) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (6,2) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (7,2) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (8,2) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (2,3) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (3,3) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (4,3) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (5,3) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (6,3) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (7,3) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (8,3) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (2,4) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (3,4) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (4,4) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (5,4) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (6,4) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (7,4) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (8,4) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (2,5) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (3,5) 0 0.5 1 0 0.5 1 (Gain A , Gain V ) = (4,5) 0 0.5 1 0 0.5 1 (Gain A , Ga