KEY WORDS : A Posteriori Probability Estimation, Density Estimation, Cost Functions, Neural Networks
ABSTRACT
A new algorithm to estimate the 'a posteriori' probabilities of the targets with neural networks in classification problems is presented here. It is based on the estimation of probability density functions with some restrictions that impose the classiffier structure. The method is applied to train perceptrons by means of Gaussian mixtures; it shows a faster convergence that other gradient based methods of posteriori probability estimation. The resulting algorithm establishes some bridge between parametric and non-parametric techniques of posteriori probability estimation.
Basically, we can consider two different methods for estimating a posteriori class probabilities: one is based on the so called Density Estimation (DE), in which hypothesis about the Probability Density Function (PDF) are made, and from that, conditional or a posteriori probabilities can be calculated. The other one is what we call Probability Estimation, and is based in SSB classifiers.
The use of SSB cost functions avoids making any assumption about the data distribution. This is an advantage over probability estimation methods based on estimating the probability density function (PDF) of the classes. However, learning is usually slow.
In this paper we try to establish some linkage between density function estimation and posteriori probability estimation, in order to find algorithms exploiting the advantages of both approaches.
(1) y = (1+exp(-w'x-w_{0}))^{-1}where w' indicates the trasposed of w. Assumme a binary class problems, with classes 0 and 1 and, also, that y is an estimate of the posteriori probability p(1|x); we can write, using the Bayes formula.
(2) y =p_{1} f(x|1)/(p_{0} f(x|0)+p_{1} f(x|1))where p_{1} and p_{0} denote the 'a priori' class probabilities, f_{0}(x)=f(x|0) and f_{1}(x)=f(x|1) are the conditional class probabilities. Combining (1) and (2) we arrive at
(3) p_{0} f_{0}(x) exp(0.5(w'x+w_{0})) = p_{1} f_{1}(x) exp(-0.5(w'x+w_{0}))where f_{0}(x)=f(x|0) and f_{1}(x)=f(x|1). Eq.(3) shows the conditions that should verify the class distributions so that the perceptron can compute exactly the posteriori probabilities. Any pair of PDF's verifying (3) are called, in the following, an implicit PDF pair of the perceptron.
Defining
(4) f_{c}(x) = p_{0} f_{0}(x) exp(0.5(w'x+w_{0})) / qwhere q is some normalizing factor ensuring that f_{c}(x) is a PDF (with area 1). Defining a PDF f_{c}(x), we can generate an arbitrary pair of implicit density functions. For instance, if f_{c}(x) is a zero-mean Gaussian function, we obtain a Gaussian pair. In the following, f_{c}(x) will be called a centered PDF.
Alternatively, in the paper we explore the idea of making hypothesis about the implicit pair, and estimating the parameters of this pair based on data. The method can be summarized as follows:
However, the Gaussian pair is very restrictive, and if the data distributions does not match a Gaussian pair, the weight estimation can be too inaccurate. In the paper, we explore the idea of assumming that fc(x) is a Gaussian mixture,
(5) f_{c}(x) = q_{1} N(x,v_{1}) + q_{2} N(x,v_{2}) + ...+ q_{m} N(x, v_{m})and we derive the learning rules resulting from this. Simulation results will be provided in the paper, showing that the proposed method is faster than other SSB approaches.
Finally, we note that, for the sake of clarity, we discuss here a binary classification problem based on the perceptron; however it can be easily generalized to multi-class problems and more general classifiers. The general case is also discussed in the final version of the paper.