TITLE: Estimation of Posteriori Class Probabilities by Means of Constrained Probability Density Functions.

AUTHORS: Juan I. Arribas-Sanchez (nacho@pluton.tel.uva.es), Jesus Cid-Sueiro
(jesus@tel.uva.es) and Anibal R. Figueiras-Vidal (arfv@ing.uc3m.es)

KEY WORDS : A Posteriori Probability Estimation, Density Estimation, Cost Functions, Neural Networks

ABSTRACT

A new algorithm to estimate the 'a posteriori' probabilities of the targets with neural networks in classification problems is presented here. It is based on the estimation of probability density functions with some restrictions that impose the classiffier structure. The method is applied to train perceptrons by means of Gaussian mixtures; it shows a faster convergence that other gradient based methods of posteriori probability estimation. The resulting algorithm establishes some bridge between parametric and non-parametric techniques of posteriori probability estimation.

Introduction

It is well-known that when a neural network is trained in order to minimize a mean square error [1] or the cross entropy [2] between the target and the network output, the network provides, after convergence, estimates of the 'a posteriori' probabilities of the classes; any cost function providing posteriori probability estimates is called Strict Sense Bayesian (SSB). General conditions for SSB cost functions are studied in [3][5].

Basically, we can consider two different methods for estimating a posteriori class probabilities: one is based on the so called Density Estimation (DE), in which hypothesis about the Probability Density Function (PDF) are made, and from that, conditional or a posteriori probabilities can be calculated. The other one is what we call Probability Estimation, and is based in SSB classifiers.

The use of SSB cost functions avoids making any assumption about the data distribution. This is an advantage over probability estimation methods based on estimating the probability density function (PDF) of the classes. However, learning is usually slow.

In this paper we try to establish some linkage between density function estimation and posteriori probability estimation, in order to find algorithms exploiting the advantages of both approaches.

Sigmoidal Perceptrons

Consider the simple case of the logistic perceptron, with output given by

(1)    y = (1+exp(-w'x-w₀))^-1

where w' indicates the trasposed of w. Assumme a binary class problems, with classes 0 and 1 and, also, that y is an estimate of the posteriori probability p(1|x); we can write, using the Bayes formula.

(2)    y =p₁ f(x|1)/(p₀ f(x|0)+p₁ f(x|1))

where p₁ and p₀ denote the 'a priori' class probabilities, f₀(x)=f(x|0) and f₁(x)=f(x|1) are the conditional class probabilities. Combining (1) and (2) we arrive at

(3)    p₀ f₀(x) exp(0.5(w'x+w₀)) = p₁ f₁(x) exp(-0.5(w'x+w₀))

where f₀(x)=f(x|0) and f₁(x)=f(x|1). Eq.(3) shows the conditions that should verify the class distributions so that the perceptron can compute exactly the posteriori probabilities. Any pair of PDF's verifying (3) are called, in the following, an implicit PDF pair of the perceptron.

Defining

(4)    f_c(x) = p₀ f₀(x) exp(0.5(w'x+w₀)) / q

where q is some normalizing factor ensuring that f_c(x) is a PDF (with area 1). Defining a PDF f_c(x), we can generate an arbitrary pair of implicit density functions. For instance, if f_c(x) is a zero-mean Gaussian function, we obtain a Gaussian pair. In the following, f_c(x) will be called a centered PDF.

Estimating implicit pairs

Note that, when a logistic perceptron is trained in order to minimice an SSB cost, no assumptions are made about the implicit pair. The learning rules apply whatever the actual implicit pair is: even if the actual class PDF's are not implicit, the minimization of an SSB cost provides an aproximation to the 'posteriori' probability of class 1 (for the binary classification problem stated above).

Alternatively, in the paper we explore the idea of making hypothesis about the implicit pair, and estimating the parameters of this pair based on data. The method can be summarized as follows:

Assumme a centered PDF f_c(x,v) where v is some parameter vector to be estimated
Express f₀ and f₁ as a function of w and v
Obtain maximum log-likelihood estimates of w and v
Use estimate w as the perceptron weight.

For instance, we can state that f_c(x,v) is a Gaussian function, where parameter vector includes the components of the mean vector and the variance matrix. It is easy to show, that, following the above procedure, the estimation of vectors v and w is very fast.

However, the Gaussian pair is very restrictive, and if the data distributions does not match a Gaussian pair, the weight estimation can be too inaccurate. In the paper, we explore the idea of assumming that fc(x) is a Gaussian mixture,

(5)    f_c(x) = q₁ N(x,v₁) + q₂ N(x,v₂) + ...+ q_m N(x, v_m)

and we derive the learning rules resulting from this. Simulation results will be provided in the paper, showing that the proposed method is faster than other SSB approaches.

Finally, we note that, for the sake of clarity, we discuss here a binary classification problem based on the perceptron; however it can be easily generalized to multi-class problems and more general classifiers. The general case is also discussed in the final version of the paper.

Conclusions

In this paper we propose the aplication of parameter estimation methods to train neural classifiers. The method requires to make some hypothesis about implicit PDF's, but this can be assummed as general as desired. The resulting method shows a faster learning than other non-parametric approaches.

REFERENCES

[1] D.W.Ruck, S.K.Rogers, M.Kabrisky , M.E.Oxley and B.W.Suter "The multilayer perceptron as an approximation to a Bayes optimal discriminant function", IEEE Transactions on Neural Networks, vol. 1, no. 4, pp. 296-298, Dec. 1990.
[2] S. Amari, "Backpropagation and stochastic gradient descent method", Neurocomputing, No.5, pp- 185-196, June 1993.
[3] J.Cid-Sueiro, A.R. Figueiras-Vidal, Cost Functions to Estimate Class
Probabilities, Procs of the European Conference on Signal Analysis and Prediction (ECSAP?97), pp. 113-116, Praha, 24-27 June, 1997.
[4] J. I. Arribas-Sanchez, J. Cid-Sueiro, "Bayesian Approaches to the Estimation of A
Posteriory Probabilities"; Proceedings of the IASTED International Conference
Signal Processing and Communications, pp. 107-110, Canary Islands, Spain, Feb. 1998.
[5] J. Cid-Sueiro, J. I. Arribas-Sánchez, A. R. Figueiras-Vidal, Cost Functions to Estimate A Posteriori Probability in Multi-Class Problems, submitted to IEEE Transactions on Neural Networks.