2015 Neil Avenue

The Ohio State University

Columbus, OH 43210

USA

email: ulugb@er4.eng.ohio-state.edu

Tel: 1-614-292-6502

Fax: 1-614-292-7596

Pattern recognition theory is the field of applied mathematics in which principles and methods are developed and implemented to perform classification and identification of objects, phenomena, processes, and signals. Here we designate an object as a mathematical primitive that can be specified by a finite set of features, or properties, which characterize the objects [Mathematical Encyclopedia, 1984].

There are two main approaches typically used to solve pattern recognition problems: the statistical approach and the syntactic approach. Statistical approaches, which are the focus of this research, assume stochastic models of feature variables under the assumption that the data can be described by a probabilistic model. In contrast, syntactic approaches usually rely on fixed structural relationships between the constituent groups of the considered feature variables, although stochastic structures are possible [Fu and Booth, 1986]

There are basically two broad classes of problems [D.J. Hand, 1997] considered in statistical pattern recognition: supervised and unsupervised learning. In supervised learning (classification) the classes of the objects are known a priori (prespecified by a ``teacher'' or ``supervisor'') and the goal is to formulate ``good'' rules to assign objects to their appropriate classes. In contrast, the aim of unsupervised learning (clustering) is to formulate a class structure, i.e. to decide how many classes there are as well as assigning objects to their appropriate classes.

In this paper, we will investigate supervised learning and in particular the application of Vector Quantization (VQ) techniques to supervised learning. It should be noted that while VQ techniques are generally considered unsupervised, the work presented here is more closely related to recent work by Kohonen and Gray, and is a departure from strictly unsupervised techniques.

For supervised learning, Bayes decision theory develops the
best (in terms of probability of error, P_{e} or more
generally Bayes risk, R) decision making process under
complete knowledge of the underlying probabilistic model. However, in
applications, complete knowledge of the underlying statistics is
seldom the case. Instead, what is typically available is a training
set, T =
{(x_{1},y_{1}),(x_{2},y_{2}),
... ,(x_{N},y_{N})} consisting of N
feature vectors, x_{i} and their class labels,
y_{i}.

Under the umbrella of Bayes decision theory and Bayes' theorem, it appears
that there are basically two main approaches to designing a classifier
using the training set in supervised learning. The first approach is to
estimate the class conditional probabilities, p(x|w_{i}) and the a
priori class probabilities, $p(w$_{i}). Alternatively,
one can estimate the posterior probabilities, p(w_{i}|x) directly.
The first approach can be loosely called **informative learning ** while the
latter can be considered as **discriminative** in nature.

Furthermore, both the informative and the discriminative learning approach can be performed via parametric or nonparametric techniques. In parametric techniques, a form (or a model), such as Gaussian, uniform, or linear, quadratic, etc. for what is being estimated is assumed (or sometimes known) and the parameters of the model are then estimated using the training set. In comparison, the nonparametric techniques obtain the desired probability estimates from the training set without assuming a specific form for the estimated quantities.

For classifier design in supervised learning with a finite amount of training data, generalization is one of the most important considerations, if not the most important. Generalization is a measure of how well the classifier designed using the training set will do on previously unobserved (test) data. For this purpose, classifier complexity (as measured by, e.g., the Vapnik-Chervonenkis (VC) dimension or more crudely by the number of classifier parameters) must be neither too high nor too low so that good generalization is attained on previously unobserved data, without, respectively, overfitting (memorizing) or underfitting the training data set.

In parametric techniques the complexity of the classifier can be typically controlled by the choice of the model family and the model order (number of parameters to be estimated). In contrast, in nonparametric classifiers complexity may be controlled in slightly more indirect or nonintuitive ways, sometimes via so called smoothing parameters.

The k-nearest neighbor classifier is the quintessential nonparametric discriminative classification algorithm. For k=1, the nearest neighbor classifier corresponds to a VQ classifier with the whole training set as the codebook. As such it is of high complexity and cannot achieve the Bayes error even asymptotically (as the size of the training set goes to infinity). Thus, clever selection of a smaller codebook for the VQ classifier is a way of reducing the complexity of this technique (for improved generalization) as well as increasing efficiency and processing speeds as a result of the reduction in storage requirements.

VQ has its origins in signal processing where it is used for compact, accurate representation of input signals or images. However, since VQ results in a partitioning of the input space, it can also be used in pattern recognition. While VQ's use in pattern recognition has been mostly confined to unsupervised learning, it can and has also been used for supervised learning, this becoming more popular especially after the introduction of Kohonen's Learning Vector Quantizer (LVQ) algorithm.

The VQ classification problem can be described as follows. A training
set T= {(x_{i},y_{i}), i= 1,2, ... ,N} where
x_{i} in R^{n} are feature vectors and y_{i}
in {1,2, ... ,L} are class labels is observed. The training set is a
sample from a random process {(X_{i}, Y_{i}),i=1,2,
... } and the individual (X_{i},Y_{i}) are assumed to
have a common but unknown distribution F_{XY}(x,y). The goal
is to design, based on the training set T, a VQ classifier, consisting
of a codebook with labels, for the observation X that provides a good
estimate of the class label, Y associated with X.

In this paper we review existing techniques to train VQs for
supervised learning and propose new algorithms to train VQ classifiers
and compare their performance to other prominent VQ classification
algorithms to-date such as Xie's independent quantizer/classifier
design, Kohonen's LVQ and Gray's Bayes Risk weighted VQ classifier. In
addition generalizations of the ordinary VQ classifier extending the
polytopal Voronoi regions (induced by the L_{2}-norm) to more
complex forms are considered.