The introduction of a real-valued degrees of membership [1] has been carried out with success in the area of unsupervised learning, in particular in the clustering problem [2]. The principal advantage to be obtained by the elimination of the Boolean restriction for the cluster memberships is that the combinatorial problem of grouping elements into crisp clusters is transformed into an analytic problem that can be tackled with the help of the tools of analysis.
In the area of supervised learning there is a sharp separation between connectionist approaches, such as neural networks and symbolic approaches, such as decision trees. Analyticity is exploited in neural networks in order to devise powerful learning algorithms, such as back-propagation, that in principle lead to globally-optimal solutions in terms of a collection of parameters (weigths, thresholds) that characterize the operation of the basic network units and their connections. The resulting representation is flexible and robust, but not easily amenable to interpretation. On the other hand, decision trees are generated by combinatoric methods that lead to representations that are transparent and easily interpretable, but which suffer from rigidity, specially for regression problems, and sensitivity to the data used in their construction.
It would be desirable to maintain the ease of interpretation of a decision
tree while increasing the flexibility of its representation, so that analytic
tools could be used in their construction. By including fuzzy tests in
the decision tree generation, we are increasing its expressive capability
and incorporating into its construction characteristics of connectionist
methods that are useful to generate a learning device that is flexible,
robust and capable to generalize. Furthermore, the continous nature
of the representation allows us to make use of analytic calculus to design
learning algorithms, similar to the back-propagation algorithm, for the
construction of globally optimal trees.
A decision tree consists of a collection of nodes diposed in a hierarchical manner. At the top of the hierarchy stands the root node, which englobes all of the data in the training set. Each of the nodes composing a tree corresponds to a subset of examples of the training data. The membership of a particular data point to a given node is determined by a test that is made on the node that is immediatly superior in the hierarchy, the parent node. We assume for simplicity that there are D continuous (ordinal) predictor variables , which can be collected in a vector x = { x1 , x2, ..., xD}. Nominal variables can be included without difficulty. In a binary tree, a given parent node generates two children nodes characterized by complementary sigmoidal membership functions, which depend on three types of parameters: The parameter c defines the linear combination of attributes c·x that is involved in the test; parameter a specifies the center of the split; finally the quantity w determines the width of the split. The reasons why sigmoidal fuzzy splits have been chosen is that in the limit w ® 0 the usual multivariate crisp splits (c·x < a) are recovered. Thus only one extra parameter per split is introduced by fuzzification.
The basic architecture of the tree is fixed by growing a crisp tree
by means of a traditional decision tree generator [3,
4]. The fuzzy splits are initially chosen to have their
center at the value for the theresholdof the corresponding crisp split,
but with a non-zero width. These initial values are refined by means of
a learning algorithm analogous to back-propagation in neural networks.
All the terminal nodes in a fuzzy tree is invoved in a given prediction.
This stands in contrast with the crisp case, where a only one terminal
node is active during a single prediction. This global character of the
prediction, reminiscent of the global character of connectionist methods,
is the key to the improvement of the tree's performance by fuzzification.
In order to illustrate the fuzzification methodology described abvove, let us examine the results obtained in a regression problem proposed in Ref. [5]. The problem has two predictor variables and a dependent variable containing some normally distributed noise (s = 0.1)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[1] L. Zadeh. Fuzzy sets. Information and
Control, 8:338-353, 1965.
[2] J. Bedzek. Fuzzy Models for Pattern
Recognition, S. Pal ed. IEE Press, NY, 1992.
[3] L. Breiman, J. H.. Friedman, R. A. Olshen, and C.J. Stone. Classification and Regression Trees. Chapman & Hall, NY, 1984.
[4] J.R. Quinlan. Induction of decision trees. Machine
Learning, 1(1):81-106, 1986.
[5] V. Cherkassky and F.Mulier. Statistical
and neural network techniques for nonparametric
regression. In Selecting Models from Data, P. Cheeseman
R.W. Oldford eds. pages 383--392, Springer-Verlag, NY, 1994.
Back
to beginning of document