A Hierarchical Self-Organizing Modular Architecture:
A Case Study in Learning Character Recognition
Alejandro Sierra & Fernando J. Corbacho
Universidad Autónoma Madrid
Dept. Ingeniería Informática
28049 Madrid (SPAIN)
Tel.: +34-1-397-4319 Fax: +34-1-397-5277
Extended Abstract

 As the title indicates the main objective of this paper is the development of algorithms for dynamic incremental task/subtask decomposition. First of all we describe some of the main advantages of modular (hierarchical) architectures, second we provide a critique of some of the most important current modular architectures, and third we provide with mechanisms to make these architectures more dynamic by allowing the autonomous construction of new modules and hierarchical levels.

Modular networks present several advantages over "monolithic" architectures. For instance each module learns to handle a subset of the complete set of training cases. The argument was well exposed by Jacobs et al. (1991) [J91] with respect to the backpropagation training algorithm. "If backpropagation is used to train a single, multilayer network to perform different subtasks on different occasions, there will generally be strong interference effects ... " The idea behind such a system is that the gating network allocates a new case to one or a few experts, and, if the output is incorrect, the weight changes are localized to these selected experts (and the gating network). Whenever an expert gives less error than the weighted average of the errors of all the experts its responsibility for that case is increased, and whenever it does worse than the weighted average its responsibility will be decreased. Nevertheless [J91] as well as Jordan and Jacobs (1994) [JJ94] assume that the number of experts needed for the task subdivision is known in advance in the sense that the structure of the modular architecture is fixed before the training data is presented. Hence, the training data will simply determine the free coefficients and may not play a role in selecting the model architecture.

In this paper we have reduced a number of assumptions in the HME architecture developed in [JJ94]. First of all we remove an implicit assumption corresponding to the ordering (structure) in the input training cases. We have found that if the HME is trained with data ordered in the form "0000111" there is a large temporal interference effect in the sense that the first learned data is forgotten when learning the examples from the second class. This was tested for both training sets in character recognition and the Pima indian database (Ripley, 1996). Our architecture allows for the data to come in any order, in particular "clustered" examples from a single class that may correspond to an inherent property of the environment generating the training cases.

Secondly there is no a priori fixation on the specific structure of the architecture, that is, the network begins with very few modules (e.g. one or two) and new modules may be autonomously created as the training data requires. Model Selection is discussed in [JJ94] "Utilizing the HME approach requires that choices be made regarding the structural parameters of the model, in particular the number of levels and the branching factor of the tree." We believe that in most cases the number of experts/modules (related to the number of subtasks) is not known a priori. Examples of systems inmersed in realistic environments which are highly dynamic and partially unpredictable abound (e.g. robotics). So we claim that it is fundamental for the system to be able to, starting with N classes, cope with N+1 classes while avoiding possible interferences (between what was already learned and what must be learned anew). In contrast in [JJ94] a new sufficiently different class has to be learned by the already trained modules. This will cause to forget part of the information previously learned while learning the new information. This is a side effect of weight sharing in most NN e.g. backpropagation.

In summary [JJ94] assume that different expert networks are appropriate in different regions of the input space. That is, the data can be well described by a collection of functions, each of which is defined over a relatively local region of the input space. Yet if not all data is available at the beginning of the training sequence the available experts will perform a specific partition but when new data becomes available this partition will have to drastically change. In order to avoid this interference phenomena the system must be able to perform autonomous construction of new modules as the training data requires. Hence, we have introduced a mechanism to decide when to construct a new module and how to determine the new configuration of modules. In particular we have found that for new input data from a new class, a very large "distribution" error (i.e. the sum of the squares of the gating net output coefficients) serves as an indication to construct a new module. This distribution error is defined as a function of the homogeneity of the gating weights, that is, represents a measure of how many experts are involved for a particular sample case.

Due to the incremental construction of new modules the gating net must be also dynamically reconfigurable to allow for new modules to be inserted in the previous network structure. We have found that when a new gating output channel is created to take care of the newly inserted module, a phenomenon of interference also occurs within the gating net. That is, the previously learned weights corresponding to the already existing gating channels are "damaged". We have provided a solution to this problem by temporal hierarchical clustering. This method creates a new cluster incorporating the previous existing network and creates a new module "outside" this cluster to account for the new class. This provides for encapsulation of the previously learned gating net as well as module parameters while allowing the overall network to grow. After learning has stabilized the cluster is flattened out. To materialize on the concepts described above we have used character recognition as our test case. In concrete we first train the network with 0s and 1s and incrementally introduce new classes up to ten, that is the environment defining the input space is dynamic.

Related work by Corbacho & Arbib (1997) argued that [JJ94] presented several limitations with respect to the Schema Theory architecture, among others the lack of distributed dynamics and autonomous composition of modules. In this paper we have provided with composition into clusters. Sierra & Santa-Cruz (1998) describe a means to assemble both local learning and global non-linear algorithms to enhance performance of the overall pattern recognition machine without introducing gating nets. Algorithms such as ID3 build classification trees that have axis-parallel splits and use heuristic splitting algorithms (Quinlan, 1986). More recent research has studied decision trees with oblique splits (Murphy et al., 1993; Utgoff and Brodley, 1990). None of these papers, however, has treated the problem of splitting data as a statistical problem, nor have they provided a global goodness-of-fit measure for their trees. Lastly current work on finding underlying "causes" (e.g. Zemel; Becker) bears certain similitude to autonomous construction of networks yet their architecture is fundamentally different in that it does not stress the modularity aspect to avoid interference.