‹header›

‹date/time›

Click to edit Master text styles

Second level

Third level

Fourth level

Fifth level

‹footer›

‹#›

Before proceeding with this slideshow, it is recommended that you watch the slideshow A Tutorial on Cascade-correlation, if you have not yet seen that more general introduction.

Perhaps the most significant limitation of current neural-network modeling is the failure to use old knowledge in new learning. Neural networks start each problem from scratch with random connection weights. Compared to humans, networks have the advantage of concentrating on one task, but have the disadvantage of starting to learn every problem from scratch.

In the following slides, a source network is represented as a square in order to simplify KBCC diagrams.

In the first input phase, KBCC sets up a pool of recruitment candidates which include previously learned source networks as well as single hidden units. There are several versions of each candidate, each with a different set of initially randomized connection weights from the input units (represented here by the dashed arrow). As in ordinary cascade-correlation, the output weights are temporarily frozen, represented here by the solid arrow.

When the correlations with network error stop increasing, the best-correlating candidate is selected and the other candidates are discarded.

The best-correlating candidate (here a source network) is installed into the network, its input weights are frozen (indicated by the solid arrow), and training reverts back to output phase, in which all weights (indicated by the dashed arrows) entering the output units are trained in order to reduce error.

Further recruitments may need to be made. In this case, a single hidden unit is installed downstream of the first (source network) recruit. The thick arrows represent the idea that the recruited source network may have multiple inputs and multiple outputs. As in ordinary cascade-correlation, training continues until all output-unit activations are within score-threshold of their targets on all of the training patterns.

The function to minimize in the output phases is the squared error in the target network summed over patterns and outputs. Vo,p is the actual output activation at output o with pattern p. To,p is the target output activation at output o with pattern p. The difference between them is squared and summed over all outputs and patterns.

The function to maximize in the input phases is a normalized covariance between candidate activation and network error. G for each candidate c is an absolute covariance between candidate activation and network error, standardized by the number of outputs for the candidate (#Oc), the number of outputs in the target network (#O), and the squared error in the target network summed across outputs o and patterns p. V is the actual output activation of candidate c for pattern p. Mean V is the mean output activation for candidate c. E is the target-network error at output o for pattern p. Mean E is the mean target-network error at output o. The products between candidate activation deviance and error deviance are summed over patterns p. The absolute value of that sum is then summed over network outputs o and candidate outputs oc.

In this illustrative experiment, KBCC learned source networks representing the components of a cross, which was the target problem. In each case, the network had to learn to respond positively to Cartesian points inside the shape and negatively to points outside of the shape.

Networks learn to identify whether a given pattern falls inside a class that has a two-dimensional uniform distribution. There are 2 linear inputs and 1 sigmoid output unit that indicates whether a point is inside or outside a class of a particular distribution with a given shape, size, and position. The network is trained with 225 patterns forming a 15 x 15 grid covering whole input space. The blue points are training patterns inside the target shape. Red points are training patterns outside the target shape. There are 200 randomly determined test patterns uniformly distributed over the input space, making a fine grid of 220 x 220 input patterns. White indicates test points inside the target shape, black outside, and gray uncertain.

This horizontal rectangle constituted another source network.

The best solution without any hidden units is to classify all patterns as being outside of target class. This is because the target class contains only 57 of the 225 training patterns. The target cross pattern is shown in blue.

The first recruit is the source network representing the horizontal rectangle.

The second recruit is the vertical rectangle, allowing completely correct performance on the target cross.

The shapes of the target solutions (3, 4) closely resemble the shapes of recruited source knowledge (1, 2). KBCC implements a new kind of connectionist compositionality, composing a cross from its component rectangles while keeping the components intact. This concatenative compositionality was thought to be beyond the ability of neural networks.

This plot shows the mean speedup in learning that is afforded by KBCC under different ten different source-network conditions. Homogeneous subsets are indicated by brackets. No knowledge is slowest to learn, followed by rotated components and irrelevant knowledge. Exact individual components afford significantly faster learning. Even faster is having both exact components (illustrated here in red, an example of which was presented on the previous few slides). Fastest of all was having the full exact knowledge provided by a source network that had previously learned the cross itself.

Such results were obtained in a variety of artificial and real problem domains. KBCC has shown the ability to deal with a variety of source-knowledge transformations including translation and size changes in addition to rotation.

Source knowledge in KBCC can be either learned or inserted, and it can be relatively crisp or fuzzy. In all of these cases, KBCC achieves a natural and seamless integration of learning by analogy and learning by induction. Consequently, we are excited about the possibility of being able to simulate the kind of knowledge-based learning at which people excel.

References

Rivest, F., & Shultz, T.R. (2002). Application of knowledge-based cascade-correlation to vowel recognition. IEEE International Joint Conference on Neural Networks 2002 (pp. 53-58).

Shultz, T. R., & Rivest, F. (2001). Knowledge-based cascade-correlation: Using knowledge to speed learning. Connection Science, 13, 43-72.

Thivierge, J.-P., & Shultz, T.R. (2002). Finding relevant knowledge: KBCC applied to splice-junction determination. IEEE International Joint Conference on Neural Networks 2002 (pp. 1401-1405).