Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
Cascade-correlation (CC) is a generative, feed-forward learning algorithm for artificial neural networks (Fahlman & Lebiere, 1990). It has been used extensively in the simulation of various aspects of cognitive development (Shultz et al., 1995).
An artificial neural network (ANN) is composed of units and connections between the units. Units can be pictorially represented as circles. Units become variously active depending on their inputs. In a very rough sense, units in ANNs can be seen as analogous to neurons or perhaps groups of neurons. Psychologically, a pattern of activation across units can be interpreted as a cognition or idea.
Connection weights determine an organizational topology for a network and allow units to send activation to each other. In a very rough sense, connection weights in ANNs can be regarded as analogous to neural synapses. Input units code the problem being presented to the network. Output units code the network’s response to the input problem. Hidden units perform essential intermediate computations. Psychologically, the matrix of connection weights can be regarded as a network’s long-term memory. In cascade-correlation, there are cross-connections that bypass hidden units.
A problem is presented to a network as a vector (ordered list) of real numbers specifying how active each input is.
Activation is passed forward from the inputs to the next layer of units, in this case the first hidden unit.
Then activation is passed from the inputs and first hidden unit to the second hidden unit. Notice the cascaded connection weight between the two hidden units.
Finally, activation is passed to the output units from all input and hidden units. The pattern of activation across the output units represents the network’s response to the particular problem that was presented on the inputs. Because activation flows only forward in one direction, cascade-correlation is known as a feed-forward algorithm.
The net input to a unit is computed as the sum of products of each sending unit activation and its connection weight. These products are summed across all of the sending units.
Useful hidden units have non-linear activation functions, such as the sigmoid function shown here. In this equation, e is the exponential and x is the net input to unit i.
The sigmoid activation function has a floor and ceiling, somewhat like real neurons. Output units designed to make binary decisions also typically have a sigmoid activation function.
Networks must be trained before they can provide correct solutions to problems. CC training involves both adjusting connection weights and adding new hidden units. CC networks begin without any hidden units. Trainable connection weights are drawn in this slideshow as dashed arrows. Initially, these weights have random values, generating random performance. Weights are adjusted to reduce discrepancy (error) between the actual output vector and a target output vector of correct activations. One of the units at the input level in CC is called the bias unit because it always has an input activation of 1.0, regardless of the input pattern being presented. With trainable connection weights to all downstream units, the bias unit effectively implements a learnable resting activation level for each hidden and output unit.
Network error is defined as the sum of squared differences between actual (A) and target (T) output activations. These squared differences are summed over output units (o) and training patterns (p). Connection weights are adjusted to minimize error. The training patterns are vector pairs of input activations and target output activations.
When error reduction stagnates, a hidden unit is recruited. As the first hidden unit is added, its input weights are frozen (shown in solid arrows), and training of the output weights resumes. CC networks are generative in the sense that the learning algorithm builds the internal topology of a network. Thus, CC networks grow as they learn.
If a second hidden unit is required, it is installed downstream of the first hidden unit. After each hidden unit is recruited, training of the output weights resumes. Thus, these phases are known as output phases. Learning continues in this fashion until network error is reduced to the extent that all output units have activations within a certain range of their targets on all training patterns. That range is called score-threshold, a parameter that can be manipulated to control depth of learning.
Hidden units are recruited during so-called input phases. Output weights are frozen (shown by solid large arrow). Eight candidate units each have initially random, trainable connection weights from the input units (shown by dashed large arrow). These input weights are adjusted in order to maximize a correlation between their activation and network error at the output units, over the training patterns.
The function to maximize during each input phase is a modified correlation between candidate hidden unit activation and network error. The absolute covariance between hidden unit activation (h) and network error (e) is summed across patterns (p) and is also summed across output units (o) and then standardized by the sum of squared error deviations. Terms in angled brackets are means. Both error minimization and correlation maximization use the same algorithm, called Quickprop (Fahlman, 1988).
When these correlations stop increasing, the candidate with the highest correlation (regardless of sign) is selected and the other candidates are discarded.
The recruited hidden unit is then installed into the network. Its freshly trained input weights are frozen (indicated by solid large arrow) so that it will always be able to track those aspects of network error that it was trained to track. The new hidden unit is provided with random, trainable connection weights (of sign opposite to that of its correlation with network error), and output training resumes.
Fahlman, S. E. (1988). Faster-learning variations on back-propagation: An empirical study. In D. S. Touretzky, G. E. Hinton, & T. J. Sejnowski (Eds.), Proceedings of the 1988 Connectionist Models Summer School (pp. 38-51). Los Altos, CA: Morgan Kaufmann. Fahlman, S. E., & Lebiere, C. (1990). The cascade-correlation learning architecture. In D. S. Touretzky (Ed.), Advances in neural information processing systems 2 (pp. 524-532). Los Altos, CA: Morgan Kaufmann. Shultz, T. R., Schmidt, W. C., Buckingham, D., & Mareschal, D. (1995). Modeling cognitive development with a generative connectionist algorithm. In T. J. Simon & G. S. Halford (Eds.), Developing cognitive competence: New approaches to process modeling (pp. 205-261). Hillsdale, NJ: Erlbaum.