‹header›
‹date/time›
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
‹footer›
‹#›
Cascade-correlation
(CC) is a generative, feed-forward learning algorithm
for artificial neural networks (Fahlman & Lebiere, 1990). It has been used extensively in the simulation of various
aspects of cognitive development
(Shultz et al., 1995).
An
artificial neural network (ANN) is composed of units and connections between the units. Units can be pictorially
represented as circles. Units become
variously active depending on their inputs. In a very rough sense, units in ANNs can be seen as analogous to neurons or perhaps groups of neurons. Psychologically,
a pattern of activation across units can
be interpreted as a cognition or idea.
Connection
weights determine an organizational topology for a network and allow units to send activation to each
other. In a very rough sense, connection
weights in ANNs can be regarded as analogous
to neural synapses.
Input units code
the problem being presented to the network. Output units code the network’s response to the input problem. Hidden units perform essential intermediate
computations. Psychologically, the matrix
of connection weights can be regarded as a
network’s long-term memory. In cascade-correlation, there are cross-connections that bypass hidden units.
A problem is
presented to a network as a vector (ordered list) of real numbers specifying how active each input is.
Activation
is passed forward from the inputs to the next layer of units, in this case the first hidden unit.
Then
activation is passed from the inputs and first hidden unit to the second hidden unit. Notice the cascaded connection
weight between the two hidden units.
Finally,
activation is passed to the output units from all input and hidden units. The pattern of activation across the
output units represents the network’s
response to the particular problem that was
presented on the inputs. Because activation flows only forward in one direction, cascade-correlation is known as a feed-forward
algorithm.
The net
input to a unit is computed as the sum of products of each sending unit activation and its connection weight.
These products are summed across all of
the sending units.
Useful hidden units have non-linear activation functions, such
as the
sigmoid function shown here. In this equation, e is the exponential and x is the net input to
unit i.
The sigmoid
activation function has a floor and ceiling, somewhat like real neurons. Output units designed to make binary
decisions also typically have a sigmoid
activation function.
Networks
must be trained before they can provide correct solutions to problems. CC training involves both adjusting
connection weights and adding new hidden
units. CC networks begin without any
hidden units. Trainable connection weights are drawn in this slideshow as dashed arrows. Initially, these weights
have random values, generating random
performance. Weights are adjusted to reduce
discrepancy (error) between the actual output vector and a target output vector of correct activations.
One of the units at the input level in CC is called the
bias unit because it always has an input
activation of 1.0, regardless of the input
pattern being presented. With trainable connection weights to all downstream units, the bias unit effectively
implements a learnable resting activation
level for each hidden and output unit.
Network
error is defined as the sum of squared differences between actual (A) and target (T) output activations. These
squared differences are summed over output
units (o) and training patterns (p).
Connection weights are adjusted to minimize error. The training patterns are vector pairs of input activations
and target output activations.
When error
reduction stagnates, a hidden unit is recruited. As the first hidden unit is added, its input weights are
frozen (shown in solid arrows), and
training of the output weights resumes.
CC networks are generative in the sense that the learning algorithm builds the internal topology of a network. Thus, CC
networks grow as they learn.
If a second
hidden unit is required, it is installed downstream of the first hidden unit.
After each hidden unit is recruited, training of the output weights resumes. Thus, these phases are known as output
phases. Learning continues in this fashion
until network error is reduced to the extent that all output units have activations within a certain range of their targets on all training patterns. That range is called score-threshold, a parameter that can be
manipulated to control depth of learning.
Hidden units
are recruited during so-called input phases. Output weights
are frozen (shown by solid large arrow). Eight candidate units each have initially random, trainable connection
weights from the input units (shown by
dashed large arrow). These input weights are
adjusted in order to maximize a correlation between their activation and network error at the output units, over
the training patterns.
The
function to maximize during each input phase is a modified correlation between candidate hidden unit activation and
network error. The absolute covariance between hidden unit activation (h) and network error (e) is summed across patterns (p) and is
also summed across output units (o) and then
standardized by the sum of squared error
deviations. Terms in angled brackets are means.
Both error minimization and correlation maximization use the
same algorithm, called Quickprop
(Fahlman, 1988).
When these
correlations stop increasing, the candidate with the highest correlation (regardless of sign) is selected
and the other candidates are discarded.
The
recruited hidden unit is then installed into the network. Its freshly trained input weights are frozen (indicated by
solid large arrow) so that it will always
be able to track those aspects of network
error that it was trained to track. The new hidden unit is provided with random, trainable connection weights (of
sign opposite to that of its correlation
with network error), and output training
resumes.
Fahlman,
S. E. (1988). Faster-learning variations on back-propagation: An empirical study. In D. S. Touretzky, G. E.
Hinton, & T. J. Sejnowski
(Eds.), Proceedings of the
1988 Connectionist Models Summer
School (pp. 38-51). Los
Altos, CA: Morgan Kaufmann.
Fahlman, S. E., & Lebiere, C. (1990). The
cascade-correlation learning architecture.
In D. S. Touretzky (Ed.), Advances in neural information processing systems 2 (pp. 524-532). Los Altos, CA: Morgan
Kaufmann.
Shultz, T. R.,
Schmidt, W. C., Buckingham, D., & Mareschal, D. (1995). Modeling cognitive development with a generative connectionist algorithm. In T. J. Simon & G. S. Halford
(Eds.), Developing
cognitive competence: New approaches to process modeling
(pp. 205-261). Hillsdale, NJ: Erlbaum.