‹header›
‹date/time›
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
‹footer›
‹#›
Before
proceeding with this slideshow, it is recommended that you watch the slideshow A
Tutorial on Cascade-correlation, if
you have not yet seen that more general
introduction.
Perhaps the
most significant limitation of current neural-network modeling is the failure to use old knowledge in new learning. Neural networks start each problem from
scratch with random connection weights.
Compared to humans, networks have the
advantage of concentrating on one task, but
have the disadvantage of starting to learn every problem from scratch.
In the following slides, a source network
is represented as a square in order to simplify KBCC diagrams.
In the first input phase, KBCC sets up a
pool of recruitment candidates which include previously learned source
networks as well as single hidden units. There are several versions of each
candidate, each with a different set of initially randomized connection
weights from the input units (represented here by the dashed arrow). As in
ordinary cascade-correlation, the output weights are temporarily frozen,
represented here by the solid arrow.
When the correlations with network error
stop increasing, the best-correlating candidate is selected and the other
candidates are discarded.
The best-correlating candidate (here a
source network) is installed into the network, its input weights are frozen
(indicated by the solid arrow), and training reverts back to output phase, in
which all weights (indicated by the dashed arrows) entering the output units
are trained in order to reduce error.
Further recruitments may need to be made.
In this case, a single hidden unit is installed downstream of the first
(source network) recruit. The thick arrows represent the idea that the
recruited source network may have multiple inputs and multiple outputs. As in
ordinary cascade-correlation, training continues until all output-unit
activations are within score-threshold of their targets on all of the training
patterns.
The function to minimize in the output
phases is the squared error in the target network summed over patterns and
outputs. Vo,p is the actual output activation at output o
with pattern p. To,p is the target output activation at output o
with pattern p. The difference between them is squared and summed over
all outputs and patterns.
The function to maximize in the input
phases is a normalized covariance between candidate activation and network
error. G for each candidate c is an absolute covariance between
candidate activation and network error, standardized by the number of outputs
for the candidate (#Oc), the number of outputs
in the target network (#O), and the squared error in the target network
summed across outputs o and patterns p. V is the actual
output activation of candidate c for pattern p. Mean V is the mean output activation for candidate c. E is the target-network error at output o for pattern p. Mean E is the mean target-network error at output o. The products between candidate activation deviance and error deviance are summed over
patterns p. The absolute value of that sum is then summed over network
outputs o and candidate outputs oc.
In this illustrative experiment, KBCC
learned source networks representing the components of a cross, which was the
target problem. In each case, the network had to learn to respond positively
to Cartesian points inside the shape and negatively to points outside of the
shape.
Networks
learn to identify whether a given pattern falls inside a class that has a two-dimensional uniform
distribution. There are 2 linear inputs
and 1 sigmoid output unit that indicates whether
a point is inside or outside a class of a particular distribution with a given shape, size, and position.
The network is trained with 225 patterns
forming a 15 x 15 grid covering whole
input space. The blue points are training patterns
inside the target shape. Red points are training patterns outside the target shape. There are 200
randomly determined test patterns
uniformly distributed over the input space,
making a fine grid of 220 x 220 input patterns. White indicates test points inside the target shape, black
outside, and gray uncertain.
This horizontal rectangle constituted
another source network.
The best
solution without any hidden units is to classify all patterns as being outside of target class. This is
because the target class contains only 57
of the 225 training patterns. The target
cross pattern is shown in blue.
The first recruit is the source network
representing the horizontal rectangle.
The second recruit is the vertical
rectangle, allowing completely correct performance on the target cross.
The shapes
of the target solutions (3, 4) closely resemble the shapes of recruited source knowledge (1, 2). KBCC implements a new kind of connectionist
compositionality, composing a cross from
its component rectangles while keeping the
components intact. This concatenative compositionality
was thought to be beyond the ability of neural
networks.
This plot
shows the mean speedup in learning that is afforded by KBCC under different ten different source-network conditions. Homogeneous subsets are indicated by
brackets. No knowledge is slowest to
learn, followed by rotated components and
irrelevant knowledge. Exact individual components
afford significantly faster learning. Even faster is having both exact components (illustrated here in red,
an example of which was presented on the
previous few slides). Fastest of all was
having the full exact knowledge provided by
a source network that had previously learned the cross itself.
Such results were obtained in a variety
of artificial and real problem domains. KBCC has shown the ability to deal
with a variety of source-knowledge transformations including translation and
size changes in addition to rotation.
Source
knowledge in KBCC can be either learned or inserted, and it can be relatively crisp or fuzzy. In all of
these cases, KBCC achieves a natural and
seamless integration of learning by
analogy and learning by induction. Consequently, we are excited about the possibility of being able to
simulate the kind of knowledge-based
learning at which people excel.
References
Rivest, F., & Shultz, T.R.
(2002). Application of knowledge-based cascade-correlation to vowel
recognition. IEEE International Joint Conference on Neural Networks
2002 (pp. 53-58).
Shultz, T. R., & Rivest, F.
(2001). Knowledge-based
cascade-correlation: Using knowledge to speed learning. Connection Science,
13, 43-72.
Thivierge, J.-P., & Shultz, T.R.
(2002). Finding relevant knowledge: KBCC applied to splice-junction
determination. IEEE International Joint Conference on Neural
Networks 2002 (pp. 1401-1405).