Stable Encoding of Large Finite-State Automata in
Recurrent Neural Networks with Sigmoid Discriminants
Christian W. Omlin
a;b
, C. Lee Giles
a;c
a
NEC Research Institute, 4 Independence Way, Princeton, NJ 08540
b
CS Department, Rensselaer Polytechnic Institute, Troy, NY 12180
c
UMIACS, U. of Maryland, College Park, MD 20742
University of Maryland Technical Report CS-TR-3337 & UMIACS-TR-94-101
Abstract
We propose an algorithm for encoding deterministic nite-state automata (DFAs) in second-order re-
current neural networks with sigmoidal discriminant function and we prove that the languages accepted
by the constructed network and the DFA are identical. The desired nite-state network dynamics is
achieved by programming a small subset of all weights. A worst case analysis reveals a relationship be-
tween the weight strength and the maximum allowed network size which guarantees nite-state behavior
of the constructed network. We illustrate the method by encoding random DFAs with 10, 100, and 1,000
states. While the theory predicts that the weight strength scales with the DFA size, we nd the weight
strength to be almost constant for all the experiments. These results can be explained by noting that
the generated DFAs represent average cases. We empirically demonstrate the existence of extreme DFAs
for which the weight strength scales with DFA size.
1 INTRODUCTION
Itispossible totrainrecurrent neuralnetworks tobehave likedeterministicnite-state automata[Elman, 1990,
Frasconi et al., 1991,Giles et al., 1992,Pollack, 1991,Servan-Schreiber et al., 1991,Watrous and Kuhn, 1992].
The internal representation of learned DFA states can deteriorate due to the dynamical nature of re-
current networks making predictions about the generalization performance of trained recurrent networks
dicult [Zeng et al., 1993]. Methods for constructing DFAs in recurrent networks with hard-limiting neu-
rons discriminant functions have been proposed [Alon et al., 1991, Horne and Hush, 1994, Minsky, 1967];
1
methods for constructing networks with sigmoidal and radial-basis discriminant functions are discussed in
[Frasconi et al., 1993, Gori et al., 1994, Giles and Omlin, 1993]. We prove that recurrent networks with con-
tinuous sigmoidal discriminant functions can be constructed such that the encoded nite-state dynamics
remains stable indenitely. Notice that we do not claim that such a stable representation can be learned.
The stability of the internal DFA representation is based on on the premise that the discriminant functions
have suciently high gain such that the neurons always operate in a saturated mode.
We show empirically that the gain does not scale with the size of networks that implement average DFAs;
however, there exist extreme cases of DFAs where we observe a scaling of the neurons' gain with network
size.
2 ENCODING DFA DYNAMICS
2.1 Finite State Automata
Adeterministicnite-state automaton(DFA) isan acceptor foraregular languageL(M). Formally,a DFAM
is a5-tuple M =< ;Q;R;F; > where  = fa
1
;:::;a
m
gis the alphabet ofthe languageL, Q = fq
1
;:::;q
n
g
is a set of states, RQ is the start state, F  Q is a set of accepting states and  : Q ! Q denes state
transitions in M. A string is accepted by the DFA M if an accepting state is reached; otherwise, the string
is rejected.
2.2 Recurrent Network
We implement DFAs in discrete-time, recurrent networks with second-order weights W
ijk
. The continuous
network dynamics are described by the following equations:
S
(t+1)
i
= h(a
i
(t)) =
1
1+ e
BnZra
i
(t)
; a
i
(t) = b
i
+
X
j;k
W
ijk
S
(t)
j
I
(t)
k
; (1)
where b
i
is the bias associated with hidden recurrent state neurons S
i
; I
k
denotes the input neuron for
symbol a
k
. The product S
(t)
j
I
(t)
k
directly corresponds to the state transition (q
j
;a
k
) = q
i
. After a string
has been processed, the output of a designated neuron S
0
decides whether the network accepts or rejects a
string. The network accepts a given string if the value of the output neuron S
t
0
at the end of the string is
greater than 0.5; otherwise, the network rejects the string.
2.3 Encoding Algorithm
The encoding algorithmachieves a nearly orthonormal internal representation of the desired DFA dynamics;
it constructs a network with n+1 recurrent state neurons (includingthe output neuron) and m input neurons
2
from a DFA with n states and m input symbols. There is a one-to-one correspondence between state neurons
S
i
and DFA states q
i
. Consider a DFA state transition (q
j
;a
k
) = q
i
. Setting W
ijk
to a large positive value
+H will ensure that S
(t+1)
i
will be high and setting W
jjk
to a large negative value BnZrH will guarantee that
the output S
(t+1)
j
will be low. Furthermore, if state q
i
is an accepting state, then we program the weight
W
0jk
to +H; otherwise, we set W
0jk
to BnZrH. We set the bias terms b
i
of all state neurons S
i
to BnZrH=2. For
each DFA state transition, at most three weights of the network have to be programmed. The initial state
S
0
of the network is S
0
= (S
0
0
;1;0;0;:::;0). The initial value of the response neuron S
0
0
is 0 if the DFA's
initial state q
0
is a rejecting state and 1 otherwise.
3 ANALYSIS
We prove the stability of DFA encodings in recurrent neural networks using xed point analysis. We only
give the proofs of the theorems which establish our results; for proofs of auxiliary lemmas see [Omlin, 1995].
3.1 Fixed Point Analysis
Recall that the recurrent network changes its state according to equation (1). Our DFA encoding algorithm
yields a special form of that equation describing the dynamics of a constructed network:
S
(t+1)
i
= g(H(2x
i
BnZr1)=2) = h(x;H) =
1
1+e
H=2(1BnZr2x)
(2)
The bias term BnZrH=2 is common to all state neurons. Hx is the weighted sum feeding into neuron S
(t+1)
i
.
Under certain conditions, the discriminant function h(:) has xed points which allow a stable internal repre-
sentation of DFA states. We state here without proof the following facts about xed points of the function
h(:); the proofs can be found in [Omlin, 1995].
Lemma 3.1.1 For 0 < H < 4, h(x;H) has the following xed point:

0
= 0:5
Furthermore, h(x;H) converge to 
0
for any choice of a start value x
0
.
Lemma 3.1.2 For H  4, h(x;H) has three xed points 
0
= 0:5, 
BnZr
and 
+
.
Lemma 3.1.3 for x < 
0
and 
0
< x; h
t
(x;H) (with H > 4) converges to 
BnZr
and 
+
, respectively.
Lemma 3.1.4 For arbitrary H > 0, the two xed points 
BnZr
and 
+
are related as follows:

BnZr
+ 
+
= 1
3.2 Worst Case Analysis
We will now investigate the conditions under which a constructed network implements a given DFA. The
main result shows how large a network may be for xed H such that the languages of the DFA and the
3
constructed network are identical:
Theorem 3.2.1 Let  denote the maximum fraction of DFA states q
j
from which there are state transitions
(q
j
;a
k
) = q
i
for a xed choice of a
k
and any q
i
. Then, a sparse recurrent neural network RNN with n+1 sig-
moidal state neurons, m input neurons, at most 3mn second-order weights with alphabet 
w
= fBnZrH;0;+Hg
(4 < H
min
< H < H
pred
), n + 1 biases with alphabet 
b
= fBnZrH=2g, and maximum fan-out 3m can be
constructed from a DFA M with n states and m input symbols of states such that the internal state repre-
sentation remains stable, i.e. S
i
> 0:5 when q
i
is the current DFA state and S
i
< 0:5 otherwise if
n <
1
2  
BnZr

(H)
(1+
1
H
) with 
+

(H) >
1
4
(3+
1
H
)
for a proper choice of H.
Proof: >From the DFA encoding algorithm, we can derive four dierent types of neuron state changes:
low ! high:
S
t+1
i
= h(S
t
j
H +
X
S
l
C
i;l
S
t
l
H BnZrS
t
i
H) (S
t
j
: high;S
t
l
;S
t
i
: low) (3)
S
t+1
i
= h(S
t
j
H +
X
S
l
C
i;l
S
t
l
H + S
t
i
H) (S
t
j
: high;S
t
l
;S
t
i
: low) (4)
high ! high:
S
t+1
i
= h(S
t
i
H +
X
S
l
C
i;l
S
t
l
H) (S
t
i
: high;S
t
l
: low) (5)
high ! low:
S
t+1
i
= h(S
t
i
H +
X
S
l
C
i;l
S
t
l
H) (S
t
i
: high;S
t
l
: low) (6)
low ! low:
S
t+1
i
= h(S
t
i
H +
X
S
l
C
i;l
S
t
l
H) (S
t
i
;S
t
l
: low) (7)
S
t+1
i
= h(S
t
i
H +
X
S
l
C
i;l
S
t
l
H) (S
t
i
;S
t
l
: low) (8)
where
C
i;l
= fS
j
j W
ijl
= Hg (9)
4
The inputs I
t
k
are not shown explicitly since we assume that each input symbol is assigned a separate input
neuron in a one-hot encoding. The DFA state transitions corresponding to the these types of neuron state
changes are shown in gure 1.
Note that the term BnZrH=2 is implicit; they cause neuron outputs which do not correspond to the cur-
rent DFA state to be reset to a small value. The term S
t
j
 H where S
t
j
is a high signal represents the
principal contribution to the neuron S
t+1
i
which is responsible for driving the output of neuron S
t+1
i
high
when the network executes a DFA state transition (q
j
;a
k
) = q
i
. All other terms are the residual contribu-
tions to the input of neuron S
t+1
i
where the signal S
t
l
is low. The term
P
S
t
l
H contributes to the total input
of state neuron S
t+1
i
if there are other transitions (q
l
;a
k
) = q
i
in the DFA from which the recurrent network
is constructed. Since there is a one-to-one correspondence between state neurons and DFA states, there will
always be a negative contribution BnZrS
t
i
H for the current DFA state transition (q
j
;a
k
) = q
i
, i.e. only S
t
i
can drive the signal S
t+1
i
low. Equations (3) and (4) only dier with respect to the sign of the residual input
S
t
i
H. If there is a state transition (q
i
;a
k
) = q
i
, then equation (4) applies; otherwise, there is a residual
low signal S
t
i
trying to drive S
t+1
i
low and equation (3) applies. Similarly,either equation (7) or (8) is chosen
for state transitions of the type low ! low. The above equations account for all possible contributions to
the net input of all state neurons because the encoding algorithm constructs a sparse recurrent network.
Before we proceed, we make some observations about the equations (3)-(9) which will simplify the anal-
ysis.
Observation 3.2.1 An analysis for state transitions of type low ! high and low ! low also covers state
transitions of type high ! high and high ! low, respectively.
The state transition types high ! high and high ! low cause stronger high and low signals, respectively,
than the state transitions types low ! high and low ! low. Thus, an analysis that shows stability of high
and low signals for these types of state transitions implies the stability of high and low signals for state
transitions of type high ! high and high ! low, respectively.
Observation 3.2.2 Of the two possible equations (3) and (4) for state transitions low ! high, the former
equation also covers the latter equation.
If high signals can be kept stable for equation (3), then they certainly can also be kept stable for equation
(4). No separate analysis is necessary for thetwo equations.
Observation 3.2.3 Of the two possible equations (7) and (8) for state transitions low ! low, the latter
equation also covers the former equation.
5
iq
jq
q l
m
1
ql
a k
a k
a k
a k
iq
jq
q l
m
1
ql
a k
a k
a k
a k
q k
t
t+1 t+1
t
iq
q l
m
1
ql a
k
a k
a k
t
t+1
iq
jq
q l
m
1
ql
a k
a k
a k
a k
q k
t
t+1
iq
q l
m
1
ql a
k
a k
a k
iq
jq
q l
m
1
ql
a k
a k
a k
a k
q k
t
t+1
qp
a k t
t+1
qp
a k
(a) (b)
(c) (d)
(e) (f)
Figure 1: Neuron State Changes and Corresponding DFA State Transitions: The gure (a)-(f)
illustrate the DFA state transitions corresponding to allpossible state changes of neuron S
i
; the DFA state(s)
participating in the current transitions are marked with t and t+1. (a) low ! high (no self-loop on q
i
) (b)
low ! high (with self-loop on q
i
) (c) high ! high (necessarily a self-loop on q
i
) (d) high ! low (necessarily
no self-loop on q
i
) (e) low ! low (no self-loop on q
i
) (f) low ! low (with self-loop on q
i
). Notice that, even
though state q
i
is neither the source nor the target of the current state transition in cases (e) and (f), the
corresponding state neuron S
i
still receives residual inputs from state neurons S
l
1
;:::;S
l
m
.
6
An argument similar to that given for validity of observation 3.2.2 can be given.
Thus, we are left with only the two following types of state transitions which represent the worst cases:
low ! low:
S
t+1
i
= h(S
t
i
H +
X
S
l
C
i;l
S
t
l
H) (S
t
i
;S
t
l
: low) (10)
low ! high:
S
t+1
i
= h(S
t
j
H +
X
S
l
C
i;l
S
t
l
H BnZrS
t
i
H) (S
t
j
: high;S
t
l
;S
t
i
: low) (11)
Thus, the network equation (1) takes on the special form
a
i
 BnZrH=2+ Hx
i
with x
i
=
X
j
S
t
j
(12)
since all but one input neuron have value 0 at any given time t. Notice that the number of terms in the sum
P
j
S
t
j
is equal to the number of DFA states q
j
for which there are transitions (q
j
;a
k
) = q
i
for the current
input symbol a
k
.
The ideal case where neurons do not receive residual inputs from other neurons, successive network state
changes can be expressed as the iteration of the function h
t
(x;H)
h
t
(x;H) =
8
<
:
h(x;H) t = 1
h(h
tBnZr1
(x;H);H) t > 1
(13)
with x = 0 and x = 1 for low and high signals, respectively
We can now dene a new function h
t

(x;H) which takes the residual inputs into consideration.
Since the initial output value of all state neurons except the neuron assigned to a DFAs start state are
zero, the residual inputs under this worst case assumption are identical for all neurons; let x denote the
residual neuron inputs. Then, the function h
t

(x;H) is dened as
h
t

(x;H) =
8
<
:
h(x;H) t = 1
h(h
tBnZr1

(x+ x;H)+x;H) t > 1
(14)
The initial values for low and high signals are x = 0 and x = 1, respectively.
7
Consider a state q
i
and let D
ik
denote the number of states q
j
such that (q
j
;a
k
) = q
i
for each symbol
a
k
. Setting D = maxfD
ik
g, each recurrent state neuron receives residual input from at most  =
D
n
recur-
rent neurons for a chosen input symbol a
k
.
We make the following simplifying assumption:
Assumption 3.2.1 Each state neuron receives residual inputs from exactly  other neurons for all input
symbols.
Although this is generally not the case, this worst case covers all possible DFAs and simplies the analysis.
Then, we can quantify x for the case of low signals:
Lemma 3.2.1 The low signals are bounded from above by the xed point 
BnZr

of the function
h
t

BnZr
(x;H) =
8
<
:
h(0;H) t = 1
h(nh
tBnZr1

BnZr
(x;H);H) t > 1
(15)
Similarly, we can quantify high signals:
Lemma 3.2.2 The high signals are bounded from below by the xed point 
+

of the function
h
t

+
(x;H) =
8
<
:
h(1;H) t = 1
h(h
tBnZr1

+
(x;H)BnZrh
tBnZr1

BnZr
(x;H);H) t > 1
(16)
The derivation of these iterated functions can be found in [Omlin, 1995]; the functions (16) and (17) converge
toward their xed points 
BnZr

and 
+

according to lemma 3.1.3.
In practice, only few neurons ever exceed or fall below the xed points 
BnZr
and 
+
, respectively. Fur-
thermore, the network has a built-in reset mechanism which allows low and high signals to be strengthened.
Low signals S
t
j
are strengthened to g(BnZrH=2) when there exists no state transition (:;a
k
) = q
j
. In that
case, the neuron S
t
j
receives no inputs from any of the other neurons; its output becomes less than 
BnZr
since
g(BnZrH=2) = h(0;H) < 
BnZr
for H > 4. Similarly, high signals S
t
i
get strengthened if either low signals feeding
into neuron S
i
on a current state transition (fq
j
g;a
k
) = q
i
have been strengthened during the previous
time step or when the number of positive residual inputs to neuron S
i
compensates for a weak high signal
from neurons fq
j
g. Thus only a small number of neurons will have S
t
j
> 
BnZr
or S
t
j
< 
+
and only for a nite
amount of time. For the majority of neurons we have S
t
j
 
BnZr
and S
t
j
 
+
. Since constructed recurrrent
networks are able to regenerate their internal signals and since typical DFAs do not have the worst case
8
properties assumed in this analysis, the conditions guaranteeing stable low and high signals are generally
much too strong for some given DFA. Scaling issues are discussed elsewhere [Omlin and Giles, 1994].
In order for the internal DFA state representation to remain stable, the low and high signals must remain
suciently distinguishable:
Denition 3.2.1 An encoding of DFA states in a SORNN is called stable if all the low and high signals are
less and larger than 0.5, respectively.
We now return to the worst case state equations (10) and (11).
In order for the function (16) to converge toward the low xed point 
BnZr

, the argument of equation (10) must
satisfy the following invariant property:
BnZr
H
2
+ H n
BnZr

<
1
2
(17)
Similarly, the argument of equation (11) must satisfy the following invariant property in order for function
(17) to converge toward the high xed point 
+

:
BnZr
H
2
+ H 
+

BnZrH(1BnZr
+
)
1
2
: (18)
Solving inequalities (18) and (19) for n and 
+

, respectively, we obtain the conditions for stable low and
high signals of the theorem.
Although we have stated the conditions for stable low and high signals separately, the condition for sta-
ble signals can be simplied as follows:
Corollary 3.2.1 If all neurons (not including the output neuron) have at least two weights W
ixk
= W
iyk
=
H for all input symbols a
k
, then the internal representation of a DFA remains stable if

BnZr

(H) <
1
2n
(1+
1
H
)
Proof: Substituting 1BnZr
+

for 
BnZr

in theorem 3.2.1 we get
1BnZr
+

<
1
2n
(1 +
1
H
) (19)
1BnZr
1
2n
(1+
1
H
) < 
+

(20)
2nBnZr1
2n
+
2nBnZr1
2Hn
< 
+

(21)
9
But stable high signals require
3
4
+
1
4H
< 
+

(22)
Comparing inequalities (22) and (23), we conclude that the former implies the latter for n  2.
We can now state the following theorem for the construction of recurrent networks for specic DFAs:
Corollary 3.2.2 Let L(M
DFA
) denote the language accepted by a DFA M with n states and let L(M
RNN
)
be the language accepted by the sparse RNN constructed from M; then, we have L(M
RNN
) = L(M
DFA
) if

BnZr

(H) <
1
2n
and 
+

(H) >
1
4
(3+
1
H
)
Proof: We just need to establish a condition for correct string classication. As we will see, the condition
for stable dynamics in partially recurrent networks is not sucient to guarantee correct string classication.
For the case of an ungrammatical strings, the following condition must be satised:
BnZr
H
2
BnZrH 
+

+(nBnZr1)H 
BnZr

<
1
2
(23)
where we have made the usual simplication about the convergence of the outputs to the xed points 
BnZr

and 
+

; furthermore, we assume that only one DFA state is a rejecting state; then the output neuron's
residual inputs from all other state neurons is positive, weakening the intended high signal for the network's
output neuron. Notice that the output neuron is the only neuron which can be forced toward a low signal
by neurons other than itself.
A similar condition can be formulated for grammatical strings:
BnZr
H
2
+H 
+

BnZr(nBnZr1)H 
BnZr

>
1
2
(24)
The above two inequalities can be simplied into a single inequality:
BnZr2H 
+

+ 2(nBnZr1)H 
BnZr

< 0 (25)
Solving for 
BnZr

, we get the following condition for the correct output of a network:

BnZr

<
1
n
(26)
Thus we have the following conditions for stable low signals and correct string classication:
for (H > 1).

BnZr

(H) <
8
>
<
>
:
1
2n
(1+
1
H
) <
1
n
for H > 1 (dynamics)
1
n
(classication)
(27)
10
Comparing these two conditions, it follows that the condition for correct string classication implies the
condition for stable low signals for rho  1.
Other scenarios of worst cases can be considered: Instead of partial interconnectivity, a worst case anal-
ysis can be carried out where it is assumed that each neuron receives inputs from all other neurons. In that
case, the conditions for stable nite-state dynamics dominate the condition for correct string classication.
Obviously, those conditions imply the conditions for partially recurrent networks. It is also possible to con-
sider a fully recurrent networks where all those weights that are not programmedto +H or BnZrH are assigned
random values from an interval [BnZrW;W]. That scenario applies when a recurrent network initialized with
the knowledge of some DFA is further trained on data for knowledge renement of revision; in that case, the
weight strength H and the value W can be chosen such that the weights initialized to random values do not
destroy the encoded knowledge.
4 EXPERIMENTS
In order to empirically validate our analysis, we constructed networks from randomly generated DFAs with
10, 100 and 1,000 states. For each of the three DFAs, we randomly generated dierent test sets each consist-
ing of 1,000 strings of length 10, 100, and 1,000, respectively. The randomly generated, minimized 100-state
DFA with alphabet  = f0;1gwe encoded into a recurrent network with 101 state neurons is shown in gure
2. The networks' generalization performance on these test sets for rule strength H = f0:0;0:1;0:2;:::;7:0g
are shown in gures 3-5. A misclassication of these long strings for arbitrary large values of H would indi-
cate a network's failure to maintain the stable nite-state dynamics that was encoded. However, we observe
that the networks can implement stable DFAs as indicated by the perfect generalization performance for
some choice of the rule strength H and chosen test set. Thus, we have empirical evidence which supports
our analysis.
All three networks achieve perfect generalization for all three test sets for approximately the same value of
H. Apparently, the network size plays an insignicantrole in determiningfor which valueof H stabilityofthe
internal DFA representation is reached, at least across the considered 3 orders of magnitudeof network sizes.
For the 100-state DFA in gure 2, we computed the value H
pred
= 13:8 which guarantees stable nite-
state dynamics of the constructed network. H
pred
was computed from a worst case analysis; it should thus
come as no surprise that 4 < H
exp
< H
pred
.
11
1
2
3
4
5
6
7
8 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
7475
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
Figure 2: Randomly generated 100-state DFA: The minimal DFA has 100 states and alphabet  =
f0;1g. State 1 is the start state. States with and without double circles are accepting and rejecting states,
respectively.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 1 2 3 4 5 6 7
Classification Error (x 100%)
H
RNN Encoding of 10-state DFA
"string.length_10"
"string.length_100"
"string.length_1000"
Figure 3: Performance of 10-state DFA: The network classication performance on three randomly-
generated data sets consisting of 1,000 strings of length 10 (3), 100 (+), and 1,000 (2), respectively, as a
function of the rule strength H (in 0.1 increments) is shown. The network achieves perfect classication on
the strings of length 1,000 for H > 6:0.
12
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 1 2 3 4 5 6 7
Classification Error (x 100%)
H
RNN Encoding of 100-state DFA
"string.length_10"
"string.length_100"
"string.length_1000"
Figure 4: Performance of 100-state DFA: The network classication performance on three randomly-
generated data sets consisting of 1,000 strings of length 10 (3), 100 (+), and 1,000 (2), respectively, as a
function of the rule strength H (in 0.1 increments) is shown. The network achieves perfect classication on
the strings of length 1,000 for H > 6:2.
0
0.1
0.2
0.3
0.4
0.5
0.6
0 1 2 3 4 5 6 7
Classification Error (x 100%)
H
RNN Encoding of 1000-state DFA
"string.length_10"
"string.length_100"
"string.length_1000"
Figure 5: Performance of 1000-state DFA: The network classication performances on three randomly-
generated data sets consisting of 1,000 strings of length 10 (3), 100 (+), and 1,000 (2), respectively, as a
function of the rule strength H (in 0.1 increments). The network achieves perfect classication on the strings
of length 1,000 for H > 6:1.
13
5 SCALING ISSUES
5.1 Preliminaries
The worst case analysis of section makes the following predictions about the implementation of arbitrary
DFAs:
(1) neural DFAs can be constructed that are stable for arbitrary string length for nite value of the
weight strength H,
(2) for most neural DFA implementations, H
emp
< H
pred
, and
(3) the value of H scales with the DFA size, i.e. the larger the DFA and thus the network, the larger
H will be for guaranteed stability.
Predictions (1) and (2) are supported by our experiments. However, when we compare the values H
emp
in
the above experiments for DFAs of dierent sizes, we nd that H
emp
 6 for allthree DFAs. This observation
seems inconsistent with the theory. The reason for this inconsistency lies in the assumption of a worst case
for the analysis, whereas the DFAs we implemented represent average cases. For the construction of the
randomly generated 100-state DFA we found correct classication of strings of length 1,000 for H
emp
= 6:3.
This value corresponds to a DFA whose states have `average' indegree n = 1:5. [The magicvalue 6 also seems
to occur for networks which are trained. Consider a neuron S
i
; then, the weight which causes transitions
between dynamical attractors often has a value  6 [Tino, 1994].]
However, there exist DFAs which exhibit the scaling behavior that is predicted by the theory. We will
briey discuss such DFAs. That discussion will be followed by an analysis of the condition for stable DFA
encodings for asymptotically large DFAs.
5.2 DFA States with Large Indegree
We can approximate the worst case analysis by considering an extreme case of a DFA:
(1) Select an arbitrary DFA state q

;
(2) select a fraction  of states q
j
and set (q
j
;a
k
) = q

.
(3) For low values of , a constructed network behaves similar to a randomly generated DFA.
(4) As the number of states q
j
for which (q
j
;a
k
) = q

increases, the behavior gradually moves
toward the worst case analysis where one neuron. receives a large number of residual inputs with
for a designated input symbol a
k
.
14
3
4
5
6
7
8
9
10
11
0 20 40 60 80 100
weight strength H
maximum indegree
captionScaling Weight Strength: An accepting state q

in 10 randomly generated 100-state DFAs was se-
lected. The number of states q
j
for which (q
j
;0) = q

was gradually increased in increments of 5% of all
DFA states. The graph shows the minimum value of H
emp
for correct classication of 100 strings of length
100. H
emp
increases up to  = 75%; for  > 75%, the DFA becomes degenerated causing H
emp
to decrease
again.
We constructed a network from a randomly generated DFA M
0
with 100 states and two input symbols. We
derived DFAs M

1
;M

2
;:::;M

R
where thefraction of DFA states q
j
from M

i
to M

i+1
with (q
j
;a
k
) = q

increased by ; for our experiments, we chose  = 0:05. Obviously, the languages L(M

i
) change for
dierent values of 
i
. The graph in gure 6 shows for 10 randomly generated DFAs with 100 state the
minimum weight strength H necessary to correctly classify 100 strings of length 100 - a new data set was
randomly generated for each DFA - as a function of  in 5% increments. We observe that H
emp
generally
increases with increasing values of ; in all cases, the hint strength H
emp
sharply decline for some percentage
value . As the number of connections +H to a single state neuron S
i
increases, the number of residual
inputs which can cause unstable internal DFA representation and incorrect classication decreases. Let us
assume that the extreme DFA state q

is an accepting state. Then, the input to output neuron S
t+1
0
is
BnZr
H
2
+H S
t

+
X
q
l
2F
S
t
l
H BnZr
X
q
l
62F
S
t
l
H (28)
For correct classication, the net input must be larger than 0.5. As the value of  increases, the number
of terms in the rst and second sum increase and decrease, respectively. Thus, smaller values of H lead to
correct string classication. A similar argument can be made if q

is a rejecting state.
We observe that there are two runs where outliers occur, i.e. H

i
> H

i+1
even though we have 
i
< 
i+1
.
Since the value H

depends on the randomly generated DFA, the choice for q

and the test set, we can
15
expect such an uncharacteristic behavior to occur in some cases.
5.3 Asymptotic Case Analysis
We are interested in nding an expression for the average number of residual inputs to a neuron in large
DFAs. Since we are dealing with a second-order network architecture, disjoint parts of the network partic-
ipate in the computation of the next state for any given input symbol. Thus, we can limit our analysis to
DFAs with a single input symbol.
Consider a DFA M and its underlying graph G(V;E) whose vertices V and directed edges E are the DFA
states Q and state transitions , respectively. We assume that G(V;E) is randomly generated: For any given
vertex v
j
, a directed edge e
ij
is drawn to another vertex v
i
with equal probability 1=n for all vertices of G.
The number of directed edges entering any given vertex v
i
from other vertices v
m
is the number of residual
inputs state neuron S
i
receives from other state neurons S
m
. Thus we only need to compute the expected
numberof incomingedges (\in-degree") fora DFA generated according to the aboveprobabilitydistribution.
The probability p(d = k) for a vertex to have in-degree k follows a binomial distribution; thus, the av-
erage in-degree is given by the expected value of k which can be written as:
Efd = kg =
n
X
k=1
k
0
@
n
k
1
A
1
n
k
(1BnZr
1
n
)
nBnZrk
For n !1 and  = np  1 where p is the probability that an event occurs (in our case we have p = 1=n and
thus  = 1) and p ! 0 the binomial distribution asymptotically converges toward the Poisson distribution:
Efd = kg =
1
X
k=1
k e
BnZr

k
k!
With  = 1, we conclude
Efd = kg = e
BnZr1
1
X
k=1
1
(kBnZr1)!
= e
BnZr1
1
X
k=0
1
k!
= e
BnZr1
e = 1
We can now state the following asymptotic result for the construction of large DFAs:
Theorem 5.3.1 Let n denote the number of states in a DFA. For n ! 1, the languages accepted by the
DFA M and the constructed neural RNN are identical only for H !1.
16
Proof: Recall the worst case equations (10) and (11) for state transitions of type low ! low and low ! high.
For the asymptotic case n !1, these equations simplify.
The worst case equations of section 3.2.1 apply here also; however, the residual inputs are zero. Thus
the following two conditions for stable low and high signals, respectively, must be satised:
BnZr
H
2
+H 
BnZr

<
1
2
(29)
and
BnZr
H
2
+H 
+

BnZrH 
BnZr

<
1
2
(30)
Combining these two inequalities and solving for 
BnZr

leads to the condition 
BnZr

(H) <
1
3
. Thus, we have the
following conditions for stability of the nite-state dynamics and correct string classication in asymptoti-
cally large DFAs:
stability of signals: 
BnZr

(H) <
1
3
correct string classication: 
BnZr

(H) <
1
n
Unlike in the case of the worst case analysis for partially recurrent networks,the condition for correct string
classication dominatesthe conditions for stable nite-state dynamics. As a matter of fact, stable nite-state
dynamics alone does not require H ! 1; however, correct string classication requires 
BnZr

! 0 and thus
H !1 for n !1.
6 CONCLUSION
We investigated how deterministic nite-state automata (DFAs) can be encoded into sparse second-order
recurrent neural networks. The operation performed by the second-order architecture is akin to DFA state
transitions, making DFA encoding a straightforward operation. We have proven that our algorithm can con-
struct a sparse recurrent network with O(n) state neurons, O(mn) weights and limited fan-out of size O(m)
from any DFA state with n states and m input symbols such that the DFA and the constructed network
accept the same regular language. The DFA dynamics is achieved by programming some of the weights to
values +H or BnZrH. A worst case analysis has revealed a quantitative relationship between the rule strength
H
pred
and the maximum allowed network size such that the network dynamics remains robust for arbitrary
string length. This is only a proof of existence, i.e. we do not make any claims that such a solution can be
learned.
17
Our empirical results suggest, that the weight strength H
exp
 6 is independent of the network size for
typical DFAs. Extreme DFAs can be constructed for the weight strength scales with the network size, i.e.
H
exp
approaches H
pred
.
7 ACKNOWLEDGMENT
We would like to acknowledge useful discussions with B.G. Horne, L. R. Leerink, and T. Lin.
References
[Alon et al., 1991] Alon, N., Dewdney, A., and Ott, T. (1991). Ecient simulation of nite automata by
neural nets. Journal of the Association for Computing Machinery, 38(2):495{514.
[Elman, 1990] Elman, J. (1990). Finding structure in time. Cognitive Science, 14:179{211.
[Frasconi et al., 1991] Frasconi, P., Gori, M., Maggini, M., and Soda, G. (1991). A unied approach for
integrating explicit knowledge and learning by example in recurrent networks. In Proceedings of the
International Joint Conference on Neural Networks, volume 1, page 811. IEEE 91CH3049-4.
[Frasconi et al., 1993] Frasconi, P., Gori, M., and Soda, G. (1993). Injecting nondeterministic nite state
automata into recurrent networks. Technical report, Dipartimento di Sistemi e Informatica, Universita di
Firenze, Italy, Florence, Italy.
[Giles et al., 1992] Giles, C., Miller, C., Chen, D., Chen, H., Sun, G., and Lee, Y. (1992). Learning and ex-
tracting nite state automatawith second-order recurrent neural networks. Neural Computation, 4(3):380.
[Giles and Omlin, 1993] Giles, C. and Omlin, C. (1993). Extraction, insertion and renement of symbolic
rules in dynamically driven recurrent neural networks. Connection Science, 5(3 & 4):307{337.
[Gori et al., 1994] Gori, M., Maggini, M., and Soda, G.(1994). Insertion of nite state automatain recurrent
radial basis function networks. Technical report, Dipartimento di Sistemi e Informatica, Universita di
Firenze, Italy.
[Horne and Hush, 1994] Horne, B. and Hush, D. (1994). Bounds on the complexity of recurrent neural
network implementations of nite state machines. In Advances in Neural Information Processing Systems
6, pages 359{366. Morgan Kaufmann.
[Minsky, 1967] Minsky, M. (1967). Computation: Finite and Innite Machines, chapter 3, pages 32{66.
Prentice-Hall, Inc., Englewood Clis, NJ.
18
[Omlin, 1995] Omlin, C. (1995). Symbolic Knowledge in Recurrent Neural Networks: Issues of Training and
Representation. PhD thesis, Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY
12180.
[Omlin and Giles, 1994] Omlin, C. and Giles, C. (1994). Stable encoding of large nite-state automata in
recurrent neural networks with sigmoid discriminants. Neural Computation. Submitted.
[Pollack, 1991] Pollack, J. (1991). The induction of dynamical recognizers. Machine Learning, 7:227{252.
[Servan-Schreiber et al., 1991] Servan-Schreiber, D., Cleeremans, A., and McClelland, J. (1991). Graded
state machine: The representation of temporal contingencies in simple recurrent networks. Machine
Learning, 7:161.
[Tino, 1994] Tino, P. (1994). Personal communication.
[Watrous and Kuhn, 1992] Watrous, R. and Kuhn, G. (1992). Induction of nite-state languages using
second-order recurrent networks. Neural Computation, 4(3):406.
[Zeng et al., 1993] Zeng, Z., Goodman, R., and Smyth, P. (1993). Learning nite state machines with self-
clustering recurrent networks. Neural Computation, 5(6):976{990.
19