Stable Encoding of Large Finite-State Automata in Recurrent Neural Networks with Sigmoid Discriminants Christian W. Omlin a;b , C. Lee Giles a;c a NEC Research Institute, 4 Independence Way, Princeton, NJ 08540 b CS Department, Rensselaer Polytechnic Institute, Troy, NY 12180 c UMIACS, U. of Maryland, College Park, MD 20742 University of Maryland Technical Report CS-TR-3337 & UMIACS-TR-94-101 Abstract We propose an algorithm for encoding deterministic nite-state automata (DFAs) in second-order re- current neural networks with sigmoidal discriminant function and we prove that the languages accepted by the constructed network and the DFA are identical. The desired nite-state network dynamics is achieved by programming a small subset of all weights. A worst case analysis reveals a relationship be- tween the weight strength and the maximum allowed network size which guarantees nite-state behavior of the constructed network. We illustrate the method by encoding random DFAs with 10, 100, and 1,000 states. While the theory predicts that the weight strength scales with the DFA size, we nd the weight strength to be almost constant for all the experiments. These results can be explained by noting that the generated DFAs represent average cases. We empirically demonstrate the existence of extreme DFAs for which the weight strength scales with DFA size. 1 INTRODUCTION Itispossible totrainrecurrent neuralnetworks tobehave likedeterministic nite-state automata[Elman, 1990, Frasconi et al., 1991,Giles et al., 1992,Pollack, 1991,Servan-Schreiber et al., 1991,Watrous and Kuhn, 1992]. The internal representation of learned DFA states can deteriorate due to the dynamical nature of re- current networks making predictions about the generalization performance of trained recurrent networks dicult [Zeng et al., 1993]. Methods for constructing DFAs in recurrent networks with hard-limiting neu- rons discriminant functions have been proposed [Alon et al., 1991, Horne and Hush, 1994, Minsky, 1967]; 1 methods for constructing networks with sigmoidal and radial-basis discriminant functions are discussed in [Frasconi et al., 1993, Gori et al., 1994, Giles and Omlin, 1993]. We prove that recurrent networks with con- tinuous sigmoidal discriminant functions can be constructed such that the encoded nite-state dynamics remains stable inde nitely. Notice that we do not claim that such a stable representation can be learned. The stability of the internal DFA representation is based on on the premise that the discriminant functions have suciently high gain such that the neurons always operate in a saturated mode. We show empirically that the gain does not scale with the size of networks that implement average DFAs; however, there exist extreme cases of DFAs where we observe a scaling of the neurons' gain with network size. 2 ENCODING DFA DYNAMICS 2.1 Finite State Automata Adeterministic nite-state automaton(DFA) isan acceptor foraregular languageL(M). Formally,a DFAM is a5-tuple M =< ;Q;R;F; > where  = fa 1 ;:::;a m gis the alphabet ofthe languageL, Q = fq 1 ;:::;q n g is a set of states, RQ is the start state, F  Q is a set of accepting states and  : Q ! Q de nes state transitions in M. A string is accepted by the DFA M if an accepting state is reached; otherwise, the string is rejected. 2.2 Recurrent Network We implement DFAs in discrete-time, recurrent networks with second-order weights W ijk . The continuous network dynamics are described by the following equations: S (t+1) i = h(a i (t)) = 1 1+ e BnZra i (t) ; a i (t) = b i + X j;k W ijk S (t) j I (t) k ; (1) where b i is the bias associated with hidden recurrent state neurons S i ; I k denotes the input neuron for symbol a k . The product S (t) j I (t) k directly corresponds to the state transition (q j ;a k ) = q i . After a string has been processed, the output of a designated neuron S 0 decides whether the network accepts or rejects a string. The network accepts a given string if the value of the output neuron S t 0 at the end of the string is greater than 0.5; otherwise, the network rejects the string. 2.3 Encoding Algorithm The encoding algorithmachieves a nearly orthonormal internal representation of the desired DFA dynamics; it constructs a network with n+1 recurrent state neurons (includingthe output neuron) and m input neurons 2 from a DFA with n states and m input symbols. There is a one-to-one correspondence between state neurons S i and DFA states q i . Consider a DFA state transition (q j ;a k ) = q i . Setting W ijk to a large positive value +H will ensure that S (t+1) i will be high and setting W jjk to a large negative value BnZrH will guarantee that the output S (t+1) j will be low. Furthermore, if state q i is an accepting state, then we program the weight W 0jk to +H; otherwise, we set W 0jk to BnZrH. We set the bias terms b i of all state neurons S i to BnZrH=2. For each DFA state transition, at most three weights of the network have to be programmed. The initial state S 0 of the network is S 0 = (S 0 0 ;1;0;0;:::;0). The initial value of the response neuron S 0 0 is 0 if the DFA's initial state q 0 is a rejecting state and 1 otherwise. 3 ANALYSIS We prove the stability of DFA encodings in recurrent neural networks using xed point analysis. We only give the proofs of the theorems which establish our results; for proofs of auxiliary lemmas see [Omlin, 1995]. 3.1 Fixed Point Analysis Recall that the recurrent network changes its state according to equation (1). Our DFA encoding algorithm yields a special form of that equation describing the dynamics of a constructed network: S (t+1) i = g(H(2x i BnZr1)=2) = h(x;H) = 1 1+e H=2(1BnZr2x) (2) The bias term BnZrH=2 is common to all state neurons. Hx is the weighted sum feeding into neuron S (t+1) i . Under certain conditions, the discriminant function h(:) has xed points which allow a stable internal repre- sentation of DFA states. We state here without proof the following facts about xed points of the function h(:); the proofs can be found in [Omlin, 1995]. Lemma 3.1.1 For 0 < H < 4, h(x;H) has the following xed point:  0 = 0:5 Furthermore, h(x;H) converge to  0 for any choice of a start value x 0 . Lemma 3.1.2 For H  4, h(x;H) has three xed points  0 = 0:5,  BnZr and  + . Lemma 3.1.3 for x <  0 and  0 < x; h t (x;H) (with H > 4) converges to  BnZr and  + , respectively. Lemma 3.1.4 For arbitrary H > 0, the two xed points  BnZr and  + are related as follows:  BnZr +  + = 1 3.2 Worst Case Analysis We will now investigate the conditions under which a constructed network implements a given DFA. The main result shows how large a network may be for xed H such that the languages of the DFA and the 3 constructed network are identical: Theorem 3.2.1 Let  denote the maximum fraction of DFA states q j from which there are state transitions (q j ;a k ) = q i for a xed choice of a k and any q i . Then, a sparse recurrent neural network RNN with n+1 sig- moidal state neurons, m input neurons, at most 3mn second-order weights with alphabet  w = fBnZrH;0;+Hg (4 < H min < H < H pred ), n + 1 biases with alphabet  b = fBnZrH=2g, and maximum fan-out 3m can be constructed from a DFA M with n states and m input symbols of states such that the internal state repre- sentation remains stable, i.e. S i > 0:5 when q i is the current DFA state and S i < 0:5 otherwise if n < 1 2   BnZr  (H) (1+ 1 H ) with  +  (H) > 1 4 (3+ 1 H ) for a proper choice of H. Proof: >From the DFA encoding algorithm, we can derive four di erent types of neuron state changes: low ! high: S t+1 i = h(S t j H + X S l C i;l S t l H BnZrS t i H) (S t j : high;S t l ;S t i : low) (3) S t+1 i = h(S t j H + X S l C i;l S t l H + S t i H) (S t j : high;S t l ;S t i : low) (4) high ! high: S t+1 i = h(S t i H + X S l C i;l S t l H) (S t i : high;S t l : low) (5) high ! low: S t+1 i = h(S t i H + X S l C i;l S t l H) (S t i : high;S t l : low) (6) low ! low: S t+1 i = h(S t i H + X S l C i;l S t l H) (S t i ;S t l : low) (7) S t+1 i = h(S t i H + X S l C i;l S t l H) (S t i ;S t l : low) (8) where C i;l = fS j j W ijl = Hg (9) 4 The inputs I t k are not shown explicitly since we assume that each input symbol is assigned a separate input neuron in a one-hot encoding. The DFA state transitions corresponding to the these types of neuron state changes are shown in gure 1. Note that the term BnZrH=2 is implicit; they cause neuron outputs which do not correspond to the cur- rent DFA state to be reset to a small value. The term S t j  H where S t j is a high signal represents the principal contribution to the neuron S t+1 i which is responsible for driving the output of neuron S t+1 i high when the network executes a DFA state transition (q j ;a k ) = q i . All other terms are the residual contribu- tions to the input of neuron S t+1 i where the signal S t l is low. The term P S t l H contributes to the total input of state neuron S t+1 i if there are other transitions (q l ;a k ) = q i in the DFA from which the recurrent network is constructed. Since there is a one-to-one correspondence between state neurons and DFA states, there will always be a negative contribution BnZrS t i H for the current DFA state transition (q j ;a k ) = q i , i.e. only S t i can drive the signal S t+1 i low. Equations (3) and (4) only di er with respect to the sign of the residual input S t i H. If there is a state transition (q i ;a k ) = q i , then equation (4) applies; otherwise, there is a residual low signal S t i trying to drive S t+1 i low and equation (3) applies. Similarly,either equation (7) or (8) is chosen for state transitions of the type low ! low. The above equations account for all possible contributions to the net input of all state neurons because the encoding algorithm constructs a sparse recurrent network. Before we proceed, we make some observations about the equations (3)-(9) which will simplify the anal- ysis. Observation 3.2.1 An analysis for state transitions of type low ! high and low ! low also covers state transitions of type high ! high and high ! low, respectively. The state transition types high ! high and high ! low cause stronger high and low signals, respectively, than the state transitions types low ! high and low ! low. Thus, an analysis that shows stability of high and low signals for these types of state transitions implies the stability of high and low signals for state transitions of type high ! high and high ! low, respectively. Observation 3.2.2 Of the two possible equations (3) and (4) for state transitions low ! high, the former equation also covers the latter equation. If high signals can be kept stable for equation (3), then they certainly can also be kept stable for equation (4). No separate analysis is necessary for thetwo equations. Observation 3.2.3 Of the two possible equations (7) and (8) for state transitions low ! low, the latter equation also covers the former equation. 5 iq jq q l m 1 ql a k a k a k a k iq jq q l m 1 ql a k a k a k a k q k t t+1 t+1 t iq q l m 1 ql a k a k a k t t+1 iq jq q l m 1 ql a k a k a k a k q k t t+1 iq q l m 1 ql a k a k a k iq jq q l m 1 ql a k a k a k a k q k t t+1 qp a k t t+1 qp a k (a) (b) (c) (d) (e) (f) Figure 1: Neuron State Changes and Corresponding DFA State Transitions: The gure (a)-(f) illustrate the DFA state transitions corresponding to allpossible state changes of neuron S i ; the DFA state(s) participating in the current transitions are marked with t and t+1. (a) low ! high (no self-loop on q i ) (b) low ! high (with self-loop on q i ) (c) high ! high (necessarily a self-loop on q i ) (d) high ! low (necessarily no self-loop on q i ) (e) low ! low (no self-loop on q i ) (f) low ! low (with self-loop on q i ). Notice that, even though state q i is neither the source nor the target of the current state transition in cases (e) and (f), the corresponding state neuron S i still receives residual inputs from state neurons S l 1 ;:::;S l m . 6 An argument similar to that given for validity of observation 3.2.2 can be given. Thus, we are left with only the two following types of state transitions which represent the worst cases: low ! low: S t+1 i = h(S t i H + X S l C i;l S t l H) (S t i ;S t l : low) (10) low ! high: S t+1 i = h(S t j H + X S l C i;l S t l H BnZrS t i H) (S t j : high;S t l ;S t i : low) (11) Thus, the network equation (1) takes on the special form a i  BnZrH=2+ Hx i with x i = X j S t j (12) since all but one input neuron have value 0 at any given time t. Notice that the number of terms in the sum P j S t j is equal to the number of DFA states q j for which there are transitions (q j ;a k ) = q i for the current input symbol a k . The ideal case where neurons do not receive residual inputs from other neurons, successive network state changes can be expressed as the iteration of the function h t (x;H) h t (x;H) = 8 < : h(x;H) t = 1 h(h tBnZr1 (x;H);H) t > 1 (13) with x = 0 and x = 1 for low and high signals, respectively We can now de ne a new function h t  (x;H) which takes the residual inputs into consideration. Since the initial output value of all state neurons except the neuron assigned to a DFAs start state are zero, the residual inputs under this worst case assumption are identical for all neurons; let x denote the residual neuron inputs. Then, the function h t  (x;H) is de ned as h t  (x;H) = 8 < : h(x;H) t = 1 h(h tBnZr1  (x+ x;H)+x;H) t > 1 (14) The initial values for low and high signals are x = 0 and x = 1, respectively. 7 Consider a state q i and let D ik denote the number of states q j such that (q j ;a k ) = q i for each symbol a k . Setting D = maxfD ik g, each recurrent state neuron receives residual input from at most  = D n recur- rent neurons for a chosen input symbol a k . We make the following simplifying assumption: Assumption 3.2.1 Each state neuron receives residual inputs from exactly  other neurons for all input symbols. Although this is generally not the case, this worst case covers all possible DFAs and simpli es the analysis. Then, we can quantify x for the case of low signals: Lemma 3.2.1 The low signals are bounded from above by the xed point  BnZr  of the function h t  BnZr (x;H) = 8 < : h(0;H) t = 1 h(nh tBnZr1  BnZr (x;H);H) t > 1 (15) Similarly, we can quantify high signals: Lemma 3.2.2 The high signals are bounded from below by the xed point  +  of the function h t  + (x;H) = 8 < : h(1;H) t = 1 h(h tBnZr1  + (x;H)BnZrh tBnZr1  BnZr (x;H);H) t > 1 (16) The derivation of these iterated functions can be found in [Omlin, 1995]; the functions (16) and (17) converge toward their xed points  BnZr  and  +  according to lemma 3.1.3. In practice, only few neurons ever exceed or fall below the xed points  BnZr and  + , respectively. Fur- thermore, the network has a built-in reset mechanism which allows low and high signals to be strengthened. Low signals S t j are strengthened to g(BnZrH=2) when there exists no state transition (:;a k ) = q j . In that case, the neuron S t j receives no inputs from any of the other neurons; its output becomes less than  BnZr since g(BnZrH=2) = h(0;H) <  BnZr for H > 4. Similarly, high signals S t i get strengthened if either low signals feeding into neuron S i on a current state transition (fq j g;a k ) = q i have been strengthened during the previous time step or when the number of positive residual inputs to neuron S i compensates for a weak high signal from neurons fq j g. Thus only a small number of neurons will have S t j >  BnZr or S t j <  + and only for a nite amount of time. For the majority of neurons we have S t j   BnZr and S t j   + . Since constructed recurrrent networks are able to regenerate their internal signals and since typical DFAs do not have the worst case 8 properties assumed in this analysis, the conditions guaranteeing stable low and high signals are generally much too strong for some given DFA. Scaling issues are discussed elsewhere [Omlin and Giles, 1994]. In order for the internal DFA state representation to remain stable, the low and high signals must remain suciently distinguishable: De nition 3.2.1 An encoding of DFA states in a SORNN is called stable if all the low and high signals are less and larger than 0.5, respectively. We now return to the worst case state equations (10) and (11). In order for the function (16) to converge toward the low xed point  BnZr  , the argument of equation (10) must satisfy the following invariant property: BnZr H 2 + H n BnZr  < 1 2 (17) Similarly, the argument of equation (11) must satisfy the following invariant property in order for function (17) to converge toward the high xed point  +  : BnZr H 2 + H  +  BnZrH(1BnZr + ) 1 2 : (18) Solving inequalities (18) and (19) for n and  +  , respectively, we obtain the conditions for stable low and high signals of the theorem. Although we have stated the conditions for stable low and high signals separately, the condition for sta- ble signals can be simpli ed as follows: Corollary 3.2.1 If all neurons (not including the output neuron) have at least two weights W ixk = W iyk = H for all input symbols a k , then the internal representation of a DFA remains stable if  BnZr  (H) < 1 2n (1+ 1 H ) Proof: Substituting 1BnZr +  for  BnZr  in theorem 3.2.1 we get 1BnZr +  < 1 2n (1 + 1 H ) (19) 1BnZr 1 2n (1+ 1 H ) <  +  (20) 2nBnZr1 2n + 2nBnZr1 2Hn <  +  (21) 9 But stable high signals require 3 4 + 1 4H <  +  (22) Comparing inequalities (22) and (23), we conclude that the former implies the latter for n  2. We can now state the following theorem for the construction of recurrent networks for speci c DFAs: Corollary 3.2.2 Let L(M DFA ) denote the language accepted by a DFA M with n states and let L(M RNN ) be the language accepted by the sparse RNN constructed from M; then, we have L(M RNN ) = L(M DFA ) if  BnZr  (H) < 1 2n and  +  (H) > 1 4 (3+ 1 H ) Proof: We just need to establish a condition for correct string classi cation. As we will see, the condition for stable dynamics in partially recurrent networks is not sucient to guarantee correct string classi cation. For the case of an ungrammatical strings, the following condition must be satis ed: BnZr H 2 BnZrH  +  +(nBnZr1)H  BnZr  < 1 2 (23) where we have made the usual simpli cation about the convergence of the outputs to the xed points  BnZr  and  +  ; furthermore, we assume that only one DFA state is a rejecting state; then the output neuron's residual inputs from all other state neurons is positive, weakening the intended high signal for the network's output neuron. Notice that the output neuron is the only neuron which can be forced toward a low signal by neurons other than itself. A similar condition can be formulated for grammatical strings: BnZr H 2 +H  +  BnZr(nBnZr1)H  BnZr  > 1 2 (24) The above two inequalities can be simpli ed into a single inequality: BnZr2H  +  + 2(nBnZr1)H  BnZr  < 0 (25) Solving for  BnZr  , we get the following condition for the correct output of a network:  BnZr  < 1 n (26) Thus we have the following conditions for stable low signals and correct string classi cation: for (H > 1).  BnZr  (H) < 8 > < > : 1 2n (1+ 1 H ) < 1 n for H > 1 (dynamics) 1 n (classi cation) (27) 10 Comparing these two conditions, it follows that the condition for correct string classi cation implies the condition for stable low signals for rho  1. Other scenarios of worst cases can be considered: Instead of partial interconnectivity, a worst case anal- ysis can be carried out where it is assumed that each neuron receives inputs from all other neurons. In that case, the conditions for stable nite-state dynamics dominate the condition for correct string classi cation. Obviously, those conditions imply the conditions for partially recurrent networks. It is also possible to con- sider a fully recurrent networks where all those weights that are not programmedto +H or BnZrH are assigned random values from an interval [BnZrW;W]. That scenario applies when a recurrent network initialized with the knowledge of some DFA is further trained on data for knowledge re nement of revision; in that case, the weight strength H and the value W can be chosen such that the weights initialized to random values do not destroy the encoded knowledge. 4 EXPERIMENTS In order to empirically validate our analysis, we constructed networks from randomly generated DFAs with 10, 100 and 1,000 states. For each of the three DFAs, we randomly generated di erent test sets each consist- ing of 1,000 strings of length 10, 100, and 1,000, respectively. The randomly generated, minimized 100-state DFA with alphabet  = f0;1gwe encoded into a recurrent network with 101 state neurons is shown in gure 2. The networks' generalization performance on these test sets for rule strength H = f0:0;0:1;0:2;:::;7:0g are shown in gures 3-5. A misclassi cation of these long strings for arbitrary large values of H would indi- cate a network's failure to maintain the stable nite-state dynamics that was encoded. However, we observe that the networks can implement stable DFAs as indicated by the perfect generalization performance for some choice of the rule strength H and chosen test set. Thus, we have empirical evidence which supports our analysis. All three networks achieve perfect generalization for all three test sets for approximately the same value of H. Apparently, the network size plays an insigni cantrole in determiningfor which valueof H stabilityofthe internal DFA representation is reached, at least across the considered 3 orders of magnitudeof network sizes. For the 100-state DFA in gure 2, we computed the value H pred = 13:8 which guarantees stable nite- state dynamics of the constructed network. H pred was computed from a worst case analysis; it should thus come as no surprise that 4 < H exp < H pred . 11 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 7475 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 Figure 2: Randomly generated 100-state DFA: The minimal DFA has 100 states and alphabet  = f0;1g. State 1 is the start state. States with and without double circles are accepting and rejecting states, respectively. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 1 2 3 4 5 6 7 Classification Error (x 100%) H RNN Encoding of 10-state DFA "string.length_10" "string.length_100" "string.length_1000" Figure 3: Performance of 10-state DFA: The network classi cation performance on three randomly- generated data sets consisting of 1,000 strings of length 10 (3), 100 (+), and 1,000 (2), respectively, as a function of the rule strength H (in 0.1 increments) is shown. The network achieves perfect classi cation on the strings of length 1,000 for H > 6:0. 12 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 1 2 3 4 5 6 7 Classification Error (x 100%) H RNN Encoding of 100-state DFA "string.length_10" "string.length_100" "string.length_1000" Figure 4: Performance of 100-state DFA: The network classi cation performance on three randomly- generated data sets consisting of 1,000 strings of length 10 (3), 100 (+), and 1,000 (2), respectively, as a function of the rule strength H (in 0.1 increments) is shown. The network achieves perfect classi cation on the strings of length 1,000 for H > 6:2. 0 0.1 0.2 0.3 0.4 0.5 0.6 0 1 2 3 4 5 6 7 Classification Error (x 100%) H RNN Encoding of 1000-state DFA "string.length_10" "string.length_100" "string.length_1000" Figure 5: Performance of 1000-state DFA: The network classi cation performances on three randomly- generated data sets consisting of 1,000 strings of length 10 (3), 100 (+), and 1,000 (2), respectively, as a function of the rule strength H (in 0.1 increments). The network achieves perfect classi cation on the strings of length 1,000 for H > 6:1. 13 5 SCALING ISSUES 5.1 Preliminaries The worst case analysis of section makes the following predictions about the implementation of arbitrary DFAs: (1) neural DFAs can be constructed that are stable for arbitrary string length for nite value of the weight strength H, (2) for most neural DFA implementations, H emp < H pred , and (3) the value of H scales with the DFA size, i.e. the larger the DFA and thus the network, the larger H will be for guaranteed stability. Predictions (1) and (2) are supported by our experiments. However, when we compare the values H emp in the above experiments for DFAs of di erent sizes, we nd that H emp  6 for allthree DFAs. This observation seems inconsistent with the theory. The reason for this inconsistency lies in the assumption of a worst case for the analysis, whereas the DFAs we implemented represent average cases. For the construction of the randomly generated 100-state DFA we found correct classi cation of strings of length 1,000 for H emp = 6:3. This value corresponds to a DFA whose states have `average' indegree n = 1:5. [The magicvalue 6 also seems to occur for networks which are trained. Consider a neuron S i ; then, the weight which causes transitions between dynamical attractors often has a value  6 [Tino, 1994].] However, there exist DFAs which exhibit the scaling behavior that is predicted by the theory. We will brie y discuss such DFAs. That discussion will be followed by an analysis of the condition for stable DFA encodings for asymptotically large DFAs. 5.2 DFA States with Large Indegree We can approximate the worst case analysis by considering an extreme case of a DFA: (1) Select an arbitrary DFA state q  ; (2) select a fraction  of states q j and set (q j ;a k ) = q  . (3) For low values of , a constructed network behaves similar to a randomly generated DFA. (4) As the number of states q j for which (q j ;a k ) = q  increases, the behavior gradually moves toward the worst case analysis where one neuron. receives a large number of residual inputs with for a designated input symbol a k . 14 3 4 5 6 7 8 9 10 11 0 20 40 60 80 100 weight strength H maximum indegree captionScaling Weight Strength: An accepting state q  in 10 randomly generated 100-state DFAs was se- lected. The number of states q j for which (q j ;0) = q  was gradually increased in increments of 5% of all DFA states. The graph shows the minimum value of H emp for correct classi cation of 100 strings of length 100. H emp increases up to  = 75%; for  > 75%, the DFA becomes degenerated causing H emp to decrease again. We constructed a network from a randomly generated DFA M 0 with 100 states and two input symbols. We derived DFAs M  1 ;M  2 ;:::;M  R where thefraction of DFA states q j from M  i to M  i+1 with (q j ;a k ) = q  increased by ; for our experiments, we chose  = 0:05. Obviously, the languages L(M  i ) change for di erent values of  i . The graph in gure 6 shows for 10 randomly generated DFAs with 100 state the minimum weight strength H necessary to correctly classify 100 strings of length 100 - a new data set was randomly generated for each DFA - as a function of  in 5% increments. We observe that H emp generally increases with increasing values of ; in all cases, the hint strength H emp sharply decline for some percentage value . As the number of connections +H to a single state neuron S i increases, the number of residual inputs which can cause unstable internal DFA representation and incorrect classi cation decreases. Let us assume that the extreme DFA state q  is an accepting state. Then, the input to output neuron S t+1 0 is BnZr H 2 +H S t  + X q l 2F S t l H BnZr X q l 62F S t l H (28) For correct classi cation, the net input must be larger than 0.5. As the value of  increases, the number of terms in the rst and second sum increase and decrease, respectively. Thus, smaller values of H lead to correct string classi cation. A similar argument can be made if q  is a rejecting state. We observe that there are two runs where outliers occur, i.e. H  i > H  i+1 even though we have  i <  i+1 . Since the value H  depends on the randomly generated DFA, the choice for q  and the test set, we can 15 expect such an uncharacteristic behavior to occur in some cases. 5.3 Asymptotic Case Analysis We are interested in nding an expression for the average number of residual inputs to a neuron in large DFAs. Since we are dealing with a second-order network architecture, disjoint parts of the network partic- ipate in the computation of the next state for any given input symbol. Thus, we can limit our analysis to DFAs with a single input symbol. Consider a DFA M and its underlying graph G(V;E) whose vertices V and directed edges E are the DFA states Q and state transitions , respectively. We assume that G(V;E) is randomly generated: For any given vertex v j , a directed edge e ij is drawn to another vertex v i with equal probability 1=n for all vertices of G. The number of directed edges entering any given vertex v i from other vertices v m is the number of residual inputs state neuron S i receives from other state neurons S m . Thus we only need to compute the expected numberof incomingedges (\in-degree") fora DFA generated according to the aboveprobabilitydistribution. The probability p(d = k) for a vertex to have in-degree k follows a binomial distribution; thus, the av- erage in-degree is given by the expected value of k which can be written as: Efd = kg = n X k=1 k 0 @ n k 1 A 1 n k (1BnZr 1 n ) nBnZrk For n !1 and  = np  1 where p is the probability that an event occurs (in our case we have p = 1=n and thus  = 1) and p ! 0 the binomial distribution asymptotically converges toward the Poisson distribution: Efd = kg = 1 X k=1 k e BnZr  k k! With  = 1, we conclude Efd = kg = e BnZr1 1 X k=1 1 (kBnZr1)! = e BnZr1 1 X k=0 1 k! = e BnZr1 e = 1 We can now state the following asymptotic result for the construction of large DFAs: Theorem 5.3.1 Let n denote the number of states in a DFA. For n ! 1, the languages accepted by the DFA M and the constructed neural RNN are identical only for H !1. 16 Proof: Recall the worst case equations (10) and (11) for state transitions of type low ! low and low ! high. For the asymptotic case n !1, these equations simplify. The worst case equations of section 3.2.1 apply here also; however, the residual inputs are zero. Thus the following two conditions for stable low and high signals, respectively, must be satis ed: BnZr H 2 +H  BnZr  < 1 2 (29) and BnZr H 2 +H  +  BnZrH  BnZr  < 1 2 (30) Combining these two inequalities and solving for  BnZr  leads to the condition  BnZr  (H) < 1 3 . Thus, we have the following conditions for stability of the nite-state dynamics and correct string classi cation in asymptoti- cally large DFAs: stability of signals:  BnZr  (H) < 1 3 correct string classi cation:  BnZr  (H) < 1 n Unlike in the case of the worst case analysis for partially recurrent networks,the condition for correct string classi cation dominatesthe conditions for stable nite-state dynamics. As a matter of fact, stable nite-state dynamics alone does not require H ! 1; however, correct string classi cation requires  BnZr  ! 0 and thus H !1 for n !1. 6 CONCLUSION We investigated how deterministic nite-state automata (DFAs) can be encoded into sparse second-order recurrent neural networks. The operation performed by the second-order architecture is akin to DFA state transitions, making DFA encoding a straightforward operation. We have proven that our algorithm can con- struct a sparse recurrent network with O(n) state neurons, O(mn) weights and limited fan-out of size O(m) from any DFA state with n states and m input symbols such that the DFA and the constructed network accept the same regular language. The DFA dynamics is achieved by programming some of the weights to values +H or BnZrH. A worst case analysis has revealed a quantitative relationship between the rule strength H pred and the maximum allowed network size such that the network dynamics remains robust for arbitrary string length. This is only a proof of existence, i.e. we do not make any claims that such a solution can be learned. 17 Our empirical results suggest, that the weight strength H exp  6 is independent of the network size for typical DFAs. Extreme DFAs can be constructed for the weight strength scales with the network size, i.e. H exp approaches H pred . 7 ACKNOWLEDGMENT We would like to acknowledge useful discussions with B.G. Horne, L. R. Leerink, and T. Lin. References [Alon et al., 1991] Alon, N., Dewdney, A., and Ott, T. (1991). Ecient simulation of nite automata by neural nets. Journal of the Association for Computing Machinery, 38(2):495{514. [Elman, 1990] Elman, J. (1990). Finding structure in time. Cognitive Science, 14:179{211. [Frasconi et al., 1991] Frasconi, P., Gori, M., Maggini, M., and Soda, G. (1991). A uni ed approach for integrating explicit knowledge and learning by example in recurrent networks. In Proceedings of the International Joint Conference on Neural Networks, volume 1, page 811. IEEE 91CH3049-4. [Frasconi et al., 1993] Frasconi, P., Gori, M., and Soda, G. (1993). Injecting nondeterministic nite state automata into recurrent networks. Technical report, Dipartimento di Sistemi e Informatica, Universita di Firenze, Italy, Florence, Italy. [Giles et al., 1992] Giles, C., Miller, C., Chen, D., Chen, H., Sun, G., and Lee, Y. (1992). Learning and ex- tracting nite state automatawith second-order recurrent neural networks. Neural Computation, 4(3):380. [Giles and Omlin, 1993] Giles, C. and Omlin, C. (1993). Extraction, insertion and re nement of symbolic rules in dynamically driven recurrent neural networks. Connection Science, 5(3 & 4):307{337. [Gori et al., 1994] Gori, M., Maggini, M., and Soda, G.(1994). Insertion of nite state automatain recurrent radial basis function networks. Technical report, Dipartimento di Sistemi e Informatica, Universita di Firenze, Italy. [Horne and Hush, 1994] Horne, B. and Hush, D. (1994). Bounds on the complexity of recurrent neural network implementations of nite state machines. In Advances in Neural Information Processing Systems 6, pages 359{366. Morgan Kaufmann. [Minsky, 1967] Minsky, M. (1967). Computation: Finite and In nite Machines, chapter 3, pages 32{66. Prentice-Hall, Inc., Englewood Cli s, NJ. 18 [Omlin, 1995] Omlin, C. (1995). Symbolic Knowledge in Recurrent Neural Networks: Issues of Training and Representation. PhD thesis, Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY 12180. [Omlin and Giles, 1994] Omlin, C. and Giles, C. (1994). Stable encoding of large nite-state automata in recurrent neural networks with sigmoid discriminants. Neural Computation. Submitted. [Pollack, 1991] Pollack, J. (1991). The induction of dynamical recognizers. Machine Learning, 7:227{252. [Servan-Schreiber et al., 1991] Servan-Schreiber, D., Cleeremans, A., and McClelland, J. (1991). Graded state machine: The representation of temporal contingencies in simple recurrent networks. Machine Learning, 7:161. [Tino, 1994] Tino, P. (1994). Personal communication. [Watrous and Kuhn, 1992] Watrous, R. and Kuhn, G. (1992). Induction of nite-state languages using second-order recurrent networks. Neural Computation, 4(3):406. [Zeng et al., 1993] Zeng, Z., Goodman, R., and Smyth, P. (1993). Learning nite state machines with self- clustering recurrent networks. Neural Computation, 5(6):976{990. 19