Using Recurrent Neural Networks to Learn the Structure of Interconnection Networks  UNIVERSITY OF MARYLAND TECHNICAL REPORT UMIACS-TR-94-20 AND CS-TR-3226 Mark W. Goudreau a and C. Lee Giles b;c a Department of Computer Science University of Central Florida, Orlando, FL 32816 b NEC Research Institute 4 Independence Way, Princeton, NJ 08540 c Institute for Advanced Computer Studies University of Maryland, College Park, MD 20742 15 February 1994 Abstract A modi ed Recurrent Neural Network (RNN) is used to learn a Self-Routing Interconnection Network (SRIN) from a set of routing examples. The RNN is modi ed so that it has several distinct initial states. This is equivalent to a single RNN learning multiple di erent synchronous sequential machines. We de ne such a sequential machine structure as augmented and show that a SRIN is essentially an Augmented Synchronous Sequential Machine (ASSM). As an example, we learn a small six-switch SRIN. After training we extract the net- work's internal representation of the ASSM and corresponding SRIN.  This paper is adapted from ( Goudreau, 1993, Chapter 6 ) . A shortened version of this paper was published in ( Goudreau & Giles, 1993 ) . 1 1 Introduction The use of Recurrent Neural Networks (RNNs) to learn Synchronous Sequential Machines (SSMs) from examples is a problem which has been studied extensively. A related topic that, to the authors' knowledge, has not been studied previously is the use of RNNs to learn SSMs for which several distinct initial states are possible. This problem is interesting because it maps directly into the problem of learning the structure of an Interconnection Network (IN) from examples. Learning an IN from examples is an unusual approach. Traditionally, INs have been designed (and not learned) based on several criteria, including speed, complexity, ease of route calcu- lation, and fault tolerance. Numerous di erent types of INs have been proposed. A detailed description of many of the INs that have been applied to parallel computing can be found in Siegel's book ( Siegel, 1990 ) . In this paper, the learning of Self-Routing Interconnection Networks (SRINs) is discussed. SRINs are described in detail in Section 2. They can be used to describe many commonly used INs. If one considers a parallel computing system, the idea is that the processors have certain communication requirements with other processors, and certain message headers (also described in Section 2) must be used that allow the message to pass through the SRIN and reach the desired destination processor. The message headers provide routing information to the switches in the SRIN. The method that is proposed makes use of a second-order Single-Layer Recurrent Neural Network (SLRNN) to learn the training data. The training data is a table of source processors, message headers, and destination processors. Once the training data has been learned, the structure of the SRIN can be extracted from the SLRNN. One topic that is related to the learning of INs was presented by Hillis ( Hillis, 1990 ) . In that paper, Hillis makes use of simulated evolution to construct sorting networks. It should be also be mentioned that neural networks have been previously used for interconnection network routing: for example, see ( Brown, 1989; Brown & Liu, 1990; Funabiki et al., 1991; Funabiki et al., 1993; Goudreau & Giles, 1992; Hakim & Meadows, 1990; Lee & Chang, 1993; Marrakchi & Troudet, 1989; Melsa et al., 1990a; Melsa et al., 1990b; Takefuji & Lee, 1991; Thomopoulos et al., 1991; Troudet & Walters, 1991 ) . However, none of these methods learned the structure of the interconnection networks; the structure of the interconnection network was always directly mapped into the neural network. 2 Self-Routing Interconnection Networks In this section we describe SRINs. The purpose of a SRIN is to allow a set of processors to communicate amongst themselves using a store-and-forward method- 2 ?header body end-symbol Figure 1: A message for a SRIN. The header is a string of symbols. The body is also a string of symbols. The header is separated from the body by the end-symbol, . nn n m 0 1 N-1 0 Figure 2: A 1N self-routing switch. If the heading symbol is a i j , the message is routed from input m 0 to output n j . Messages are bu ered until they can be routed. Messages are routed in a First-In, First-Out (FIFO) basis. ology. For store-and-forward routing, a message travels along the path towards its destination one step at a time. Once a switch has sent a message to the next switch, it is free to be used to route a di erent message. A SRIN does not use an external controller to route messages. Rather, the switches in a SRIN are smart; they examine the message that is being sent and decide which way to route it. A message, as it is de ned in this paper, consists of two parts: a header and a body. The header and the body are separated by an end-symbol, which will be denoted by . A schematic diagram of a message is shown in Figure 1. The SRIN uses self-routing switches to route messages. A self-routing switch with M inputs and N outputs will be called a M N self-routing switch. A drawing of a 1N self-routing switch is shown in Figure 2. Figure 3 is a drawing of a M N self-routing switch. Intuitively, the routing works in the following manner. When a message arrives at a switch, the switch strips o the rst symbol of the message header and examines it. If the symbol is not the end-symbol, , the switch sends the message (minus the header symbol that it just examined) through the appropriate output port. If the 3 nn n mm m 0 1 N-1 0 1 M-1 Figure 3: An M  N self-routing switch. Again, input messages are routed on a First-In, First-Out (FIFO) basis. symbol is the end-symbol, then the message should be given to the processor that is associated with the switch. In other words, once the header has been stripped down so that only the end-symbol is left, the message does not get passed through the SRIN any longer. Although it is not shown in Figures 2 and 3, it must be remembered that there is a connection from each switch to its associated processor. This can be thought of as another output port for the end-symbol, . The switches work in a First-In, First-Out (FIFO) manner. If a message can not be routed immediately, it is bu ered until it can be routed. 2.1 A Formal Description of a Self-Routing Interconnection Net- work We will now present a more formal description of an SRIN. The SRIN will have a set of M processors, P = fp 0 ;p 1 ;:::;p MBnZr1 g. Each processor, p j , will be associated with a set of switches, Q j . Each set Q j must contain at least one switch. (Otherwise, the processor would have no way to communicate with the other processors.) Note that not all of the switches in the SRIN need to be associated with a processor. Some switches in an SRIN might never be used to connect to a processor. Such switches are called don't care switches, or free switches. A message can be routed through a free switch, but a free switch should never be the rst switch nor last switch in a route; to do so would imply that the free switch is associated with some processor. For the sake of convenience, we will associate some processor with each free switch, even though such an association is meaningless since it is never used. Now, the SRIN has the set of switches Q = Q 0 [ Q 1 [[ Q MBnZr1 . The processor function, , performs the mapping, : Q ! P. That is, if q is a switch, then (q) is the processor associated with that switch. We will let R be the nite input alphabet for the header and the body. The end- 4 symbol, , is not a member of R; that is,  62 R. The end-symbol is only used to separate the header from the body. One typical alphabet would be R = fr 0 ;r 1 g. In general, however, the magnitude of the alphabet can be greater than two. Since most computing environments are binary, the situation becomes more complicated when the magnitude of the alphabet is greater than two. In such cases, the members of the alphabet must be encoded in some way. There does not need to be a size limitation for the header nor the body. In a binary system, the end-symbol might consist of a string of zeros and ones that is illegal in the header. Alternatively, one might send the header and the body separately, in which case the position of the end-symbol will be understood by the receiving switch. Another approach would be to designate the rst byte of the header to represent the length of the header. There are many di erent ways to implement the end-symbol, but for our purposes here we will assume the end-symbol is just a symbol that can be transmitted in one time step. We now de ne the switch transition function, , which performs the mapping,  : Q R ! Q. If q is a switch and r is the input symbol that is taken from the front of the header, then (q;r) is the next switch that the message will be sent to. Finally, when processor p j sends a message, it starts the message o from one of the switches in the set Q j . Each processor will have a switch that is designated for this purpose. We de ne the switch function, , which performs the mapping, : P ! Q. If p is a processor, then (p) is the switch that performs the rst stage of the routing for any messages that p sends. We will call the switch (p) the designated switch for processor p. The SRIN can now be de ned formally. De nition 1 A self-routing interconnection network is a 7-tuple, (P;Q;R;; ; ;), where:  P is a nite, nonempty set of processors.  Q is a nite, nonempty set of switches.  R is a nite, nonempty set of input symbols.   : QR ! Q is the switch transition function.  : Q ! P is the processor function.  : P ! Q is the switch function.   is the end-symbol. 5 2.2 An Example of a Self-Routing Interconnection Network Figure 4 shows an example of a SRIN. This SRIN has the set of switches Q = fq 0 ;q 1 ;q 2 ;q 3 ;q 4 ;q 5 g. In Figure 4, the switches are shown as white boxes with their labels in the upper left corner. The outputs, labeled r 0 and r 1 , are on the right side of each switch. The inputs come to the left side of the switch. A switch with a * in its lower left corner is a designated switch. The set of processors is P = fp 0 ;p 1 ;p 2 ;p 3 g. In Figure 4, the processors are repre- sented by shaded areas. For this example, we have the input alphabet R = fr 0 ;r 1 g. Thus, each switch has two output ports. In practice, not all of the output ports need to be connected; some can be don't cares if they are never used for routing. The number of input ports for each switch can be zero or any larger integer. If a switch has zero input ports, it must be a designated switch or it will have no purpose in the SRIN. The processor function, , is shown here: (q 0 ) = p 0 (q 1 ) = p 0 (q 2 ) = p 1 (q 3 ) = p 2 (q 4 ) = p 3 (q 5 ) = p 3 (1) The switch transition function, , is shown here: (q 0 ;r 0 ) = q 4 (q 0 ;r 1 ) = q 2 (q 1 ;r 0 ) = q 3 (q 1 ;r 1 ) = q 3 (q 2 ;r 0 ) = q 1 (q 2 ;r 1 ) = q 4 (q 3 ;r 0 ) = q 0 (q 3 ;r 1 ) = q 4 (q 4 ;r 0 ) = q 1 (q 4 ;r 1 ) = q 5 (q 5 ;r 0 ) = q 5 (q 5 ;r 1 ) = q 2 (2) Finally, the switch function, , is shown here: (p 0 ) = q 0 (p 1 ) = q 2 (p 2 ) = q 3 (p 3 ) = q 4 (3) Suppose processor p 1 has data to send to processor p 2 . One possible way to send the data there is to use the header r 1 r 0 r 1 . The message starts in switch q 2 = (p 1 ). The 6 r0 r1 r0 r1 r0 r1 r0 r1 r0 r1 r0 r1 q q q q q q * * * * p p p p 0 1 2 3 4 5 0 1 2 3 Figure 4. A sample self-routing interconnection network. 7 switch q 2 strips o the left-most symbol in the header, in this case r 1 , and routes the message to q 4 = (q 2 ;r 1 ). The message then goes to switch q 1 = (q 4 ;r 0 ), and at last to switch q 3 = (q 1 ;r 1 ). At this point, the header has been spent and the message is led by the end-symbol. Switch q 3 therefore delivers the message to processor p 2 = (q 3 ). 3 Synchronous Sequential Machines In this section we discuss the relationship between SSMs and SRINs. FSAs and SSMs are thoroughly described in ( Hopcroft & Ullman, 1979; Kohavi, 1978 ) . We will use the de nition of SSMs that is provided in ( Kohavi, 1978 ) . 1 De nition 2 A synchronous sequential machine is a quintuple, (O;S;I;;), where:  O is a nite, nonempty set of outputs symbols.  S is a nite, nonempty set of states.  I is a nite, nonempty set of inputs symbols.   : S I ! S is the state transition function.   : S ! O is the output function. >From De nitions 1 and 2, it is clear that SRINs and SSMs are very similar. In fact, it only takes a slight expansion of the de nition of SSMs to make them directly equivalent to SRINs. We will describe how SRINs are equivalent to Augmented SSMs (ASSMs), which will be de ned below. Let each processor in P be an output symbol in O. Similarly, let switch in Q be a state in S, and each input symbol in R be an input symbol in I. The switch transition function, , becomes the state transition function, . The processor function, , becomes the output function, . Now the only components of the SRIN that are not equivalent to components in the SSM are the end-symbol, , and the switch function, . The ASSM will have an end-symbol, . The meaning of the end-symbol in this context is merely that the input string has reached its conclusion, and the ASSM can now output the value corresponding to the input string. The ASSM will also have a state function, . The state function  performs the mapping,  : O ! S. In this context, the state 1 Speci cally, our de nition is for a Moore machine. 8 function allows for some set of initial states in the ASSM. Thus, each input string that is to be entered into the ASSM must have an output symbol associated with it. This output symbol allows the ASSM to choose the correct starting state. The ASSM can now be de ned formally. De nition 3 As augmented synchronous sequential machine is an 7-tuple, (O;S;I;;;;), where:  O is a nite, nonempty set of outputs.  S is a nite, nonempty set of states.  I is a nite, nonempty set of inputs.   : S I ! S is the state transition function.   : S ! O is the output function.   : O ! S is the state function.   is the end-symbol. It is now clear from De nitions 1 and 3 that SRINs and ASSMs are equivalent. 4 Machine Inference Since SRINs and ASSMs are equivalent, there are many issues that have been ex- plored for ASSMs that can now be used for SRINs. For example, just as one can minimize the size of an ASSM by merging equivalent states ( Kohavi, 1978 ) , one can minimize the size of a SRIN by merging equivalent switches. What we are interested in is the inference of a SRIN from examples. A great deal of work has been done on the problem of machine inference. It has been shown that, in the worstcase, inferring a SSM fromsparse data is an intractable problem ( Angluin, 1978; Gold, 1978; Kearns & Valiant, 1989; Pitt & Warmuth, 1989 ) . Approaches that can be used to infer SSMs will now be examined. 4.1 Recurrent Neural Network Approaches The literature on the use of neural networks for grammatical inference and nite- state machine learning is now well-established ( Cleeremans et al., 1989; Giles et al., 9 1992; Giles et al., 1992; Mozer & Bachrach, 1991; Pollack, 1991; Watrous & Kuhn, 1992; Zeng et al., 1993 ) . These approaches use RNNs to represent SSMs. For the work done in this paper, the approach described in ( Giles et al., 1992; Giles et al., 1992 ) will be used (see Section 5.2). We refer readers who are interested in the details to those references. In Section 5, there is a qualitative explanation of the RNN approach to learning SRINs. Until recently, the RNN approach for SSM inference that is used in this paper had only been possible for unknown SSMs with a small number of states (approximately 30). It should be pointed out that the limited success of this approaches is due to the learning algorithms. Generally, the RNNs haverich representational capabilities. However, recent work has shown that certain types of large SSMs, with thousands of states, are learnable ( Giles & Horne, 1994 ) . Furthermore, the performance of the RNNs can sometimes be improved by using \hints" if partial information about the structure of the SSM is known ( Giles & Omlin, 1993 ) . Other approaches that use neural networks for grammatical inference exist that will not be used in this paper. For example, the use of update graphs has been proposed by Rivest and Schapire ( Rivest & Schapire, 1987a; Rivest & Schapire, 1987b; Schapire, 1988 ) . An update graph is an alternate representation of a FSA that can be much smaller than the FSA for certain environments that often arise in practice. Update graphs can be mapped to a connectionist system that can learn the environment from examples ( Mozer & Bachrach, 1990; Mozer & Bachrach, 1991 ) . 4.2 Traditional Approaches Other methods for grammatical inference, which do not use neural networks, have demonstratedsomepromising results. In fact, apolynomial time algorithm proposed by Trakhtenbrot and Barzdin ( Trakhtenbrot & Barzdin, 1973 ) has been shown to be able to infer some very large nite automata. The algorithm produces a machine that is consistent with a sparsely labeled tree, but the machine that is produced is not necessarily the minimum machine that is consistent with the data. Lang ( Lang, 1992 ) performed several experiments using this algorithm for random nite automata with 1000 states and 2000 transitions. Given enough training examples, the algorithm was almost always able to construct a machine that was similar to the correct machine. 10 source header destination processor processor . . . . . . . . . p 0 r 1 p 1 p 2 r 1 r 1 r 0 r 0 r 1 p 1 p 2 r 0 r 0 r 0 p 0 p 3 r 0 r 1 p 2 p 1 r 1 r 1 r 0 p 3 . . . . . . . . . Table 1: An example of training data for a SRIN. The data is consistent with the SRIN in Figure 4. 5 Recurrent Neural Networks to Learn Interconnec- tion Networks The problem that we are trying to solve is posed in the following form. We have a training list of source processor, header, and destination processor combinations that must be implemented by a SRIN. We must infer an SRIN that can accomplish all of the routings described in the training list. Hopefully, the SRIN will also be able to generalize. That is, we would like the SRIN to perform correct routings even for examples that are not on the training list. For example, Table 1 contains data for some such problem. The data in Table 1 is consistent with the SRIN in Figure 4. 5.1 Recurrent Network with Several Distinct Initial States The structure of a second order SLRNN is shown in Figure 5. There are M inputs lines, x 1 ;x 2 ;:::;x M . The value of input x i (1  i  M) at time t is x t i . There is a single layer of N neurons, y 1 ;y 2 ;:::;y N . The output value of neuron y i (1  i  N) at time t is y t i . At each time step, these output values are stored in a bank of latches to act as the \state" of the network. The state is fed back as an input to the layer of neurons on the subsequent time step. For a second order SLRNN, neuron j and input k to have a combined e ect on neuron i that is quanti ed by the parameter w ijk . The output of neuron i is de ned by the following equation: y t i = g( N X j=1 M X k=1 w ijk y tBnZr1 j x t k ) (4) 11 N t 1 t M t 1 t N t-1 1 t-1 N1 yy yy y y xx Figure 5: A Single-Layer Recurrent Neural Network (SLRNN). There are M input bits, N state bits, and (up to) N output bits. The bank of N latches is shown on the right. The activation function, g(x), is the sigmoid function shown here: g(x) = 1 1+e BnZrax (5) The second order SLRNN is used to infer the ASSM that is equivalent to the un- known SRIN. Again, the approach used in ( Giles et al., 1992; Giles et al., 1992 ) will be used here. The SLRNN will learn the training data, and the ASSM will be extracted from the SLRNN. A gradient descent algorithm that is similar to the one proposed by Williams and Zipser ( Williams & Zipser, 1989 ) will be used to train the SLRNN. If it were not possible to extract the ASSM from the SLRNN, then the SLRNN would not be useful for this problem. Without ASSM extraction, the SLRNN could still generalize, but when the structure of a SRIN is the desired result such generalization is not useful. The problem of inferring an ASSM, as opposed to inferring an SSM, is that in an ASSM there will generally be several possible initial states. In an SSM, a single initial state is generally assumed. With this in mind, the reader should be able to see that the SLRNN that is trying to learn an ASSM will, in fact, be trying to learn a SSM for each possible initial state. However, all of these SSMs will have the same structure; the only di erence is that the initial state varies. Since all of the SSMs have the same structure, it can reasonably be hoped that the SLRNN will try to merge these multiple SSMs into a single ASSM. In fact, this turns out to be the case. Rather than learning several di erent SSMs in di erent sections of the state 12 space, the SLRNN will take advantage of the identical structures of the SSMs and merge them. Intuitively, the input vectors of the SLRNN will represent the inputs and the end- symbol of the ASSM (and therefore the input symbols and the end-symbol of the SRIN). The state vectors of the SLRNN represent the states of the ASSM (and the switches of the SRIN). And the output vectors of the SLRNN represent the outputs of the the ASSM (and the processors of the SRIN). We will use simple one-hot encodings for the input symbols and the output symbols. Recall that a one-hot code is a code for which each symbol is represented by a vector that has one element equal to one while all of the other elements are equal to zero. Additionally, the initial state, which depends on the source processor, will be a one- hot code. This structure is chosen because it is known that a solution will exist to map the SLRNN to the desired ASSM ( Goudreau et al., 1994 ) . The solution that is known to exist requires the use of one-hot codes for the states and the inputs. The representation that the SLRNN actually learns, however, can have states that are not in a one-hot code. The SLRNN might construct a solution that is di erent from the one-hot solution. Clearly, there must be enough neurons to represent the processors (outputs) with a one-hot code. Therefore, the number of neurons must at least be equal to the number of processors. For the one-hot solution to exist, however, there must be one neuron for each switch as well. Unfortunately, one does not generally know the number of switches beforehand. It is necessary to estimate the number of switches, and provide at least that many neurons. This is one of the weaknesses that is common to many neural network approaches: often it is not clear what size neural network would be best. One approach is to start with as many switch neurons as reasonably possible; if training is successful, then reduce the number of neurons using a destructive heuristic ( Omlin & Giles, 1993 ) . The SLRNN will work in the following manner. The input to the SLRNN will be an initial state vector and a series of input vectors. The initial state vector and the rst input vector will be applied to the SLRNN at time step one, the second input vector will be applied at time step two, and so on. Note that the last input vector will always correspond to the end-symbol. The outputs of the neurons are not examined until the entire input has been applied. After the end-symbol is entered, gradient descent techniques can be used to train the SLRNN so that the actual output gets closer to the desired output. There are many techniques that can be used to help the SLRNN correctly handle all of the training data. The statevectors that are created through the use of the gradient descent techniques are not examined during training, but they are examined after training to describe the structure of the underlying SRIN. As mentioned before, these states actually represent switches in the SRIN. Since the neurons have a continuous activation 13 function, it is generally necessary to use some partitioning and clustering techniques to make a group of states equivalent. After this is done, an ASSM can be extracted from the SLRNN. The extracted ASSM might not be minimal, but in this case standard SSM minimization techniques can be used. 5.2 A Training Example with Analysis Perhaps the easiest wayto describe the mapping to the SLRNN would be to describe the structure of the SLRNN given data for the SRIN from Figure 4. We will choose to have N = 8 neurons in the SLRNN. Since one-hot codes will be used, four initial states vectors are necessary, corresponding to the four source processors that are present in the datatable. In reality, however, it should be remembered thatthe state vectors in the SLRNN will actually correspond to switches from the SRIN, rather than processors. Thus, when we choose state vectors for the four source processors, wearereally choosingstatevectorsfortheir respective designated switches (asshown in Equation 3). Let h n;m be a vector of dimension n that has a value \1" in position m and a value \0" in all of the other positions. For example, h 5;2 = [0;1;0;0;0] T . When the source processor is processor p 0 (which is equivalent to saying that the designated switch is q 0 , see Equation 3) the initial statevectorwill be h 8;1 . Similarly, source processor p 1 will make the initial state vector h 8;3 , source processor p 2 will make the initial state vector h 8;4 , and source processor p 3 will make the initial state vector h 8;5 . This mapping of source processors to initial state vectors can be made arbitrarily. There is no reason not to use vector h 8;2 , for example. The only reason that this vector was not used was because of the way the training data was generated. Now the input symbols, r 0 and r 1 , as well as the end-symbol, , can be mapped to input vectors. There are three symbols that must be mapped to input vectors, so we will have three vectors of length three for this purpose. Let r 0 correspond to input vector h 3;1 , r 1 correspond to input vector h 3;2 , and  correspond to input vector h 3;3 . Finally, the processors can now be mapped to output vectors. There are eight neurons in the SLRNN, but only the outputs of four of them are needed for the output vectors. Therefore, the output values will only be taken from neurons 0, 1, 2, and 3. Destination processor p 0 will correspond to output vector h 4;1 , destination processor p 1 will correspond to output vector h 4;2 , destination processor p 2 will correspond to output vector h 4;3 , and destination processor p 3 will correspond to output vector h 4;4 . Now that the speci cs of the encoding have been explained, we can use the training data to train the SLRNN. The data table contained an entry for every header up to length 11 (including the end-symbol) for every processor. The data table starts with 14 strings of length one and concludes with strings of length 11. The SLRNN does not try to learn all of the data in this table simultaneously. Rather, it rst learns the rst 20 lines of the data table. Then the resulting SLRNN is checked against the rest of the data table to see how it generalizes. If perfect generalization does not occur, the SLRNN adds 20 more lines to its training data. Once these 40 lines are learned, generalization is checked again. This process is repeated until all of the lines in the data table have been learned. This heuristic approach, which involves incremental expansion of the training data, has proven to be quite successful in practice. For our experiment, by the time the SLRNN learned the rst 280 lines of the table, the SLRNN was able to generalize for all of the remaining strings in the table. The full table had 8188 lines. 5.3 Extraction of the Interconnection Network from the Trained Neural Network It was mentioned in Section 5.1 that this problem can be thought of as the problem of learning several separate SSMs, in this case four. Given this fact, one can examine the four separate SSMs that are generated from the four di erent initial states. Table 2 shows the unminimized SSM with initial state corresponding to switch q 0 that was extracted from the SLRNN. This machine shall be called M1. Details on the method of SSM extraction that was used can be found in ( Giles et al., 1992; Giles et al., 1992 ) . The left column (S) contains the number of each state. State \1" in Tables 2 to 9 corresponds to the initial state. The next column (O) contains the output value associated with that state. The following two columns contain the next state given input zero (NS 0 ) and given input one (NS 1 ). The nal column (QR) contains the quantized representation for the state in the SLRNN. It should be kept in mind that the SLRNN actually uses real valued state vectors, as does the clustering algorithm that was used for SSM extraction. The SSM extraction algorithm makes use of a real valued representative vector for each state. In the SLRNN, whenever a state vector is \close" to a representative state vector, it is assumed that the two vectors implement the same state. To simplify our analysis, the real valued representative state vectors are quantized. Any value greater than or equal to 0.5 in a representative state vector is set to 1 after quantization, while any value less than 0.5 is set to 0. Henceforth, when state vectors are mentioned, we will actually be talking about these quantized representative state vectors. Thus, in general, the state vectors in Table 2 actually represent some volume of the state space. The unminimized SSMs given designated switches q 2 , q 3 , and q 4 are shown in Ta- bles 3 (machine M2), 4 (machine M3), and 5 (machine M4), respectively. Once the unminimized SSMs and their state vector representations are known, the 15 S O NS 0 NS 1 QR 1 1 2 3 10000000 2 4 4 5 01111100 3 2 4 6 00100001 4 1 7 7 11000011 5 4 8 9 11010100 6 4 4 5 01011100 7 3 10 6 00010101 8 4 11 9 11001101 9 2 4 2 00100101 10 1 6 3 10000001 11 4 11 9 11011101 Table 2: Machine M1, the unminimized SSM with initial state corresponding to switch q 0 . Column S contains the state. Column O contains the output. Column NS 0 contains the next state given input 0. Column NS 1 contains the next state given input 1. Column QR contains the quantized representation for the state in the SLRNN. S O NS 0 NS 1 QR 1 2 2 3 00100000 2 1 4 4 11000011 3 4 2 5 01011110 4 3 6 7 00010101 5 4 8 9 11010100 6 1 7 10 10000101 7 4 2 5 01011100 8 4 11 12 11001101 9 2 2 7 01100101 10 2 2 7 00100001 11 4 11 12 11011101 12 2 2 13 00100101 13 4 2 5 01111100 Table 3: Machine M2, the unminimized SSM with initial state corresponding to switch q 2 . 16 S O NS 0 NS 1 QR 1 3 2 3 00010000 2 1 3 4 10000101 3 4 5 6 01011100 4 2 5 3 00100001 5 1 7 7 11000011 6 4 8 9 11010100 7 3 2 10 00010101 8 4 11 12 11001101 9 2 5 3 01100101 10 4 5 6 01010100 11 4 8 9 11011101 12 2 5 13 00100101 13 4 5 6 01111100 Table 4: Machine M3, the unminimized SSM with initial state corresponding to switch q 3 . S O NS 0 NS 1 QR 1 4 2 3 00001000 2 1 4 4 11000011 3 4 5 6 11000101 4 3 7 8 00010101 5 4 9 10 11001101 6 2 2 11 01100101 7 1 11 12 10000101 8 4 2 13 01010100 9 4 9 10 11011101 10 2 2 14 00100101 11 4 2 13 01011100 12 2 2 11 00100001 13 4 5 6 11010100 14 4 2 13 01111100 Table 5: Machine M4, the unminimized SSM with initial state corresponding to switch q 4 . 17 S O NS 0 NS 1 QR 1 1 2 3 10000000 10000001 2 4 4 5 01011100 01111100 3 2 4 2 00100001 00100101 4 1 6 6 11000011 5 4 5 3 11001101 11010100 11011101 6 3 1 2 00010101 Table 6: The minimal SSM with initial state corresponding to switch q 0 . This SSM is equivalent to machine M1 in Table 2. SSMs can be minimized and merged. Table 6 contains the minimized SSM that is extracted from the SLRNN when the initial state vector corresponds to designated switch q 0 . Each state in the minimized machine can be associated with one or more state vectors. For example, state 1 of the SSM in Table 6 is associated with two state vectors, 10000000 and 10000001. The minimal SSMs with initial states corresponding to designated switches q 2 , q 3 , and q 4 are shown in Tables 7, 8, and 9, respectively. With the information in Tables 6 to 9, we can merge the SSMs into an ASSM. States in the SSMs with any overlapping vector representations are assumed to be equivalent. The results show that the SLRNN has indeed \merged" the SSMs. In fact, if we ignore the quantized representations of the states, the four minimal SSMs are equivalent except for the fact that they have di erent initial states. That is, if we ignore the initial states, the four SSMs can be relabeled so that they are identical. It should be clear how Table 6 corresponds to the SRIN in Figure 4. State 1 corresponds to switch q 0 , state 2 corresponds to switch q 4 , state 3 corresponds to switch q 2 , state 4 corresponds to switch q 1 , state 5 corresponds to switch q 5 , and state 6 corresponds to switch q 3 . Simple observation will show that the other three SSMs also correspond to the SRIN in Figure 4. Table 10 merges this information to show how the SLRNN represents the switches of Figure 4. State vectorsare shownand the machines thatutilized them are presented. Examination of Table 10 shows that, for the most part, machines M1, M2, M3, and M4 make use of state vectors that are approximately equal. For example, all four 18 S O NS 0 NS 1 QR 1 2 2 3 00100000 00100001 00100101 01100101 2 1 4 4 11000011 3 4 2 5 01011100 01011110 01111110 4 3 6 3 00010101 5 4 5 1 11001101 11010100 11011101 6 1 3 1 10000101 Table 7: The minimal SSM with initial state corresponding to switch q 2 . This SSM is equivalent to machine M2 in Table 3. S O NS 0 NS 1 QR 1 3 2 3 00010000 00010101 2 1 3 4 10000101 3 4 5 6 01010100 01011100 01111100 4 2 5 3 00100001 00100101 01100101 5 1 1 1 10000101 6 4 6 4 11001101 11010100 11011101 Table 8: The minimal SSM with initial state corresponding to switch q 3 . This SSM is equivalent to machine M3 in Table 4. 19 S O NS 0 NS 1 QR 1 4 2 3 00001000 01010100 01011100 01111100 2 1 4 4 11000011 3 4 3 5 11000101 11001101 11010100 11011101 4 3 6 1 00010101 5 2 2 1 00100001 00100101 01100101 6 1 1 5 10000101 Table 9: The minimal SSM with initial state corresponding to switch q 4 . This SSM is equivalent to machine M4 in Table 5. switch QR machine q 0 10000000 M1 10000001 M1 10000101 M2,M3,M4 q 1 11000011 M1,M2,M3,M4 q 2 00100000 M2 00100001 M1,M2,M3,M4 00100101 M1,M2,M3,M4 01100101 M2,M3,M4 q 3 00010000 M3 00010101 M1,M2,M3,M4 q 4 00001000 M4 01010100 M3,M4 01011100 M1,M2,M3,M4 01011110 M2 01111100 M1,M2,M3,M4 q 5 11000101 M4 11001101 M1,M2,M3,M4 11010100 M1,M2,M3,M4 11011101 M1,M2,M3,M4 Table 10. How the SLRNN represents the switches from the SRIN in Figure 4. 20 machines use the state region 00010101 to represent switch q 3 from Figure 4. Another interesting fact is that the state vectors that represent equivalent states tend to be close to each other in the state space. That is, the state vectors for equivalent states tend to have small Hamming distances from one another. For example, machine M1 uses two state regions to represent switch q 4 , 01011100 and 01111100. This fact leads one to believe that the SLRNN actually uses something like a cohesive sub-space for each state. It seems likely that the unminimized SSMs that were extracted have equivalent states due to the nature of the SSM extraction algorithm that is used. If other clustering approaches were used ( Das & Mozer, 1994; Watrous & Kuhn, 1992; Zeng et al., 1993 ) , it is possible that minimal SSMs could have been extracted directly from the SLRNN. Furthermore, values such as 0.49 and 0.51 will have di erent values after quantization, although they are in fact quite close in the state space. Thus, two state vectors that are quite close in terms of Euclidean distance (before quantization) might be quite far from each other after quantization. This is another possible explanation for the fact that the extracted SSM was not minimal. Another observation is that the SLRNN did not learn anything approaching the one-hot solution that was described in Section 5. In fact, the only states vectors that were one-hot after quantization were the initial state vectors that were forced upon the SLRNN. However, while these initial state vectors were not returned to, they did seem to give the SLRNN a bias on its state representation. Finally, the factthat the correct SRIN was extracted means thatgood generalization for strings longer than those in the training data has obviously been achieved. 6 Conclusions A radical approach to the construction of interconnection networks has been pre- sented. This approach uses training data from an unknown interconnection network to teach a RNN to generate an interconnection network that is capable of routing the training data. It was shown that this problem maps directly to the problem of learning a SSM with several distinct initial states. The proposed approach took advantage of previous work on the use of RNNs to inference SSMs. However, it should be noted that the relationship between interconnection networks and SSMs might also allow for some non-neural networkapproaches to be used for the same problem. It seems likely that such methods could be varied slightly to accommodate the interconnection network inference problem, just as the RNN method for SSM inference can be varied slightly to perform interconnection network inference. 21 It was demonstrated that given a table of training data, it is possible to use a RNN to generate the structure of an interconnection network that is capable of routing the training data. Furthermore, the interconnection networkthat is generated might be able to generalize for inputs that are not in the training data. A sample problem was used to illustrate the methodology. This work clearly pointed out the need for further research into the use of RNNs to inference larger SSMs. To date, RNNs have had limited success for large problems in grammatical inference, but some recent results are promising ( Giles & Horne, 1994 ) . The concept of learning interconnection networks is an unusual one for the intercon- nection network community. It remains to be seen whether such learning approaches will become a useful method for interconnection network design. 7 Acknowledgement The authors would like to acknowledge useful discussions with Sanjeev Kulkarni and Cli B. Miller. References Angluin, D. (1978). On the complexity of minimum inference of regular sets. Infor- mation and Control, 39, 337{350. Brown, T. X. (1989). Neural networks for switching. IEEE Communications Mag- azine, 27(11), 72{81. Brown, T. X. & Liu, K.-H. (1990). Neural network design of a Banyan network controller. IEEE Journal on Selected Areas of Communication, 8(8), 1428{ 1438. Cleeremans, A., Servan-Schreiber, D., & McClelland, J. L. (1989). Finite state automata and simple recurrent networks. Neural Computation, 1, 372{381. Das, S. & Mozer, M. (1994). A hybrid clustering/gradient descent architecture for nite state machine induction. In Cowan, J., Tesauro, G., & Alspector, J. (Eds.), Advances in Neural Information Processing Systems 6. San Mateo, CA: Morgan Kaufmann. Funabiki, N., Takefuji, Y., & Lee, K. C. (1991). A neural network model for trac controls in multistage interconnection networks. In Proceedings of the Interna- tional Joint Conference on Neural Networks 1991, (pp. A898). 22 Funabiki, N., Takefuji, Y., & Lee, K. C. (1993). Comparisons of seven neural net- work models on trac control problems in multistage interconnection networks. IEEE Transactions on Computers, 42(4), 497{501. Giles, C. L. & Horne, B. G. (1994). Some large DFA's are learnable with recurrent neural networks. NEC Research Institute, Inc., Princeton, NJ. Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., & Lee, Y. C. (1992). Learning and extracting nite state automata with second-order recurrent neu- ral networks. Neural Computation, 4(3), 393{405. Giles, C. L., Miller, C. B., Chen, D., Sun, G. Z., Chen, H. H., & Lee, Y. C. (1992). Extracting and learning an unknown grammar with recurrent neural networks. In Moody, J., Hanson, S., & Lippmann, R. (Eds.), Advances in Neural Informa- tion Processing Systems 4, (pp. 317{324)., San Mateo, CA. Morgan Kaufmann Publishers. Giles, C.L. & Omlin, C.W. (1993). Extraction, insertion and re nement of symbolic rules in dynamically-driven recurrent neural networks. Connection Science, 5(3,4), 307{337. Special Issue on Architectures for Integrating Symbolic and Neural Processes. Gold, E. M. (1978). Complexity of automaton identi cation from given data. In- formation and Control, 37, 302{320. Goudreau, M. W. (1993). Neural Network Applications for Interconnection Net- works. PhD thesis, Princeton University, Princeton, NJ. Goudreau, M. W. & Giles, C. L. (1992). Routing in random multistage intercon- nection networks: Comparing exhaustive search, greedy and neural network approaches. International Journal of Neural Systems, 3(2), 125{142. Goudreau, M. W. & Giles, C. L. (1993). Discovering the structure of a self-routing interconnection network with a recurrent neural network. In Alspector, J., Goodman, R., & Brown, T. X. (Eds.), Proceedings of the International Work- shop on Applications of Neural Networks to Telecommunications, (pp. 52{59)., Hillsdale, NJ. Lawrence Erlbaum Associates, Inc. Goudreau, M. W., Giles, C. L., Chakradhar, S. T., & Chen, D. (1994). First-order vs. second-order single layer recurrent neural networks. To be published in IEEE Transactions on Neural Networks. Hakim, N. Z. & Meadows, H. E. (1990). A neural network approach to the setup of the Benes switch. In Infocom 90, (pp. 397{402). Hillis, W. D. (1990). Co-evolving parasites improve simulated evolution as an opti- mization procedure. Physics D, 42, 228{234. 23 Hopcroft, J. E. & Ullman, J. D. (1979). Introduction to Automata Theory, Lan- guages, and Computation. Reading, MA:Addison-Wesley Publishing Company, Inc. Kearns, M. & Valiant, L. (1989). Cryptographic limitations on learning boolean for- mulae and nite automata. In Proceedings of the 21st Annual ACM Symposium on Theory of Computing. ACM Press. Kohavi, Z. (1978). Switching and Finite Automata Theory (second ed.). New York, NY: McGraw-Hill, Inc. Lang, K. (1992). Random DFA's can be approximately learned from sparse uni- form examples. In Proceedings of the Fifth ACM Workshop on Computational Learning Theory, (pp. 45{52). ACM Press. Lee, S.-L. & Chang, S. (1993). Neural networks for routing of communication net- works with unreliable components. IEEE Transactions on Neural Networks, 4(5), 854{863. Marrakchi, A. M. & Troudet, T. (1989). A neural net arbitrator for large crossbar packet-switches. IEEE Transactions on Circuits and Systems, 36(7), 1039{ 1041. Melsa, P. J. W., Kenney, J. B., & Rohrs, C. E. (1990a). A neural network solution for call routing with preferential call placement. In Proceedings of the 1990 Global Telecommunications Conference, (pp. 1377{1382). Melsa, P. J. W., Kenney, J. B., & Rohrs, C. E. (1990b). A neural network solution for routing in three stage interconnection networks. In Proceedings of the 1990 International Symposium on Circuits and Systems, (pp. 483{486). Mozer, M. C. & Bachrach, J. (1990). Discovering the structure of a reactive envi- ronment by exploration. Neural Computation, 2(4), 447{457. Mozer, M. C. & Bachrach, J. (1991). SLUG: A connectionist architecture for in- ferring the structure of nite-state environments. Machine Learning, 7(2/3), 139{160. Omlin, C. W. & Giles, C. L. (1993). Pruning recurrent neural networks for im- proved generalization performance. Technical Report TR 93-6, Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY. Pitt, L. & Warmuth, M. (1989). The minimum DFA consistency problem cannot be approximated within any polynomial. In Proceedings of the 21st Annual ACM Symposium on Theory of Computing. ACM Press. 24 Pollack, J. B. (1991). The induction of dynamical recognizers. Machine Learning, 7(2/3), 227{252. Rivest, R. L. & Schapire, R. E. (1987a). Diversity-based inference of nite au- tomata. In Proceedings of the Twenty-Eighth Annual Symposium on Founda- tions of Computer Science, (pp. 78{87). Rivest, R. L. & Schapire, R. E. (1987b). A new approach to unsupervised learning in deterministic environments. In Langley, P. (Ed.), Proceedings of the Fourth International Workshop on Machine Learning. Schapire, R. E. (1988). Diversity-based inference of nite automata. Master's thesis, Massachusetts Institute of Technology, Cambridge, MA. Siegel, H. J. (1990). Interconnection Networks for Large Scale Parallel Processing. New York: McGraw-Hill. Takefuji, Y. & Lee, K. C. (1991). An arti cial hysteresis binary neuron: A model suppressing the oscillatory behavior of neural dynamics. Biological Cybernetics, 64, 353{356. Thomopoulos, S. C. A., Zhang, L., & Wann, C. D. (1991). Neural network imple- mentation of the shortest path algorithm for trac routing in communication networks. In Proceedings of the International Joint Conference on Neural Net- works 1991, (pp. 2693{2702)., Singapore. Trakhtenbrot, B. & Barzdin, Y. (1973). Finite Automata: Behavior and Synthesis. Amsterdam: North-Holland Publishing Company. Troudet, T. P. & Walters, S. M. (1991). Neural network architecture for crossbar switch control. IEEE Transactions on Circuits and Systems, 38(1), 42{56. Watrous, R. L. & Kuhn, G. M. (1992). Induction of nite-state languages using second-order recurrent networks. Neural Computation, 4(3), 406{414. Williams, R. J. & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2), 270{280. Zeng, Z., Goodman, R. M., & Smyth, P. (1993). Learning nite state machines with self-clustering recurrent networks. Neural Computation, 5(6), 976{990. 25