0 оценок0% нашли этот документ полезным (0 голосов)
7 просмотров8 страниц
This paper compares an evolutionary method with the Evidence Driven State Merging (EDSM) algorithm. Only the transition matrix is evolved, and the state labels are chosen to optimize the fit between final states and training set labels. We show that our method outperforms EDSM when the target DFA is small (less than 32 states) and the training set is sparse.
This paper compares an evolutionary method with the Evidence Driven State Merging (EDSM) algorithm. Only the transition matrix is evolved, and the state labels are chosen to optimize the fit between final states and training set labels. We show that our method outperforms EDSM when the target DFA is small (less than 32 states) and the training set is sparse.
This paper compares an evolutionary method with the Evidence Driven State Merging (EDSM) algorithm. Only the transition matrix is evolved, and the state labels are chosen to optimize the fit between final states and training set labels. We show that our method outperforms EDSM when the target DFA is small (less than 32 states) and the training set is sparse.
Learning DFA: Evolution versus Evidence Driven State Merging
Simon M. Lucas and T. Jeff Reynolds
Department of Computer Science University of Essex Colchester, Essex CO4 3SQ {sml,reynt}@essex.ac.uk Abstract Learning Deterministic Finite Automata (DFA) is a hard task that has been much studied within machine learn- ing and evolutionary computation research. This paper presents a new method for evolving DFAs, where only the transition matrix is evolved, and the state labels are cho- sen to optimize the t between nal states and training set labels. This new procedure reduces the size and in partic- ular, the complexity, of the search space. We present re- sults on the Tomita languages, and also on a set of random DFA induction problems of varying target size and training set density. The Tomita set results show that we can learn the languages with far fewer tness evaluations than pre- vious evolutionary methods. On the random DFA task we compare our method with the Evidence Driven State Merg- ing (EDSM) algorithm, which is one of the most powerful known DFA learning algorithms. We show that our method outperforms EDSM when the target DFA is small (less than 32 states) and the training set is sparse. 1 Introduction Learning deterministic nite automata (DFA) from sam- ples of labelled data is an interesting problem that has been extensively studied. It has been shown to be a hard task by a number of criteria [14, 8], and is a good benchmark for evaluating machine learning algorithms. More discussion can be found in [5, 13]. An attractive property for compar- ing learning algorithms is that tasks can readily be gener- ated which vary in the complexity of the target that has to be learnt as well as the amount of data which is supplied to the learning process. This paper compares an evolutionary method with the Evidence Driven State Merging (EDSM) algorithm due to Price and appearing in Lang, Pearlmutter and Price [9]. This was shown to be a very successful algo- rithm for learning DFA in the Abbadingo One competition [9], and has since been rened by Cicchello and Kremer [3]. There have been many attempts to learn DFA(or their as- sociated regular languages) using evolutionary approaches [4, 11], by training recurrent neural networks using appro- priate variations of the back-propagation training algorithm [6, 19], and also by evolving recurrent neural networks [1]. Note that to learn some DFA that is consistent with the training data is a trivial problem - one could simply con- struct the prex tree acceptor. Hence it is usual to add a further constraint, that the DFA should generalise to unseen test data - or that the challenge is to nd the smallest DFA that is consistent with the training set; these two goals are generally mutually compatible. 2 Deterministic Finite Automata A Deterministic Finite Automaton (DFA) is convention- ally dened as a 5-tuple (Q, , , q 0 , F), where Q is the set of states, is the set of input symbols, : Q Q is the state transition function, q 0 is the start state and F (where F Q) is the set of accepting states. This form of DFA is also known as a complete DFA since there is a tran- sition from every state to some other state for every input. For convenience of representation, and also to generalize the problem to multi-class problems, we adopt a modied representation described next. 2.1 Representation and Search Space We enumerate the states Q as the set of integers from 0 . . . (n 1), where there are n states, and we always use state zero as the start state. Similarly, for the set of inputs we use the set of integers in the range 0 . . . ( 1). The transition function is implemented as a matrix of size (n ) indexed on current state and current input, where each element of the matrix is in the range 0 . . . (n 1). We label each state with its output class where each la- bel is in the range 0 . . . (c 1), where c is the number of string classes. The vector o represents these labels, where 1 o i is the output label for state i. For conventional DFA us- age of accepting or rejecting strings the number of possible outputs is two, with some arbitrary convention chosen such as zero for an accepting state and one for a rejecting state. This denition, however, can be used for c-way classica- tion problems, and is not restricted to merely accepting or rejecting strings. Using this representation, the only possible aspects of the system to evolve are the number of states, the state la- bel vector, and the state transition matrix. For each run of the evolutionary algorithm (EA, described in the next Sec- tion) we can either x the number of states by using prior knowledge of the target DFA size, or else we can re-run the EA with an increasing number of states until optimal per- formance is achieved on the training set, or we exceed a time-bound. In all the experiments reported below we xed the number of states using prior knowledge of the problem. When we run the EA with the number of states xed at n, we can partition the search space into two parts: the tran- sition matrix and the output vector. The size of the search space, S, is given in Equation 1 below. S = n (n) c n (1) Note that the true search space is smaller than this at least by a factor of (n1)! due to the existence of isomorphisms. For all the experiments reported in this paper we deal with binary input strings ( = 2) and a two class problem, where each string is accepted or rejected (c = 2). S = n 2n 2 n (2) The 2 n term comes from the number of ways of labelling the states. We eliminate this from the search using our smart labelling scheme, described next. 2.2 Smart Tuning the Output Labels If we evolve both the transition matrix and the output vector, there will certainly exist some epistasis or depen- dency between these two parts. This means that to improve a particular solution, the EA may need to make adjustments both to the transition matrix and to the set of output labels. To overcome this, we devised an algorithm for optimally selecting the output labels given the transition matrix and the training set - we refer to this as Smart Tuning the Output Labels. The procedure is simple. Let h[i][c] be an array denoting the number of times the DFA nished in state i for pattern class c. Given a set of training strings T, and having initial- ized the elements of h to zero: for each string t in T increment h[f(t)][c], where f(t) is the nal state reached on completion of reading string t, and c is the class of t. For each state we then choose the output label: o i = argmax c h[i][c] (3) This is a very efcient procedure which can be directly incorporated into the tness function, since the number of correctly classied strings is simply the sum over all states of the argmax terms from Equation 3 above. The search space is now: S = n 2n (4) Hence, we reduce the search space by a factor of 2 n . We found that this simple trick allowed the EA to nd the op- timal solution in signicantly fewer evaluations than when we evolved both the transition matrix and the output vector (i.e. the state labels). This is to be expected, since for each transition matrix under consideration we directly determine the optimal output vector. 3 Evolutionary Algorithm (EA) We used a multi-start random hill-climber. This simpli- es the design of the EA, as it obviates the need to experi- ment with population size, selection methods, and more sig- nicantly, the need to dene a meaningful crossover opera- tor that avoids the problem of competing conventions [16]. Random hill-climbers, also known as a (1, 1) Evolutionary Strategy [2], are the simplest form of evolutionary algo- rithm, but often perform competitively with more complex EAs [12]. The random hill climber takes a random solution, then each time around the loop produces a mutated version which it evaluates. If the mutated version has better than or equal tness to the current solution, it is accepted, oth- erwise it is rejected. Using our DFA representation no mu- tated copies are made. Instead, the current solution is mod- ied in-place, and the change is reverted if it causes a de- crease in performance. The algorithm also notes whether any improvement took place, and keeps a count of the num- ber of steps taken without an improvement in tness. If the number of evaluations since restarting exceeds a parame- ter we call noImprovementLimit, then the current solution is recorded and the hill-climber is restarted. This procedure is run until a perfect score on the training set is achieved, or until the number of allowed tness evaluations maxEvals is reached. The best solution from all restarts is then returned as the nal result. 3.1 Fitness Function We use a simple measure of tness: the proportion of strings in the training set that are classied correctly. Hence, a DFA that classies every string incorrectly will have a 2 tness of zero, and one that classies every string correctly will have a tness of one. On balanced datasets (where the number of strings in each class is approximately equal) we expect randomly constructed DFAs to have a mean tness of 0.5. 3.2 Fitness Evaluation Efciency Note that one of the benets of evolving DFAs directly in this way, as compared with evolving a neural network to act as a DFA, is that the time taken for each tness evaluation is much reduced. Signicantly, the cost of tness evalua- tion is independent of the number of states in the DFA, and depends only linearly on the sum of the lengths of each se- quence in the training set. 4 Evidence Driven State Merging Algorithm The Evidence Driven State Merging (EDSM) algorithm emerged from the Abbadingo One competition [9], held in 1997. The intention of this competition was to stimulate both theoreticians and experimentalists to try out their ideas on signicant DFA learning problems. Among its strengths were that a number of large unseen problems were made available for solution and the possibility of test-set tuning was avoided by only providing 1 bit of information to an entrant algorithm i.e. success or failure (at 99% classica- tion accuracy). A DFA can be simply inferred by constructing a prex tree equivalent to the training data and progressively merg- ing states [18]. For the Abbadingo One competition the nominal size of each target DFA is known. If the initial prex tree is larger than the target size then the operation of merging states is presumed to lead towards the target DFA. In general many valid merges will be possible at any step. A valid merge preserves the consistency of the DFA with the training data and normally produces a new DFA which is a generalization of the previous one. The new DFA is said to be correct if it is consistent with the target machine. However, without exhaustive training data, merges amount to guesses about the behaviour of the target machine on un- seen data. These guesses can turn out to be wrong. In par- ticular it is important to guess correctly in the early stages or recovery is impossible. Aplausible strategy is to search through the space of pos- sible merge choices. This would nd all DFAs accessible from the initial prex tree. In practice the space is too large for a complete search, though a beam search can be success- ful [7] in some cases. A beam search is still expensive, how- ever, and failed to solve the more difcult problems in the Abbadingo competition. A more successful approach was to focus on minimizing the risk of making wrong guesses by weighing heuristic evidence and choosing the merges which are most likely to lead to the target machine. The EDSM al- gorithm scores a possible merge by counting the number of tests that the merge passes while being validated. If the tests are assumed independent and have an equal probability of success then the merge which passes the most tests has the highest probability of being correct. A further improvement is obtained by considering all possible pairs of states as candidates for merging. In gen- eral this requires the recursive consideration of the children of the candidate pair. Each potential merge generates a par- tition of the DFA states into equivalence classes and a to- tal score can be computed for the multiple merge implied. When the best score is found all states can be merged in one step. In practice EDSM can be speeded up by eliminat- ing states from consideration for merging if they are more than a certain depth from the initial state [9]. This does not detract signicantly from its performance. Potential merges may tie for rst place. In our implementation we search through the possible merges (p,q) in upper triangular matrix raster scan order [(0, 1), (0, 2), . . . , (1, 2), (1, 3), . . .] and will select the rst of any tying merges that we come across. We tested our im- plementation of EDSM by verifying that it obtained similar results to those reported in [9]. Note that the EDSM algorithm has been further rened by Cicchello and Kremer. They analyzed the EDSM heuris- tic and found it to be very good in the later stages of conver- gence to the correct target DFA but less good at the initial choices of merge. In fact they found that a 27% improve- ment could be made over EDSM by searching through the choices made in the rst 5 merges. We would expect to im- prove the EDSM results reported below by adding search to the algorithm, but at the expense of speed. 5 Experimental Setup Each method under test (EDSM and EA) is seen as a black box that takes as input a set of labelled strings, and produces as output a DFA its best guess at the underlying DFA. This is the true interface for EDSM, but for the EA there is another input: the maximum allowed size of DFA. In this sense, the algorithms are not truly comparable, since the EA requires more guidance than EDSM. This can be seen as both an advantage and a disadvantage. In cases where we know the size of the target, being able to inform the learner of this gives it an advantage. In general though, this is usu- ally more of a hinderance, and the learner that gures out the size of the target DFA for itself would normally be pre- ferred. Note that all evolutionary methods suffer from this drawback, and either choose a xed size that is expected to be sufcient, or in the case of Genetic Programming meth- ods, usually adopt some method of controlling the size dis- 3 tribution of the population see [15] for recent ideas on this. We also compared our system with the Genetic Pro- gramming method of Luke et al [11]. In this approach each DFA within a population is represented by a genome which contains an unbounded number of genes. Each gene rep- resents a state of the DFA. Further attributes of each gene, called a chemical template, are used to represent gene reg- ulation which controls state transitions when the DFA is exercised. It should be noted that we are evolving a xed-size struc- ture (though we could easily modify this by allowing inser- tion and deletion of states), so the comparison with vari- able size methods such as [11] could be considered un- fair. More work would be needed to properly analyze this, though Luke et al do not give the size distribution of their initial population. To enable direct comparison of results with [11] we used the average of the per-class accuracies as the measure of test-set accuracy. This measure is preferred when the lan- guages are unbalanced e.g. when there are more strings in the language than not in the language - which is the case for the Tomita languages. 6 Results 6.1 Tomita Languages A common benchmark for learning DFAs is the Tomita suite of 7 target DFAs rst dened in [17]. We used the training set specied in [11]. We note that the description of Tomita-3 and the regular expression given in [11] disagree; we chose to use the description and re-classied two of the training strings as a consequence. The results reported in that paper are among the best known for an evolution-based system on the Tomita languages, so we have included them for comparison with out system. We rst studied the num- ber of function evaluations needed to nd a DFA consis- tent with the training set, comparing a standard random hill climbing approach (plain) with our optimal state labelling method (smart). For the standard random hill climber, we xed the number of states to be 10. We report results from two versions of our Smart method. Smart shows results where we also xed the maximum number of states to be 10, whereas for nSmart we set the number of states to be exactly the number of states in the minimal DFA consis- tent with the training set. Table 1 summarizes these results, showing the average number of tness evaluations needed by each system, together with the Genetic Programming (GP) system of Luke et al [11]. Note that in all cases the simple random hill-climber requires far fewer tness evalu- ation than the GP method, and that in all cases apart from language 1, the Smart version requires far fewer than the plain one. For the these experiments we set maxEvals to 100,000 and noImprovementLimit to 2n 2 , the latter based on the size of the neighbourhood. Interestingly, when we x the number of states for each problem (nSmart) the number of average evaluations may signicantly increase, as observed on problems 3,4,5 and 7. This may seem counter-intuitive, since on the face of it weve reduced the size of the search space. However, searching for the minimal DFA is a harder problem than searching for some larger consistent DFA that may have some slack in it. Tomita No. n Plain Smart nSmart GP 1 2 107 25 15 30 2 3 186 37 40 1010 3 5 1809 237 833 12450 4 4 1453 177 654 7870 5 4 1059 195 734 13670 6 3 734 93 82 2580 7 5 1243 188 1377 11320 Table 1. Number of states in minimal DFA for each Tomita language, and average number of tness evaluations required by each sys- tem to learn the training set. Table 2 summarizes the generalization accuracy of our EAs, EDSM and GP on this learning task. We ran the EAs with 20 different random seeds for each target. We ran EDSM only once for each language because our EDSM is deterministic for a given training set. The performance over the 7 Tomita targets indicates that the EA has the best performance when we tell it the num- ber of target states (nSmart), but if we omit this information and just choose some arbitrary small value for n, then the test set accuracy is much poorer. However we really need a bigger test so our next section will deal with larger, ran- domly generated DFAs. In every single run nSmart found a minimal DFA con- sistent with the training data. Therefore, cases where the average test set accuracy of nSmart is less than 100% indi- cate that the problem is under-specied, and that there ex- ists many distinct minimal DFA that are consistent with the training set. In these cases getting the correct DFA is down to chance. To demonstrate this point with a specic example con- sider two examples of three-state DFAs learned by the EA for language 2 (Figures 1 and 2 start state outlined in bold). The DFA in Figure 1 scores 100% accuracy on the test set, while the one shown in Figure 2 scores only 83%. 4 Tomita No. Smart nSmart EDSM GP 1 81.8 100 52.4 88.4 2 88.8 95.5 91.8 84.00 3 71.8 90.8 86.1 66.3 4 61.1 100 100 65.3 5 65.9 100 100 68.7 6 61.9 100 100 95.9 7 62.6 82.9 71.9 67.7 Table 2. Average test set accuracy on Tomita languages for various methods. 1 0 1 0 0 1 0 0 1 Figure 1. An evolved minimal DFAwith perfect generalization for language 2. This latter DFA is more frequently produced by the EA. This implies that the search landscape induced by our oper- ators and the training set just happens to make the EA more likely to nd the incorrect DFA. From a general DFA learn- ing perspective, however, both are minimal consistent DFAs and should really be considered as equally good solutions, given the training set. 6.2 Run-times Table 3 shows the average elapsed time to learn the Tomita training sets. All timings are based on Java imple- mentations running on a 2.4GHz Pentium. Note that the smart hill climber signicantly outperforms the plain hill- climber. This is partially explained by the fewer number of tness evaluations required (see Table 1), and partially by the reduced book-keeping, since in the smart hill-climber we replace the copy/mutate operation with an in-place mu- tation. 1 0 1 0 0 1 0 0 1 Figure 2. An evolved minimal DFA with 83% test set score for language 2. Algorithm t (ms) EDSM 37 Plain 33 Smart 1.6 Table 3. Average elapsed time in milliseconds to learn the Tomita languages. 6.3 Random Target DFAs We followed the Abbadingo One [9] style of DFA and dataset generation. To generate a random DFA of nominal size n, we do the following: 1. Generate a random degree-2 digraph with 5*n/4 nodes (=states). 2. Choose an initial state randomly. 3. Find all states reachable from the initial state. 4. Label all states with a toss of a fair coin. DFAs generated this way have a size centred near n and a depth centred near 2 log 2 n 2. We follow Abbadingo in generating DFAs until we nd one that has depth of ex- actly 2 log 2 n 2. This does not really suit our EA, which would rather have a target that has an exactly known number of states, however we retain comparability. For each target DFAwe generate a training set by drawing, without replace- ment, from the set of 16n 2 1 possible input strings from length 0 . . . 2log 2 n + 3 inclusive. The number of training 5 examples is varied from a density of 0.01 to 0.20. Density is dened as the proportion of the total number of possi- ble input strings. The test set is the set of remaining input strings. EDSM is run once where each merge is the best indicated by the heuristic, i.e. no search is performed. For the EA we set maxEvals to 1,000,000 and noIm- provementLimit to 10,000. The size of DFA that each EA is set to generate is 5*n/4 where n is the nominal size. This number is a compromise - it means that the EA is set to run with more than enough states needed to solve the problem - though of course it may choose not to use some of the states. For each nominal size n rising from 4 to 16 in powers of 2 and each density we generated 100 different random DFA targets along with 100 different random training sets. We drop the repetitions to 10 for nominal size 32, simply because of the long time each experiment takes. Each of the gures below show graphs of the average performance of both EDSM and the EA as they vary against training set density. Error bars show the standard errors calculated from the data taken. Several points emerge from the results: Both algorithms perform very poorly on extremely sparse data, i.e. their performance on unseen test data is close to the random expectation of 0.5. A possible exception to this is that the EA is better on extremely sparse data for the smallest target DFAs (nominal size 4 states). For targets in the nominal range (n = 4 . . . 16), and in a middle range of sparsity the EA performs signif- icantly better. This better performance is lost as we scale to larger targets (n = 32), and EDSM is better. As more training data is made available both algo- rithms approach 100% performance on the test set, i.e. the target DFA is successfully learnt. Averages and error bars do not tell the whole story. It is also of interest how often both algorithms can be said to actually nd the target successfully. The Abbadingo One competition chose to use a performance of better than 0.99 on an unseen test set as an indication that the target has been found. Table 4 shows how often the two algorithms succeed by this success measure. We pick on a density of 0.08 because it represents a point of contrast between the two algorithms. Note that in the case of n = 8 the EA is twice as likely to be successful than EDSM. This superiority diminishes for n = 16, and is reversed for n = 32. The algorithms fail for opposite reasons. When the EA fails, it is because it has failed to learn the training data. On the other hand, EDSM always returns a DFA that is con- sistent with the training data, but often fails to generalize. When it does fail, the DFA it constructs is usually signi- cantly larger than the target DFA. 0. 0.04 0.08 0.12 0.16 0.2 0.5 0.6 0.7 0.8 0.9 1.0 EA EDSM density fitness Figure 3. EDSM versus smart hill climber (tar- get n=4). 0. 0.04 0.08 0.12 0.16 0.2 0.5 0.6 0.7 0.8 0.9 1.0 EA EDSM density fitness Figure 4. EDSM versus smart hill climber (tar- get n=8) 0. 0.04 0.08 0.12 0.16 0.2 0.5 0.6 0.7 0.8 0.9 1.0 EA EDSM density fitness Figure 5. EDSM versus smart hill climber (tar- get n=16). 6 Target Size Smart EDSM nRuns 4 27 24 100 8 35 18 100 16 41 34 100 32 3 7 10 Table 4. Number of successful runs on sparse data with respect to nominal target size. 0. 0.04 0.08 0.12 0.16 0.2 0.5 0.6 0.7 0.8 0.9 1.0 EA EDSM density fitness Figure 6. EDSMversus smart hill climber. (tar- get n=32) 7 Discussion We have taken a very simple approach to evolving DFAs. The method not only appears to outperform other evo- lutionary approaches (e.g. [11, 1]), but also outperforms the powerful heuristic method EDSM on a certain class of prob- lem. It would be interesting to compare with Sage [7], but weve not yet had chance to do this. 7.1 Can we be smarter? So far we have observed that applying a smart state la- belling scheme makes the job of evolving a DFA much sim- pler than attempting to simultaneously evolve both the tran- sition matrix and the state labels. It is therefore natural to look for other ways in which we may improve the per- formance of the random hill-climber. Some investigation shows that there is indeed such a method, but we have not yet implemented or evaluated it. The method is based on the observation that we can assign credit to the transitions of the DFA based on the input strings they are involved in processing. Suppose we keep a count for each transition of the number of correctly recognized strings each transition is involved in, and also a count for the number of incor- rectly recognized strings it is involved in. Since we know the total number of strings in the training set, this allows us to calculate a measure of tness for each transition as the proportion of total strings that the transition is involved in labelling correctly. Modifying a transition can only affect the score for the strings that it is involved in classifying. If these are all in- correctly classied strings, then the modication will either maintain or increase the overall score. In particular, some transitions are unreachable, and are not involved with clas- sifying any strings - it would seem futile to spend time mod- ifying these. Note, however, that this is a measure that must be recalculated for each tness evaluation, since a previ- ously unused transition can become highly used as a result of modifying a different transition in the matrix. Hence, a modied sampling procedure should improve the performance of the hill climber by avoiding the waste of time involved in making and evaluating futile modications. 7.2 Trick or General Principle? The results demonstrate a signicant improvement using the smart labelling method compared to evolving the entire DFA. A question that naturally arises is whether we should see this as a neat trick that improves performance on the problem of DFA induction, or is it a more general principle that can be applied elsewhere? One immediate possibility would be to apply the same principle to evolving Finite State Transducers (FSTs). Lu- cas [10] obtained FSTs by evolving both the state transition matrix and the output matrix. It would be interesting to in- vestigate evolving only the transition matrix for the FST, leaving the entries of the output matrix to be free variables whose values are chosen to optimize the transduction score on the training set. While this is a little more complex than the procedure for optimally assigning state labels for the DFA, and depends on the tness function used, it should still be possible to formulate an efcient method for doing this. 8 Conclusions In this paper we presented a new scheme for evolving DFA. The method is simple: use a multi-start random hill climber to optimize the transition matrix of the DFA, and use a smart state labelling scheme to optimally choose the state labels, given the transition matrix and the training set. We evaluated the scheme on two types of data: the Tomita languages, and randomly constructed target DFAs with randomly constructed training samples of varying den- sity. On the Tomita languages we nd that our system learns a small DFA consistent with the training set typically in many fewer tness evaluations than previous evolutionary methods. We show that whether or not the generalization is better than other methods is a dubious question to ask. 7 Faced with a number of distinct minimal DFA that are con- sistent with the training set, picking one with perfect gener- alization is more of a lottery than a scientic process. The average time taken to learn a Tomita language with our method is 1.6ms, which compares very favourably with other methods. On the (Abbadingo style) random DFA, we nd that our evolutionary method outperforms the well- known heuristic method EDSM when the target DFA are small, and the training sample is sparse. For larger ma- chines with 32 states, our evolutionary method fails and EDSM then clearly outperforms it. We are currently inves- tigating ways of making our evolutionary approach perform better on these larger problems. Acknowledgements The authors would like to thank the members of the Nat- ural and Evolutionary Computation group at the University of Essex, UK, for helpful comments and discussion. References [1] P. J. Angeline, G. M. Saunders, and J. P. Pollack. An evolu- tionary algorithm that constructs recurrent neural networks. IEEE Transactions on Neural Networks, 5(1):5465, Jan- uary 1994. [2] H.-G. Beyer. Toward a theory of evolution strategies: The (mue, lambda)-theory. Evolutionary Computation, 2(4):381407, 1994. [3] O. Cicchello and S. C. Kremer. Beyond EDSM. Lecture Notes in Computer Science, 2484:3748, 2002. [4] P. Dupont. Regular grammatical inference from positive and negative samples by genetic search: The GIG method. In R. C. Carrasco and J. Oncina, editors, Grammatical Infer- ence and Applications (ICGI-94), pages 236245. Springer, Berlin, Heidelberg, 1994. [5] P. Dupont, L. Miclet, and E. Vidal. What is the search space of the regular inference ? In R. C. Carrasco and J. Oncina, editors, Grammatical Inference and Applications (ICGI-94), pages 2537. Springer, Berlin, Heidelberg, 1994. [6] C. Giles, G. Sun, H. Chen, Y. Lee, and D. Chen. Higher order recurrent neural networks and grammatical inference. In D. Touretzky, editor, Advances in Neural Information Processing Systems 2. Morgan Kaufman, San Mateo, CA, (1990). [7] H. Juille and J. B. Pollack. A sampling-based heuristic for tree search applied to grammar induction. In Proceedings of the Fifteenth National Conference on Articial Intelligence (AAAI-98), Madison, Wisconsin, USA, 26-30 1998. AAAI Press Books. [8] M. Kearns and L. G. Valiant. Cryptographic limitations on learning Boolean formulae and nite automata. In Proceed- ings of ACM Symposium on Theory of Computation (STOC- 89), pages 433444, 1989. [9] K. J. Lang, B. A. Pearlmutter, and R. A. Price. Results of the Abbadingo One DFA learning competition and a new evidence-driven state merging algorithm. Lecture Notes in Computer Science, 1433:112, 1998. [10] S. M. Lucas. Evolving nite state transducers: Some initial explorations. In Proceedings of 6th European Conference on Genetic Programming, pages 130 141, 2003. [11] S. Luke, S. Hamahashi, and H. Kitano. Genetic Program- ming. In W. Banzhaf et al , editor, GECCO-99: Proceedings of the Genetic and Evolutionary Computation Conference. Morgan Kaufmann, 1999. [12] M. Mitchell, J. Holland, and S. Forrest. When will a ge- netic algorithm outperform hill climbing? In J. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems 6, pages 51 58. Morgan Kaufman, San Mateo, CA, 1994. [13] A. L. Oliveira and J. P. M. Silva. Efcient search techniques for the inference of minimum size nite automata. In String Processing and Information Retrieval, pages 8189, 1998. [14] L. Pitt and M. Warmuth. The minimum consistent DFA problem cannot be approximated within any polynomial. Journal of the ACM, 40(1), 1993. [15] R. Poli. A simple but theoretically-motivated method to con- trol bloat in genetic programming. In Proceedings of 6th European Conference on Genetic Programming, pages 204 217, 2003. [16] J. D. Schaffer, D. Whitley, and L. J. Eshelman. Combina- tions of genetic algorithms and neural networks: a survey of the state of the art. In D. Whitley and J. D. Schaffer, edi- tors, COGANN-92, Int. Workshop on Combinations of Ge- netic Algorithms and Neural Networks, pages 137. IEEE Computer Society, 1992. [17] M. Tomita. Dynamic construction of nite automata from examples using hill climbing. In Proc. of the 4th Annual Cognitive Science Conference, USA, pages 105108. 1982. [18] B. A. Trakhtenbrot and Y. M. Barzdin. Finite Automata. North-Holland, Amsterdam, 1973. [19] R. Watrous and G. Kuhn. Induction of nite-state automata using second-order recurrent networks. In J. Moody, S. Han- son, and R. Lippmann, editors, Advances in Neural Informa- tion Processing Systems 4, pages 309 316. Morgan Kauf- man, San Mateo, CA, 1992. 8