C O M P U T A T I O N A L C O M P L E X I T Y IN T W O - L E V E L MORPHOLOGY
G. Edward Barton, Jr. M.I.T. Artificial Intelligence Laboratory 545 Technology Square Cambridge, MA 02139
Morphological analysis must take into account the spelling-change processes of a language as well as its possible configurations of stems, affixes, and inflectional markings. The computational difficultyof the task can be clarified by investigating specific models of morphological processing. The use of finite-state machinery in the "twolevel" model by K i m m o Koskenniemi gives it the appearance of computational efficiency, but closer examination shows the model does not guarantee efficient processing. Reductions of the satisfiability problem show that finding the proper lexical/surface correspondence in a two-level generation or recognition problem can be computationally difficult. The difficulty increases if unrestricted deletions (null characters) are allowed.
difficulty of the problems does not seem to be a precompilation effect. In addition to knowing the stems, affixes, and cooccurrence restrictions of a language, a successful morphological analyzer must take into account the spelling-change processes that often accompany affixation. In English, the program must expect love+ing to appear as loving, fly+s as flies, lie+ing as lying, and big+er as bigger. Its knowledge must be sufficiently sophisticated to distinguish such surface forms as hopped and hoped. Crosslinguistically, spelllng-change processes m a y span either a limited or a more extended range of characters, and the material that triggers a change m a y occur either before or after the character that is affected. (Reduplication, a complex copying process that m a y also be found, will not be considered here.)
The K i m m o system described by Karttunen (1983} is attractive for putting morphological knowledge to use in processing. K i m m o is an implementation of the "two-level" model of morphology that K i m m o Koskenniemi proposed and developed in his Ph.D. thesis.I A system of lexicons in the dictionary component regulates the sequence of roots and affixes at the lexical level, while several finite-state transducers in the automaton component -- ~ 20 transducers for Finnish, for instance -- mediate the correspondence between lexical and surface forms. Null characters allow the automata to handle insertion and deletion processes. The overall system can be used either for generation or for recognition. The finite-state transducers of the automaton component serve to implement spelling changes, which m a y be triggered by either left or right context and which m a y ignore irrelevant intervening characters. As an example, the following automaton describes a simplified "Y-change" process that changes y to i before suffix es: IUniversity of Helsinki,Finland, circaFall 1983.
The "dictionary lookup" stage in a natural-language system can involve much more than simple retrieval. Inflectional endings, prefixes, suffixes, spelling-change processes, reduplication, non-concatenative morphology, and clitics may cause familiar words to show up in heavily disguised form, requiring substantial morphological analysis. Superficially, it seems that word recognition might potentially be complicated and difficult. This paper examines the question more formally by investigating the computational characteristics of the "twolevel" model of morphological processes. Given the kinds of constraints that can be encoded in two-level systems, how difficult could it be to translate between lexical and surface forms? Although the use of finite-state machinery in the two-level model gives it the appearance of computational efficiency, the model itself does not guarantee efficient processing. Taking the Kimmo system (Karttunen, 1983) for concreteness, it will be shown that the general problem of mapping between ]exical and surface forms in two-level systems is computationally difficult in the worst case; extensive backtracking is possible. If null characters are excluded, the generation and recognition problems are NP-complete in the worst case. If null characters are completely unrestricted, the problems is PSPACEcomplete, thus probably even harder. The fundamental
"Y-Change" 5 5
y state state state state state 1: 2. 3. 4: S: i 2 0 0 2 2 y y 4 0 0 4 4 * s = = 1 0 0 1 1 (lexicalcharacters) (surface characters) ( n o r m a l state) (require * s ) (require s ) (forbid+s) (forbids) = s 1 1 3 0 0 1 8 1 1 0
Examined in detail, the runtime complexity of Kimmo processing can be traced to three main sources. The recognizer and generator must both run the finite-state machines of the automaton component; in addition, the recognizer must descend the letter trees that make up a lexicon. The recognizer must also decide which suffix lexicon to explore at the end of an entry. Finally, both the recognizer and the generator must discover the correct lexical-surface correspondence. All these aspects of runtime processing are apparent in traces of implemented Kimmo recognition, for instance when the recognizer analyzes the English surface form s p i e l (in 61 steps) according to K a r t t u n e n and Wittenburg's (1983) analysis (Figure 1). The stepping of transducers and letter-trees is ubiquitous. The search for the lexical-surface correspondence is also clearly displayed; for example, before backtracking to discover the correct lexical entry s p i e l , the recognizer considers the lexical string spy+ with y surfacing as i and + as e. Finally, after finding the putative root spy the recognizer must decide whether to search the lexicon I that contains the zero verbal ending of the present indicative, the lexicon AG storing the agentive suffix *er, or one of several other lexicons inhabited by inflectional endings such as +ed. The finite-state framework makes it easy to step the automata; the letter-trees are likewise computationally well-behaved. It is more troublesome to navigate through the lexicons of the dictionary component, and the current implementation spends considerable time wandering about. However, changing the implementation of the dictionary component can sharply reduce this source of complexity; a merged dictionary with bit-vectors reduces the number of choices among alternative lexicons by allowing several to be searched at once (Barton, 1985). More ominous with respect to worst-case behavior is the backtracking that results from local ambiguity in the construction of the lexical-surface correspondence. Even if only one possibility is globally compatible with the constraints imposed by the lexicon and the automata, there may not be enough evidence at every point in processing to choose the correct lexical-surface pair. Search behavior results. In English examples, misguided search subtrees are necessarily shallow because the relevant spelling-change processes are local in character. Since long-distance harmony processes are also possible, there can potentially be a long interval before the acceptability of a lexical-surfaee pair is ultimately determined. For instance, when vowel alternations within a verb stem are conditioned by the occurrence of particular tense suffixes, the recognizer must sometimes see the end of the word before making final decisions about the stem.
The details of this notation will not be explained here; basic familiarity with the Kimmo system is assumed. For further introduction, see Barton (1985), K a r t t u n e n (1983), and references cited therein.
At first glance, the finite-state machines of the twolevel model appear to promise unfailing computational efficiency. Both recognition and generation are built on the simple process of stepping the machines through the input. Lexical lookup is also fast, interleaved character by character with the quick left-to-right steps of the automata. The fundamental efficiency of finite-state machines promises to make the speed of Kimmo processing largely independent of the nature of the constraints that the automata encode: The most important technical feature of Koskenniemi's and our implementation of the Two-level model is that morphological rules are represented in the processor as automata, more specifically, as finite state transducers . . . . One important consequence of compiling [the grammar rules into automata] is that the complexity of the linguistic description of a language has no significant effect on the speed at which the forms of that language can be recognized or generated. This is due to the fact that finite state machines are very fast to operate because of their simplicity . . . . Although Finnish, for example, is morphologically a much more complicated language than English, there is no difference of the same magnitude in the processing times for the two languages . . . . [This fact] has some psycholinguistie interest because of the common sense observation that we talk about "simple" and "complex" languages b u t not about "fast" and "slow" ones. (Karttunen, 1983:166f) For this kind of interest in the model to be sustained, it must be the model itself that wipes out processing difficulty, rather than some accidental property of the encoded morphological constraints.
Recognizing surface f o r m " s p i e l " . 1 s 184.108.40.206.1.1 2 sp 220.127.116.11.1.1 3 spy 18.104.22.168.1.1 4 "spy" ends, new lelXlCOn N 5 "0" ends. new l e x i c o n C1 6 spy XXX e x t r a input 7 (5) spy+ 22.214.171.124.1.1 8 spy+ XXX 9 (5) spy + 126.96.36.199.1.1 10 spy+ XXX 11 (4) "spy" ends, new l e x t c o n 1 12 spy XXX e x t r a t n p u t 13 (4) "spy" ends, new l e x i c o n P3 14 spy+ 188.8.131.52.1.1 15 spy+ XXX 16 (14) spy+ 1,184.108.40.206.1 17 spy+ XXX 18 (4) "spy" ends, new l e x t c o n PS 19 spy+ 220.127.116.11.1.1 20 spy+e 18.104.22.168.4.1 Zl spy+e XXX 22 (20) spy÷e 22.214.171.124.3.1 23 spy+e XXX 24 (19) spy+ 126.96.36.199.1.1 25 spy+e XXX Epenthesls 26 (4) "spy" ends, new l e x i c o n PP 27 spy+ 188.8.131.52.1.1 28 spy+e 184.108.40.206.4.1 zg spy+e XXX 30 (28) spy+e 220.127.116.11.3.1 31 spy+e XXX 32 (27) spy+ 18.104.22.168.1.1 33 spy+e XXX Epenthests 34 (4) "spy" ends. new l e x i c o n PR 35 spy+ 22.214.171.124,1,1 36 spy+ XXX 37 (38) spy+ 126.96.36.199.1.1 38 spy+ XXX 39 (4) "spy" ends. new l e x t c o n AG 40 spy+ 188.8.131.52.1. I 41 spy+e 184.108.40.206.4.1 42 spy+e XXX 43 (41) spy+e 1.1.4,1.3,1 44 spy+e XXX 45 (40) spy+ 220.127.116.11.1.1 46 spy+e XXX Epenthests 47 (4) "spy" ends. new l e x t c o n AB 48 spy+ 1,18.104.22.168.1 49 spy+ XXX 50 (48) Spy+ 1,22.214.171.124.1 51 spy÷ XXX 52 (3) spt 126.96.36.199.2.8 53 spte 188.8.131.52.6.1
LLL+]H+ LLL÷---+XXX÷ -~-+XXX+ LLL+---+-*-+XXX+
---+---+XXX+ ---+---+LLL+LLL+**-÷ ---+XXX+
Key to t r e e nodes: --LLL AAA XXX III *'* normal t r e v e r s a l new lexicon blocking by automata no lexlcal-surface p a i r s compatible with surface char and dictionary blocking by leftover input a n a l y s i s found
58 (53) sple 56 spiel 57 " s p i e l " ends. 58 "0" ends. 59 "spiel" 60 (58) s p i e l + 61 spiel+ (("spiel" (N SG)))
XXX 184.108.40.206.5.6 220.127.116.11, I. I
new l e x t c o n N new l e x i c o n Cl *** r e s u l t
F i g u r e ]: These traces show the steps that the KIMMOrecognizer for English goes through while analyzing the surface form s p i e l . Each llne of the table oil the left shows the le]dcal string and automaton states at the end of a step. If some autoz,mton blocked, the automaton states axe replaced by ~ , XXI entry. An XXX entry with no autonmto,, n:une indicates that the ]exical string could not bc extended becau,~e the surface c],aracter .'tnd h,xical letter tree together ruh'd out ,-dl feasible p,'drs. After xn XXX or *** entry, the recognizer backtracks and picks up from a previous choice point. indicated by the paxenthesized step l*lU,zl)er before the lexical .~tring. The tree Ol, the right depicts the search graphically, reading from left to right and top t . ])ottoln with vertir;d b;trs linking the choices at each choice p o i n t The flhntres were generated witl, a ](IM M() hnplen*entation written i , an ;I.llgll*t,llter| version of MACI,ISI'I,t,sed initiMly on Kltrttllnel*',,? (1983:182ff) ;dgorithni description; the diction;n'y m . l antomaton contpouents for E,glish were taken front 1.;artt,ne, and Wittenlmrg (1983) with minor ('llikllgCS. This iJz*ple*l*Vl*tatio*)se;u'?h(.s del.th-tlr,~t a,s K m t t u , e n ' s does, but explores the Mternatives at a giwm depth in a different order from Karttttnen's.
Ignoring the problem of choosing among alternative lexicons, it is easy to see that the use of finite-state machinery helps control only one of the two remaining sources of complexity. Stepping the automata should be fast, but the finite-state framework does not guarantee speed in the task of guessing the correct lexical-surface correspondence. The search required to find the correspondence may predominate. In fact, the Kimmo recognition and generation problems bear an uncomfortable resemblance to problems in the computational class NP. Informally, problems in NP have solutions that may be hard to guess but are easy to verify - - just the situation that might hold in the discovery of a Kimmo lexical-surface correspondence, since the automata can verify an acceptable correspondence quickly but may need search to discover one.
The notation is unambiguous without parentheses because is required to be in CNF. Second, construct a Kimmo automaton component A in three parts. (A varies from formula to formula only when the formulas involve different sets of variables.) The alphabet specification should list the variables in a together with the special characters T, F, minus sign, and comma; the equals sign should be declared as the Kimmo wildcard character, as usual. The consistency automata, one for each variable in a, should be constructed on the following model:
"x-consistency" 3 3
x T 2 2
x F 3 0
= = 1 2
(lezical characters) (surface characters} (x undecided} (x true} (xfalsc}
THE COMPLEXITY OF TWO-LEVEL MORPHOLOGY
The Kimmo algorithms contain the seeds of complexity, for local evidence does not always show how to construct a lexical-surface correspondence that will satisfy the constraints expressed in a set of two-level automata. These seeds can be exploited in mathematical reductions to show that two-level automata can describe computationally difficult problems in a very natural way. It follows that the finite-state two-level framework itself cannot guarantee computational efficiency. If the words of natural languages are easy to analyze, the efficiency of processing must result from some additional property that natural languages have, beyond those that are captured in the twolevel model. Otherwise, computationally difficult problems might turn up in the two-level automata for some natural language, just as they do in the artificially constructed languages here. In fact, the reductions are abstractly modeled on the Kimmo treatment of harmony processes and other long-distance dependencies in natural languages. The reductions use the computationally difficult Boolean satisfiability problems SAT and 3SAT, which involve deciding whether a CNF formula has a satisfying truth-assignment. It is easy to encode an arbitrary SAT problem as a Kimmo generation problem, hence the general problem of mapping from lexical to surface forms in Kimmo systems is NP-complete. 2 Given a CNF formula ~, first construct a string o by notational translation: use a minus sign for negation, a comma for conjunction, and no explicit operator for disjunction. Then the o corresponding to the formula (~ v y)&(~ v z ) & ( x v y v z) is - x y . - y z .xyz. 2Membership in NP is also required for this conclusion. A later section ("The Effect of Nulls~) shows membershipin NP by sketching how a nondeterministicmachinecould quicklysolve Kimmogeneration and recognitionproblems.
The consistency automaton for variable x constrains the mapping from variables in the lexical string to truth-values in the surface string, ensuring that whatever value is assigned to x in one occurrence must be assigned to x in every occurrence. Finally, use the following satisfaction automaton, which does not vary from formula to formula:
"satisfaction" 3 4
1. 2: 3.
= T 2 2 1
= F 1 2 2
3 2 0
, , 0 1 0
(lexical characters} (surface characters} (no true seen in this group) (true seen in this group} (-F counts as true)
The satisfaction automaton determines whether the truthvalues assigned to the variables cause the formula to come out true. Since the formula is in CNF, the requirement is that the groups between commas must all contain at least one true value. The net result of the constraints imposed by the consistency and satisfaction automata is that some surface string can be generated from a just in case the original formula has a satisfying truth-assignment. Furthermore, A and o can be constructed in time polynomial in the length of ~; thus SAT is polynomial-time reduced to the Kimmo generation problem, and the general case of Kimmo generation is at least as hard as SAT. Incidentally, note that it is local rather than global ambiguity that causes trouble; the generator system in the reduction can go through quite a bit of search even when there is just one final answer. Figure 2 traces the operation of the Kimmo generation algorithm on a (uniquely) satisfiable formula. Like the generator, the Kimmo recognizer can also be used to solve computationally difficult problems. One easy reduction treats 3SAT rather than SAT, uses negated alphabet symbols instead of a negation sign, and replaces the satisfaction automaton with constraints from the dictionary component; see Barton (1985) for details.
Generating from lexical form "-xy. -yz. -y-z,xyz" 1 1,1.1,3 38 + 2 -F 3,1,1,2 3g 3 -FF 3.3,1,2 40 (3) 4 -FF, 3,3,1.1 41 5 -FF, 3,3,1,3 42 6 -FF, -T XXX y-con. 43 7 + -FF, -F 3,3,1,2 44 + 3,3,3,2 45 8 -FF, -FF g -FF, -FF. 3,3.3,1 46 10 -FF, -FF, 3,3,3,3 47 (45) -FF, -FF, -T XXX y-con. 48 ll 12 + -FF, -FF, -F 3,3,3,2 49 13 -FF -FF, -F3,3,3,2 50 14 -FF -FF, -F-T XXX z-con. 51 + 15 + -FF -FF, -F-F 3,3,3,2 52 16 -FF -FF, - F - F , 3,3.3,1 53 17 -FF - F F , - F - F , T XXX x-con. 54 + 18 + -FF -FF,-F-F,F 3,3,3,1 55 -FF - F F , - F - F , F T XXX y-con. 56 (2) lg 57 20 + -FF -FF, -F-F,FF 3,3,3,1 -FF -FF,-F-F,FFT XXX z-con. 58 21 5g (57) 22 + -FF -FF,-F-F,FFF 3,3,3,1 23 -FF -FF,-F-F,FFF XXX satis, nf. 60 24 (8) -FF -FT 3,3,2,2 61 25 -FF -FT, 3,3,2,1 62 26 -FF -FT,3,3,2,3 63 + -FF -FT,-T XXX y-con. 64 27 3,3,2,2 65 28 + -FF -FT,-F 3,3,2,2 66 (64) 29 -FF - F T , - F 30 -FF - F T , - F - F XXX z-con. 67 31 + -FF -FT,-F-T 3,3,2,2 68 32 -FF -FT,-F-T. 3,3,2,1 6g