8.2 Extra Goals

Any DCG rule is really syntactic sugar for an ordinary Prolog rule. So it’s not really too surprising that we’re allowed to make use of extra arguments. Similarly, it shouldn’t come as too much of a surprise that we can call any Prolog predicate whatsoever from the right hand side of a DCG rule.

The DCG of the previous section can, for example, be adapted to work with Prolog numbers (instead of the successor representation of numbers) by using calls to Prolog’s built-in arithmetic functionality. We simply count how many a s, b s, and c s have been generated. Here’s the code:

   s  -->  ablock(Count),bblock(Count),cblock(Count).
   
   ablock(0)  -->  [].
   ablock(NewCount)  -->  [a],ablock(Count),
                                             {NewCount  is  Count  +  1}.
   
   bblock(0)  -->  [].
   bblock(NewCount)  -->  [b],bblock(Count),
                                             {NewCount  is  Count  +  1}.
   
   cblock(0)  -->  [].
   cblock(NewCount)  -->  [c],cblock(Count),
                                             {NewCount  is  Count  +  1}.

As this example suggests, extra goals can be written (anywhere) on the right side of a DCG rule, but must be placed between curly brackets. When Prolog encounters such curly brackets while translating a DCG into its internal representation, it just takes the extra goals specified between the curly brackets over into the translation. So, the second rule for the non-terminal ablock above would be translated as follows:

   ablock(NewCount,A,B):-
         ’C’(A,  a,  C),
         ablock(Count,  C,  B),
         NewCount  is  Count  +  1.

Incidentally, if you play around with this DCG, you will find that there are actually some problems with it. In contrast to the one that we saw in the last section, this new version only works correctly when used in the recognition mode. If you try to generate with it, it will at some point enter an infinite loop. We won’t bother to fix this problem here (apart from anything else, we find the earlier succ based approach more elegant).

The possibility of adding arbitrary Prolog goals to the right hand side of DCG rules, makes DCGs very powerful (it means that we can do anything that we can do in plain Prolog). In general, however, this capability is not used much, which tends to suggest that the basic DCG notation is well designed. There is, however, one classic application for extra goals in computational linguistics: with the help of extra goals, we can neatly separate grammar rules and lexical information. Let’s see how.

Separating rules and lexicon

We are going to separate rules and lexicon. That is, we are going to eliminate all mention of individual words in our DCGs and instead record all the information about individual words separately in a lexicon. To see what is meant by this, let’s return to our basic grammar:

   np  -->  det,n.
   
   vp  -->  v,np.
   vp  -->  v.
   
   det  -->  [the].
   det  -->  [a].
   
   n  -->  [woman].
   n  -->  [man].
   
   v  -->  [shoots].

We are now going to write a DCG that generates exactly the same language, but in which no rule mentions any individual word. All the information about individual words will be recorded separately.

Here is an example of a (very simple) lexicon. Lexical entries are encoded by using a predicate lex/2 whose first argument is a word, and whose second argument is a syntactic category.

   lex(the,det).
   lex(a,det).
   lex(woman,n).
   lex(man,n).
   lex(shoots,v).

And here is a simple grammar that could go with this lexicon. In essence it’s the same as the previous one. In fact, the only rules that have changed are those that mentioned specific words, that is, the det , n , and v rules.

   np  -->  det,n.
   
   vp  -->  v,np.
   vp  -->  v.
   
   det  -->  [Word],{lex(Word,det)}.
   n  -->  [Word],{lex(Word,n)}.
   v  -->  [Word],{lex(Word,v)}.

Consider the new det rule. This rule part says “a det can consist of a list containing a single element Word ” (note that Word is a variable). Then the extra goal adds the crucial stipulation: “so long as Word unifies with something that is listed in the lexicon as a determiner”. With our present lexicon, this means that Word must be matched either with the word “a” or “the”. So this single rule replaces the two previous DCG rules for det .

This explains the “how” of separating rules from lexicon, but it doesn’t explain the “why”. Is it really so important? Is this new way of writing DCGs really that much better?

The answer is an unequivocal yes! It’s much better, and for at least two reasons.

The first reason is theoretical. Arguably rules should not mention specific lexical items. The purpose of rules is to list general syntactic facts, such as the fact that sentence can be made up of a noun phrase followed by a verb phrase. The rules for s , np , and vp describe such general syntactic facts, but the old rules for det , n , and v don’t. Instead, the old rules simply list particular facts: that “a” is a determiner, that “the” is a determiner, and so on. From theoretical perspective it is much neater to have a single rule that says “anything is a determiner (or a noun, or a verb, or any other grammatical category) if it is listed as such in the lexicon”. And this, of course, is precisely what our new DCG rules say.

The second reason is more practical. One of the key lessons computational linguists have learnt over the last twenty or so years is that the lexicon is by far the most interesting, important (and expensive!) repository of linguistic knowledge. Bluntly, if you want to get to grips with natural language from a computational perspective, you need to know a lot of words, and you need to know a lot about them.

Now, our little lexicon, with its simple two-place lex entries, is a toy. But a real lexicon is (most emphatically!) not. A real lexicon is likely to be very large (it may contain hundreds of thousands of words) and moreover, the information associated with each word is likely to be very rich. Our lex entries give only the syntactical category of each word, but a real lexicon will give much more, such as information about its phonological, morphological, semantic, and pragmatic properties.

Because real lexicons are big and complex, from a software engineering perspective it is best to write simple grammars that have a simple, well-defined way, of pulling out the information they need from vast lexicons. That is, grammars should be thought of as separate entities which can access the information contained in lexicons. We can then use specialised mechanisms for efficiently storing the lexicon and retrieving data from it.

Our new DCG rules, though simple, illustrate the basic idea. The new rules really do just list general syntactic facts, and the extra goals act as an interface to our lexicon that lets the rules find exactly the information they need. Furthermore, we now take advantage of Prolog’s first argument indexing which makes looking up a word in the lexicon more efficient. First argument indexing is a technique for making Prolog’s knowledge base access more efficient. If in the query the first argument is instantiated it allows Prolog to ignore all clauses where the first argument’s functor and arity is different. This means, for example, that we can get all the possible categories of man immediately without having to even look at the lexicon entries for all the other hundreds or thousands of words that we might have in our lexicon.

eXTReMe Tracker
© 2006-2012 Patrick Blackburn, Johan Bos, Kristina Striegnitz