op(500,yfx,+)declares the symbol '+' to be an infix binary left-associative operator with binding looseness 500 (binding looseness is the opposite of precedence - anti-precedence, so to speak).
The second argument of op/3 is called the operator's specifier. It is a symbol that encodes three kinds of information concerning the operator:
Note that yfy is not allowed as an operator specifier because that would mean an ambiguous way of parsing the operator by associating either to the left or to the right.
The binding looseness used by Prolog's op/3 works in fact as the opposite of the precedence level used in parsing: the smaller a Prolog operator's binding looseness measure is, the more it takes precedence for parsing. These binding looseness measures range inclusively from 1 (maximum precedence) to 1200 (minimum precedence).
The third argument of op/3 can be any syntactically well-formed Prolog functor. In particular, these need not be known as operators prior to runtime. Prolog's tokenizer only recognizes such a token as a functor. Thus, any functor, whether declared operator or not, can always be parsed as a prefix operator preceding a parenthesized comma-separated sequence of arguments. Whether it is a declared operator determines how it may be parsed otherwise. In Sicstus Prolog, for example, declaring:
?- X = 1+2.yields this query's answer:
?- X = 1+2 ? X = 1 + 2. yesand declaring:
?- X = +(1,2).yields the same query's answer:
?- X = 1+2 ? X = 1 + 2. yesProlog's parser can accommodate dynamic operators for two reasons:
Operator-precedence parsing is possible only for a restricted class of context-free grammars: the so-called operator grammars. A context-free grammar is an operator grammar if and only if no production's right-hand side is empty or contains two adjacent nonterminals.
For example, the two-rule context-free grammar where '(', ')', 'ID', '+', '*', '-', and '/' are terminal symbols:
Ex -> Ex Op Ex | '(' Ex ')' | '-' Ex | 'ID' Op -> '+' | '*' | '-' | '/'is not an operator grammar. But the equivalent one-rule context-free grammar:
Ex -> Ex '+' Ex | Ex '*' Ex | Ex '-' Ex | Ex '/' Ex | '(' Ex ')' | '-' Ex | 'ID'is. It is not difficult to see that a Prolog term can be recognized by an operator grammar. Namely,
Term -> 'VAR' | 'FUN' | 'FUN' '(' Body ')' | 'FUN' Term | Term 'FUN' Term | Term 'FUN' | '(' Term ')' Body -> Term | Term ',' Bodywhere 'VAR', 'FUN', '(', ')', ',', are terminal symbols; in particular, 'FUN', is dynamically tokenized setting the terminal 'FUN' string value to the functor the tokenizer recognized it to be — and can thus easily accommodate dynamic operators.
Thus, although they can recognize a much larger class of context-free languages, conventional shift-reduce parsers for LR grammars cannot accommodate parse-time ambiguity resolution. Although this makes parsing more efficient, it also forbids a parser generated by a yacc-like parser generator to support Prolog style operators.
In what follows, we propose to reorganize the structure of the implementation of a yacc-style parser generator to accommodate Prolog-style dynamic operators. We do so:
%dynamic opis used to declare that the parser of the grammar being specified will allow defining, or redefining, dynamic operators of category op. The effect of this declaration is to create a non-terminal symbol named op that stands for this token category. Three implicit grammar rules are then automatically defined:
op : op_ | _op_ | _op ;which introduce, respectively, prefix, infix, and postfix, subcategories for operators of category op. These are terminal symbols standing as generic tokens that denote specific operators for each fixity. They are always an uppercase version of the dynamic category symbol, with underscores attached on each side as a mnemonic way of showing where their arguments are expected.
Once a dynamic operator category has ben declared such as above, specific operators of this category may be defined in the grammar specicification as follows:
%op <operator> <specifier> <looseness>or, equivalently:
%op <operator> <looseness> <specifier>For example,
%op '+' yfx 500or, equivalently,
%op '+' 500 yfxdeclares the symbol '+' to be an infix binary left-associative operator of category op, with binding looseness 500, as in Prolog.
In addition, the generated parser defines the following method:
public final void op (String operator, String specifier, int looseness)whose effect is to define, or redefine, an operator for the token category op dynamically using the given (Prolog-style) specifier and (Prolog-style) binding looseness. It is this method that can be invoked in a parser's semantic action at parse time.
An operator's category name may be used in a grammar specification wherever an operator of that category is expected. However, one may limit occurrences of such operators to those of specific fixity only using the terminal symbols:
Expression : op_ Expression | Expression _op | Expression _op_ Expression ;
A major modification in the parser generator and the generic parser must also be made regarding the parser's actions. A state may have contending actions on a given input. Such a state is deemed conflictual if and only if the input creating the conflict is a dynamic operator, or if one of its conflicting actions is a reduction with a rule with a tag that is a dynamic operator. (Recall that a rule's tag is the last terminal symbol occurring in its body, if one does; otherwise the rule has no tag.) All other states can be treated as usual, resolving potential conflicts using the conventional method based on precedence and associativity. Clearly, a dynamic operator token category does not have this information but delegates it to the specific token, which will be known only at parse time. At parser-construction time, a pseudo-action is generated for conflictual states which delays decision until parse time. It uses the state's table associating a set of actions with the token creating the conflict in this state. These sets of conflicting actions are thus recorded for each conflictual state.
When a token is identified and the current state is a conflictual state, which action to perform is determined by choosing in the action set associated to the state according to the same disambiguation rules followed by the static table construction but using the current precedence and associativity values of the specific operator being read. If a reduce action in the set involves a rule tagged with a dynamic operator, which precedence and associativity values to use for the rule are those of the specific operator tag for that rule, which can be obtained in the current stack. The stack offset of that operator will depend on which of the dynamic operator's rules is being considered.
%token '!' %dynamic op1 %op1 '!' yf 200 %dynamic op2 %op2 '!' yfx 500 %% Expression : '!' Expression | Expression _op1 | Expression _op2_ Expression ; %%For this grammar, the character '!' may be tokenized as either '!', 'op1', or 'op2'.
The tokenizer can therefore be made to dispense with guaranteeing a token's lexical category. Looking up its token category tables, the parser then determines the set of admissible lexical categories for this token in the current state (i.e., those for which it has an action defined). If more than one token remain in the set, a choice point for this state is created. Such a choice point records the current state of parsing for backtracking purposes. Namely, the grammar state, and the token set. The tokens are then tried in the order of the set, and upon error, backtracking resets the parser at the latest choice point deprived of the token that was chosen for it.
Note that the use of backtracking for token identification is not a guarantee of complete recovery. First, full backtracking is generally not a feasible nor desirable option as it would entail possibly keeping an entire input stream in memory as the buffer grows. The option is to keep only a fixed-size buffer and flush from the choice point stack any choice point that becomes stale when this buffer overflows. In effect, this enforces an automatic commit whenever a token choice is not invalidated within the time it takes to read further tokens as allowed by the buffer size.
Second, although backtracking restores the parser's state, it does not automatically undo the side effects that may have been performed by the execution of any semantic action encountered between the failure state and the restored state. If there are any, these must be undone manually. We thus let the grammar formalism allow to specify undo actions to perform when a rule is backtracked over.
The only limitation -- shallow backtracking -- is not serious, and in fact the choice-point stack's size can be specified arbitrarily large if need be. However, any input that overruns the choice-point stack's default depth is in fact cleaning up space by getting rid of older and less likely to be used choice-points. Indeed, failure occurs generally shortly after a wrong choice has been made. We give separately a more detailed specification of the implementation of the shallow backtracking scheme that is adequate for this purpose.
We also wish to preserve compatibility with Prolog's conventions. Recall that the number argument in a Prolog 'op'/3' declaration denotes the binding looseness of the operator, which is inversely related to parsing precedence. The range of these numbers is the interval [1,1200]. To make this compatible with the foregoing yacc commands, the Grammar class defines two constants:
static final int MIN_PRECEDENCE = 1; static final int MAX_PRECEDENCE = 1200;In order to have the binding looseness to be such that 1200 corresponds to minimum precedence and 1 to maximum precedence, we simply define the precedence level of binding looseness n to be 1200-n+1. Thus, a declaration such as:
%op '+' yfx 500associates to binary '+' a precedence level of 701 (i.e., 1200-500+1).
More generaly, dynamic operators can be declared with the form:
%op <operator> <specifier> <precedence>or:
%op <operator> <precedence> <specifier>or the simpler form:
%op <operator> <specifier>leaving the precedence implicit, and defaulting to the precedence level effective at the command's execution time.
The first encountered token declaration with implicit precedence (i.e., a conventional yacc token command or a two-argument dynamic operator command) uses the initial precedence level set to Grammar.MIN_PRECEDENCE, then increments it by a fixed increment. This increment is 10 by default, but using the command:
%precstep <number>sets the increment to the given number. This command may be used several times. Each subsequent declaration with implicit precedence uses the current precedence level, then increments the precedence level by the current precedence increment. Any attempt to set a precedence level outside the [1,1200] range is ignored: the closest bound is used instead (i.e., 1 if less and 1200 if more), and a warning is issued.