This manual describes IPC/IPCSEM, a family of tools for generating parsers that perform tokenization and pattern-matching on text. The manual includes the following sections:
1.0 Introduction to IPC
Interactive Parser Constructor is a tool for creating parsers: programs that recognize lexical patterns in text. IPC reads the given input file for a description of a parser to generate. The description is a context-free grammar (CFG) in the form of BNF-style production rules. IPC generates as output a C include file, PT.h, which defines data structures containing grammar-specific information suitable for parsing text which can be generated from that CFG. This include file is compiled with parser.c to produce an executable parser.
IPC offers several features that enhance the computational power of a parser. The interactive capability of IPC allows the user to view and solve ambiguous grammar conflicts before creating a parser, which minimizes error-edit-run cycles. The capacity to add semantic functions to syntax rules extend the parser and allow it to implement a variety of operations, from creating simple parse trees to generating complete target code.
1.1 Running IPC
The IPC program can be operated in either the prompt mode or via command-line arguments. By default, the interactive mode will prompt for the following information:
Once this information has been provided, IPC begins to read the grammar file. If the grammar file does not contain any syntax error or rule conflicts, the message file and parse table file is generated. If syntax errors are encountered, you must correct the grammar file and re-run IPC. Rule conflicts however, can be resolved interactively through question-and-answer prompts.
1.2 Grammar File
The format for IPC grammar files is a version of BNF syntax in the
following form:
LHS0 = t0 LHS1
t1 ... tn-1 LHSn tn ;
where t0...t1 are terminal tokens and
LHS1...LHSn are non-terminals. The following
input specifies a simple expression grammar which can be used to parse
text like: "a + b * 10 + 20" and "c * 6 + 8 * 2".
By default, IPC recognizes four built-in terminals; "#id" matches identifiers (variables), "#int", "#real" and "#string". Here is another simple example that extends the first example by allowing assignment statements and parentheses.
Since IPC is a bottom-up parser, rules with LHS non-terminals "lower down" are reduced first, followed by rules with LHS non-terminals "higher up." In the context of the previous grammar, the expression "a + b * 10 + 20" would be parsed correctly with "b * 10" having higher precedence. Hence E = "(" E ")" has higher precedence (as it should) than any other grammar rule. A somewhat more complicated example:
DeclList = DeclList ";" Decl ; SemFunc1 DeclList = Decl ; SemFunc2 Decl = IdList ":" Type ; SemFunc3 IdList = IdList "," "#id" ; SemFunc4 IdList = "#id" ; SemFunc5 Type = ScalarType ; SemFunc6 Type = "array" "(" ScalarTypeList ")" "of" Type ; SemFunc7 ScalarType = "#id" ; SemFunc8 ScalarType = Bound ".." Bound ; SemFunc9 Bound = Sign IntLiteral ; SemFunc10 Bound = "#id" ; SemFunc11 ...
This is the beginning of a grammar for declaration statement of Pascal like language. The last items on each line, SemFunc1...SemFunc11, are semantic functions that will be called when the rule to the left is being used in a reduction.
1.3 Message File
The IPC message file contains three major sections. The first section shows the FIRST, FOLLOW and SET information for each rule. The second section, PARSE TABLE is the text (readable) version of the data structures in PT.h and has the following format: LHS, Action, Rule. The last section contains IPC’s internal data structure statistics showing number of productions, alternative rule sets, number of rows and columns in the parse table, etc.
GRAMMAR FILE "grammar": No Format Errors FIRST ( Z ): { #id #int } FOLLOW( Z ): { $ } FIRST ( E ): { #int #id } FOLLOW( E ): { + $ } FIRST ( T ): { #id #int } FOLLOW( T ): { * $ + } FIRST ( F ): { #int #id } FOLLOW( F ): { + $ * } SET 0: ITEMS-> * , { $ } -> * "+" , { + $ } -> * , { + $ } -> * "*" , { * + $ } -> * , { * + $ } -> * "#id", { * + $ } -> * "#int", { * + $ } TRANSITIONS on "#int" goto Set 5 on "#id" goto Set 4 on goto Set 3 on goto Set 2 on goto Set 1 SET 1: ITEMS -> *, { $ } -> * "+" , { $ + } TRANSITIONS on "+" goto Set 6 SET 2: ITEMS -> *, { $ + } -> * "*" , { $ + * } TRANSITIONS on "*" goto Set 7 ... SET 9: ITEMS -> "*" *, { $ + * } PARSE TABLE State 0: "#id", S, 4 "#int", S, 5 , G, 1 , G, 3 , G, 2 State 1: "$", A "+", S, 6 State 2: "$", R, 2 "*", S, 7 "+", R, 2 ... State 9: "$", R, 3 "*", R, 3 "+", R, 3 LR(1) Data Structure Statistics for the Grammar "grammar": Statistic Corresponding Constant ----------------------------------------------------------------- Grammar Data Structures: Number of productions 4 MAX_PROD Number of alternates 7 MAX_ALT (MAX_RULE) Number of elements 11 MAX_ELEM Length of the grammar name space 24 MAX_SPACE Item Sets Data Structures: Number of item sets 10 MAX_SET (MAX_ROW) Total number of kernel items 13 MAX_ITEM Number of items in largest set 7 MAX_TEMP Parse Table Data Structures: Number of parse table rows 10 MAX_ROW Number of parse table columns 11 MAX_COLUMN
1.4 Parse Table
This C "include" file, called "PT.h" by default, contains all the necessary data for generation of a parser capable of analyzing text of the given grammar. Here is the content of PT.h for the first example grammar:
typedef struct pt_entry
In addition to the PT array, the RULE array contains all the grammar rules in the order in which they were specified in the grammar file. R_NODE.Name is an index into the G_LIST array, which has the entire Terminal/Non-Terminal vocabulary, sorted alphabetically while R_NODE.Length is the number RHS elements for that rule.
1.5 Conflict Resolution
When the IPC program detects grammar ambiguity -- a grammar is ambiguous if more than one derivation exists for some input string -- it will interactively prompt for clarification to resolve the conflict(s). Here is a blatant example of an ambiguous grammar:
Using the following dialog, the IPC resolves the ambiguity and generates a correct parser. Comments are listed to the right for clarification of user actions. Also, note the * (called dot) in each rule; this is a good indicator of which rule you must chose.
Parse Table Conflicts, Set 14:
1 = Resolve Conflicts
2 = Print Conflicts
3 = Print Derivation Trace
Enter function number (1/2/3): 1
Set 14, S/R Conflict, Input_Symbol "+" -- Items that conflict:
1.
2.0 Introduction to Parser
The actual parsing of a language (an expression, strings, program source code, etc.) is the responsibility of a program called parser. To build a parser, you must compile and link parser.c. This maybe accomplished using the following UNIX command.
$ cc parser.c –o parser
By default, parser.c uses the parse table file PT.h that was generated by IPC. If for some reason you specified a different filename for the parse table file, you have to modify parser.c and change the #include "PT.h" to reflect your parse table include file.
2.1 Running the Parser
The parser that you have created can operate either interactively or by using command-line arguments. The interactive mode offers the following prompts.
If the parser is successful in parsing the source text file, it will print "Text syntactically correct" and then exit to the command prompt.
2.2 Parser Message File
The verbose version of the parser’s message file contains all the information on how your source file was parsed. Using the grammar
parsing the expression "a + b * 10 + 20" results in this message file:
Parsing text "expression": STACK CONTENTS ..... NEXT LEXEME ----------------------------------------------------------------------- 0 ..... "a" 0 "a" 4 ..... "+" 0 <F> 3 ..... "+" 0 <T> 2 ..... "+" 0 <E> 1 ..... "+" 0 <E> 1 "+" 6 ..... "b" 0 <E> 1 "+" 6 "b" 4 ..... "*" 0 <E> 1 "+" 6 <F> 3 ..... "*" 0 <E> 1 "+" 6 <T> 8 ..... "*" 0 <E> 1 "+" 6 <T> 8 "*" 7 ..... "10" 0 <E> 1 "+" 6 <T> 8 "*" 7 "10" 5 ..... "+" 0 <E> 1 "+" 6 <T> 8 "*" 7 <F> 9 ..... "+" 0 <E> 1 "+" 6 <T> 8 ..... "+" 0 <E> 1 ..... "+" 0 <E> 1 "+" 6 ..... "20" 0 <E> 1 "+" 6 "20" 5 ..... "EOF" 0 <E> 1 "+" 6 <F> 3 ..... "EOF" 0 <E> 1 "+" 6 <T> 8 ..... "EOF" 0 <E> 1 ..... "EOF" 0 <E> 1 ..... "EOF"
Each message line shows you the contents of the stack and the next lexeme (token) to be parsed. As each lexeme is processed, the stack contents forms a pattern that parser can recognize and reduce via a rule whose RHS matches the pattern on the stack. Each lexeme or non-terminal is followed by a SET number (or in case of "#id", "#int"…, the rule #) that was used for its derivation. All the SETS are listed in the IPC message file.
3.0 Introduction to Semantic Functions
By default, the parser that you generate can only inform you about the syntactical correctness of the source language. To generate target code, this is not sufficient. For example, a CFG does not allow type checking of variables. Hence, there is motivation to extending the parser by writing semantic functions.
As was indicated earlier, semantic functions are C functions that the parser calls before it reduces the stack via a production rule. Each rule in your grammar file can be associated with a different semantic function. The IPC will include the name of these functions in the parse table that it creates for use by the parser. The functions themselves must be compiled and linked with the parser.
Since the semantic functions become part of the parser, they can access all the information and structures that the parser has at its disposal. This combination can create a powerful tool that is capable of performing any compile-related task, from a simple expression re-writing to a full target-language code generation.
3.1 Examples
The first step in creating a semantic function is to augment the grammar file. Here is the simple expression grammar with some semantic functions:
Z = E ; function0
E = E "+" T ; function1
E = T ; function2
T = T "*" F ; function3
T = F ; function4
F = "#id" ; function5
F = "#int" ; function6
Next, we must implement these functions to perform a task. For our example, we will write these functions to create a post-fix representation of an expression input in in-fix form.
int function6(STACK *L1) // parsers stack is passed as the only argument
{
int top;
top = L1->top; // array index to the top of the stack
//
// now that we know where the top is, we need to look at the rule.
// rule F = "#int" has 1 item on it's RHS. So that is the only symbol
// that we have to print.
//
printf("%s ", L1->Data[top].Symbol);
return (0); // some compilers will complain if we don't do this
}
int function5(STACK *L1)
{
//
// for our example, this function will perform the same task as
// funtion6, so we can just call that one.
//
return function6(L1);
}
int function4(STACK *L1)
{
// we don't need this function to do anything and could have excluded it
// from our grammar.
//
Return (0);
}
int function3(STACK *L1)
{
int top;
top = L1->top;
//
// looking at the rule, T = T "*" F, we can see that the operator is in top-1
// since the operator is the symbol that we want, we have to decrement the
// stack top (our local variable) to get it.
//
top = top - 1;
printf("%s ", L1->Data[top].Symbol);
return(0);
}
int function2(STACK *L1)
{
// don't need this function either.
//
return(0);
}
int function1(STACK *L1)
{
//
// this function is identical to function3. The rule E = E "+" T has the
// RHS order and number as function3.
//
return function3(L1);
}
int function0(STACK *L1)
{
//
// this function will output a single return to end our print string
//
printf("\n");
return(0);
}
Running the resulting parser with the input "a + b * 10 + 20" will output "a b 10 * + 20 +".
The semantic functions that you write need not be the only functions in your C file. You may add as many support functions as you need. For our next and final example, we implement an abstract syntax tree (AST) using the same grammar, by making calls to support routines not shown (e.g., makeTree() ).
int function6(STACK *L1)
{
int top;
ITEM item;
top = L1->top;
item = createItem(L1->Data[top].Symbol);
pushStack(item);
return (0);
}
int function5(STACK *L1)
{
return function6(L1);
}
int function4(STACK *L1)
{
Return (0);
}
int function3(STACK *L1)
{
int top;
ITEM item1, item2, node;
char binOp;
top = (L1->top) - 1;
binOp = *(L1->Data[top].Symbol); // this is OK since we only have * and -
item2 = popStack();
item1 = popStack();
node = makeTree(item1, item2, binOp);
pushStack(node);
return(0);
}
int function2(STACK *L1)
{
return(0);
}
int function1(STACK *L1)
{
return function3(L1);
}
int function0(STACK *L1)
{
ITEM rootNode;
rootNode = popStack();
printTree(rootNode);
return(0);
}
As should be clear by now, semantic functions can significantly extend the parser. You can use them to create data structures as necessary to implement intermediate forms (ASTs, DAGs, three-address code, etc.), or even to implement a code generator which takes the intermediate form as input & generates from it target code. Hence, semantic functions provide sufficient flexibility to allow the parser to metamorphose into whatever compile-related tool is desired, even into a complete compiler.