In 1970, Alexander Birman laid the groundwork for packrat parsing by introducing the "TMG recognition scheme" (TS), and "generalized TS" (gTS). TS was based upon Robert M. McClure's TMG compiler-compiler, and gTS was based upon Dewey Val Schorre's META compiler-compiler.
Birman's work was later refined by Aho and Ullman; and renamed as Top-Down Parsing Language (TDPL), and Generalized TDPL (GTDPL), respectively. These algorithms were the first of their kind to employ deterministic top-down parsing with backtracking.[2][3]
Bryan Ford developed PEGs as an expansion of GTDPL and TS. Unlike CFGs, PEGs are unambiguous and can match well with machine-oriented languages. PEGs, similar to GTDPL and TS, can also express all LL(k) and LR(k). Bryan also introduced Packrat as a parser that uses memoization techniques on top of a simple PEG parser. This was done because PEGs have an unlimited lookahead capability resulting in a parser with exponential time performance in the worst case.[2][3]
Packrat keeps track of the intermediate results for all mutually recursive parsing functions. Each parsing function is only called once at a specific input position. In some instances of packrat implementation, if there is insufficient memory, certain parsing functions may need to be called multiple times at the same input position, causing the parser to take longer than linear time.[4]
The packrat parser takes in input the same syntax as PEGs: a simple PEG is composed of terminal and nonterminal symbols, possibly interleaved with operators that compose one or several derivation rules.[2]
Symbols
Nonterminal symbols are indicated with capital letters (e.g., )
Terminal symbols are indicated with lowercase (e.g., )
Expressions are indicated with lower-case Greek letter (e.g., )
Expressions can be a mix of terminal symbols, nonterminal symbols and operators
Operators
Syntax Rules
Operator
Semantics
Sequence
Success: If and are recognized
Failure: If or are not recognized
Consumed: and in case of success
Ordered choice
Success: If any of is recognized starting from the left
Failure: All of do not match
Consumed: The atomic expression that has generated a success
so if multiple succeed the first one is always returned
And predicate
Success: If is recognized
Failure: If is not recognized
Consumed: No input is consumed
Not predicate
Success: If is not recognized
Failure: If is recognized
Consumed: No input is consumed
One or more
Success: Try to recognize one or multiple time
Failure: If is not recognized
Consumed: The maximum number that is recognized
Zero or more
Success: Try to recognize zero or multiple time
Failure: Cannot fail
Consumed: The maximum number that is recognized
Zero or one
Success: Try to recognize zero or once
Failure: Cannot fail
Consumed: if it is recognized
Terminal range
[]
Success: Recognize any terminal that are inside the range . In the case of , can be any letter from h to z
Failure: If no terminal inside of can be recognized
Consumed: if it is recognized
Any character
Success: Recognize any character in the input
Failure: If no character in the input
Consumed: any character in the input
Rules
A derivation rule is composed by a nonterminal symbol and an expression .
A special expression is the starting point of the grammar.[2] In case no is specified, the first expression of the first rule is used.
An input string is considered accepted by the parser if the is recognized. As a side-effect, a string can be recognized by the parser even if it was not fully consumed.[2]
An extreme case of this rule is that the grammar matches any string.
This can be avoided by rewriting the grammar as
Example
This grammar recognizes a palindrome over the alphabet , with an optional digit in the middle.
Example strings accepted by the grammar include: and .
Left recursion
Left recursion happens when a grammar production refers to itself as its left-most element, either directly or indirectly. Since Packrat is a recursive descent parser, it cannot handle left recursion directly.[5] During the early stages of development, it was found that a production that is left-recursive can be transformed into a right-recursive production.[6] This modification significantly simplifies the task of a Packrat parser. Nonetheless, if there is an indirect left recursion involved, the process of rewriting can be quite complex and challenging. If the time complexity requirements are loosened from linear to superlinear, it is possible to modify the memoization table of a Packrat parser to permit left recursion, without altering the input grammar.[5]
Iterative combinator
The iterative combinator , , needs special attention when used in a Packrat parser. As a matter of fact, the use of iterative combinators introduces a secret recursion that does not record intermediate results in the outcome matrix. This can lead to the parser operating with a superlinear behaviour. This problem can be resolved apply the following transformation:[1]
Original
Translated
With this transformation, the intermediate results can be properly memoized.
Memoization technique
Memoization is an optimization technique in computing that aims to speed up programs by storing the results of expensive function calls. This technique essentially works by caching the results so that when the same inputs occur again, the cached result is simply returned, thus avoiding the time-consuming process of re-computing.[7] When using packrat parsing and memoization, it's noteworthy that the parsing function for each nonterminal is solely based on the input string. It does not depend on any information gathered during the parsing process. Essentially, memoization table entries do not affect or rely on the parser's specific state at any given time.[8]
Packrat parsing stores results in a matrix or similar data structure that allows for quick look-ups and insertions. When a production is encountered, the matrix is checked to see if it has already occurred. If it has, the result is retrieved from the matrix. If not, the production is evaluated, the result is inserted into the matrix, and then returned.[9] When evaluating the entire matrix in a tabular approach, it would require space.[9] Here, represents the number of nonterminals, and represents the input string size.
In a naïve implementation, the entire table can be derived from the input string starting from the end of the string.
The Packrat parser can be improved to update only the necessary cells in the matrix through a depth-first visit of each subexpression tree. Consequently, using a matrix with dimensions of is often wasteful, as most entries would remain empty.[5] These cells are linked to the input string, not to the nonterminals of the grammar. This means that increasing the input string size would always increase memory consumption, while the number of parsing rules changes only the worst space complexity.[1]
Cut operator
Another operator called cut has been introduced to Packrat to reduce its average space complexity even further. This operator utilizes the formal structures of many programming languages to eliminate impossible derivations. For instance, control statements parsing in a standard programming language is mutually exclusive from the first recognized token, e.g.,.[10]
Operator
Semantics
Cut
if is recognized but is not, skip the evaluation of the alternative.
In the first case don't evaluate if was recognized
The second rule is can be rewritten as and the same rules can be applied.
When a Packrat parser uses cut operators, it effectively clears its backtracking stack. This is because a cut operator reduces the number of possible alternatives in an ordered choice. By adding cut operators in the right places in a grammar's definition, the resulting Packrat parser only needs a nearly constant amount of space for memoization.[10]
The algorithm
Sketch of an implementation of a Packrat algorithm in a Lua-like pseudocode.[5]
INPUT(n)-- return the character at position nRULE(R:Rule,P:Position)entry=GET_MEMO(R,P)-- return the number of elements previously matched in rule R at position Pifentry==nilthenreturnEVAL(R,P);endreturnentry;EVAL(R:Rule,P:Position)start=P;forchoiceinR.choices-- Return a list of choiceacc=0;forsymbolinchoicethen-- Return each element of a rule, terminal and nonterminalifsymbol.is_terminalthenifINPUT(start+acc)==symbol.terminalthenacc=acc+1;--Found correct terminal skip pass itelsebreak;endelseres=RULE(symbol.nonterminal,start+acc);-- try to recognize a nonterminal in position start+accSET_MEMO(symbol.nonterminal,start+acc,res);-- we memoize also the failure with special value failifres==failthenbreak;endacc=acc+res;endifsymbol==choice.last-- check if we have matched the last symbol in a choice if so returnreturnacc;endendreturnfail;--if no choice match return fail
Example
Given the following context, a free grammar that recognizes simple arithmetic expressions composed of single digits interleaved by sum, multiplication, and parenthesis.
Denoted with the line terminator we can apply the packrat algorithm
Derivation of
Syntax tree
Action
Packrat Table
Derivation Rules
Input shifted
ɛ
Notes
Input left
Input doesn't match the first element in the derivation.
Backtrack to the first grammar rule with unexplored alternative
Index
1
2
3
4
5
6
7
S
A
M
P
D
2
*
(
3
+
4
)
No update because no terminal was recognized
Derivation Rules
Input shifted
Notes
Input left
Shift input by one after deriving terminal
Index
1
2
3
4
5
6
7
S
A
M
P
1
D
1
2
*
(
3
+
4
)
Update:
D(1) = 1;
P(1) = 1;
Derivation Rules
Input shifted
Notes
Input left
Shift input by two terminal
Index
1
2
3
4
5
6
7
S
A
M
P
1
D
1
2
*
(
3
+
4
)
No update because no nonterminal was fully recognized
Derivation Rules
Input shifted
Notes
Input left
Input doesn't match the first element in the derivation.
Backtrack to the first grammar rule with unexplored alternative
Index
1
2
3
4
5
6
7
S
A
M
P
1
D
1
2
*
(
3
+
4
)
No update because no terminal was recognized
Derivation Rules
Input shifted
Notes
Input left
Shift input by one after deriving terminal
but the new input will not match inside so an unroll is necessary to
Index
1
2
3
4
5
6
7
S
A
M
P
1
1
D
1
1
2
*
(
3
+
4
)
Update:
D(4) = 1;
P(4) = 1;
Derivation Rules
Input shifted
Notes
Input left
Roll Back to
And we don't expand it has we have an hit in the memoization table P(4) ≠ 0 so shift the input by P(4).
Shift also the from
Index
1
2
3
4
5
6
7
S
A
M
1
P
1
1
D
1
1
2
*
(
3
+
4
)
Hit on P(4)
Update M(4) = 1 as M was recognized
Derivation Rules
Input shifted
Notes
Input left
Input doesn't match the first element in the derivation.
Backtrack to the first grammar rule with unexplored alternative
Index
1
2
3
4
5
6
7
S
A
M
1
P
1
1
D
1
1
2
*
(
3
+
4
)
No update because no terminal was recognized
Derivation Rules
Input shifted
Notes
Input left
Shift input by one after deriving terminal
but the new input will not match inside so an unroll is necessary
Index
1
2
3
4
5
6
7
S
A
M
1
P
1
1
1
D
1
1
1
2
*
(
3
+
4
)
Update:
D(6) = 1;
P(6) = 1;
Derivation Rules
Input shifted
Notes
Input left
Roll Back to
And we don't expand it has we have an hit in the memoization table P(6) ≠ 0 so shift the input by P(6).
but the new input will not match inside so an unroll is necessary
Index
1
2
3
4
5
6
7
S
A
M
1
1
P
1
1
1
D
1
1
1
2
*
(
3
+
4
)
Hit on P(6)
Update M(6) = 1 as M was recognized
Derivation Rules
Input shifted
Notes
Input left
Roll Back to
And we don't expand it has we have an hit in the memoization table M(6) ≠ 0 so shift the input by M(6).
Also shift from
Index
1
2
3
4
5
6
7
S
A
3
M
1
1
P
1
5
1
1
D
1
1
1
2
*
(
3
+
4
)
Hit on M(6)
Update A(4) = 3 as A was recognized
Update P(3)=5 as P was recognized
Derivation Rules
Input shifted
Notes
Input left
Roll Back to as terminal
Index
1
2
3
4
5
6
7
S
A
3
M
1
1
P
1
5
1
1
D
1
1
1
2
*
(
3
+
4
)
No update because no terminal was recognized
Derivation Rules
Input shifted
Notes
Input left
we don't expand it as we have a hit in the memoization table P(3) ≠ 0, so shift the input by P(3).
Index
1
2
3
4
5
6
7
S
A
3
M
7
1
1
P
1
5
1
1
D
1
1
1
2
*
(
3
+
4
)
Hit on P(3)
Update M(1)=7 as M was recognized
Derivation Rules
Input shifted
Notes
Input left
Roll Back to as terminal
Index
1
2
3
4
5
6
7
S
A
3
M
7
1
1
P
1
5
1
1
D
1
1
1
2
*
(
3
+
4
)
No update because no terminal was recognized
Derivation Rules
Input shifted
Notes
Input left
We don't expand it as we have a hit in the memoization table M(1) ≠ 0, so shift the input by M(1).
S was totally reduced, so the input string is recognized.
^ abScience, International Journal of Scientific Research in; Ijsrset, Engineering and Technology. "A Survey of Packrat Parser". A Survey of Packrat Parser.