Recursive Descent Parser

programming compilers programmingconcept

First heard in syllabus of Dabeaz (David Beazley) Write a Compiler Course Notebook

Also see ((How to Write a (Lisp) Interpreter (in Python)) (by Peter Norvig)(archived))

A recursive descent parser is a top-down parsing technique where each non-terminal in a grammar corresponds to a function in the parser. The parser starts at the top-level rule (usually the start symbol) and recursively calls functions for each sub-rule, consuming tokens from the input when they match expected patterns.

chatgpt

Uses a set of mutually recursive functions to process the input.

Limitations:

left recursion
- Direct left recursion (A → Aα | β) causes infinite recursion and must be eliminated via transformation.
- Indirect left recursion (A → Bα, B → Aβ) also needs resolution.

Dabeaz implementation from Python Cookbook

David Beazley (Dabeaz) code in Python Cookbook:

You should start by having a formal specification of the grammar in the form of BNF (Backus-Naur Form (BNF)) or EBNF (Extended Backus-Naur Form (EBNF)). The grammar of algebraic expressions:

expr ::= expr + term
	| expr - term
	| term
term ::= term * factor
	| term / factor
	| factor
factor ::= ( expr )
	| NUM

import re
import collections

# Token specification
NUM = r'(?P<NUM>\d+)'
PLUS = r'(?P<PLUS>\+)'
MINUS = r'(?P<MINUS>-)'
TIMES = r'(?P<TIMES>\*)'
DIVIDE = r'(?P<DIVIDE>/)'
LPAREN = r'(?P<LPAREN>\()'
RPAREN = r'(?P<RPAREN>\))'
WS = r'(?P<WS>\s+)'

master_pat = re.compile('|'.join([NUM, PLUS, MINUS, TIMES,
DIVIDE, LPAREN, RPAREN, WS]))

# Tokenizer
Token = collections.namedtuple('Token', ['type','value'])

def generate_tokens(text):
	scanner = master_pat.scanner(text)
	for m in iter(scanner.match, None):
		tok = Token(m.lastgroup, m.group())
		if tok.type != 'WS':
			yield tok

# Parser
class ExpressionEvaluator:
	'''
	Implementation of a recursive descent parser. Each method
	implements a single grammar rule. Use the ._accept() method
	to test and accept the current lookahead token. Use the ._expect()
	method to exactly match and discard the next token on on the input
	(or raise a SyntaxError if it doesn't match).
	'''
	def parse(self,text):
		self.tokens = generate_tokens(text)
		self.tok = None   # Last symbol consumed
		self.nexttok = None  # Next symbol tokenized
		self._advance() # Load first lookahead token
		return self.expr()
		
	def _advance(self):
		'Advance one token ahead'
		self.tok, self.nexttok = self.nexttok, next(self.tokens, None)
		
	def _accept(self,toktype):
		'Test and consume the next token if it matches toktype'
		if self.nexttok and self.nexttok.type == toktype:
			self._advance()
			return True
		else:
			return False
			
	def _expect(self,toktype):
		'Consume next token if it matches toktype or raise SyntaxError'
		if not self._accept(toktype):
			raise SyntaxError('Expected ' + toktype)
		# Grammar rules follow
		
	def expr(self):
		"expression ::= term { ('+'|'-') term }*"
		exprval = self.term()
		while self._accept('PLUS') or self._accept('MINUS'):
			op = self.tok.type
			right = self.term()
			if op == 'PLUS':
				exprval += right
			elif op == 'MINUS':
				exprval -= right
		return exprval
		
	def term(self):
		"term ::= factor { ('*'|'/') factor }*"
		termval = self.factor()
		while self._accept('TIMES') or self._accept('DIVIDE'):
			op = self.tok.type
			right = self.factor()
			if op == 'TIMES':
				termval *= right
			elif op == 'DIVIDE':
				termval /= right
		return termval
		
	def factor(self):
		"factor ::= NUM | ( expr )"
		if self._accept('NUM'):
			return int(self.tok.value)
		elif self._accept('LPAREN'):
			exprval = self.expr()
			self._expect('RPAREN')
			return exprval
		else:
			raise SyntaxError('Expected NUMBER or LPAREN')

How it works on terminal:

>>> e = ExpressionTreeBuilder()
>>> e.parse('2 + 3')
('+', 2, 3)
>>> e.parse('2 + 3 * 4')
('+', 2, ('*', 3, 4))
>>> e.parse('2 + (3 + 4) * 5')
('+', 2, ('*', ('+', 3, 4), 5))
>>> e.parse('2 + 3 + 4')
('+', ('+', 2, 3), 4)
>>>

Nevertheless, the overall idea of writing a recursive descent parser is generally simple.

To start, you take every grammar rule and you turn it into a function or method. Thus,

if your grammar looks like this:

expr ::= term { ('+'|'-') term }*<br>

term ::= factor { ('*'|'/') factor }*<br>

factor ::= '(' expr ')'<br>

	| NUM<br>

You start by turning it into a set of methods like this:

class ExpressionEvaluator:
	...
	def expr(self):
		...
	def term(self):
		...
	def factor(self):
		...

Key Characteristics

Top-Down Parsing
- It begins at the start symbol and expands productions recursively.
- Works best for LL(k) grammars (left-to-right parsing, leftmost derivation, k-token lookahead).
Backtracking (if naive)
- If a rule fails, the parser may revert to an earlier state and try another rule.
- Simple implementations suffer from exponential backtracking, making them impractical for ambiguous grammars.
Handling of Left Recursion
- Direct left recursion (A → Aα | β) causes infinite recursion and must be eliminated via transformation.
- Indirect left recursion (A → Bα, B → Aβ) also needs resolution.
Lookahead (LL(k))
- Typically uses one-token lookahead (LL(1)) but can be extended to LL(k) for more complex grammars.
- With predictive parsing, backtracking is avoided if the lookahead uniquely determines the next rule.
Mutual Recursion Between Functions
- Each grammar rule is implemented as a recursive function.
- Sub-rules are handled by calling appropriate functions.
Efficiency Considerations
- Left factoring is needed if multiple rules share common prefixes.
- Packrat parsing (memoization) can be applied to eliminate redundant computations.

Incoming Internal References (2)

Outgoing Internal References (7)

Outgoing Web References (1)

chatgpt.com/share/67bd1e01-8d58-8010-aef2-e2ac33f5901e
- chatgpt

Recursive Descent Parser

Dabeaz implementation from Python Cookbook

Key Characteristics

Receive my updates