Recursive Descent Parser

programming compilers programmingconcept


First heard in syllabus of Dabeaz (David Beazley) Write a Compiler Course Notebook

Also see ((How to Write a (Lisp) Interpreter (in Python)) (by Peter Norvig)(archived))


A recursive descent parser is a top-down parsing technique where each non-terminal in a grammar corresponds to a function in the parser. The parser starts at the top-level rule (usually the start symbol) and recursively calls functions for each sub-rule, consuming tokens from the input when they match expected patterns.


Uses a set of mutually recursive functions to process the input.


Limitations:

  • left recursion

    • Direct left recursion (A → Aα | β) causes infinite recursion and must be eliminated via transformation.

    • Indirect left recursion (A → Bα, B → Aβ) also needs resolution.


Dabeaz implementation from Python Cookbook

David Beazley (Dabeaz) code in Python Cookbook:


You should start by having a formal specification of the grammar in the form of BNF (Backus-Naur Form (BNF)) or EBNF (Extended Backus-Naur Form (EBNF)). The grammar of algebraic expressions:

expr ::= expr + term
	| expr - term
	| term
term ::= term * factor
	| term / factor
	| factor
factor ::= ( expr )
	| NUM

import re
import collections

# Token specification
NUM = r'(?P<NUM>\d+)'
PLUS = r'(?P<PLUS>\+)'
MINUS = r'(?P<MINUS>-)'
TIMES = r'(?P<TIMES>\*)'
DIVIDE = r'(?P<DIVIDE>/)'
LPAREN = r'(?P<LPAREN>\()'
RPAREN = r'(?P<RPAREN>\))'
WS = r'(?P<WS>\s+)'

master_pat = re.compile('|'.join([NUM, PLUS, MINUS, TIMES,
DIVIDE, LPAREN, RPAREN, WS]))

# Tokenizer
Token = collections.namedtuple('Token', ['type','value'])

def generate_tokens(text):
	scanner = master_pat.scanner(text)
	for m in iter(scanner.match, None):
		tok = Token(m.lastgroup, m.group())
		if tok.type != 'WS':
			yield tok

# Parser
class ExpressionEvaluator:
	'''
	Implementation of a recursive descent parser. Each method
	implements a single grammar rule. Use the ._accept() method
	to test and accept the current lookahead token. Use the ._expect()
	method to exactly match and discard the next token on on the input
	(or raise a SyntaxError if it doesn't match).
	'''
	def parse(self,text):
		self.tokens = generate_tokens(text)
		self.tok = None   # Last symbol consumed
		self.nexttok = None  # Next symbol tokenized
		self._advance() # Load first lookahead token
		return self.expr()
		
	def _advance(self):
		'Advance one token ahead'
		self.tok, self.nexttok = self.nexttok, next(self.tokens, None)
		
	def _accept(self,toktype):
		'Test and consume the next token if it matches toktype'
		if self.nexttok and self.nexttok.type == toktype:
			self._advance()
			return True
		else:
			return False
			
	def _expect(self,toktype):
		'Consume next token if it matches toktype or raise SyntaxError'
		if not self._accept(toktype):
			raise SyntaxError('Expected ' + toktype)
		# Grammar rules follow
		
	def expr(self):
		"expression ::= term { ('+'|'-') term }*"
		exprval = self.term()
		while self._accept('PLUS') or self._accept('MINUS'):
			op = self.tok.type
			right = self.term()
			if op == 'PLUS':
				exprval += right
			elif op == 'MINUS':
				exprval -= right
		return exprval
		
	def term(self):
		"term ::= factor { ('*'|'/') factor }*"
		termval = self.factor()
		while self._accept('TIMES') or self._accept('DIVIDE'):
			op = self.tok.type
			right = self.factor()
			if op == 'TIMES':
				termval *= right
			elif op == 'DIVIDE':
				termval /= right
		return termval
		
	def factor(self):
		"factor ::= NUM | ( expr )"
		if self._accept('NUM'):
			return int(self.tok.value)
		elif self._accept('LPAREN'):
			exprval = self.expr()
			self._expect('RPAREN')
			return exprval
		else:
			raise SyntaxError('Expected NUMBER or LPAREN')

How it works on terminal:

>>> e = ExpressionTreeBuilder()
>>> e.parse('2 + 3')
('+', 2, 3)
>>> e.parse('2 + 3 * 4')
('+', 2, ('*', 3, 4))
>>> e.parse('2 + (3 + 4) * 5')
('+', 2, ('*', ('+', 3, 4), 5))
>>> e.parse('2 + 3 + 4')
('+', ('+', 2, 3), 4)
>>>

Nevertheless, the overall idea of writing a recursive descent parser is generally simple.

To start, you take every grammar rule and you turn it into a function or method. Thus,

if your grammar looks like this:


expr ::= term { ('+'|'-') term }*<br>

term ::= factor { ('*'|'/') factor }*<br>

factor ::= '(' expr ')'<br>

	| NUM<br>

You start by turning it into a set of methods like this:

class ExpressionEvaluator:
	...
	def expr(self):
		...
	def term(self):
		...
	def factor(self):
		...

Key Characteristics


  1. Top-Down Parsing

    • It begins at the start symbol and expands productions recursively.

    • Works best for LL(k) grammars (left-to-right parsing, leftmost derivation, k-token lookahead).

  2. Backtracking (if naive)

    • If a rule fails, the parser may revert to an earlier state and try another rule.

    • Simple implementations suffer from exponential backtracking, making them impractical for ambiguous grammars.

  3. Handling of Left Recursion

    • Direct left recursion (A → Aα | β) causes infinite recursion and must be eliminated via transformation.

    • Indirect left recursion (A → Bα, B → Aβ) also needs resolution.

  4. Lookahead (LL(k))

    • Typically uses one-token lookahead (LL(1)) but can be extended to LL(k) for more complex grammars.

    • With predictive parsing, backtracking is avoided if the lookahead uniquely determines the next rule.

  5. Mutual Recursion Between Functions

    • Each grammar rule is implemented as a recursive function.

    • Sub-rules are handled by calling appropriate functions.

  6. Efficiency Considerations

    • Left factoring is needed if multiple rules share common prefixes.

    • Packrat parsing (memoization) can be applied to eliminate redundant computations.







Receive my updates

Barış Özmen © 2025