# The Earley Algorithm Explained
Let be a grammar
A state is an element
is a production in the set of grammar productions , and - The dot • represents the current position in that production and
- The origin position
is the position in the input at which the matching of this production began: .
The set of active states when the input prefix
- Input position 0 is the position prior to input.
- Input position
n
is the position after accepting the nth token. Informally, input positions can be thought of as locations at token boundaries.
More precisely,
Observe that it holds that
Idea
So the idea is that if state
A state is said finished when its current position is the last position of the right side of the production
The parser is seeded with
- prediction,
- scanning, and
- completion.
These three operations are repeated until no new states can be added to the set
# Prediction
(where j is the start of the substring), and is in the set of non terminals - add states
to grammar productions having Y on the left hand side.
Duplicate states are not added to the state set, only new ones.
# Scanning
If
# Completion
# Acceptance
The algorithm accepts if an state with the form
If there are several of these states it means the grammar is ambiguous.
If there are none, the input is rejected.
# Algorithm Pseudocode
We augment the initial grammar with the rule γ → •S
We also assume that the lexical analysis has been made and the parameter tokens
contains an array with the token objects of the input string.
function EarleyParse(tokens, grammar) {
function init(length) {
let S = [];
for(k =0; k <= length; k++) {
S[k] = new Set();
}
return S;
}
let S = init(tokens.length);
function PREDICTOR([A → α•Bβ, j], k) {
grammar.rules(B).forEach((B → γ) => S[k].add([B → •γ, k]))
}
function SCANNER([A → α•aβ, j], k) {
// "a" is a terminal like ID while tokens[k] is a token object with a "type" and a value like "temp"
if (tokens[k].type == a) S[k+1].add([A → αa•β, j])
}
function COMPLETER([B → γ•, s], k) {
S[s].filter([A → α•Bβ, j]) => S[k].add([A → αB•β, j])
}
S[0].add([γ → •grammar.start, 0]);
for(k = 0; k <= tokens.length; k++) {
S[k].forEach(state => {
if (!state.isFinished())
if (state.nextElement() in grammar.nonTerminal)
PREDICTOR(state, k) // non-terminal
else
SCANNER(state, k) // terminal
else
COMPLETER(state, k) // state matches the pattern [B → γ•, s]
})
}
return S;
}
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
En este código γ
denota una nueva variable sintáctica auxiliar que produce el símbolo de arranque grammar.start
.
La función de aceptación sería:
function ACCEPTS(S, tokens) {
return S[tokens.length].has((γ → S•, 0))
}
2
3
# Ambiguous Grammar Example
Consider the following grammar:
➜ examples git:(main) ✗ cat nearley-algorithm-ambiguous.ne
e -> e "-" e {% ([L, _, R]) => `minus(${L}, ${R})` %}
| "1" {% id %}
2
We compile it:
➜ examples git:(main) ✗ npx nearleyc nearley-algorithm-ambiguous.ne -o nearley-algorithm-ambiguous.js
and the trace the execution:
➜ examples git:(main) ✗ npx nearley-test nearley-algorithm-ambiguous.js -i '1-1-1'
Table length: 6
Number of parses: 2
Parse Charts
Chart: 0
0: {e → ● e "-" e}, from: 0
1: {e → ● "1"}, from: 0
Chart: 1
0: {e → "1" ● }, from: 0
1: {e → e ● "-" e}, from: 0
Chart: 2
0: {e → e "-" ● e}, from: 0
1: {e → ● e "-" e}, from: 2
2: {e → ● "1"}, from: 2
Chart: 3
0: {e → "1" ● }, from: 2
1: {e → e ● "-" e}, from: 2
2: {e → e "-" e ● }, from: 0
3: {e → e ● "-" e}, from: 0
Chart: 4
0: {e → e "-" ● e}, from: 0
1: {e → e "-" ● e}, from: 2
2: {e → ● e "-" e}, from: 4
3: {e → ● "1"}, from: 4
Chart: 5
0: {e → "1" ● }, from: 4
1: {e → e ● "-" e}, from: 4
2: {e → e "-" e ● }, from: 2
3: {e → e "-" e ● }, from: 0
4: {e → e ● "-" e}, from: 2
5: {e → e "-" e ● }, from: 0
6: {e → e ● "-" e}, from: 0
7: {e → e ● "-" e}, from: 0
Parse results:
[ 'minus(minus(1, 1), 1)', 'minus(1, minus(1, 1))' ]
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
To get the full derivations from the chart, you need to search for a completed state matching the pattern e →
from 0
in the last chart.
Chart: 5
0: {e → "1" ● }, from: 4
1: {e → e ● "-" e}, from: 4
2: {e → e "-" e ● }, from: 2
3: {e → e "-" e ● }, from: 0
4: {e → e ● "-" e}, from: 2
6: {e → e ● "-" e}, from: 0
2
3
4
5
6
7
The fact that state 3 {e → e "-" e ● }, from: 0
exist means the state {e → e "-" ● e }, from: 0
must also exist somewhere.
In this case such state appears in Charts 2 and 4.
The advance from {e → e "-" ● e }, from: 0
(Chart 4) to {e → e "-" e ● }, from: 0
in Chart 5
can be produced due to any of the completed states
0 and 2 in Chart 5 0: {e → "1" ● } from: 4
and 2: {e → e "-" e ● }, from: 2
, that leads to two different syntax trees for 1-1-1
:
Parse results:
[ 'minus(minus(1, 1), 1)', 'minus(1, minus(1, 1))' ]
2
# Second Example
Consider this example from Wikipedia (opens new window):
# https://en.wikipedia.org/wiki/Earley_parser#Example
P -> S
S -> S "+" M
| M
M -> M "*" T
| T
T -> "1"
| "2"
| "3"
| "4"
2
3
4
5
6
7
8
9
10
11
12
13
14
executed with nearley-test wikipedia.js -i '2+3*4'
➜ examples git:(main) ✗ nearleyc wikipedia.ne -o wikipedia.js
➜ examples git:(main) ✗ nearley-test wikipedia.js -i '2+3*4'
Table length: 6
Number of parses: 1
Parse Charts
Chart: 0
0: {P → ● S}, from: 0
1: {S → ● S "+" M}, from: 0
2: {S → ● M}, from: 0
3: {M → ● M "*" T}, from: 0
4: {M → ● T}, from: 0
5: {T → ● "1"}, from: 0
6: {T → ● "2"}, from: 0
7: {T → ● "3"}, from: 0
8: {T → ● "4"}, from: 0
Chart: 1
0: {T → "2" ● }, from: 0
1: {M → T ● }, from: 0
2: {M → M ● "*" T}, from: 0
3: {S → M ● }, from: 0
4: {S → S ● "+" M}, from: 0
5: {P → S ● }, from: 0
Chart: 2
0: {S → S "+" ● M}, from: 0
1: {M → ● M "*" T}, from: 2
2: {M → ● T}, from: 2
3: {T → ● "1"}, from: 2
4: {T → ● "2"}, from: 2
5: {T → ● "3"}, from: 2
6: {T → ● "4"}, from: 2
Chart: 3
0: {T → "3" ● }, from: 2
1: {M → T ● }, from: 2
2: {M → M ● "*" T}, from: 2
3: {S → S "+" M ● }, from: 0
4: {S → S ● "+" M}, from: 0
5: {P → S ● }, from: 0
Chart: 4
0: {M → M "*" ● T}, from: 2
1: {T → ● "1"}, from: 4
2: {T → ● "2"}, from: 4
3: {T → ● "3"}, from: 4
4: {T → ● "4"}, from: 4
Chart: 5
0: {T → "4" ● }, from: 4
1: {M → M "*" T ● }, from: 2
2: {M → M ● "*" T}, from: 2
3: {S → S "+" M ● }, from: 0
4: {S → S ● "+" M}, from: 0
5: {P → S ● }, from: 0
Parse results:
[
[
[ [ [ [ '2' ] ] ], '+', [ [ [ '3' ] ], '*', [ '4' ] ] ]
]
]
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
The state {P → S ● }, from: 0
represents a completed parse. This state also appears in charts 1
(corresponding to 2 • + 3 * 4
) and 3 (2 + 3 • * 4
), since they are complete sentences.
# Loup Blog "Earley Parsing Explained"
# Advanced: Optimizing Right Recursion. Loup blog
Joop Leo’s optimizations for right-Recursion original paper
# Toby Ho video "on the Earley Parsing Algorithm"
Toby Ho illustrates step by step the algorithm.
# Video on "Earley Parsing" by Natalie Parde
# "Early Parser" in the Wikipedia
# Marpa: a Perl parser generator based on the Earley Algorithm
See What is the Marpa algorithm? (opens new window) By Jeffrey Kegler on November 18, 2011 in Jeffrey Kegler's Blog (opens new window)
# Other Earley Parsers in JS
# NearleyJS: A parser generator based on the Earley Algorithm
See section NearleyJS
# lagodiuk/earley-parser-js
See repo lagodiuk/earley-parser-js (opens new window).
The implementation is in file earley-oop.js (opens new window).
A fork of this repo for ULL-ESIT-PL is in https://github.com/ULL-ESIT-PL/earley-parser-js (opens new window)
# Others
See also npm packages