# The Earley Algorithm Explained

Let be a grammar and an input sequence of tokens .

A state is an element where

is a production in the set of grammar productions , and
The dot • represents the current position in that production and
The origin position is the position in the input at which the matching of this production began: .

The set of active states when the input prefix is being analyzed is called .

Input position 0 is the position prior to input.
Input position n is the position after accepting the nth token. Informally, input positions can be thought of as locations at token boundaries.

More precisely, is the set of states whose production rule appears in a derivation from the symbol

Observe that it holds that

Idea

So the idea is that if state is in (with ) is because was used in some partial parse tree that produces and has derived onto .

A state is said finished when its current position is the last position of the right side of the production , that is, when there is no symbol to the right of the dot • in the visual representation of the state.

The parser is seeded with consisting of only the top-level rule. The parser then repeatedly executes three operations:

prediction,
scanning, and
completion.

These three operations are repeated until no new states can be added to the set

# Prediction

(where j is the start of the substring), and is in the set of non terminals
add states to grammar productions having Y on the left hand side.

Duplicate states are not added to the state set, only new ones.

# Scanning

If is the next terminal in the input stream, then of the form , add to .

# Completion

of the form , find all states in of the form and add to .

# Acceptance

The algorithm accepts if an state with the form ends up in , where is the start symbol of the grammar and is the input length.

If there are several of these states it means the grammar is ambiguous.

If there are none, the input is rejected.

# Algorithm Pseudocode

We augment the initial grammar with the rule γ → •S

We also assume that the lexical analysis has been made and the parameter tokens contains an array with the token objects of the input string.

function EarleyParse(tokens, grammar) {

    function init(length) {
        let S = [];
        for(k =0; k <= length; k++) {
          S[k] = new Set();
        }
        return S;     
    }

    let S = init(tokens.length);

    function PREDICTOR([A → α•Bβ, j], k) {
      grammar.rules(B).forEach((B → γ) => S[k].add([B → •γ, k]))
    }

    function SCANNER([A → α•aβ, j], k) {
      // "a" is a terminal like ID while tokens[k] is a token object with a "type" and a value like "temp"
      if (tokens[k].type == a) S[k+1].add([A → αa•β, j])
    }

    function COMPLETER([B → γ•, s], k) {
      S[s].filter([A → α•Bβ, j]) => S[k].add([A → αB•β, j])
    }

    S[0].add([γ → •grammar.start, 0]); 
    for(k = 0; k <= tokens.length; k++) {
        S[k].forEach(state => {  
            if (!state.isFinished()) 
                if (state.nextElement() in grammar.nonTerminal) 
                    PREDICTOR(state, k)         // non-terminal
                else
                    SCANNER(state, k)             // terminal
            else 
                COMPLETER(state, k)  // state matches the pattern [B → γ•, s]     
        })
    }
    return S;
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

En este código γ denota una nueva variable sintáctica auxiliar que produce el símbolo de arranque grammar.start.

La función de aceptación sería:

function  ACCEPTS(S, tokens) {
  return S[tokens.length].has((γ → S•, 0))
}

1
2
3

# Ambiguous Grammar Example

Consider the following grammar:

➜  examples git:(main) ✗ cat nearley-algorithm-ambiguous.ne

e -> e "-" e {% ([L, _, R]) => `minus(${L}, ${R})` %}
   | "1"     {% id %}

1
2

We compile it:

➜  examples git:(main) ✗ npx nearleyc nearley-algorithm-ambiguous.ne -o nearley-algorithm-ambiguous.js

and the trace the execution:

➜  examples git:(main) ✗ npx nearley-test nearley-algorithm-ambiguous.js -i '1-1-1'  
Table length: 6
Number of parses: 2
Parse Charts
Chart: 0
0: {e →  ● e "-" e}, from: 0
1: {e →  ● "1"}, from: 0

Chart: 1
0: {e → "1" ● }, from: 0
1: {e → e ● "-" e}, from: 0

Chart: 2
0: {e → e "-" ● e}, from: 0
1: {e →  ● e "-" e}, from: 2
2: {e →  ● "1"}, from: 2

Chart: 3
0: {e → "1" ● }, from: 2
1: {e → e ● "-" e}, from: 2
2: {e → e "-" e ● }, from: 0
3: {e → e ● "-" e}, from: 0

Chart: 4
0: {e → e "-" ● e}, from: 0
1: {e → e "-" ● e}, from: 2
2: {e →  ● e "-" e}, from: 4
3: {e →  ● "1"}, from: 4

Chart: 5
0: {e → "1" ● }, from: 4
1: {e → e ● "-" e}, from: 4
2: {e → e "-" e ● }, from: 2
3: {e → e "-" e ● }, from: 0
4: {e → e ● "-" e}, from: 2
5: {e → e "-" e ● }, from: 0
6: {e → e ● "-" e}, from: 0
7: {e → e ● "-" e}, from: 0


Parse results: 
[ 'minus(minus(1, 1), 1)', 'minus(1, minus(1, 1))' ]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

To get the full derivations from the chart, you need to search for a completed state matching the pattern e → ● from 0 in the last chart.

Chart: 5
0: {e → "1" ● }, from: 4
1: {e → e ● "-" e}, from: 4
2: {e → e "-" e ● }, from: 2
3: {e → e "-" e ● }, from: 0
4: {e → e ● "-" e}, from: 2
6: {e → e ● "-" e}, from: 0

1
2
3
4
5
6
7

The fact that state 3 {e → e "-" e ● }, from: 0 exist means the state {e → e "-" ● e }, from: 0 must also exist somewhere. In this case such state appears in Charts 2 and 4.

The advance from {e → e "-" ● e }, from: 0 (Chart 4) to {e → e "-" e ● }, from: 0 in Chart 5 can be produced due to any of the completed states 0 and 2 in Chart 5 0: {e → "1" ● } from: 4 and 2: {e → e "-" e ● }, from: 2, that leads to two different syntax trees for 1-1-1:

Parse results: 
[ 'minus(minus(1, 1), 1)', 'minus(1, minus(1, 1))' ]

1
2

# Second Example

Consider this example from Wikipedia (opens new window):

# https://en.wikipedia.org/wiki/Earley_parser#Example

P -> S

S -> S "+" M
   | M

M -> M "*" T
   | T

T -> "1"
   | "2"
   | "3"
   | "4"

1
2
3
4
5
6
7
8
9
10
11
12
13
14

executed with nearley-test wikipedia.js -i '2+3*4'

➜  examples git:(main) ✗ nearleyc wikipedia.ne -o wikipedia.js 
➜  examples git:(main) ✗ nearley-test wikipedia.js -i '2+3*4'
Table length: 6
Number of parses: 1
Parse Charts
Chart: 0
0: {P →  ● S}, from: 0
1: {S →  ● S "+" M}, from: 0
2: {S →  ● M}, from: 0
3: {M →  ● M "*" T}, from: 0
4: {M →  ● T}, from: 0
5: {T →  ● "1"}, from: 0
6: {T →  ● "2"}, from: 0
7: {T →  ● "3"}, from: 0
8: {T →  ● "4"}, from: 0

Chart: 1
0: {T → "2" ● }, from: 0
1: {M → T ● }, from: 0
2: {M → M ● "*" T}, from: 0
3: {S → M ● }, from: 0
4: {S → S ● "+" M}, from: 0
5: {P → S ● }, from: 0

Chart: 2
0: {S → S "+" ● M}, from: 0
1: {M →  ● M "*" T}, from: 2
2: {M →  ● T}, from: 2
3: {T →  ● "1"}, from: 2
4: {T →  ● "2"}, from: 2
5: {T →  ● "3"}, from: 2
6: {T →  ● "4"}, from: 2

Chart: 3
0: {T → "3" ● }, from: 2
1: {M → T ● }, from: 2
2: {M → M ● "*" T}, from: 2
3: {S → S "+" M ● }, from: 0
4: {S → S ● "+" M}, from: 0
5: {P → S ● }, from: 0

Chart: 4
0: {M → M "*" ● T}, from: 2
1: {T →  ● "1"}, from: 4
2: {T →  ● "2"}, from: 4
3: {T →  ● "3"}, from: 4
4: {T →  ● "4"}, from: 4

Chart: 5
0: {T → "4" ● }, from: 4
1: {M → M "*" T ● }, from: 2
2: {M → M ● "*" T}, from: 2
3: {S → S "+" M ● }, from: 0
4: {S → S ● "+" M}, from: 0
5: {P → S ● }, from: 0


Parse results: 
[
  [
    [ [ [ [ '2' ] ] ], '+', [ [ [ '3' ] ], '*', [ '4' ] ] ]
  ]
]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

The state {P → S ● }, from: 0 represents a completed parse. This state also appears in charts 1 (corresponding to 2 • + 3 * 4) and 3 (2 + 3 • * 4), since they are complete sentences.