# Gramáticas y Lenguajes Generados

# Gramáticas Independientes del Contexto

Supongamos una gramática con alfabeto , conjunto de variables sintácticas (o no terminales) , reglas de producción y símbolo de arranque .

Por ejemplo, en la gramática de Egg este es el conjunto de reglas de producción:

expression: STRING
          | NUMBER
          | WORD apply

apply: /* vacio */
     | '(' (expression ',')* expression? ')' apply

1
2
3
4
5
6

Sólo hay dos variables sintácticas . El símbolo de arranque es .

El conjunto de tokens es:

Observe que algunos de los tokens son a su vez lenguajes de cierta complejidad, cuya definición está en otro nivel de abstracción, el nivel léxico y que se pueden definir mediante un mecanismo mas secillo como son las expresiones regulares.

Por ejemplo, en una definición de Egg inicial podríamos definir así lo que entendemos por espacios o blancos, esto es, que partes del texto no son significativas para que nuestro programa pueda entender la estructura de la frase:

WHITES = /(\s|[#;].*|\/\*(.|\n)*?\*\/)*/

así como los tokens mas complejos:

STRING = /"((?:[^"\\]|\\.)*)"/
NUMBER = /([-+]?\d*\.?\d+([eE][-+]?\d+)?)/
WORD   = /([^\s(),"]+)/

1
2
3

# Ejercicio

Construye una derivación para la frase

print(**(g,f)(8))

Observa que el resultado del análisis léxico sería un stream como este:

WORD["print"] "(" WORD[**] "(" WORD[g] "," WORD[f] ")" "(" NUMBER[8] ")" ")"

Solución:

En la solución que sigue, abreviamos expression por , apply por , WORD por y NUMBER por :

(Aquí )

(Ya que )

(Aquí hizo )

(La última a hizo )

(La última hace )

(después de aplicar reiteradas veces las reglas)

En forma gráfica, tenemos el árbol sintáctico concreto que sigue:

Este es el mismo diagrama hecho usando mermaid (opens new window):

# Lenguaje Generado por Una Gramática

Para cada variable sintáctica el lenguaje generado desde la variable se define como:

Esto es, es el conjunto de frases del alfabeto que derivan en varias substituciones desde la variable .

El lenguaje Egg es el conjunto de frases donde es la gramática definida arriba.

El problema a considerar es el de construir para un lenguaje una función parseA() que reconozca las frases del lenguaje .

Siguiendo con el ejemplo de Egg, en tenemos frases como:

()
(4,b)
(4, +(5,c))
(4,)
/* nada */

Recuerda que:

apply: /* vacio */ 
     | '(' (expression ',')* expression? ')' apply

1
2

y que:

# ECMAScript A Complex Language Specification

This Ecma Standard (opens new window) defines the ECMAScript 2022 Language.

It is the thirteenth edition of the ECMAScript Language Specification. Since publication of the first edition in 1997, ECMAScript has grown to be one of the world's most widely used general-purpose programming languages. It is best known as the language embedded in web browsers but has also been widely adopted for server and embedded applications.

Although ECMAScript started as a language with a simple design, over the years that design has become more and more complex. The following section is just an illustration of how some design decisions have led to increased complexity in interpreting and implementing the language.

# Lexical Ambiguity Example

The source text of an ECMAScript is first converted into a sequence of input elements, which are

tokens,
line terminators,
comments, or
white space.

The source text is scanned from left to right, repeatedly taking the longest possible sequence of code points as the next input element.

In ECMAScript, there are several situations where the identification of lexical input elements is sensitive to the syntactic grammar context that is consuming the input elements.

This requires multiple goal symbols for the lexical grammar. The use of multiple lexical goals ensures that there are no lexical ambiguities that would affect automatic semicolon insertion.

For example, there are no syntactic grammar contexts where both a leading division or division-assignment, and a leading RegularExpressionLiteral (opens new window) are permitted.

This is not affected by semicolon insertion (see 12.5 (opens new window)); in examples such as lines 4 and 5 in the following code:

let {a, b, hi, g, c, d} = require('./hidden-amb')
a = b
/hi/g.exec(c).map(d)
console.log(a);

1
2
3
4

where the first non-whitespace, non-comment code point after a LineTerminator (opens new window) is the / (U+002F unicode name SOLIDUS) and the syntactic context allows division or division-assignment, no semicolon is inserted at the LineTerminator!.

That is, the above example is interpreted in the same way as:

a = b / hi / g.exec(c).map(d);

When we run the code above, we get:

➜  prefix-lang git:(master) ✗ node examples/lexical-ambiguity.js
1

1
2

The contents of file examples/hidden-amb.js explain why the output is 1:

let tutu = { map(_) { return 2}}
let a = 5, b = 8, hi = 4, c = "hello", d =
    g = { exec(_) { return tutu; }}
module.exports = {a, b, hi, c, d, g}

1
2
3
4

See the code in the repo crguezl/js-lexical-ambiguity (opens new window)

# ECMAScript Language: Grammar

# ECMAScript Language: Lexical Specification

# ECMA TC39 at GitHub

# The Design of Programming Languages

See section The Design of Programming Languages