Internals: Parser implementation

The SQL string is first processed by Lexer and converted to TokenStream object aggregating implementations of Token. Parser then goes over that stream and builds the Abstract Syntax Tree composed of Node implementations.

Lexer class

The class is based on flex lexer defined in src/backend/parser/scan.l file of Postgres sources.

Lexer does not create Tokens for whitespace and comments. It also does some preprocessing: unescapes strings and identifiers that used Unicode escapes and removes UESCAPE clauses.

TokenType enum

This is an int-backed enum containing possible types for Tokens. The backing values are bitmasks that can be used for checking that the concrete type matches a generic one

if (0 !== ($token->getType()->value & TokenType::PARAMETER->value)) {
    echo "Token represents a parameter placeholder";
}

Tokens can only have a concrete type rather than a generic one (with the notable exception of TokenType::IDENTIFIER), additionally TokenType::UNICODE_STRING and TokenType::UNICODE_IDENTIFIER are only used inside Lexer.

Keyword enum

This is a string-backed enum containing the list of all keywords for the most recent Postgres version. It is generated from src/include/parser/kwlist.h file.

It has two methods corresponding to additional keyword properties from the above file:

getType(): TokenType

Returns a case of TokenType representing the category of keyword. Postgres has a lot of keywords, but most of these may be used as identifiers without the need to quote them.

The case returned will always be a “subtype” of generic TokenType::KEYWORD.

isBareLabel(): bool

Returns whether the keyword may be used as column alias in SELECT statement / RETURNING clause without the AS keyword.

Token interface and its implementations

The Token interface represents a token that has knowledge of its type, value and position in input string.

namespace sad_spirit\pg_builder;

interface Token extends \Stringable
{
    public function matches(TokenType $type, string|string[]|null $values = null) : bool;
    public function matchesAnyKeyword(Keyword ...$keywords): bool;

    public function getPosition() : int;
    public function getType() : TokenType;
    public function getKeyword() : ?Keyword;
    public function getValue() : string;
}
matches()

Checks whether current token matches given type and/or value. $type is matched like a bitmask (see above) and then value is checked against given $values.

matchesAnyKeyword()

Checks whether current token matches any of the given keywords. This can only return true if the token represents a keyword (e.g. is an instance of KeywordToken).

The following implementations of Token are available:

tokens\EOFToken

Represents end of input.

tokens\KeywordToken

Represents a keyword. This returns a non-null value from getKeyword() and may return true from matchesAnyKeyword().

tokens\StringToken

Token defined by a type and a string value. E.g. token with type TokenType::STRING and foo value represents literal 'foo' while the one with TokenType::IDENTIFIER and foo value represents identifier foo.

TokenStream

This class represents a stream of Tokens.

namespace sad_spirit\pg_builder;

class TokenStream implements \Stringable
{
    // Movement within stream
    public function next() : Token;
    public function skip(int $number) : void;
    public function isEOF() : bool;
    public function getCurrent() : Token;
    public function look(int $number = 1) : Token;
    public function reset() : void;

    // These map to methods of current Token
    public function matches(TokenType $type, string|string[]|null $values = null) : bool;
    public function getKeyword() : ?Keyword;
    public function matchesAnyKeyword(Keyword ...$keywords): ?Keyword;

    // Wrappers for common matches() cases
    public function matchesSpecialChar(string|string[] $char) : bool;
    public function matchesAnyType(TokenType ...$types) : bool;
    public function matchesKeywordSequence(Keyword|Keyword[] ...$keywords): bool

    // These throw SyntaxException if the current Token does not match the given values
    public function expect(TokenType $type, string|string[]|null $values = null): Token;
    public function expectKeyword(Keyword ...$keywords) : Keyword;
}

Token implementations and TokenStream implement magic __toString() method allowing easy debug output:

use sad_spirit\pg_builder\Lexer;

$lexer = new Lexer();
echo $lexer->tokenize('select * from some_table');

yields

keyword 'select' at position 0
special character '*' at position 7
keyword 'from' at position 9
identifier 'some_table' at position 14
end of input

Parser

This is a LL(*) recursive descent parser. It tries to closely follow a part of bison grammar defined in src/backend/parser/gram.y file of Postgres sources, but the implementation is completely independent.

Note

The part that is reimplemented starts around the PreparableStmt production in gram.y.

Differences from Postgres parser: the following constructs are not supported

  • TABLE name alias for SELECT * FROM name

  • SELECT INTO

  • WHERE CURRENT OF cursor for UPDATE and DELETE queries

  • Undocumented TREAT() function