Parsing SQL

Parsing is done by \sad_spirit\pg_builder\Parser class backed by \sad_spirit\pg_builder\Lexer. The latter splits the SQL string into tokens, the former goes over the tokens building the Abstract Syntax Tree.

This section describes usage of these classes and some implementation details that may be of interest to those trying to extend pg_builder in some way.

Parser API

Tip

It is generally not necessary to manually call methods that parse SQL statements or their parts: use StatementFactory::createFromString() to parse complete statements or one of its builder methods to create Statement instances from scratch. Statement instances created by StatementFactory will automatically accept strings for their properties and call relevant Parser methods.

It may be necessary to configure the Parser instance, however. Parser constructor accepts an instance of Lexer (see below for its configuration options) and an optional instance of a class implementing CacheItemPoolInterface from PSR-6.

parseSomething() methods

All public Parser methods that have parse prefix and process (parts of) SQL statements are actually overloaded via __call() magic method. It contains code for getting / setting cache if available, tokenizing strings with Lexer and forwarding a call to a protected method that does actual parsing work.

Several dozens of such methods are defined, e.g.

parseStatement(string|TokenStream $input): Statement

Parses a complete SQL statement. Used internally by StatementFactory::createFromString().

parseTypeName(string|TokenStream $input): nodes\TypeName

Parses a type name. Used by converters\BuilderSupportDecorator so that it can handle any type name Postgres itself can.

Other parse*() methods are used by Node implementations that accept strings for their properties or array offsets.

Caching of ASTs

Parser can automatically cache ASTs generated by its parseSomething() methods. You only need to provide an instance of class implementing CacheItemPoolInterface from PSR-6 either to Parser constructor or to its setCache(CacheItemPoolInterface $cache): void method.

$parser = new Parser(new Lexer(), new CacheImplementation());

$parser->setCache(new AnotherCacheImplementation());

ASTs will be stored in cache under keys having parsetree- prefix.

Tip

Unserializing AST is at least 4-5 times faster than creating it from SQL. Use cache if possible.

Lexer API

The class has only one public method

tokenize(string $sql): \sad_spirit\pg_builder\TokenStream

Tokenizes the input string. Usually you don’t need to call it yourself as it is automatically called by Parser when a string is passed to any of its parse*() methods.

You may need to set options via Lexer’s constructor, however:

'standard_conforming_strings'

Has the same meaning as postgresql.conf parameter of the same name: when true (default), then backslashes in '...' strings are treated literally, when false they are treated as escape characters. Backslashes in e'...' strings are always treated as escape characters, of course.

use sad_spirit\pg_builder\Lexer;

$strings = <<<TEST
'foo\\\\bar' e'foo\\\\bar'
TEST;

$lexerStandard = new Lexer([
    'standard_conforming_strings' => true
]);

$lexerNonStandard = new Lexer([
    'standard_conforming_strings' => false
]);

echo $lexerStandard->tokenize($strings)
     . "\n\n"
     . $lexerNonStandard->tokenize($strings);

will output

string literal 'foo\\bar' at position 0
string literal 'foo\bar' at position 11
end of input

string literal 'foo\bar' at position 0
string literal 'foo\bar' at position 11
end of input

Parser implementation details

The SQL string is first processed by Lexer and converted to TokenStream object aggregating implementations of Token. Parser then goes over that stream and builds the Abstract Syntax Tree composed of Node implementations.

Lexer class

The class is based on flex lexer defined in src/backend/parser/scan.l file of Postgres sources.

Lexer does not create Tokens for whitespace and comments. It also does some preprocessing: unescapes strings and identifiers that used Unicode escapes and removes UESCAPE clauses.

TokenType enum

This is an int-backed enum containing possible types for Tokens. The backing values are bitmasks that can be used for checking that the concrete type matches a generic one

if (0 !== ($token->getType()->value & TokenType::PARAMETER->value)) {
    echo "Token represents a parameter placeholder";
}

Tokens can only have a concrete type rather than a generic one (with the notable exception of TokenType::IDENTIFIER), additionally TokenType::UNICODE_STRING and TokenType::UNICODE_IDENTIFIER are only used inside Lexer.

Keyword enum

This is a string-backed enum containing the list of all keywords for the most recent Postgres version. It is generated from src/include/parser/kwlist.h file.

It has two methods corresponding to additional keyword properties from the above file:

getType(): TokenType

Returns a case of TokenType representing the category of keyword. Postgres has a lot of keywords, but most of these may be used as identifiers without the need to quote them.

The case returned will always be a “subtype” of generic TokenType::KEYWORD.

isBareLabel(): bool

Returns whether the keyword may be used as column alias in SELECT statement / RETURNING clause without the AS keyword.

Token interface and its implementations

The Token interface represents a token that has knowledge of its type, value and position in input string.

namespace sad_spirit\pg_builder;

interface Token extends \Stringable
{
    public function matches(TokenType $type, string|string[]|null $values = null) : bool;
    public function matchesAnyKeyword(Keyword ...$keywords): bool;

    public function getPosition() : int;
    public function getType() : TokenType;
    public function getKeyword() : ?Keyword;
    public function getValue() : string;
}
matches()

Checks whether current token matches given type and/or value. $type is matched like a bitmask (see above) and then value is checked against given $values.

matchesAnyKeyword()

Checks whether current token matches any of the given keywords. This can only return true if the token represents a keyword (e.g. is an instance of KeywordToken).

The following implementations of Token are available:

tokens\EOFToken

Represents end of input.

tokens\KeywordToken

Represents a keyword. This returns a non-null value from getKeyword() and may return true from matchesAnyKeyword().

tokens\StringToken

Token defined by a type and a string value. E.g. token with type TokenType::STRING and foo value represents literal 'foo' while the one with TokenType::IDENTIFIER and foo value represents identifier foo.

TokenStream

This class represents a stream of Tokens.

namespace sad_spirit\pg_builder;

class TokenStream implements \Stringable
{
    // Movement within stream
    public function next() : Token;
    public function skip(int $number) : void;
    public function isEOF() : bool;
    public function getCurrent() : Token;
    public function look(int $number = 1) : Token;
    public function reset() : void;

    // These map to methods of current Token
    public function matches(TokenType $type, string|string[]|null $values = null) : bool;
    public function getKeyword() : ?Keyword;
    public function matchesAnyKeyword(Keyword ...$keywords): ?Keyword;

    // Wrappers for common matches() cases
    public function matchesSpecialChar(string|string[] $char) : bool;
    public function matchesAnyType(TokenType ...$types) : bool;
    public function matchesKeywordSequence(Keyword|Keyword[] ...$keywords): bool

    // These throw SyntaxException if the current Token does not match the given values
    public function expect(TokenType $type, string|string[]|null $values = null): Token;
    public function expectKeyword(Keyword ...$keywords) : Keyword;
}

Token implementations and TokenStream implement magic __toString() method allowing easy debug output:

use sad_spirit\pg_builder\Lexer;

$lexer = new Lexer();
echo $lexer->tokenize('select * from some_table');

yields

keyword 'select' at position 0
special character '*' at position 7
keyword 'from' at position 9
identifier 'some_table' at position 14
end of input

Parser

This is a LL(*) recursive descent parser. It tries to closely follow a part of bison grammar defined in src/backend/parser/gram.y file of Postgres sources, but the implementation is completely independent.

Note

The part that is reimplemented starts around the PreparableStmt production in gram.y.

Differences from Postgres parser: the following constructs are not supported

  • TABLE name alias for SELECT * FROM name

  • SELECT INTO

  • WHERE CURRENT OF cursor for UPDATE and DELETE queries

  • Undocumented TREAT() function