Internals: Parser implementation
The SQL string is first processed by Lexer and converted to TokenStream object aggregating implementations
of Token. Parser then goes over that stream and builds the Abstract Syntax Tree composed of
Node implementations.
Lexer class
The class is based on flex lexer defined in src/backend/parser/scan.l file of Postgres sources.
Lexer does not create Tokens for whitespace and comments. It also does some preprocessing: unescapes
strings and identifiers that used Unicode escapes and removes UESCAPE clauses.
TokenType enum
This is an int-backed enum containing possible types for Tokens. The backing values are bitmasks that can be used
for checking that the concrete type matches a generic one
if (0 !== ($token->getType()->value & TokenType::PARAMETER->value)) {
echo "Token represents a parameter placeholder";
}
Tokens can only have a concrete type rather than a generic one (with the notable exception of
TokenType::IDENTIFIER), additionally TokenType::UNICODE_STRING and TokenType::UNICODE_IDENTIFIER are
only used inside Lexer.
Keyword enum
This is a string-backed enum containing the list of all keywords for the most recent Postgres version.
It is generated from src/include/parser/kwlist.h file.
It has two methods corresponding to additional keyword properties from the above file:
getType(): TokenTypeReturns a case of
TokenTyperepresenting the category of keyword. Postgres has a lot of keywords, but most of these may be used as identifiers without the need to quote them.The case returned will always be a “subtype” of generic
TokenType::KEYWORD.isBareLabel(): boolReturns whether the keyword may be used as column alias in
SELECTstatement /RETURNINGclause without theASkeyword.
Token interface and its implementations
The Token interface represents a token that has knowledge of its type, value and position in input string.
namespace sad_spirit\pg_builder;
interface Token extends \Stringable
{
public function matches(TokenType $type, string|string[]|null $values = null) : bool;
public function matchesAnyKeyword(Keyword ...$keywords): bool;
public function getPosition() : int;
public function getType() : TokenType;
public function getKeyword() : ?Keyword;
public function getValue() : string;
}
matches()Checks whether current token matches given type and/or value.
$typeis matched like a bitmask (see above) and then value is checked against given$values.matchesAnyKeyword()Checks whether current token matches any of the given keywords. This can only return
trueif the token represents a keyword (e.g. is an instance ofKeywordToken).
The following implementations of Token are available:
tokens\EOFTokenRepresents end of input.
tokens\KeywordTokenRepresents a keyword. This returns a non-
nullvalue fromgetKeyword()and may returntruefrommatchesAnyKeyword().tokens\StringTokenToken defined by a type and a string value. E.g. token with type
TokenType::STRINGandfoovalue represents literal'foo'while the one withTokenType::IDENTIFIERandfoovalue represents identifierfoo.
TokenStream
This class represents a stream of Tokens.
namespace sad_spirit\pg_builder;
class TokenStream implements \Stringable
{
// Movement within stream
public function next() : Token;
public function skip(int $number) : void;
public function isEOF() : bool;
public function getCurrent() : Token;
public function look(int $number = 1) : Token;
public function reset() : void;
// These map to methods of current Token
public function matches(TokenType $type, string|string[]|null $values = null) : bool;
public function getKeyword() : ?Keyword;
public function matchesAnyKeyword(Keyword ...$keywords): ?Keyword;
// Wrappers for common matches() cases
public function matchesSpecialChar(string|string[] $char) : bool;
public function matchesAnyType(TokenType ...$types) : bool;
public function matchesKeywordSequence(Keyword|Keyword[] ...$keywords): bool
// These throw SyntaxException if the current Token does not match the given values
public function expect(TokenType $type, string|string[]|null $values = null): Token;
public function expectKeyword(Keyword ...$keywords) : Keyword;
}
Token implementations and TokenStream implement magic __toString() method
allowing easy debug output:
use sad_spirit\pg_builder\Lexer;
$lexer = new Lexer();
echo $lexer->tokenize('select * from some_table');
yields
keyword 'select' at position 0
special character '*' at position 7
keyword 'from' at position 9
identifier 'some_table' at position 14
end of input
Parser
This is a LL(*) recursive descent parser. It tries to closely follow a part of bison grammar defined
in src/backend/parser/gram.y file of Postgres sources, but the implementation is completely independent.
Note
The part that is reimplemented starts around the PreparableStmt production in gram.y.
Differences from Postgres parser: the following constructs are not supported
TABLE namealias forSELECT * FROM nameSELECT INTOWHERE CURRENT OF cursorforUPDATEandDELETEqueriesUndocumented
TREAT()function