Parsing SQL
Parsing is done by \sad_spirit\pg_builder\Parser
class backed by \sad_spirit\pg_builder\Lexer
.
The latter splits the SQL string into tokens, the former goes over the tokens building the Abstract Syntax Tree.
This section describes usage of these classes and some implementation details that may be of interest to those trying to extend pg_builder in some way.
Parser
API
Tip
It is generally not necessary to manually call methods that parse SQL statements or their parts: use
StatementFactory::createFromString() to parse complete statements
or one of its builder methods to create Statement
instances from scratch.
Statement
instances created by StatementFactory
will automatically accept strings for their properties
and call relevant Parser
methods.
It may be necessary to configure the Parser
instance, however. Parser
constructor accepts
an instance of Lexer
(see below for its configuration options) and an optional instance
of a class implementing CacheItemPoolInterface
from PSR-6.
parseSomething()
methods
All public Parser
methods that have parse
prefix and process (parts of) SQL statements are
actually overloaded via __call()
magic method. It contains code for getting / setting cache if available,
tokenizing strings with Lexer
and forwarding a call to a protected method that does actual parsing work.
Several dozens of such methods are defined, e.g.
parseStatement(string|TokenStream $input): Statement
Parses a complete SQL statement. Used internally by StatementFactory::createFromString().
parseTypeName(string|TokenStream $input): nodes\TypeName
Parses a type name. Used by converters\BuilderSupportDecorator so that it can handle any type name Postgres itself can.
Other parse*()
methods are used by Node
implementations that accept strings for their properties or array
offsets.
Caching of ASTs
Parser
can automatically cache ASTs generated by its parseSomething()
methods.
You only need to provide an instance of class implementing CacheItemPoolInterface
from
PSR-6 either to Parser
constructor or
to its setCache(CacheItemPoolInterface $cache): void
method.
$parser = new Parser(new Lexer(), new CacheImplementation());
$parser->setCache(new AnotherCacheImplementation());
ASTs will be stored in cache under keys having parsetree-
prefix.
Tip
Unserializing AST is at least 4-5 times faster than creating it from SQL. Use cache if possible.
Lexer
API
The class has only one public method
tokenize(string $sql): \sad_spirit\pg_builder\TokenStream
Tokenizes the input string. Usually you don’t need to call it yourself as it is automatically called by
Parser
when a string is passed to any of itsparse*()
methods.
You may need to set options via Lexer
’s constructor, however:
'standard_conforming_strings'
Has the same meaning as postgresql.conf parameter of the same name: when
true
(default), then backslashes in'...'
strings are treated literally, whenfalse
they are treated as escape characters. Backslashes ine'...'
strings are always treated as escape characters, of course.
use sad_spirit\pg_builder\Lexer;
$strings = <<<TEST
'foo\\\\bar' e'foo\\\\bar'
TEST;
$lexerStandard = new Lexer([
'standard_conforming_strings' => true
]);
$lexerNonStandard = new Lexer([
'standard_conforming_strings' => false
]);
echo $lexerStandard->tokenize($strings)
. "\n\n"
. $lexerNonStandard->tokenize($strings);
will output
string literal 'foo\\bar' at position 0
string literal 'foo\bar' at position 11
end of input
string literal 'foo\bar' at position 0
string literal 'foo\bar' at position 11
end of input
Parser implementation details
The SQL string is first processed by Lexer
and converted to TokenStream
object aggregating implementations
of Token
. Parser
then goes over that stream and builds the Abstract Syntax Tree composed of
Node
implementations.
Lexer
class
The class is based on flex lexer defined in src/backend/parser/scan.l
file of Postgres sources.
Lexer
does not create Token
s for whitespace and comments. It also does some preprocessing: unescapes
strings and identifiers that used Unicode escapes and removes UESCAPE
clauses.
TokenType
enum
This is an int-backed enum containing possible types for Token
s. The backing values are bitmasks that can be used
for checking that the concrete type matches a generic one
if (0 !== ($token->getType()->value & TokenType::PARAMETER->value)) {
echo "Token represents a parameter placeholder";
}
Tokens can only have a concrete type rather than a generic one (with the notable exception of
TokenType::IDENTIFIER
), additionally TokenType::UNICODE_STRING
and TokenType::UNICODE_IDENTIFIER
are
only used inside Lexer
.
Keyword
enum
This is a string-backed enum containing the list of all keywords for the most recent Postgres version.
It is generated from src/include/parser/kwlist.h
file.
It has two methods corresponding to additional keyword properties from the above file:
getType(): TokenType
Returns a case of
TokenType
representing the category of keyword. Postgres has a lot of keywords, but most of these may be used as identifiers without the need to quote them.The case returned will always be a “subtype” of generic
TokenType::KEYWORD
.isBareLabel(): bool
Returns whether the keyword may be used as column alias in
SELECT
statement /RETURNING
clause without theAS
keyword.
Token
interface and its implementations
The Token
interface represents a token that has knowledge of its type, value and position in input string.
namespace sad_spirit\pg_builder;
interface Token extends \Stringable
{
public function matches(TokenType $type, string|string[]|null $values = null) : bool;
public function matchesAnyKeyword(Keyword ...$keywords): bool;
public function getPosition() : int;
public function getType() : TokenType;
public function getKeyword() : ?Keyword;
public function getValue() : string;
}
matches()
Checks whether current token matches given type and/or value.
$type
is matched like a bitmask (see above) and then value is checked against given$values
.matchesAnyKeyword()
Checks whether current token matches any of the given keywords. This can only return
true
if the token represents a keyword (e.g. is an instance ofKeywordToken
).
The following implementations of Token
are available:
tokens\EOFToken
Represents end of input.
tokens\KeywordToken
Represents a keyword. This returns a non-
null
value fromgetKeyword()
and may returntrue
frommatchesAnyKeyword()
.tokens\StringToken
Token defined by a type and a string value. E.g. token with type
TokenType::STRING
andfoo
value represents literal'foo'
while the one withTokenType::IDENTIFIER
andfoo
value represents identifierfoo
.
TokenStream
This class represents a stream of Token
s.
namespace sad_spirit\pg_builder;
class TokenStream implements \Stringable
{
// Movement within stream
public function next() : Token;
public function skip(int $number) : void;
public function isEOF() : bool;
public function getCurrent() : Token;
public function look(int $number = 1) : Token;
public function reset() : void;
// These map to methods of current Token
public function matches(TokenType $type, string|string[]|null $values = null) : bool;
public function getKeyword() : ?Keyword;
public function matchesAnyKeyword(Keyword ...$keywords): ?Keyword;
// Wrappers for common matches() cases
public function matchesSpecialChar(string|string[] $char) : bool;
public function matchesAnyType(TokenType ...$types) : bool;
public function matchesKeywordSequence(Keyword|Keyword[] ...$keywords): bool
// These throw SyntaxException if the current Token does not match the given values
public function expect(TokenType $type, string|string[]|null $values = null): Token;
public function expectKeyword(Keyword ...$keywords) : Keyword;
}
Token
implementations and TokenStream
implement magic __toString()
method
allowing easy debug output:
use sad_spirit\pg_builder\Lexer;
$lexer = new Lexer();
echo $lexer->tokenize('select * from some_table');
yields
keyword 'select' at position 0
special character '*' at position 7
keyword 'from' at position 9
identifier 'some_table' at position 14
end of input
Parser
This is a LL(*) recursive descent parser. It tries to closely follow a part of bison grammar defined
in src/backend/parser/gram.y
file of Postgres sources, but the implementation is completely independent.
Note
The part that is reimplemented starts around the PreparableStmt
production in gram.y
.
Differences from Postgres parser: the following constructs are not supported
TABLE name
alias forSELECT * FROM name
SELECT INTO
WHERE CURRENT OF cursor
forUPDATE
andDELETE
queriesUndocumented
TREAT()
function