Page MenuHomePhorge

abstract class PhutilLexer
Arcanist Technical Documentation ()

Slow, inefficient regexp-based lexer. Define rules like this:

array(
  'start'  => array(...),
  'state1' => array(...),
  'state2' => array(...),
)

Lexers start at the state named 'start'. Each state should have a list of rules which can match in that state. A list of rules looks like this:

array(
  array('\s+', 'space'),
  array('\d+', 'digit'),
  array('\w+', 'word'),
)

The lexer operates by processing each rule in the current state in order. When one matches, it produces a token. For example, the lexer above would lex this text:

3 asdf

...to produce these tokens (assuming the rules are for the 'start' state):

array('digit', '3', null),
array('space', ' ', null),
array('word', 'asdf', null),

A rule can also cause a state transition:

array('zebra', 'animal', 'saw_zebra'),

This would match the text "zebra", emit a token of type "animal", and change the parser state to "saw_zebra", causing the lexer to start using the rules from that state.

To pop the lexer's state, you can use the special state '!pop'.

Finally, you can provide additional options in the fourth parameter. Supported options are case-insensitive and context.

Possible values for context are push (push the token value onto the context stack), pop (pop the context stack and use it to provide context for the token), and discard (pop the context stack and throw away the value).

For example, to lex text like this:

Class::CONSTANT

You can use a rule set like this:

'start' => array(
  array('\w+(?=::)', 'class', 'saw_class', array('context' => 'push')),
),
'saw_class' => array(
  array('::', 'operator'),
  array('\w+', 'constant, '!pop', array('context' => 'pop')),
),

This would parse the above text into this token stream:

array('class', 'Class', null),
array('operator', '::', null),
array('constant', 'CONSTANT', 'Class'),

For a concrete implementation, see PhutilPHPFragmentLexer.

Tasks

Lexer Implementation

  • abstract protected function getRawRules() — Return a set of rules for this lexer. See description in @{class:PhutilLexer}.

Lexer Rules

  • protected function getRules() — Process, normalize, and validate the raw lexer rules.

Lexer Tokens

  • public function getTokens($input, $initial_state) — Lex an input string into tokens.

Other Methods

  • public function __get($name)
  • public function __set($name, $value)
  • public function current()
  • public function key()
  • public function next()
  • public function rewind()
  • public function valid()
  • private function throwOnAttemptedIteration()
  • public function getPhobjectClassConstant($key, $byte_limit) — Read the value of a class constant.
  • public function mergeTokens($tokens) — Merge adjacent tokens of the same type. For example, if a comment is tokenized as <"//", "comment">, this method will merge the two tokens into a single combined token.
  • public function getLexerState()

Methods

public function __get($name)
Inherited

This method is not documented.
Parameters
$name
Return
wild

public function __set($name, $value)
Inherited

This method is not documented.
Parameters
$name
$value
Return
wild

public function current()
Inherited

This method is not documented.
Return
wild

public function key()
Inherited

This method is not documented.
Return
wild

public function next()
Inherited

This method is not documented.
Return
wild

public function rewind()
Inherited

This method is not documented.
Return
wild

public function valid()
Inherited

This method is not documented.
Return
wild

private function throwOnAttemptedIteration()
Inherited

This method is not documented.
Return
wild

public function getPhobjectClassConstant($key, $byte_limit)
Inherited

Phobject

Read the value of a class constant.

This is the same as just typing self::CONSTANTNAME, but throws a more useful message if the constant is not defined and allows the constant to be limited to a maximum length.

Parameters
string$keyName of the constant.
int|null$byte_limitMaximum number of bytes permitted in the value.
Return
stringValue of the constant.

abstract protected function getRawRules()

Return a set of rules for this lexer. See description in PhutilLexer.

Return
dictLexer rules.

protected function getRules()

Process, normalize, and validate the raw lexer rules.

Return
wild

public function getTokens($input, $initial_state)

Lex an input string into tokens.

Parameters
string$inputInput string.
string$initial_stateInitial lexer state.
Return
listList of lexer tokens.

public function mergeTokens($tokens)

Merge adjacent tokens of the same type. For example, if a comment is tokenized as <"//", "comment">, this method will merge the two tokens into a single combined token.

Parameters
array$tokens
Return
wild

public function getLexerState()

This method is not documented.
Return
wild