NPEG DSL Documentation

This document describes NPEG framework DSL notation.

Legend

Framework Notation Meaning
e An expression consisting of non-terminals or terminals
. Any char value, printable or not
_ Nothing, no character is matched nor consumed
eN A numbered expression
e1, e2, …, eN A list of expressions
(e1 e2 … eN) A sequence of expressions
e {lower bound, upper bound} At least lower bound and at most upper bound repetitions of expression e
e {lower bound,} Lower bound or more repetitions of expression e
e {,upper bound} At most upper bound repetitions of expression e
e* Zero or more repetitions of expression e
e+ One or more repetitions of expression e
e? Zero or one occurences of expression e
&e Predicate - Peeks without consuming any bytes of the input. Used to verify input matches the e before consumming the next expression.
!e Predicate - Peeks without consuming any bytes of the input. Used to verify the input negates the truth value of e before consumming the next expression.
e1 / e2 Or operation, is true if either e1 or e2 are true. If e1 succeeds, e2 is not evaluated.
<USER INPUT> Some sort of user input
stack(NAME) A named stack
stack.top

Top-most element of a stack

Non-Terminals

Input Framework Notation Meaning
Capturing Group (?< TokenName > e )

Use this non-terminal element to develop the expected abstract syntax tree, AST. If the expression, e, is matched by input, a new AstNode named TokenName will be linked into the final abstract syntax tree. The root node in the final grammar is required to be a capturing group.

Notation Variations:

Optional parameter: \rsc means reduce to single child.

Optional parameter: \ir means intermediate representation; This instructs code generation process to create plain objects in language specified that will be mapped to the AST.

Optional parameter: \rn means replacement node. This instructs the code generation process that you will provide a custom object that implements IAstNodeReplacement during Ast Creation through use of a callback.

\ir and \rn are exclusive and cannot be used together.

(?< TokenName \rsc >)

(?< TokenName \rn = [ typename, assemblyname ] >)

(?< TokenName \ir = [ callbackname ] >)

(?< TokenName \rsc \rn = [ typename, assemblyname ] >)

(?< TokenName \rsc \ir = [ callbackname ] >)

Dynamic Back Reference \k< PreviouslyCapturedTokenName >

Use this non-terminal to parse xml like structures that need to match current position with previously captured text of the input.

This element is dependent on:

  • Capturing Group TokenName
  • Flag to enable back referencing

Notation Variations:

\k<PreviouslyCapturedTokenName[\i]> by default the previously captured text requires current position to match using case sensitivity. By specifiying switch [\i] the match becomes case-insensitive.

And Predicate & e

A predicate does not consume characters. Used to test a positive logic condition for decision branching in grammar.

Not Predicate ! e

A predicate does not consume characters. Used to invert the logic test condition for decision branching in grammar.

Limiting Repetitions e { lower,upper } This non-terminal is useful for matching a bounded number of occurences of an expression. lower and upper may be the same value, in which case exactly lower = upper repetitions of the expression e will be required to exist at the current position of the input iterator.

Notation Variations:

e{lower,}: Unlimited Repetitions with a lower bound. Requires at least lower satisfactions of the expression e. The absence of an upper bound is signaled by passing -1 in the upper bound argument in the C and C++ implementations of NPEG, and with a null reference in the C# implementation.

e{,upper}: Tries to match e at most upper times with the input stream content. The routine stops the first time e cannot be satisfied, however it does not return false, also if the expression is not found a single time in the input. The absence of a lower bound is signaled by passing -1 in the lower bound argument in the C and C++ implementations of NPEG, and with a null reference in the C# implementation.

One Or More e + Equivalent to, but faster than, Limiting Repetition with the bounds {1,}, this routine matches at least one occurence of e, and continues to do so until the expression fails to be satisfied by the input stream data.
Failure to match e at least once, results in a 0 return value and a restauration of the input iterator position.
Optional e? Equivalent to, but faster than, Limiting Repetition with the bounds {,1}, this routine tries to match e with the current input stream content.
Failure to match e even once leads to restauration of the input iterator position, but still results in a non-zero return value.
Prioritized Choice (e1|e2) First tries to match the expression e1 with the input stream content. If e1 could be satisfied only the characters corresponding to e1 are consumed by this operation. If satisfaction of e1 fails, the input iterator position is restored and e2 is evaluated.
Failure to satisfy neither of e1 and e2, leads to restauration of the input iterator position and a zero return value.
Sequence ( e1 e2 ) First tries to match the expression e1 then e2 with the input stream content. The characters corresponding to both e1 and e2 are consumed by this operation.
Failure to satisfy e1 and then e2, leads to restauration of the input iterator position and a zero return value.
Zero Or More e * Equivalent to, but faster than, Limiting Repetition with the bounds {0,}, this routine matches all occurences of e until the expression first fails to be satisfied by the input stream data.
Always returns true, but when the matching fails the input iterator is restored to the position after the last satisfaction of e, or in case of a failure to satisfy e in the first attempt, to the state before entering the Zero Or More routine.

Terminals

Input Framework Notation Meaning
Any Character . Consumes one character, printable or non-printable. Will always return success (boolean true) as long as there is a character in the input stream that can be consumed.
Character Class <CharacterClassChar> Consumes one character of the input stream, if it’s in the user-specified character class, else the index of the input iterator is restored to it’s position at the time of the call and 0 is returned.
Character classes are defined by either passing a range of characters enclosed in bracktes, of the form [a-o], a group of individual characters of the form [bq_@], or a combination of the two [a-zA-Z0-9_.@].
Code Point <BinaryValueOfNumberString>

This routine allows the user to specify the numerical values of the chars to be matched and consumed. Essentially it is the same as matching literal strings, however it’s more convenient for matching non-printable characters.
A speciality of Npeg is that it allows for wildcards in binary and hexadecimal input, i.e. by placing x’s in the input string one can tell the parser to accept any value in the corresponding bits.
A failure to match the input results in a restauration of the input stream index and a 0 return value.

Notation Variations:

Binary strings:#b10101, if the input string has a length that is no multiple of 8 (8bits = 1 byte), the string is automatically padded with leading zeros.
Binary strings with wildcards:#b10xx1, in this example the values of the 7th and 8th bit (due to padding with three leading zeros) would be ignored.

Hexadecimal strings:#xa90ef, if the length of the input string is no multiple of two (two hexadecimal digits = 4bit), it is padded with one leading 0.
Hexadecimal strings with wildcards:#xx90ef, in this example the values of the 5th through 8th bits (due to padding with a leading zeros) would be ignored.

Decimal strings:#1234, In case of a decimal input value, the number string is parsed as one 32-bit integer value. The number of characters being consumed can vary from one to four, depending on how many of the higher position bytes of the int are all zero. In case of #0 the framework assumes that the user explicitly wants to match zero.
There is no option for “don’t cares” with decimal input.

Fatal _ Allows the Npeg client to stop parsing when something unexpected happens in the client code.
A call of fatal results in a ParsingFatalTerminalException exception.
The user can pass the routine a string that describes the error. This string is stored in the exception object.
Literal < StringToBeMatched > Tries to match the current input stream content with StringToBeMatched.
Failure to do so, results in a restauration of the input stream index and a 0 return value.

Case-Sensitive Matching: <StringToBeMatched \c > The literal terminal element requires the users to specify whether they want the string to be matched with or without case-sensitivity.

Warn _ Places a user warning in a vector which can be accessed through the getWarnings member of Npeg C++ or through the warnings field of the npeg_context structure in the C implementation.
The routine always returns true.

Code sample

Since we support natively several languages see the github project for details on your specific language API.  The following code snippet is for C# developers but is similar to the other languages.

          String grammar = @"

        S: [\s]+;
        (?<Gate>): ('*' / 'AND') / ('~*' / 'NAND') / ('+' / 'OR') / ('~+' / 'NOR') / ('^' / 'XOR') / ('~^' / 'XNOR');
        ValidVariable: '""' (?<Variable>[a-zA-Z0-9]+) '""'  / '\'' (?<Variable>[a-zA-Z0-9]+) '\'' / (?<Variable>[a-zA-Z]);
        VarProjection1: ValidVariable /  (?<Invertor>'!' ValidVariable);
        VarProjection2: VarProjection1 / '(' Expression ')' / (?<Invertor>'!' '(' Expression ')');
        Expression: S? VarProjection2 S? (Gate S? VarProjection2 S?)*;
        (?<BooleanEquation>): Expression !.;
        
    ".Trim();

AExpression ROOT = PEGrammar.Load(grammar);
var iterator = new StringInputIterator("((((!X*Y*Z)+(!X*Y*!Z)+(X*Z))))");
var visitor = new NpegParserVisitor(iterator);
ROOT.Accept(visitor);
Assert.IsTrue(visitor.IsMatch);
AstNode node = visitor.AST;