Verbatim++: Verified, Optimized, and Semantically Rich Lexing with Derivatives
Lexers and parsers are attractive targets for attackers because they often sit at the boundary between a software system’s internals and the outside world. Formally verified lexers can reduce the attack surface of these systems, thus making them more secure.
One recent step in this direction is the development of Verbatim, a verified lexer based on the concept of Brzozowski derivatives. Two limitations restrict the tool’s usefulness. First, its running time is quadratic in the length of its input string. Second, the lexer produces tokens with a simple “tag and string” representation, which limits the tool’s ability to integrate with parsers that operate on more expressive token representations.
In this work, we present a suite of extensions to Verbatim that overcomes these limitations while preserving the tool’s original correctness guarantees. The lexer achieves effectively linear performance on a JSON benchmark through a combination of optimizations that, to our knowledge, has not been previously verified. The enhanced version of Verbatim also enables users to augment their lexical specifications with custom semantic actions, and it uses these actions to produce semantically rich tokens—i.e., tokens that carry values with arbitrary, user-defined types. All extensions were implemented and verified with the Coq Proof Assistant.