RE/flex
RE/flex (regex-centric, fast lexical analyzer)[1][2] is a free and open source computer program written in C++ that generates fast lexical analyzers (also known as "scanners" or "lexers")[3][4] in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers (non-greedy, lazy repeats), and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library. HistoryThe RE/flex project was designed and implemented by professor Robert van Engelen in 2016 and released as free open source. The software evolved with several contributions made by others. The RE/flex tool generates lexical analyzers based on regular expression ("regex") libraries, instead of fixed DFA tables generated by traditional lexical analyzer generators.[5] Lexer specificationThe RE/flex lexical analyzer generator accepts an extended syntax of Flex lexer specifications as input. The RE/flex specification syntax is more expressive than the traditional Flex lexer specification syntax and may include indentation anchors, word boundaries, lazy quantifiers (non-greedy, lazy repeats), and new actions such as A lexer specification is of the form: Definitions
%%
Rules
%%
User Code
The Definitions section includes declarations and customization options, followed by name-pattern pairs to define names for regular expression patterns. Named patterns may be referenced in other patterns by embracing them in %top{
#include <inttypes.h> // strtol()
}
%class{
public:
int value; // yyFlexLexer class public member
}
%init{
value = 0; // yyFlexLexer initializations
}
%option flex
digit [0-9]
number {digit}+
The Rules section defines pattern-action pairs. The following example defines a rule to translate a number to the lexer class integer {number} value = strtol(yytext, NULL, 10);
\s // skip white space
. throw *yytext;
The User Code section typically defines C/C++ functions, for example a int main()
{
yyFlexLexer lexer;
try
{
while (lexer.yylex() != 0)
std::cout << "number=" << lexer.value << std::endl;
}
catch (int ch)
{
std::cerr << "Error: unknown character code " << ch << std::endl;
}
}
The Source code outputThe generated algorithm for lexical analysis is based on the concept that any regular expression engine can in principle be used to tokenize input into tokens: given a set of n regular expression patterns for , a regular expression of the form This approach makes it possible for any regex library that supports group captures to be utilized as a matcher. However, note that all groupings of the form The following RE/flex-generated int yyFlexLexer::yylex()
{
if (!has_matcher())
matcher("(p1)|(p2)|...|(pn)"); // new matcher engine for regex pattern (p1)|(p2)|...|(pn)
while (true)
{
switch (matcher().scan()) // scan and match next token, get capture index
{
case 0: // no match
if (... EOF reached ...)
return 0;
output(matcher().input()); // echo the current input character
break;
case 1: // pattern p1 matched
... // Action for pattern p1
break;
case 2: // pattern p2 matched
... // Action for pattern p2
break;
... // and so on for patterns up to pn
}
}
}
If none of the n patterns match and the end-of-file (EOF) is not reached, the so-called "default rule" is invoked. The default rule echo's the current input character and advances the scanner to the next character in the input. The regular expression pattern %%
p1 Action for pattern p1
p2 Action for pattern p2
...
pn Action for pattern pn
%%
From this specification, RE/flex generates the aforementioned std::ifstream ifs(filename, std::ios::in);
yyFlexLexer lexer(ifs);
int token;
while ((token = lexer.yylex()) != 0)
std::cout << "token = " << token << std::endl;
ifs.close();
Note that This example tokenizes a file. A lexical analyzer often serves as a tokenizer for a parser generated by a parser generator such as Bison. CompatibilityRE/flex is compatible with Flex specifications when By contrast to Flex, RE/flex scanners are thread-safe by default on work with reentrant Bison parsers. Unicode supportRE/flex supports Unicode regular expression patterns in lexer specifications and automatically tokenizes UTF-8, UTF-16, and UTF-32 input files. Code pages may be specified to tokenize input files encoded in ISO/IEC 8859 1 to 16, Windows-1250 to Windows-1258, CP-437, CP-850, CP-858, MacRoman, KOI-8, EBCDIC, and so on. Normalization to UTF-8 is automatically performed by internal incremental buffering for (partial) pattern matching with Unicode regular expression patterns. Indent, nodent, and dedent matchingRE/flex integrates indent and dedent matching directly in the regular expression syntax with new %%
^[ \t]+ std::cout << "| "; // nodent: text is aligned to current indent margin
^[ \t]*\i std::cout << "> "; // indent: matched with \i
^[ \t]*\j std::cout << "< "; // dedent: matched with \j
\j std::cout << "< "; // dedent: for each extra level dedented
%%
Lazy quantifiersLazy quantifiers may be associated with repeats in RE/flex regular expression patterns to simplify the expressions using non-greedy repeats, when applicable. Normally matching is "greedy", meaning that the longest pattern is matched. For example, the pattern As a practical application of lazy quantifiers, consider matching C/C++ multiline comments of the form Other pattern matchersBesides the built-in RE/flex POSIX regex pattern matcher, RE/flex also supports PCRE2, Boost.Regex and std::regex pattern matching libraries. PCRE2 and Boost.Regex offer a richer regular expression pattern syntax with Perl pattern matching semantics, but are slower due to their intrinsic NFA-based matching algorithm. TranslationLex, Flex and RE/flex translate regular expressions to DFA, which are implemented in tables for run-time scanning. RE/flex differs from Lex and Flex in that the generated tables contain a list of opcode words executed by a virtual machine to perform pattern matching. In addition, a DFA implemented in code instead of opcode tables is generated with the For example, the following direct-coded DFA for pattern void reflex_code_INITIAL(reflex::Matcher& m)
{
int c0 = 0, c1 = 0;
m.FSM_INIT(c1);
S0:
m.FSM_FIND();
c1 = m.FSM_CHAR();
if ('a' <= c1 && c1 <= 'z') goto S5;
if (c1 == '_') goto S5;
if ('A' <= c1 && c1 <= 'Z') goto S5;
if ('0' <= c1 && c1 <= '9') goto S5;
return m.FSM_HALT(c1);
S5:
m.FSM_TAKE(1);
c1 = m.FSM_CHAR();
if ('a' <= c1 && c1 <= 'z') goto S5;
if (c1 == '_') goto S5;
if ('A' <= c1 && c1 <= 'Z') goto S5;
if ('0' <= c1 && c1 <= '9') goto S5;
return m.FSM_HALT(c1);
}
A list of virtual machine opcode words for pattern REFLEX_CODE_DECL reflex_code_INITIAL[11] =
{
0x617A0005, // 0: GOTO 5 ON 'a'-'z'
0x5F5F0005, // 1: GOTO 5 ON '_'
0x415A0005, // 2: GOTO 5 ON 'A'-'Z'
0x30390005, // 3: GOTO 5 ON '0'-'9'
0x00FFFFFF, // 4: HALT
0xFE000001, // 5: TAKE 1
0x617A0005, // 6: GOTO 5 ON 'a'-'z'
0x5F5F0005, // 7: GOTO 5 ON '_'
0x415A0005, // 8: GOTO 5 ON 'A'-'Z'
0x30390005, // 9: GOTO 5 ON '0'-'9'
0x00FFFFFF, // 10: HALT
};
Debugging and profilingThe RE/flex built-in profiler can be used to measure the performance of the generated scanner automatically. The profiler instruments the scanner source code to collect run-time metrics. At run-time when the instrumented scanner terminates, the profiler reports the number of times a rule is matched and the cumulative time consumed by the matching rule. Profiling includes the time spent in the parser when the rule returns control to the parser. This allows for fine-tuning the performance of the generated scanners and parsers. Lexer rules that are hot spots, i.e. computationally expensive, are detected and can be optimized by the user in the lexer source code. Also debugging of the generated scanner is supported with Flex-compatible options. Debugging outputs annotated lexer rules during scanning. See alsoReferences
External links |