-- Copyright 2006-2020 Mitchell mitchell.att.foicica.com. See License.txt. local M = {} --[=[ This comment is for LuaDoc. --- -- Lexes Scintilla documents and source code with Lua and LPeg. -- -- ## Writing Lua Lexers -- -- Lexers highlight the syntax of source code. Scintilla (the editing component -- behind [Textadept][]) traditionally uses static, compiled C++ lexers which -- are notoriously difficult to create and/or extend. On the other hand, Lua -- makes it easy to to rapidly create new lexers, extend existing ones, and -- embed lexers within one another. Lua lexers tend to be more readable than C++ -- lexers too. -- -- Lexers are Parsing Expression Grammars, or PEGs, composed with the Lua -- [LPeg library][]. The following table comes from the LPeg documentation and -- summarizes all you need to know about constructing basic LPeg patterns. This -- module provides convenience functions for creating and working with other -- more advanced patterns and concepts. -- -- Operator | Description -- ---------------------|------------ -- `lpeg.P(string)` | Matches `string` literally. -- `lpeg.P(`_`n`_`)` | Matches exactly _`n`_ number of characters. -- `lpeg.S(string)` | Matches any character in set `string`. -- `lpeg.R("`_`xy`_`")` | Matches any character between range `x` and `y`. -- `patt^`_`n`_ | Matches at least _`n`_ repetitions of `patt`. -- `patt^-`_`n`_ | Matches at most _`n`_ repetitions of `patt`. -- `patt1 * patt2` | Matches `patt1` followed by `patt2`. -- `patt1 + patt2` | Matches `patt1` or `patt2` (ordered choice). -- `patt1 - patt2` | Matches `patt1` if `patt2` does not also match. -- `-patt` | Equivalent to `("" - patt)`. -- `#patt` | Matches `patt` but consumes no input. -- -- The first part of this document deals with rapidly constructing a simple -- lexer. The next part deals with more advanced techniques, such as custom -- coloring and embedding lexers within one another. Following that is a -- discussion about code folding, or being able to tell Scintilla which code -- blocks are "foldable" (temporarily hideable from view). After that are -- instructions on how to use Lua lexers with the aforementioned Textadept -- editor. Finally there are comments on lexer performance and limitations. -- -- [LPeg library]: http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html -- [Textadept]: http://foicica.com/textadept -- -- ### Lexer Basics -- -- The *lexers/* directory contains all lexers, including your new one. Before -- attempting to write one from scratch though, first determine if your -- programming language is similar to any of the 100+ languages supported. If -- so, you may be able to copy and modify that lexer, saving some time and -- effort. The filename of your lexer should be the name of your programming -- language in lower case followed by a *.lua* extension. For example, a new Lua -- lexer has the name *lua.lua*. -- -- Note: Try to refrain from using one-character language names like "c", "d", -- or "r". For example, Lua lexers for those language names are named "ansi_c", -- "dmd", and "rstats", respectively. -- -- #### New Lexer Template -- -- There is a *lexers/template.txt* file that contains a simple template for a -- new lexer. Feel free to use it, replacing the '?'s with the name of your -- lexer. Consider this snippet from the template: -- -- -- ? LPeg lexer. -- -- local lexer = require('lexer') -- local token, word_match = lexer.token, lexer.word_match -- local P, S = lpeg.P, lpeg.S -- -- local lex = lexer.new('?') -- -- -- Whitespace. -- local ws = token(lexer.WHITESPACE, lexer.space^1) -- lex:add_rule('whitespace', ws) -- -- [...] -- -- return lex -- -- The first 3 lines of code simply define often used convenience variables. The -- fourth and last lines [define](#lexer.new) and return the lexer object -- Scintilla uses; they are very important and must be part of every lexer. The -- fifth line defines something called a "token", an essential building block of -- lexers. You will learn about tokens shortly. The sixth line defines a lexer -- grammar rule, which you will learn about later, as well as token styles. (Be -- aware that it is common practice to combine these two lines for short rules.) -- Note, however, the `local` prefix in front of variables, which is needed -- so-as not to affect Lua's global environment. All in all, this is a minimal, -- working lexer that you can build on. -- -- #### Tokens -- -- Take a moment to think about your programming language's structure. What kind -- of key elements does it have? In the template shown earlier, one predefined -- element all languages have is whitespace. Your language probably also has -- elements like comments, strings, and keywords. Lexers refer to these elements -- as "tokens". Tokens are the fundamental "building blocks" of lexers. Lexers -- break down source code into tokens for coloring, which results in the syntax -- highlighting familiar to you. It is up to you how specific your lexer is when -- it comes to tokens. Perhaps only distinguishing between keywords and -- identifiers is necessary, or maybe recognizing constants and built-in -- functions, methods, or libraries is desirable. The Lua lexer, for example, -- defines 11 tokens: whitespace, keywords, built-in functions, constants, -- built-in libraries, identifiers, strings, comments, numbers, labels, and -- operators. Even though constants, built-in functions, and built-in libraries -- are subsets of identifiers, Lua programmers find it helpful for the lexer to -- distinguish between them all. It is perfectly acceptable to just recognize -- keywords and identifiers. -- -- In a lexer, tokens consist of a token name and an LPeg pattern that matches a -- sequence of characters recognized as an instance of that token. Create tokens -- using the [`lexer.token()`]() function. Let us examine the "whitespace" token -- defined in the template shown earlier: -- -- local ws = token(lexer.WHITESPACE, lexer.space^1) -- -- At first glance, the first argument does not appear to be a string name and -- the second argument does not appear to be an LPeg pattern. Perhaps you -- expected something like: -- -- local ws = token('whitespace', S('\t\v\f\n\r ')^1) -- -- The `lexer` module actually provides a convenient list of common token names -- and common LPeg patterns for you to use. Token names include -- [`lexer.DEFAULT`](), [`lexer.WHITESPACE`](), [`lexer.COMMENT`](), -- [`lexer.STRING`](), [`lexer.NUMBER`](), [`lexer.KEYWORD`](), -- [`lexer.IDENTIFIER`](), [`lexer.OPERATOR`](), [`lexer.ERROR`](), -- [`lexer.PREPROCESSOR`](), [`lexer.CONSTANT`](), [`lexer.VARIABLE`](), -- [`lexer.FUNCTION`](), [`lexer.CLASS`](), [`lexer.TYPE`](), [`lexer.LABEL`](), -- [`lexer.REGEX`](), and [`lexer.EMBEDDED`](). Patterns include -- [`lexer.any`](), [`lexer.alpha`](), [`lexer.digit`](), [`lexer.alnum`](), -- [`lexer.lower`](), [`lexer.upper`](), [`lexer.xdigit`](), [`lexer.graph`](), -- [`lexer.print`](), [`lexer.punct`](), [`lexer.space`](), [`lexer.newline`](), -- [`lexer.nonnewline`](), [`lexer.dec_num`](), [`lexer.hex_num`](), -- [`lexer.oct_num`](), [`lexer.integer`](), [`lexer.float`](), -- [`lexer.number`](), and [`lexer.word`](). You may use your own token names if -- none of the above fit your language, but an advantage to using predefined -- token names is that your lexer's tokens will inherit the universal syntax -- highlighting color theme used by your text editor. -- -- ##### Example Tokens -- -- So, how might you define other tokens like keywords, comments, and strings? -- Here are some examples. -- -- **Keywords** -- -- Instead of matching _n_ keywords with _n_ `P('keyword_`_`n`_`')` ordered -- choices, use another convenience function: [`lexer.word_match()`](). It is -- much easier and more efficient to write word matches like: -- -- local keyword = token(lexer.KEYWORD, lexer.word_match[[ -- keyword_1 keyword_2 ... keyword_n -- ]]) -- -- local case_insensitive_keyword = token(lexer.KEYWORD, lexer.word_match([[ -- KEYWORD_1 keyword_2 ... KEYword_n -- ]], true)) -- -- local hyphened_keyword = token(lexer.KEYWORD, lexer.word_match[[ -- keyword-1 keyword-2 ... keyword-n -- ]]) -- -- In order to more easily separate or categorize keyword sets, you can use Lua -- line comments within keyword strings. Such comments will be ignored. For -- example: -- -- local keyword = token(lexer.KEYWORD, lexer.word_match[[ -- -- Version 1 keywords. -- keyword_11, keyword_12 ... keyword_1n -- -- Version 2 keywords. -- keyword_21, keyword_22 ... keyword_2n -- ... -- -- Version N keywords. -- keyword_m1, keyword_m2 ... keyword_mn -- ]]) -- -- **Comments** -- -- Line-style comments with a prefix character(s) are easy to express with LPeg: -- -- local shell_comment = token(lexer.COMMENT, lexer.to_eol('#')) -- local c_line_comment = token(lexer.COMMENT, lexer.to_eol('//', true)) -- -- The comments above start with a '#' or "//" and go to the end of the line. -- The second comment recognizes the next line also as a comment if the current -- line ends with a '\' escape character. -- -- C-style "block" comments with a start and end delimiter are also easy to -- express: -- -- local c_comment = token(lexer.COMMENT, lexer.range('/*', '*/')) -- -- This comment starts with a "/\*" sequence and contains anything up to and -- including an ending "\*/" sequence. The ending "\*/" is optional so the lexer -- can recognize unfinished comments as comments and highlight them properly. -- -- **Strings** -- -- Most programming languages allow escape sequences in strings such that a -- sequence like "\\"" in a double-quoted string indicates that the -- '"' is not the end of the string. [`lexer.range()`]() handles escapes -- inherently. -- -- local dq_str = lexer.range('"') -- local sq_str = lexer.range("'") -- local string = token(lexer.STRING, dq_str + sq_str) -- -- In this case, the lexer treats '\' as an escape character in a string -- sequence. -- -- **Numbers** -- -- Most programming languages have the same format for integer and float tokens, -- so it might be as simple as using a predefined LPeg pattern: -- -- local number = token(lexer.NUMBER, lexer.number) -- -- However, some languages allow postfix characters on integers. -- -- local integer = P('-')^-1 * (lexer.dec_num * S('lL')^-1) -- local number = token(lexer.NUMBER, lexer.float + lexer.hex_num + integer) -- -- Your language may need other tweaks, but it is up to you how fine-grained you -- want your highlighting to be. After all, you are not writing a compiler or -- interpreter! -- -- #### Rules -- -- Programming languages have grammars, which specify valid token structure. For -- example, comments usually cannot appear within a string. Grammars consist of -- rules, which are simply combinations of tokens. Recall from the lexer -- template the [`lexer.add_rule()`]() call, which adds a rule to the lexer's -- grammar: -- -- lex:add_rule('whitespace', ws) -- -- Each rule has an associated name, but rule names are completely arbitrary and -- serve only to identify and distinguish between different rules. Rule order is -- important: if text does not match the first rule added to the grammar, the -- lexer tries to match the second rule added, and so on. Right now this lexer -- simply matches whitespace tokens under a rule named "whitespace". -- -- To illustrate the importance of rule order, here is an example of a -- simplified Lua lexer: -- -- lex:add_rule('whitespace', token(lexer.WHITESPACE, ...)) -- lex:add_rule('keyword', token(lexer.KEYWORD, ...)) -- lex:add_rule('identifier', token(lexer.IDENTIFIER, ...)) -- lex:add_rule('string', token(lexer.STRING, ...)) -- lex:add_rule('comment', token(lexer.COMMENT, ...)) -- lex:add_rule('number', token(lexer.NUMBER, ...)) -- lex:add_rule('label', token(lexer.LABEL, ...)) -- lex:add_rule('operator', token(lexer.OPERATOR, ...)) -- -- Note how identifiers come after keywords. In Lua, as with most programming -- languages, the characters allowed in keywords and identifiers are in the same -- set (alphanumerics plus underscores). If the lexer added the "identifier" -- rule before the "keyword" rule, all keywords would match identifiers and thus -- incorrectly highlight as identifiers instead of keywords. The same idea -- applies to function, constant, etc. tokens that you may want to distinguish -- between: their rules should come before identifiers. -- -- So what about text that does not match any rules? For example in Lua, the '!' -- character is meaningless outside a string or comment. Normally the lexer -- skips over such text. If instead you want to highlight these "syntax errors", -- add an additional end rule: -- -- lex:add_rule('whitespace', ws) -- ... -- lex:add_rule('error', token(lexer.ERROR, lexer.any)) -- -- This identifies and highlights any character not matched by an existing -- rule as a `lexer.ERROR` token. -- -- Even though the rules defined in the examples above contain a single token, -- rules may consist of multiple tokens. For example, a rule for an HTML tag -- could consist of a tag token followed by an arbitrary number of attribute -- tokens, allowing the lexer to highlight all tokens separately. That rule -- might look something like this: -- -- lex:add_rule('tag', tag_start * (ws * attributes)^0 * tag_end^-1) -- -- Note however that lexers with complex rules like these are more prone to lose -- track of their state, especially if they span multiple lines. -- -- #### Summary -- -- Lexers primarily consist of tokens and grammar rules. At your disposal are a -- number of convenience patterns and functions for rapidly creating a lexer. If -- you choose to use predefined token names for your tokens, you do not have to -- define how the lexer highlights them. The tokens will inherit the default -- syntax highlighting color theme your editor uses. -- -- ### Advanced Techniques -- -- #### Styles and Styling -- -- The most basic form of syntax highlighting is assigning different colors to -- different tokens. Instead of highlighting with just colors, Scintilla allows -- for more rich highlighting, or "styling", with different fonts, font sizes, -- font attributes, and foreground and background colors, just to name a few. -- The unit of this rich highlighting is called a "style". Styles are simply Lua -- tables of properties. By default, lexers associate predefined token names -- like `lexer.WHITESPACE`, `lexer.COMMENT`, `lexer.STRING`, etc. with -- particular styles as part of a universal color theme. These predefined styles -- are contained in [`lexer.styles`](), and you may define your own styles. See -- that table's documentation for more information. As with token names, -- LPeg patterns, and styles, there is a set of predefined color names, but they -- vary depending on the current color theme in use. Therefore, it is generally -- not a good idea to manually define colors within styles in your lexer since -- they might not fit into a user's chosen color theme. Try to refrain from even -- using predefined colors in a style because that color may be theme-specific. -- Instead, the best practice is to either use predefined styles or derive new -- color-agnostic styles from predefined ones. For example, Lua "longstring" -- tokens use the existing `lexer.styles.string` style instead of defining a new -- one. -- -- ##### Example Styles -- -- Defining styles is pretty straightforward. An empty style that inherits the -- default theme settings is simply an empty table: -- -- local style_nothing = {} -- -- A similar style but with a bold font face looks like this: -- -- local style_bold = {bold = true} -- -- You can derive new styles from predefined ones without having to rewrite -- them. This operation leaves the old style unchanged. For example, if you had -- a "static variable" token whose style you wanted to base off of -- `lexer.styles.variable`, it would probably look like: -- -- local style_static_var = lexer.styles.variable .. {italics = true} -- -- The color theme files in the *lexers/themes/* folder give more examples of -- style definitions. -- -- #### Token Styles -- -- Lexers use the [`lexer.add_style()`]() function to assign styles to -- particular tokens. Recall the token definition and from the lexer template: -- -- local ws = token(lexer.WHITESPACE, lexer.space^1) -- lex:add_rule('whitespace', ws) -- -- Why is a style not assigned to the `lexer.WHITESPACE` token? As mentioned -- earlier, lexers automatically associate tokens that use predefined token -- names with a particular style. Only tokens with custom token names need -- manual style associations. As an example, consider a custom whitespace token: -- -- local ws = token('custom_whitespace', lexer.space^1) -- -- Assigning a style to this token looks like: -- -- lex:add_style('custom_whitespace', lexer.styles.whitespace) -- -- Do not confuse token names with rule names. They are completely different -- entities. In the example above, the lexer associates the "custom_whitespace" -- token with the existing style for `lexer.WHITESPACE` tokens. If instead you -- prefer to color the background of whitespace a shade of grey, it might look -- like: -- -- lex:add_style('custom_whitespace', -- lexer.styles.whitespace .. {back = lexer.colors.grey}) -- -- Remember to refrain from assigning specific colors in styles, but in this -- case, all user color themes probably define `colors.grey`. -- -- #### Line Lexers -- -- By default, lexers match the arbitrary chunks of text passed to them by -- Scintilla. These chunks may be a full document, only the visible part of a -- document, or even just portions of lines. Some lexers need to match whole -- lines. For example, a lexer for the output of a file "diff" needs to know if -- the line started with a '+' or '-' and then style the entire line -- accordingly. To indicate that your lexer matches by line, create the lexer -- with an extra parameter: -- -- local lex = lexer.new('?', {lex_by_line = true}) -- -- Now the input text for the lexer is a single line at a time. Keep in mind -- that line lexers do not have the ability to look ahead at subsequent lines. -- -- #### Embedded Lexers -- -- Lexers embed within one another very easily, requiring minimal effort. In the -- following sections, the lexer being embedded is called the "child" lexer and -- the lexer a child is being embedded in is called the "parent". For example, -- consider an HTML lexer and a CSS lexer. Either lexer stands alone for styling -- their respective HTML and CSS files. However, CSS can be embedded inside -- HTML. In this specific case, the CSS lexer is the "child" lexer with the HTML -- lexer being the "parent". Now consider an HTML lexer and a PHP lexer. This -- sounds a lot like the case with CSS, but there is a subtle difference: PHP -- _embeds itself into_ HTML while CSS is _embedded in_ HTML. This fundamental -- difference results in two types of embedded lexers: a parent lexer that -- embeds other child lexers in it (like HTML embedding CSS), and a child lexer -- that embeds itself into a parent lexer (like PHP embedding itself in HTML). -- -- ##### Parent Lexer -- -- Before embedding a child lexer into a parent lexer, the parent lexer needs to -- load the child lexer. This is done with the [`lexer.load()`]() function. For -- example, loading the CSS lexer within the HTML lexer looks like: -- -- local css = lexer.load('css') -- -- The next part of the embedding process is telling the parent lexer when to -- switch over to the child lexer and when to switch back. The lexer refers to -- these indications as the "start rule" and "end rule", respectively, and are -- just LPeg patterns. Continuing with the HTML/CSS example, the transition from -- HTML to CSS is when the lexer encounters a "style" tag with a "type" -- attribute whose value is "text/css": -- -- local css_tag = P(']+type="text/css"', index) then -- return index -- end -- end) -- -- This pattern looks for the beginning of a "style" tag and searches its -- attribute list for the text "`type="text/css"`". (In this simplified example, -- the Lua pattern does not consider whitespace between the '=' nor does it -- consider that using single quotes is valid.) If there is a match, the -- functional pattern returns a value instead of `nil`. In this case, the value -- returned does not matter because we ultimately want to style the "style" tag -- as an HTML tag, so the actual start rule looks like this: -- -- local css_start_rule = #css_tag * tag -- -- Now that the parent knows when to switch to the child, it needs to know when -- to switch back. In the case of HTML/CSS, the switch back occurs when the -- lexer encounters an ending "style" tag, though the lexer should still style -- the tag as an HTML tag: -- -- local css_end_rule = #P('') * tag -- -- Once the parent loads the child lexer and defines the child's start and end -- rules, it embeds the child with the [`lexer.embed()`]() function: -- -- lex:embed(css, css_start_rule, css_end_rule) -- -- ##### Child Lexer -- -- The process for instructing a child lexer to embed itself into a parent is -- very similar to embedding a child into a parent: first, load the parent lexer -- into the child lexer with the [`lexer.load()`]() function and then create -- start and end rules for the child lexer. However, in this case, call -- [`lexer.embed()`]() with switched arguments. For example, in the PHP lexer: -- -- local html = lexer.load('html') -- local php_start_rule = token('php_tag', '') -- lex:add_style('php_tag', lexer.styles.embedded) -- html:embed(lex, php_start_rule, php_end_rule) -- -- #### Lexers with Complex State -- -- A vast majority of lexers are not stateful and can operate on any chunk of -- text in a document. However, there may be rare cases where a lexer does need -- to keep track of some sort of persistent state. Rather than using `lpeg.P` -- function patterns that set state variables, it is recommended to make use of -- Scintilla's built-in, per-line state integers via [`lexer.line_state`](). It -- was designed to accommodate up to 32 bit flags for tracking state. -- [`lexer.line_from_position()`]() will return the line for any position given -- to an `lpeg.P` function pattern. (Any positions derived from that position -- argument will also work.) -- -- Writing stateful lexers is beyond the scope of this document. -- -- ### Code Folding -- -- When reading source code, it is occasionally helpful to temporarily hide -- blocks of code like functions, classes, comments, etc. This is the concept of -- "folding". In many Scintilla-based editors, such as Textadept, little -- indicators in the editor margins appear next to code that can be folded at -- places called "fold points". When the user clicks an indicator, the editor -- hides the code associated with the indicator until the user clicks the -- indicator again. The lexer specifies these fold points and what code exactly -- to fold. -- -- The fold points for most languages occur on keywords or character sequences. -- Examples of fold keywords are "if" and "end" in Lua and examples of fold -- character sequences are '{', '}', "/\*", and "\*/" in C for code block and -- comment delimiters, respectively. However, these fold points cannot occur -- just anywhere. For example, lexers should not recognize fold keywords that -- appear within strings or comments. The [`lexer.add_fold_point()`]() function -- allows you to conveniently define fold points with such granularity. For -- example, consider C: -- -- lex:add_fold_point(lexer.OPERATOR, '{', '}') -- lex:add_fold_point(lexer.COMMENT, '/*', '*/') -- -- The first assignment states that any '{' or '}' that the lexer recognized as -- an `lexer.OPERATOR` token is a fold point. Likewise, the second assignment -- states that any "/\*" or "\*/" that the lexer recognizes as part of a -- `lexer.COMMENT` token is a fold point. The lexer does not consider any -- occurrences of these characters outside their defined tokens (such as in a -- string) as fold points. How do you specify fold keywords? Here is an example -- for Lua: -- -- lex:add_fold_point(lexer.KEYWORD, 'if', 'end') -- lex:add_fold_point(lexer.KEYWORD, 'do', 'end') -- lex:add_fold_point(lexer.KEYWORD, 'function', 'end') -- lex:add_fold_point(lexer.KEYWORD, 'repeat', 'until') -- -- If your lexer has case-insensitive keywords as fold points, simply add a -- `case_insensitive_fold_points = true` option to [`lexer.new()`](), and -- specify keywords in lower case. -- -- If your lexer needs to do some additional processing in order to determine if -- a token is a fold point, pass a function that returns an integer to -- `lex:add_fold_point()`. Returning `1` indicates the token is a beginning fold -- point and returning `-1` indicates the token is an ending fold point. -- Returning `0` indicates the token is not a fold point. For example: -- -- local function fold_strange_token(text, pos, line, s, symbol) -- if ... then -- return 1 -- beginning fold point -- elseif ... then -- return -1 -- ending fold point -- end -- return 0 -- end -- -- lex:add_fold_point('strange_token', '|', fold_strange_token) -- -- Any time the lexer encounters a '|' that is a "strange_token", it calls the -- `fold_strange_token` function to determine if '|' is a fold point. The lexer -- calls these functions with the following arguments: the text to identify fold -- points in, the beginning position of the current line in the text to fold, -- the current line's text, the position in the current line the fold point text -- starts at, and the fold point text itself. -- -- #### Fold by Indentation -- -- Some languages have significant whitespace and/or no delimiters that indicate -- fold points. If your lexer falls into this category and you would like to -- mark fold points based on changes in indentation, create the lexer with a -- `fold_by_indentation = true` option: -- -- local lex = lexer.new('?', {fold_by_indentation = true}) -- -- ### Using Lexers -- -- #### Textadept -- -- Put your lexer in your *~/.textadept/lexers/* directory so you do not -- overwrite it when upgrading Textadept. Also, lexers in this directory -- override default lexers. Thus, Textadept loads a user *lua* lexer instead of -- the default *lua* lexer. This is convenient for tweaking a default lexer to -- your liking. Then add a [file type][] for your lexer if necessary. -- -- [file type]: textadept.file_types.html -- -- ### Migrating Legacy Lexers -- -- Legacy lexers are of the form: -- -- local l = require('lexer') -- local token, word_match = l.token, l.word_match -- local P, R, S = lpeg.P, lpeg.R, lpeg.S -- -- local M = {_NAME = '?'} -- -- [... token and pattern definitions ...] -- -- M._rules = { -- {'rule', pattern}, -- [...] -- } -- -- M._tokenstyles = { -- 'token' = 'style', -- [...] -- } -- -- M._foldsymbols = { -- _patterns = {...}, -- ['token'] = {['start'] = 1, ['end'] = -1}, -- [...] -- } -- -- return M -- -- While such legacy lexers will be handled just fine without any changes, it is -- recommended that you migrate yours. The migration process is fairly -- straightforward: -- -- 1. Replace all instances of `l` with `lexer`, as it's better practice and -- results in less confusion. -- 2. Replace `local M = {_NAME = '?'}` with `local lex = lexer.new('?')`, where -- `?` is the name of your legacy lexer. At the end of the lexer, change -- `return M` to `return lex`. -- 3. Instead of defining rules towards the end of your lexer, define your rules -- as you define your tokens and patterns using -- [`lex:add_rule()`](#lexer.add_rule). -- 4. Similarly, any custom token names should have their styles immediately -- defined using [`lex:add_style()`](#lexer.add_style). -- 5. Convert any table arguments passed to [`lexer.word_match()`]() to a -- space-separated string of words. -- 6. Replace any calls to `lexer.embed(M, child, ...)` and -- `lexer.embed(parent, M, ...)` with -- [`lex:embed`](#lexer.embed)`(child, ...)` and `parent:embed(lex, ...)`, -- respectively. -- 7. Define fold points with simple calls to -- [`lex:add_fold_point()`](#lexer.add_fold_point). No need to mess with Lua -- patterns anymore. -- 8. Any legacy lexer options such as `M._FOLDBYINDENTATION`, `M._LEXBYLINE`, -- `M._lexer`, etc. should be added as table options to [`lexer.new()`](). -- 9. Any external lexer rule fetching and/or modifications via `lexer._RULES` -- should be changed to use [`lexer.get_rule()`]() and -- [`lexer.modify_rule()`](). -- -- As an example, consider the following sample legacy lexer: -- -- local l = require('lexer') -- local token, word_match = l.token, l.word_match -- local P, R, S = lpeg.P, lpeg.R, lpeg.S -- -- local M = {_NAME = 'legacy'} -- -- local ws = token(l.WHITESPACE, l.space^1) -- local comment = token(l.COMMENT, '#' * l.nonnewline^0) -- local string = token(l.STRING, l.delimited_range('"')) -- local number = token(l.NUMBER, l.float + l.integer) -- local keyword = token(l.KEYWORD, word_match{'foo', 'bar', 'baz'}) -- local custom = token('custom', P('quux')) -- local identifier = token(l.IDENTIFIER, l.word) -- local operator = token(l.OPERATOR, S('+-*/%^=<>,.()[]{}')) -- -- M._rules = { -- {'whitespace', ws}, -- {'keyword', keyword}, -- {'custom', custom}, -- {'identifier', identifier}, -- {'string', string}, -- {'comment', comment}, -- {'number', number}, -- {'operator', operator} -- } -- -- M._tokenstyles = { -- 'custom' = l.STYLE_KEYWORD .. ',bold' -- } -- -- M._foldsymbols = { -- _patterns = {'[{}]'}, -- [l.OPERATOR] = {['{'] = 1, ['}'] = -1} -- } -- -- return M -- -- Following the migration steps would yield: -- -- local lexer = require('lexer') -- local token, word_match = lexer.token, lexer.word_match -- local P, S = lpeg.P, lpeg.S -- -- local lex = lexer.new('legacy') -- -- lex:add_rule('whitespace', token(lexer.WHITESPACE, lexer.space^1)) -- lex:add_rule('keyword', token(lexer.KEYWORD, word_match[[foo bar baz]])) -- lex:add_rule('custom', token('custom', P('quux'))) -- lex:add_style('custom', lexer.styles.keyword .. {bold = true}) -- lex:add_rule('identifier', token(lexer.IDENTIFIER, lexer.word)) -- lex:add_rule('string', token(lexer.STRING, lexer.range('"'))) -- lex:add_rule('comment', token(lexer.COMMENT, lexer.to_eol('#'))) -- lex:add_rule('number', token(lexer.NUMBER, lexer.number)) -- lex:add_rule('operator', token(lexer.OPERATOR, S('+-*/%^=<>,.()[]{}'))) -- -- lex:add_fold_point(lexer.OPERATOR, '{', '}') -- -- return lex -- -- ### Considerations -- -- #### Performance -- -- There might be some slight overhead when initializing a lexer, but loading a -- file from disk into Scintilla is usually more expensive. On modern computer -- systems, I see no difference in speed between Lua lexers and Scintilla's C++ -- ones. Optimize lexers for speed by re-arranging `lexer.add_rule()` calls so -- that the most common rules match first. Do keep in mind that order matters -- for similar rules. -- -- In some cases, folding may be far more expensive than lexing, particularly -- in lexers with a lot of potential fold points. If your lexer is exhibiting -- signs of slowness, try disabling folding in your text editor first. If that -- speeds things up, you can try reducing the number of fold points you added, -- overriding `lexer.fold()` with your own implementation, or simply eliminating -- folding support from your lexer. -- -- #### Limitations -- -- Embedded preprocessor languages like PHP cannot completely embed in their -- parent languages in that the parent's tokens do not support start and end -- rules. This mostly goes unnoticed, but code like -- --
-- -- will not style correctly. -- -- #### Troubleshooting -- -- Errors in lexers can be tricky to debug. Lexers print Lua errors to -- `io.stderr` and `_G.print()` statements to `io.stdout`. Running your editor -- from a terminal is the easiest way to see errors as they occur. -- -- #### Risks -- -- Poorly written lexers have the ability to crash Scintilla (and thus its -- containing application), so unsaved data might be lost. However, I have only -- observed these crashes in early lexer development, when syntax errors or -- pattern errors are present. Once the lexer actually starts styling text -- (either correctly or incorrectly, it does not matter), I have not observed -- any crashes. -- -- #### Acknowledgements -- -- Thanks to Peter Odding for his [lexer post][] on the Lua mailing list -- that provided inspiration, and thanks to Roberto Ierusalimschy for LPeg. -- -- [lexer post]: http://lua-users.org/lists/lua-l/2007-04/msg00116.html -- @field DEFAULT (string) -- The token name for default tokens. -- @field WHITESPACE (string) -- The token name for whitespace tokens. -- @field COMMENT (string) -- The token name for comment tokens. -- @field STRING (string) -- The token name for string tokens. -- @field NUMBER (string) -- The token name for number tokens. -- @field KEYWORD (string) -- The token name for keyword tokens. -- @field IDENTIFIER (string) -- The token name for identifier tokens. -- @field OPERATOR (string) -- The token name for operator tokens. -- @field ERROR (string) -- The token name for error tokens. -- @field PREPROCESSOR (string) -- The token name for preprocessor tokens. -- @field CONSTANT (string) -- The token name for constant tokens. -- @field VARIABLE (string) -- The token name for variable tokens. -- @field FUNCTION (string) -- The token name for function tokens. -- @field CLASS (string) -- The token name for class tokens. -- @field TYPE (string) -- The token name for type tokens. -- @field LABEL (string) -- The token name for label tokens. -- @field REGEX (string) -- The token name for regex tokens. -- @field any (pattern) -- A pattern that matches any single character. -- @field ascii (pattern) -- A pattern that matches any ASCII character (codes 0 to 127). -- @field extend (pattern) -- A pattern that matches any ASCII extended character (codes 0 to 255). -- @field alpha (pattern) -- A pattern that matches any alphabetic character ('A'-'Z', 'a'-'z'). -- @field digit (pattern) -- A pattern that matches any digit ('0'-'9'). -- @field alnum (pattern) -- A pattern that matches any alphanumeric character ('A'-'Z', 'a'-'z', -- '0'-'9'). -- @field lower (pattern) -- A pattern that matches any lower case character ('a'-'z'). -- @field upper (pattern) -- A pattern that matches any upper case character ('A'-'Z'). -- @field xdigit (pattern) -- A pattern that matches any hexadecimal digit ('0'-'9', 'A'-'F', 'a'-'f'). -- @field cntrl (pattern) -- A pattern that matches any control character (ASCII codes 0 to 31). -- @field graph (pattern) -- A pattern that matches any graphical character ('!' to '~'). -- @field print (pattern) -- A pattern that matches any printable character (' ' to '~'). -- @field punct (pattern) -- A pattern that matches any punctuation character ('!' to '/', ':' to '@', -- '[' to ''', '{' to '~'). -- @field space (pattern) -- A pattern that matches any whitespace character ('\t', '\v', '\f', '\n', -- '\r', space). -- @field newline (pattern) -- A pattern that matches a sequence of end of line characters. -- @field nonnewline (pattern) -- A pattern that matches any single, non-newline character. -- @field dec_num (pattern) -- A pattern that matches a decimal number. -- @field hex_num (pattern) -- A pattern that matches a hexadecimal number. -- @field oct_num (pattern) -- A pattern that matches an octal number. -- @field integer (pattern) -- A pattern that matches either a decimal, hexadecimal, or octal number. -- @field float (pattern) -- A pattern that matches a floating point number. -- @field number (pattern) -- A pattern that matches a typical number, either a floating point, decimal, -- hexadecimal, or octal number. -- @field word (pattern) -- A pattern that matches a typical word. Words begin with a letter or -- underscore and consist of alphanumeric and underscore characters. -- @field FOLD_BASE (number) -- The initial (root) fold level. -- @field FOLD_BLANK (number) -- Flag indicating that the line is blank. -- @field FOLD_HEADER (number) -- Flag indicating the line is fold point. -- @field fold_level (table, Read-only) -- Table of fold level bit-masks for line numbers starting from 1. -- Fold level masks are composed of an integer level combined with any of the -- following bits: -- -- * `lexer.FOLD_BASE` -- The initial fold level. -- * `lexer.FOLD_BLANK` -- The line is blank. -- * `lexer.FOLD_HEADER` -- The line is a header, or fold point. -- @field indent_amount (table, Read-only) -- Table of indentation amounts in character columns, for line numbers -- starting from 1. -- @field line_state (table) -- Table of integer line states for line numbers starting from 1. -- Line states can be used by lexers for keeping track of persistent states. -- @field property (table) -- Map of key-value string pairs. -- @field property_expanded (table, Read-only) -- Map of key-value string pairs with `$()` and `%()` variable replacement -- performed in values. -- @field property_int (table, Read-only) -- Map of key-value pairs with values interpreted as numbers, or `0` if not -- found. -- @field style_at (table, Read-only) -- Table of style names at positions in the buffer starting from 1. -- @field folding (boolean) -- Whether or not folding is enabled. -- This option is disabled by default. -- This is an alias for `lexer.property['fold'] = '1|0'`. -- @field fold_on_zero_sum_lines (boolean) -- Whether or not to mark as a fold point lines that contain both an ending -- and starting fold point. For example, `} else {` would be marked as a fold -- point. -- This option is disabled by default. -- This is an alias for `lexer.property['fold.on.zero.sum.lines'] = '1|0'`. -- @field fold_compact (boolean) -- Whether or not blank lines after an ending fold point are included in that -- fold. -- This option is disabled by default. -- This is an alias for `lexer.property['fold.compact'] = '1|0'`. -- @field fold_by_indentation (boolean) -- Whether or not to fold based on indentation level if a lexer does not have -- a folder. -- Some lexers automatically enable this option. It is disabled by default. -- This is an alias for `lexer.property['fold.by.indentation'] = '1|0'`. -- @field fold_line_groups (boolean) -- Whether or not to fold multiple, consecutive line groups (such as line -- comments and import statements) and only show the top line. -- This option is disabled by default. -- This is an alias for `lexer.property['fold.line.groups'] = '1|0'`. module('lexer')]=] if not require then -- Substitute for Lua's require() function, which does not require the package -- module to be loaded. -- Note: all modules must be in the global namespace, which is the case in -- LexerLPeg's default Lua State. function require(name) return name == 'lexer' and M or _G[name] end end local lpeg = require('lpeg') local lpeg_P, lpeg_R, lpeg_S, lpeg_V = lpeg.P, lpeg.R, lpeg.S, lpeg.V local lpeg_Ct, lpeg_Cc, lpeg_Cp = lpeg.Ct, lpeg.Cc, lpeg.Cp local lpeg_Cmt, lpeg_C = lpeg.Cmt, lpeg.C local lpeg_match = lpeg.match -- Searches for the given *name* in the given *path*. -- This is a safe implementation of Lua 5.2's `package.searchpath()` function -- that does not require the package module to be loaded. local function searchpath(name, path) local tried = {} for part in path:gmatch('[^;]+') do local filename = part:gsub('%?', name) local ok, errmsg = loadfile(filename) if ok or not errmsg:find('cannot open') then return filename end tried[#tried + 1] = string.format("no file '%s'", filename) end return nil, table.concat(tried, '\n') end --- -- Map of color name strings to color values in `0xBBGGRR` or `"#RRGGBB"` -- format. -- Note: for applications running within a terminal emulator, only 16 color -- values are recognized, regardless of how many colors a user's terminal -- actually supports. (A terminal emulator's settings determines how to actually -- display these recognized color values, which may end up being mapped to a -- completely different color set.) In order to use the light variant of a -- color, some terminals require a style's `bold` attribute must be set along -- with that normal color. Recognized color values are black (0x000000), red -- (0x000080), green (0x008000), yellow (0x008080), blue (0x800000), magenta -- (0x800080), cyan (0x808000), white (0xC0C0C0), light black (0x404040), light -- red (0x0000FF), light green (0x00FF00), light yellow (0x00FFFF), light blue -- (0xFF0000), light magenta (0xFF00FF), light cyan (0xFFFF00), and light white -- (0xFFFFFF). -- @name colors -- @class table M.colors = setmetatable({}, { __index = function(_, name) local color = M.property['color.' .. name] return tonumber(color) or color end, __newindex = function(_, name, color) M.property['color.' .. name] = color end }) -- A style object that distills into a property string that can be read by the -- LPeg lexer. local style_obj = {} style_obj.__index = style_obj -- Create a style object from a style name, property table, or legacy style -- string. function style_obj.new(name_or_props) local prop_string = tostring(name_or_props) if type(name_or_props) == 'string' and name_or_props:find('^[%w_]+$') then prop_string = string.format('$(style.%s)', name_or_props) elseif type(name_or_props) == 'table' then local settings = {} for k, v in pairs(name_or_props) do settings[#settings + 1] = type(v) ~= 'boolean' and string.format('%s:%s', k, v) or string.format('%s%s', v and '' or 'not', k) end prop_string = table.concat(settings, ',') end return setmetatable({prop_string = prop_string}, style_obj) end -- Returns a new style based on this one with the properties defined in the -- given table or legacy style string. function style_obj.__concat(self, props) if type(props) == 'table' then props = tostring(style_obj.new(props)) end return setmetatable( {prop_string = string.format('%s,%s', self.prop_string, props)}, style_obj) end -- Returns this style object as property string for use with the LPeg lexer. function style_obj.__tostring(self) return self.prop_string end --- -- Map of style names to style definition tables. -- -- Style names consist of the following default names as well as the token names -- defined by lexers. -- -- * `default`: The default style all others are based on. -- * `line_number`: The line number margin style. -- * `control_char`: The style of control character blocks. -- * `indent_guide`: The style of indentation guides. -- * `call_tip`: The style of call tip text. Only the `font`, `size`, `fore`, -- and `back` style definition fields are supported. -- * `fold_display_text`: The style of text displayed next to folded lines. -- * `class`, `comment`, `constant`, `embedded`, `error`, `function`, -- `identifier`, `keyword`, `label`, `number`, `operator`, `preprocessor`, -- `regex`, `string`, `type`, `variable`, `whitespace`: Some token names used -- by lexers. Some lexers may define more token names, so this list is not -- exhaustive. -- -- Style definition tables may contain the following fields: -- -- * `font`: String font name. -- * `size`: Integer font size. -- * `bold`: Whether or not the font face is bold. The default value is `false`. -- * `weight`: Integer weight or boldness of a font, between 1 and 999. -- * `italics`: Whether or not the font face is italic. The default value is -- `false`. -- * `underlined`: Whether or not the font face is underlined. The default value -- is `false`. -- * `fore`: Font face foreground color in `0xBBGGRR` or `"#RRGGBB"` format. -- * `back`: Font face background color in `0xBBGGRR` or `"#RRGGBB"` format. -- * `eolfilled`: Whether or not the background color extends to the end of the -- line. The default value is `false`. -- * `case`: Font case, `'u'` for upper, `'l'` for lower, and `'m'` for normal, -- mixed case. The default value is `'m'`. -- * `visible`: Whether or not the text is visible. The default value is `true`. -- * `changeable`: Whether the text is changeable instead of read-only. The -- default value is `true`. -- @class table -- @name styles M.styles = setmetatable({}, { __index = function(_, name) return style_obj.new(name) end, __newindex = function(_, name, style) if getmetatable(style) ~= style_obj then style = style_obj.new(style) end M.property['style.' .. name] = tostring(style) end }) -- Default styles. local default = { 'nothing', 'whitespace', 'comment', 'string', 'number', 'keyword', 'identifier', 'operator', 'error', 'preprocessor', 'constant', 'variable', 'function', 'class', 'type', 'label', 'regex', 'embedded' } for _, name in ipairs(default) do M[name:upper()] = name M['STYLE_' .. name:upper()] = style_obj.new(name) -- backward compatibility end -- Predefined styles. local predefined = { 'default', 'line_number', 'brace_light', 'brace_bad', 'control_char', 'indent_guide', 'call_tip', 'fold_display_text' } for _, name in ipairs(predefined) do M[name:upper()] = name M['STYLE_' .. name:upper()] = style_obj.new(name) -- backward compatibility end --- -- Adds pattern *rule* identified by string *id* to the ordered list of rules -- for lexer *lexer*. -- @param lexer The lexer to add the given rule to. -- @param id The id associated with this rule. It does not have to be the same -- as the name passed to `token()`. -- @param rule The LPeg pattern of the rule. -- @see modify_rule -- @name add_rule function M.add_rule(lexer, id, rule) if lexer._lexer then lexer = lexer._lexer end -- proxy; get true parent if not lexer._RULES then lexer._RULES = {} -- Contains an ordered list (by numerical index) of rule names. This is used -- in conjunction with lexer._RULES for building _TOKENRULE. lexer._RULEORDER = {} end lexer._RULES[id] = rule lexer._RULEORDER[#lexer._RULEORDER + 1] = id lexer:build_grammar() end --- -- Replaces in lexer *lexer* the existing rule identified by string *id* with -- pattern *rule*. -- @param lexer The lexer to modify. -- @param id The id associated with this rule. -- @param rule The LPeg pattern of the rule. -- @name modify_rule function M.modify_rule(lexer, id, rule) if lexer._lexer then lexer = lexer._lexer end -- proxy; get true parent lexer._RULES[id] = rule lexer:build_grammar() end --- -- Returns the rule identified by string *id*. -- @param lexer The lexer to fetch a rule from. -- @param id The id of the rule to fetch. -- @return pattern -- @name get_rule function M.get_rule(lexer, id) if lexer._lexer then lexer = lexer._lexer end -- proxy; get true parent return lexer._RULES[id] end --- -- Associates string *token_name* in lexer *lexer* with style table *style*. -- *style* may have the following fields: -- -- * `font`: String font name. -- * `size`: Integer font size. -- * `bold`: Whether or not the font face is bold. The default value is `false`. -- * `weight`: Integer weight or boldness of a font, between 1 and 999. -- * `italics`: Whether or not the font face is italic. The default value is -- `false`. -- * `underlined`: Whether or not the font face is underlined. The default value -- is `false`. -- * `fore`: Font face foreground color in `0xBBGGRR` or `"#RRGGBB"` format. -- * `back`: Font face background color in `0xBBGGRR` or `"#RRGGBB"` format. -- * `eolfilled`: Whether or not the background color extends to the end of the -- line. The default value is `false`. -- * `case`: Font case, `'u'` for upper, `'l'` for lower, and `'m'` for normal, -- mixed case. The default value is `'m'`. -- * `visible`: Whether or not the text is visible. The default value is `true`. -- * `changeable`: Whether the text is changeable instead of read-only. The -- default value is `true`. -- -- Field values may also contain "$(property.name)" expansions for properties -- defined in Scintilla, theme files, etc. -- @param lexer The lexer to add a style to. -- @param token_name The name of the token to associated with the style. -- @param style A style string for Scintilla. -- @usage lex:add_style('longstring', lexer.styles.string) -- @usage lex:add_style('deprecated_func', lexer.styles['function'] .. -- {italics = true} -- @usage lex:add_style('visible_ws', lexer.styles.whitespace .. -- {back = lexer.colors.grey} -- @name add_style function M.add_style(lexer, token_name, style) local num_styles = lexer._numstyles if num_styles == 33 then num_styles = num_styles + 8 end -- skip predefined if num_styles >= 256 then print('Too many styles defined (256 MAX)') end lexer._TOKENSTYLES[token_name], lexer._numstyles = num_styles, num_styles + 1 if type(style) == 'table' and not getmetatable(style) then style = style_obj.new(style) end lexer._EXTRASTYLES[token_name] = tostring(style) -- If the lexer is a proxy or a child that embedded itself, copy this style to -- the parent lexer. if lexer._lexer then lexer._lexer:add_style(token_name, style) end end --- -- Adds to lexer *lexer* a fold point whose beginning and end tokens are string -- *token_name* tokens with string content *start_symbol* and *end_symbol*, -- respectively. -- In the event that *start_symbol* may or may not be a fold point depending on -- context, and that additional processing is required, *end_symbol* may be a -- function that ultimately returns `1` (indicating a beginning fold point), -- `-1` (indicating an ending fold point), or `0` (indicating no fold point). -- That function is passed the following arguments: -- -- * `text`: The text being processed for fold points. -- * `pos`: The position in *text* of the beginning of the line currently -- being processed. -- * `line`: The text of the line currently being processed. -- * `s`: The position of *start_symbol* in *line*. -- * `symbol`: *start_symbol* itself. -- @param lexer The lexer to add a fold point to. -- @param token_name The token name of text that indicates a fold point. -- @param start_symbol The text that indicates the beginning of a fold point. -- @param end_symbol Either the text that indicates the end of a fold point, or -- a function that returns whether or not *start_symbol* is a beginning fold -- point (1), an ending fold point (-1), or not a fold point at all (0). -- @usage lex:add_fold_point(lexer.OPERATOR, '{', '}') -- @usage lex:add_fold_point(lexer.KEYWORD, 'if', 'end') -- @usage lex:add_fold_point(lexer.COMMENT, lexer.fold_consecutive_lines('#')) -- @usage lex:add_fold_point('custom', function(text, pos, line, s, symbol) -- ... end) -- @name add_fold_point function M.add_fold_point(lexer, token_name, start_symbol, end_symbol) if not lexer._FOLDPOINTS then lexer._FOLDPOINTS = {_SYMBOLS = {}} end local symbols = lexer._FOLDPOINTS._SYMBOLS if not symbols[start_symbol] then symbols[#symbols + 1], symbols[start_symbol] = start_symbol, true end if not lexer._FOLDPOINTS[token_name] then lexer._FOLDPOINTS[token_name] = {} end if type(end_symbol) == 'string' then if not symbols[end_symbol] then symbols[#symbols + 1], symbols[end_symbol] = end_symbol, true end lexer._FOLDPOINTS[token_name][start_symbol] = 1 lexer._FOLDPOINTS[token_name][end_symbol] = -1 else lexer._FOLDPOINTS[token_name][start_symbol] = end_symbol -- function or int end -- If the lexer is a proxy or a child that embedded itself, copy this fold -- point to the parent lexer. if lexer._lexer then lexer._lexer:add_fold_point(token_name, start_symbol, end_symbol) end end -- (Re)constructs `lexer._TOKENRULE`. local function join_tokens(lexer) local patterns, order = lexer._RULES, lexer._RULEORDER local token_rule = patterns[order[1]] for i = 2, #order do token_rule = token_rule + patterns[order[i]] end lexer._TOKENRULE = token_rule + M.token(M.DEFAULT, M.any) return lexer._TOKENRULE end -- Metatable for lexer grammars. -- These grammars are just tables ultimately passed to `lpeg.P()`. local grammar_mt = {__index = { -- Adds lexer *lexer* and any of its embedded lexers to this grammar. -- @param lexer The lexer to add. add_lexer = function(self, lexer) local lexer_name = lexer._PARENTNAME or lexer._NAME local token_rule = lexer:join_tokens() for _, child in ipairs(lexer._CHILDREN) do if child._CHILDREN then self:add_lexer(child) end local rules = child._EMBEDDEDRULES[lexer_name] local rules_token_rule = self['__' .. child._NAME] or rules.token_rule self[child._NAME] = (-rules.end_rule * rules_token_rule)^0 * rules.end_rule^-1 * lpeg_V(lexer_name) local embedded_child = '_' .. child._NAME self[embedded_child] = rules.start_rule * (-rules.end_rule * rules_token_rule)^0 * rules.end_rule^-1 token_rule = lpeg_V(embedded_child) + token_rule end self['__' .. lexer_name] = token_rule -- can contain embedded lexer rules self[lexer_name] = token_rule^0 end }} -- (Re)constructs `lexer._GRAMMAR`. -- @param initial_rule The name of the rule to start lexing with. The default -- value is `lexer._NAME`. Multilang lexers use this to start with a child -- rule if necessary. local function build_grammar(lexer, initial_rule) if not lexer._RULES then return end if lexer._CHILDREN then if not initial_rule then initial_rule = lexer._NAME end local grammar = setmetatable({initial_rule}, grammar_mt) grammar:add_lexer(lexer) lexer._INITIALRULE = initial_rule lexer._GRAMMAR = lpeg_Ct(lpeg_P(grammar)) else lexer._GRAMMAR = lpeg_Ct(lexer:join_tokens()^0) end end --- -- Embeds child lexer *child* in parent lexer *lexer* using patterns -- *start_rule* and *end_rule*, which signal the beginning and end of the -- embedded lexer, respectively. -- @param lexer The parent lexer. -- @param child The child lexer. -- @param start_rule The pattern that signals the beginning of the embedded -- lexer. -- @param end_rule The pattern that signals the end of the embedded lexer. -- @usage html:embed(css, css_start_rule, css_end_rule) -- @usage html:embed(lex, php_start_rule, php_end_rule) -- from php lexer -- @name embed function M.embed(lexer, child, start_rule, end_rule) if lexer._lexer then lexer = lexer._lexer end -- proxy; get true parent -- Add child rules. if not child._EMBEDDEDRULES then child._EMBEDDEDRULES = {} end if not child._RULES then error('Cannot embed lexer with no rules') end child._EMBEDDEDRULES[lexer._NAME] = { ['start_rule'] = start_rule, token_rule = child:join_tokens(), ['end_rule'] = end_rule } if not lexer._CHILDREN then lexer._CHILDREN = {} end local children = lexer._CHILDREN children[#children + 1] = child -- Add child styles. for token, style in pairs(child._EXTRASTYLES) do lexer:add_style(token, style) end -- Add child fold symbols. if child._FOLDPOINTS then for token_name, symbols in pairs(child._FOLDPOINTS) do if token_name ~= '_SYMBOLS' then for symbol, v in pairs(symbols) do lexer:add_fold_point(token_name, symbol, v) end end end end lexer:build_grammar() child._lexer = lexer -- use parent's tokens if child is embedding itself end --- -- Lexes a chunk of text *text* (that has an initial style number of -- *init_style*) using lexer *lexer*, returning a table of token names and -- positions. -- @param lexer The lexer to lex text with. -- @param text The text in the buffer to lex. -- @param init_style The current style. Multiple-language lexers use this to -- determine which language to start lexing in. -- @return table of token names and positions. -- @name lex function M.lex(lexer, text, init_style) if not lexer._GRAMMAR then return {M.DEFAULT, #text + 1} end if not lexer._LEXBYLINE then -- For multilang lexers, build a new grammar whose initial_rule is the -- current language. if lexer._CHILDREN then for style, style_num in pairs(lexer._TOKENSTYLES) do if style_num == init_style then local lexer_name = style:match('^(.+)_whitespace') or lexer._PARENTNAME or lexer._NAME if lexer._INITIALRULE ~= lexer_name then lexer:build_grammar(lexer_name) end break end end end return lpeg_match(lexer._GRAMMAR, text) else local tokens = {} local function append(tokens, line_tokens, offset) for i = 1, #line_tokens, 2 do tokens[#tokens + 1] = line_tokens[i] tokens[#tokens + 1] = line_tokens[i + 1] + offset end end local offset = 0 local grammar = lexer._GRAMMAR for line in text:gmatch('[^\r\n]*\r?\n?') do local line_tokens = lpeg_match(grammar, line) if line_tokens then append(tokens, line_tokens, offset) end offset = offset + #line -- Use the default style to the end of the line if none was specified. if tokens[#tokens] ~= offset then tokens[#tokens + 1], tokens[#tokens + 2] = 'default', offset + 1 end end return tokens end end --- -- Determines fold points in a chunk of text *text* using lexer *lexer*, -- returning a table of fold levels associated with line numbers. -- *text* starts at position *start_pos* on line number *start_line* with a -- beginning fold level of *start_level* in the buffer. -- @param lexer The lexer to fold text with. -- @param text The text in the buffer to fold. -- @param start_pos The position in the buffer *text* starts at, counting from -- 1. -- @param start_line The line number *text* starts on, counting from 1. -- @param start_level The fold level *text* starts on. -- @return table of fold levels associated with line numbers. -- @name fold function M.fold(lexer, text, start_pos, start_line, start_level) local folds = {} if text == '' then return folds end local fold = M.property_int['fold'] > 0 local FOLD_BASE = M.FOLD_BASE local FOLD_HEADER, FOLD_BLANK = M.FOLD_HEADER, M.FOLD_BLANK if fold and lexer._FOLDPOINTS then local lines = {} for p, l in (text .. '\n'):gmatch('()(.-)\r?\n') do lines[#lines + 1] = {p, l} end local fold_zero_sum_lines = M.property_int['fold.on.zero.sum.lines'] > 0 local fold_compact = M.property_int['fold.compact'] > 0 local fold_points = lexer._FOLDPOINTS local fold_point_symbols = fold_points._SYMBOLS local style_at, fold_level = M.style_at, M.fold_level local line_num, prev_level = start_line, start_level local current_level = prev_level for _, captures in ipairs(lines) do local pos, line = captures[1], captures[2] if line ~= '' then if lexer._CASEINSENSITIVEFOLDPOINTS then line = line:lower() end local level_decreased = false for _, symbol in ipairs(fold_point_symbols) do local word = not symbol:find('[^%w_]') local s, e = line:find(symbol, 1, true) while s and e do --if not word or -- line:find('^%f[%w_]' .. symbol .. '%f[^%w_]', s) then local word_before = s > 1 and line:find('^[%w_]', s - 1) local word_after = line:find('^[%w_]', e + 1) if not word or not (word_before or word_after) then local symbols = fold_points[style_at[start_pos + pos - 1 + s - 1]] local level = symbols and symbols[symbol] if type(level) == 'function' then level = level(text, pos, line, s, symbol) end if type(level) == 'number' then current_level = current_level + level if level < 0 and current_level < prev_level then -- Potential zero-sum line. If the level were to go back up on -- the same line, the line may be marked as a fold header. level_decreased = true end end end s, e = line:find(symbol, s + 1, true) end end folds[line_num] = prev_level if current_level > prev_level then folds[line_num] = prev_level + FOLD_HEADER elseif level_decreased and current_level == prev_level and fold_zero_sum_lines then if line_num > start_line then folds[line_num] = prev_level - 1 + FOLD_HEADER else -- Typing within a zero-sum line. local level = fold_level[line_num] - 1 if level > FOLD_HEADER then level = level - FOLD_HEADER end if level > FOLD_BLANK then level = level - FOLD_BLANK end folds[line_num] = level + FOLD_HEADER current_level = current_level + 1 end end if current_level < FOLD_BASE then current_level = FOLD_BASE end prev_level = current_level else folds[line_num] = prev_level + (fold_compact and FOLD_BLANK or 0) end line_num = line_num + 1 end elseif fold and (lexer._FOLDBYINDENTATION or M.property_int['fold.by.indentation'] > 0) then -- Indentation based folding. -- Calculate indentation per line. local indentation = {} for indent, line in (text .. '\n'):gmatch('([\t ]*)([^\r\n]*)\r?\n') do indentation[#indentation + 1] = line ~= '' and #indent end -- Find the first non-blank line before start_line. If the current line is -- indented, make that previous line a header and update the levels of any -- blank lines inbetween. If the current line is blank, match the level of -- the previous non-blank line. local current_level = start_level for i = start_line, 1, -1 do local level = M.fold_level[i] if level >= FOLD_HEADER then level = level - FOLD_HEADER end if level < FOLD_BLANK then local indent = M.indent_amount[i] if indentation[1] and indentation[1] > indent then folds[i] = FOLD_BASE + indent + FOLD_HEADER for j = i + 1, start_line - 1 do folds[j] = start_level + FOLD_BLANK end elseif not indentation[1] then current_level = FOLD_BASE + indent end break end end -- Iterate over lines, setting fold numbers and fold flags. for i = 1, #indentation do if indentation[i] then current_level = FOLD_BASE + indentation[i] folds[start_line + i - 1] = current_level for j = i + 1, #indentation do if indentation[j] then if FOLD_BASE + indentation[j] > current_level then folds[start_line + i - 1] = current_level + FOLD_HEADER current_level = FOLD_BASE + indentation[j] -- for any blanks below end break end end else folds[start_line + i - 1] = current_level + FOLD_BLANK end end else -- No folding, reset fold levels if necessary. local current_line = start_line for _ in text:gmatch('\r?\n') do folds[current_line] = start_level current_line = current_line + 1 end end return folds end --- -- Creates a returns a new lexer with the given name. -- @param name The lexer's name. -- @param opts Table of lexer options. Options currently supported: -- * `lex_by_line`: Whether or not the lexer only processes whole lines of -- text (instead of arbitrary chunks of text) at a time. -- Line lexers cannot look ahead to subsequent lines. -- The default value is `false`. -- * `fold_by_indentation`: Whether or not the lexer does not define any fold -- points and that fold points should be calculated based on changes in line -- indentation. -- The default value is `false`. -- * `case_insensitive_fold_points`: Whether or not fold points added via -- `lexer.add_fold_point()` ignore case. -- The default value is `false`. -- * `inherit`: Lexer to inherit from. -- The default value is `nil`. -- @usage lexer.new('rhtml', {inherit = lexer.load('html')}) -- @name new function M.new(name, opts) local lexer = { _NAME = assert(name, 'lexer name expected'), _LEXBYLINE = opts and opts['lex_by_line'], _FOLDBYINDENTATION = opts and opts['fold_by_indentation'], _CASEINSENSITIVEFOLDPOINTS = opts and opts['case_insensitive_fold_points'], _lexer = opts and opts['inherit'] } -- Create the initial maps for token names to style numbers and styles. local token_styles = {} for i = 1, #default do token_styles[default[i]] = i end for i = 1, #predefined do token_styles[predefined[i]] = i + 32 end lexer._TOKENSTYLES, lexer._numstyles = token_styles, #default + 1 lexer._EXTRASTYLES = {} return setmetatable(lexer, {__index = { add_rule = M.add_rule, modify_rule = M.modify_rule, get_rule = M.get_rule, add_style = M.add_style, add_fold_point = M.add_fold_point, join_tokens = join_tokens, build_grammar = build_grammar, embed = M.embed, lex = M.lex, fold = M.fold }}) end -- Legacy support for older lexers. -- Processes the `lex._rules`, `lex._tokenstyles`, and `lex._foldsymbols` -- tables. -- Since legacy lexers may be processed up to twice, ensure their default styles -- and rules are not processed more than once. local function process_legacy_lexer(lexer) local function warn(msg) --[[io.stderr:write(msg, "\n")]] end if not lexer._LEGACY then lexer._LEGACY = true warn("lexers as tables are deprecated; use 'lexer.new()'") local token_styles = {} for i = 1, #default do token_styles[default[i]] = i end for i = 1, #predefined do token_styles[predefined[i]] = i + 32 end lexer._TOKENSTYLES, lexer._numstyles = token_styles, #default + 1 lexer._EXTRASTYLES = {} setmetatable(lexer, getmetatable(M.new(''))) if lexer._rules then warn("lexer '_rules' table is deprecated; use 'add_rule()'") for _, rule in ipairs(lexer._rules) do lexer:add_rule(rule[1], rule[2]) end end end if lexer._tokenstyles then warn("lexer '_tokenstyles' table is deprecated; use 'add_style()'") for token, style in pairs(lexer._tokenstyles) do -- If this legacy lexer is being processed a second time, only add styles -- added since the first processing. if not lexer._TOKENSTYLES[token] then lexer:add_style(token, style) end end end if lexer._foldsymbols then warn("lexer '_foldsymbols' table is deprecated; use 'add_fold_point()'") for token_name, symbols in pairs(lexer._foldsymbols) do if type(symbols) == 'table' and token_name ~= '_patterns' then for symbol, v in pairs(symbols) do lexer:add_fold_point(token_name, symbol, v) end end end if lexer._foldsymbols._case_insensitive then lexer._CASEINSENSITIVEFOLDPOINTS = true end elseif lexer._fold then lexer.fold = function(self, ...) return lexer._fold(...) end end end local lexers = {} -- cache of loaded lexers --- -- Initializes or loads and returns the lexer of string name *name*. -- Scintilla calls this function in order to load a lexer. Parent lexers also -- call this function in order to load child lexers and vice-versa. The user -- calls this function in order to load a lexer when using this module as a Lua -- library. -- @param name The name of the lexing language. -- @param alt_name The alternate name of the lexing language. This is useful for -- embedding the same child lexer with multiple sets of start and end tokens. -- @param cache Flag indicating whether or not to load lexers from the cache. -- This should only be `true` when initially loading a lexer (e.g. not from -- within another lexer for embedding purposes). -- The default value is `false`. -- @return lexer object -- @name load function M.load(name, alt_name, cache) if cache and lexers[alt_name or name] then return lexers[alt_name or name] end -- When using this module as a stand-alone module, the `property` and -- `property_int` tables do not exist (they are not useful). Create them in -- order prevent errors from occurring. if not M.property then M.property = setmetatable( {['lexer.lpeg.home'] = package.path:gsub('/%?%.lua', '')}, { __index = function() return '' end, __newindex = function(t, k, v) rawset(t, k, tostring(v)) end }) M.property_int = setmetatable({}, { __index = function(t, k) return tonumber(M.property[k]) or 0 end, __newindex = function() error('read-only property') end }) end -- Load the language lexer with its rules, styles, etc. -- However, replace the default `WHITESPACE` style name with a unique -- whitespace style name (and then automatically add it afterwards), since -- embedded lexing relies on these unique whitespace style names. Note that -- loading embedded lexers changes `WHITESPACE` again, so when adding it -- later, do not reference the potentially incorrect value. M.WHITESPACE = (alt_name or name) .. '_whitespace' local path = M.property['lexer.lpeg.home']:gsub(';', '/?.lua;') .. '/?.lua' local lexer = dofile(assert(searchpath(name, path))) assert(lexer, string.format("'%s.lua' did not return a lexer", name)) if alt_name then lexer._NAME = alt_name end if not getmetatable(lexer) or lexer._LEGACY then -- A legacy lexer may need to be processed a second time in order to pick up -- any `_tokenstyles` or `_foldsymbols` added after `lexer.embed_lexer()`. process_legacy_lexer(lexer) if lexer._lexer and lexer._lexer._LEGACY then process_legacy_lexer(lexer._lexer) -- mainly for `_foldsymbols` edits end end lexer:add_style((alt_name or name) .. '_whitespace', M.styles.whitespace) -- If the lexer is a proxy or a child that embedded itself, set the parent to -- be the main lexer. Keep a reference to the old parent name since embedded -- child rules reference and use that name. if lexer._lexer then lexer = lexer._lexer lexer._PARENTNAME, lexer._NAME = lexer._NAME, alt_name or name end if cache then lexers[alt_name or name] = lexer end return lexer end -- The following are utility functions lexers will have access to. -- Common patterns. M.any = lpeg_P(1) M.alpha = lpeg_R('AZ', 'az') M.digit = lpeg_R('09') M.alnum = lpeg_R('AZ', 'az', '09') M.lower = lpeg_R('az') M.upper = lpeg_R('AZ') M.xdigit = lpeg_R('09', 'AF', 'af') M.graph = lpeg_R('!~') M.punct = lpeg_R('!/', ':@', '[\'', '{~') M.space = lpeg_S('\t\v\f\n\r ') M.newline = lpeg_P('\r')^-1 * '\n' M.nonnewline = 1 - M.newline M.dec_num = M.digit^1 M.hex_num = '0' * lpeg_S('xX') * M.xdigit^1 M.oct_num = '0' * lpeg_R('07')^1 M.integer = lpeg_S('+-')^-1 * (M.hex_num + M.oct_num + M.dec_num) M.float = lpeg_S('+-')^-1 * ( (M.digit^0 * '.' * M.digit^1 + M.digit^1 * '.' * M.digit^0 * -lpeg_P('.')) * (lpeg_S('eE') * lpeg_S('+-')^-1 * M.digit^1)^-1 + (M.digit^1 * lpeg_S('eE') * lpeg_S('+-')^-1 * M.digit^1)) M.number = M.float + M.integer M.word = (M.alpha + '_') * (M.alnum + '_')^0 -- Deprecated. M.nonnewline_esc = 1 - (M.newline + '\\') + '\\' * M.any M.ascii = lpeg_R('\000\127') M.extend = lpeg_R('\000\255') M.cntrl = lpeg_R('\000\031') M.print = lpeg_R(' ~') --- -- Creates and returns a token pattern with token name *name* and pattern -- *patt*. -- If *name* is not a predefined token name, its style must be defined via -- `lexer.add_style()`. -- @param name The name of token. If this name is not a predefined token name, -- then a style needs to be assiciated with it via `lexer.add_style()`. -- @param patt The LPeg pattern associated with the token. -- @return pattern -- @usage local ws = token(lexer.WHITESPACE, lexer.space^1) -- @usage local annotation = token('annotation', '@' * lexer.word) -- @name token function M.token(name, patt) return lpeg_Cc(name) * patt * lpeg_Cp() end --- -- Creates and returns a pattern that matches from string or pattern *prefix* -- until the end of the line. -- *escape* indicates whether the end of the line can be escaped with a '\' -- character. -- @param prefix String or pattern prefix to start matching at. -- @param escape Optional flag indicating whether or not newlines can be escaped -- by a '\' character. The default value is `false`. -- @return pattern -- @usage local line_comment = lexer.to_eol('//') -- @usage local line_comment = lexer.to_eol(P('#') + ';') -- @name to_eol function M.to_eol(prefix, escape) return prefix * (not escape and M.nonnewline or M.nonnewline_esc)^0 end --- -- Creates and returns a pattern that matches a range of text bounded by strings -- or patterns *s* and *e*. -- This is a convenience function for matching more complicated ranges like -- strings with escape characters, balanced parentheses, and block comments -- (nested or not). *e* is optional and defaults to *s*. *single_line* indicates -- whether or not the range must be on a single line; *escapes* indicates -- whether or not to allow '\' as an escape character; and *balanced* indicates -- whether or not to handle balanced ranges like parentheses, and requires *s* -- and *e* to be different. -- @param s String or pattern start of a range. -- @param e Optional string or pattern end of a range. The default value is *s*. -- @param single_line Optional flag indicating whether or not the range must be -- on a single line. -- @param escapes Optional flag indicating whether or not the range end may -- be escaped by a '\' character. -- The default value is `false` unless *s* and *e* are identical, -- single-character strings. In that case, the default value is `true`. -- @param balanced Optional flag indicating whether or not to match a balanced -- range, like the "%b" Lua pattern. This flag only applies if *s* and *e* are -- different. -- @return pattern -- @usage local dq_str_escapes = lexer.range('"') -- @usage local dq_str_noescapes = lexer.range('"', false, false) -- @usage local unbalanced_parens = lexer.range('(', ')') -- @usage local balanced_parens = lexer.range('(', ')', false, false, true) -- @name range function M.range(s, e, single_line, escapes, balanced) if type(e) ~= 'string' and type(e) ~= 'userdata' then e, single_line, escapes, balanced = s, e, single_line, escapes end local any = M.any - e if single_line then any = any - '\n' end if balanced then any = any - s end if escapes == nil then -- Only allow escapes by default for ranges with identical, single-character -- string delimiters. escapes = type(s) == 'string' and #s == 1 and s == e end if escapes then any = any - '\\' + '\\' * M.any end if balanced and s ~= e then return lpeg_P{s * (any + lpeg_V(1))^0 * lpeg_P(e)^-1} else return s * any^0 * lpeg_P(e)^-1 end end -- Deprecated function. Use `lexer.range()` instead. -- Creates and returns a pattern that matches a range of text bounded by -- *chars* characters. -- This is a convenience function for matching more complicated delimited ranges -- like strings with escape characters and balanced parentheses. *single_line* -- indicates whether or not the range must be on a single line, *no_escape* -- indicates whether or not to ignore '\' as an escape character, and *balanced* -- indicates whether or not to handle balanced ranges like parentheses and -- requires *chars* to be composed of two characters. -- @param chars The character(s) that bound the matched range. -- @param single_line Optional flag indicating whether or not the range must be -- on a single line. -- @param no_escape Optional flag indicating whether or not the range end -- character may be escaped by a '\\' character. -- @param balanced Optional flag indicating whether or not to match a balanced -- range, like the "%b" Lua pattern. This flag only applies if *chars* -- consists of two different characters (e.g. "()"). -- @return pattern -- @usage local dq_str_escapes = lexer.delimited_range('"') -- @usage local dq_str_noescapes = lexer.delimited_range('"', false, true) -- @usage local unbalanced_parens = lexer.delimited_range('()') -- @usage local balanced_parens = lexer.delimited_range('()', false, false, -- true) -- @see range -- @name delimited_range function M.delimited_range(chars, single_line, no_escape, balanced) print("lexer.delimited_range() is deprecated, use lexer.range()") local s = chars:sub(1, 1) local e = #chars == 2 and chars:sub(2, 2) or s local range local b = balanced and s or '' local n = single_line and '\n' or '' if no_escape then local invalid = lpeg_S(e .. n .. b) range = M.any - invalid else local invalid = lpeg_S(e .. n .. b) + '\\' range = M.any - invalid + '\\' * M.any end if balanced and s ~= e then return lpeg_P{s * (range + lpeg_V(1))^0 * e} else return s * range^0 * lpeg_P(e)^-1 end end --- -- Creates and returns a pattern that matches pattern *patt* only at the -- beginning of a line. -- @param patt The LPeg pattern to match on the beginning of a line. -- @return pattern -- @usage local preproc = token(lexer.PREPROCESSOR, lexer.starts_line('#') * -- lexer.nonnewline^0) -- @name starts_line function M.starts_line(patt) return lpeg_Cmt(lpeg_C(patt), function(input, index, match, ...) local pos = index - #match if pos == 1 then return index, ... end local char = input:sub(pos - 1, pos - 1) if char == '\n' or char == '\r' or char == '\f' then return index, ... end end) end --- -- Creates and returns a pattern that verifies the first non-whitespace -- character behind the current match position is in string set *s*. -- @param s String character set like one passed to `lpeg.S()`. -- @return pattern -- @usage local regex = lexer.last_char_includes('+-*!%^&|=,([{') * -- lexer.range('/') -- @name last_char_includes function M.last_char_includes(s) s = string.format('[%s]', s:gsub('[-%%%[]', '%%%1')) return lpeg_P(function(input, index) if index == 1 then return index end local i = index while input:sub(i - 1, i - 1):match('[ \t\r\n\f]') do i = i - 1 end if input:sub(i - 1, i - 1):match(s) then return index end end) end -- Deprecated function. Use `lexer.range()` instead. -- Returns a pattern that matches a balanced range of text that starts with -- string *start_chars* and ends with string *end_chars*. -- With single-character delimiters, this function is identical to -- `delimited_range(start_chars .. end_chars, false, true, true)`. -- @param start_chars The string starting a nested sequence. -- @param end_chars The string ending a nested sequence. -- @return pattern -- @usage local nested_comment = lexer.nested_pair('/*', '*/') -- @see range -- @name nested_pair function M.nested_pair(start_chars, end_chars) print("lexer.nested_pair() is deprecated, use lexer.range()") local s, e = start_chars, lpeg_P(end_chars)^-1 return lpeg_P{s * (M.any - s - end_chars + lpeg_V(1))^0 * e} end --- -- Creates and returns a pattern that matches any single word in string *words*. -- *case_insensitive* indicates whether or not to ignore case when matching -- words. -- This is a convenience function for simplifying a set of ordered choice word -- patterns. -- If *words* is a multi-line string, it may contain Lua line comments (`--`) -- that will ultimately be ignored. -- @param words A string list of words separated by spaces. -- @param case_insensitive Optional boolean flag indicating whether or not the -- word match is case-insensitive. The default value is `false`. -- @param word_chars Unused legacy parameter. -- @return pattern -- @usage local keyword = token(lexer.KEYWORD, word_match[[foo bar baz]]) -- @usage local keyword = token(lexer.KEYWORD, word_match([[foo-bar foo-baz -- bar-foo bar-baz baz-foo baz-bar]], true)) -- @name word_match function M.word_match(words, case_insensitive, word_chars) local word_list = {} if type(words) == 'table' then -- Legacy `word_match(word_list, word_chars, case_insensitive)` form. words = table.concat(words, ' ') word_chars, case_insensitive = case_insensitive, word_chars end for word in words:gsub('%-%-[^\n]+', ''):gmatch('%S+') do word_list[case_insensitive and word:lower() or word] = true for char in word:gmatch('[^%w_]') do if not (word_chars or ''):find(char, 1, true) then word_chars = (word_chars or '') .. char end end end local chars = M.alnum + '_' if (word_chars or '') ~= '' then chars = chars + lpeg_S(word_chars) end return lpeg_Cmt(chars^1, function(input, index, word) if case_insensitive then word = word:lower() end return word_list[word] and index or nil end) end -- Deprecated legacy function. Use `parent:embed()` instead. -- Embeds child lexer *child* in parent lexer *parent* using patterns -- *start_rule* and *end_rule*, which signal the beginning and end of the -- embedded lexer, respectively. -- @param parent The parent lexer. -- @param child The child lexer. -- @param start_rule The pattern that signals the beginning of the embedded -- lexer. -- @param end_rule The pattern that signals the end of the embedded lexer. -- @usage lexer.embed_lexer(M, css, css_start_rule, css_end_rule) -- @usage lexer.embed_lexer(html, M, php_start_rule, php_end_rule) -- @usage lexer.embed_lexer(html, ruby, ruby_start_rule, ruby_end_rule) -- @see embed -- @name embed_lexer function M.embed_lexer(parent, child, start_rule, end_rule) if not getmetatable(parent) then process_legacy_lexer(parent) end if not getmetatable(child) then process_legacy_lexer(child) end parent:embed(child, start_rule, end_rule) end -- Determines if the previous line is a comment. -- This is used for determining if the current comment line is a fold point. -- @param prefix The prefix string defining a comment. -- @param text The text passed to a fold function. -- @param pos The pos passed to a fold function. -- @param line The line passed to a fold function. -- @param s The s passed to a fold function. local function prev_line_is_comment(prefix, text, pos, line, s) local start = line:find('%S') if start < s and not line:find(prefix, start, true) then return false end local p = pos - 1 if text:sub(p, p) == '\n' then p = p - 1 if text:sub(p, p) == '\r' then p = p - 1 end if text:sub(p, p) ~= '\n' then while p > 1 and text:sub(p - 1, p - 1) ~= '\n' do p = p - 1 end while text:sub(p, p):find('^[\t ]$') do p = p + 1 end return text:sub(p, p + #prefix - 1) == prefix end end return false end -- Determines if the next line is a comment. -- This is used for determining if the current comment line is a fold point. -- @param prefix The prefix string defining a comment. -- @param text The text passed to a fold function. -- @param pos The pos passed to a fold function. -- @param line The line passed to a fold function. -- @param s The s passed to a fold function. local function next_line_is_comment(prefix, text, pos, line, s) local p = text:find('\n', pos + s) if p then p = p + 1 while text:sub(p, p):find('^[\t ]$') do p = p + 1 end return text:sub(p, p + #prefix - 1) == prefix end return false end --- -- Returns for `lexer.add_fold_point()` the parameters needed to fold -- consecutive lines that start with string *prefix*. -- @param prefix The prefix string (e.g. a line comment). -- @usage lex:add_fold_point(lexer.COMMENT, lexer.fold_consecutive_lines('--')) -- @usage lex:add_fold_point(lexer.COMMENT, lexer.fold_consecutive_lines('//')) -- @usage lex:add_fold_point( -- lexer.KEYWORD, lexer.fold_consecutive_lines('import')) -- @name fold_consecutive_lines function M.fold_consecutive_lines(prefix) local property_int = M.property_int return prefix, function(text, pos, line, s) if property_int['fold.line.groups'] == 0 then return 0 end if s > 1 and line:match('^%s*()') < s then return 0 end local prev_line_comment = prev_line_is_comment(prefix, text, pos, line, s) local next_line_comment = next_line_is_comment(prefix, text, pos, line, s) if not prev_line_comment and next_line_comment then return 1 end if prev_line_comment and not next_line_comment then return -1 end return 0 end end -- Deprecated legacy function. Use `lexer.fold_consecutive_lines()` instead. -- Returns a fold function (to be passed to `lexer.add_fold_point()`) that folds -- consecutive line comments that start with string *prefix*. -- @param prefix The prefix string defining a line comment. -- @usage lex:add_fold_point(lexer.COMMENT, '--', -- lexer.fold_line_comments('--')) -- @usage lex:add_fold_point(lexer.COMMENT, '//', -- lexer.fold_line_comments('//')) -- @name fold_line_comments function M.fold_line_comments(prefix) print( "lexer.fold_line_comments() is deprecated, " .. "use lexer.fold_consecutive_lines()") return select(2, M.fold_consecutive_lines(prefix)) end M.property_expanded = setmetatable({}, { -- Returns the string property value associated with string property *key*, -- replacing any "$()" and "%()" expressions with the values of their keys. __index = function(t, key) return M.property[key]:gsub('[$%%]%b()', function(key) return t[key:sub(3, -2)] end) end, __newindex = function() error('read-only property') end }) --[[ The functions and fields below were defined in C. --- -- Returns the line number (starting from 1) of the line that contains position -- *pos*, which starts from 1. -- @param pos The position to get the line number of. -- @return number local function line_from_position(pos) end ]] return M