diff options
Diffstat (limited to 'doc/LPegLexer.html')
-rw-r--r-- | doc/LPegLexer.html | 2608 |
1 files changed, 2608 insertions, 0 deletions
diff --git a/doc/LPegLexer.html b/doc/LPegLexer.html new file mode 100644 index 000000000..1a0049799 --- /dev/null +++ b/doc/LPegLexer.html @@ -0,0 +1,2608 @@ +<?xml version="1.0"?> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> + +<html xmlns="http://www.w3.org/1999/xhtml"> + <head> + <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> + + <title>Lua LPeg Lexers</title> + + <style type="text/css"> + <!-- + /*<![CDATA[*/ + CODE { font-weight: bold; font-family: Menlo,Consolas,Bitstream Vera Sans Mono,Courier New,monospace; } + A:visited { color: blue; } + A:hover { text-decoration: underline ! important; } + A.message { text-decoration: none; font-weight: bold; font-family: Menlo,Consolas,Bitstream Vera Sans Mono,Courier New,monospace; } + A.seealso { text-decoration: none; font-weight: bold; font-family: Menlo,Consolas,Bitstream Vera Sans Mono,Courier New,monospace; } + A.toc { text-decoration: none; } + A.jump { text-decoration: none; } + LI.message { text-decoration: none; font-weight: bold; font-family: Menlo,Consolas,Bitstream Vera Sans Mono,Courier New,monospace; } + H2 { background: #E0EAFF; } + + table { + border: 0px; + border-collapse: collapse; + } + + table.categories { + border: 0px; + border-collapse: collapse; + } + table.categories td { + padding: 4px 12px; + } + + table.standard { + border-collapse: collapse; + } + table.standard th { + background: #404040; + color: #FFFFFF; + padding: 1px 5px 1px 5px; + } + table.standard tr:nth-child(odd) {background: #D7D7D7} + table.standard tr:nth-child(even) {background: #F0F0F0} + table.standard td { + padding: 1px 5px 1px 5px; + } + + .S0 { + color: #808080; + } + .S2 { + font-family: 'Comic Sans MS'; + color: #007F00; + font-size: 9pt; + } + .S3 { + font-family: 'Comic Sans MS'; + color: #3F703F; + font-size: 9pt; + } + .S4 { + color: #007F7F; + } + .S5 { + font-weight: bold; + color: #00007F; + } + .S9 { + color: #7F7F00; + } + .S10 { + font-weight: bold; + color: #000000; + } + .S17 { + font-family: 'Comic Sans MS'; + color: #3060A0; + font-size: 9pt; + } + DIV.highlighted { + background: #F7FCF7; + border: 1px solid #C0D7C0; + margin: 0.3em 3em; + padding: 0.3em 0.6em; + font-family: 'Verdana'; + color: #000000; + font-size: 10pt; + } + .provisional { + background: #FFB000; + } + .parameter { + font-style:italic; + } + /*]]>*/ + --> + </style> + </head> + + <body bgcolor="#FFFFFF" text="#000000"> + <table bgcolor="#000000" width="100%" cellspacing="0" cellpadding="0" border="0" + summary="Banner"> + <tr> + <td><img src="SciTEIco.png" border="3" height="64" width="64" alt="Scintilla icon" /></td> + + <td><a href="index.html" + style="color:white;text-decoration:none;font-size:200%">Scintilla</a></td> + </tr> + </table> + + <h1>Lua LPeg Lexers</h1> + + <p>Scintilla's LPeg lexer adds dynamic <a href="http://lua.org">Lua</a> + <a href="http://www.inf.puc-rio.br/~roberto/lpeg/">LPeg</a> lexers to + Scintilla. It is the quickest way to add new or customized syntax + highlighting and code folding for programming languages to any + Scintilla-based text editor or IDE.</p> + + <h2>Features</h2> + + <ul> + <li>Support for <a href="#LexerList">over 100 programming languages</a>.</li> + <li>Easy lexer embedding for multi-language lexers.</li> + <li>Universal color themes.</li> + <li>Comparable speed to native Scintilla lexers.</li> + </ul> + + <h2>Enabling and Configuring the LPeg Lexer</h2> + + <p>Scintilla is <em>not</em> compiled with the LPeg lexer enabled by + default (it is present, but empty). You need to manually enable it with the + <code>LPEG_LEXER</code> flag when building Scintilla and its lexers. You + also need to build and link the Lua source files contained in Scintilla's + <code>lua/src/</code> directory to <code>lexers/LexLPeg.cxx</code>. If your + application has its own copy of Lua, you can ignore Scintilla's copy and + link to yours. + + <p>At this time, only the GTK, curses, and MinGW32 (for win32) platform + makefiles facilitate enabling the LPeg lexer. For example, when building + Scintilla, run <code>make LPEG_LEXER=1</code>. User contributions to + facilitate this for the other platforms is encouraged.</p> + + <p>When Scintilla is compiled with the LPeg lexer enabled, and after + selecting it as the lexer to use via + <a class="message" href="ScintillaDoc.html#SCI_SETLEXER">SCI_SETLEXER</a> or + <a class="message" href="ScintillaDoc.html#SCI_SETLEXERLANGUAGE">SCI_SETLEXERLANGUAGE</a>, + the following property <em>must</em> be set via + <a class="message" href="ScintillaDoc.html#SCI_SETPROPERTY">SCI_SETPROPERTY</a>:</p> + + <table class="standard" summary="Search flags"> + <tbody> + <tr> + <td><code>lexer.lpeg.home</code></td> + + <td>The directory containing the Lua lexers. This is the path + where you included Scintilla's <code>lexlua/</code> directory in + your application's installation location.</td> + </tr> + </tbody> + </table> + + <p>The following properties are optional and may or may not be set:</p> + + <table class="standard" summary="Search flags"> + <tbody> + <tr> + <td><code>lexer.lpeg.color.theme</code></td> + + <td>The color theme to use. Color themes are located in the + <code>lexlua/themes/</code> directory. Currently supported themes + are <code>light</code>, <code>dark</code>, <code>scite</code>, and + <code>curses</code>. Your application can define colors and styles + manually through Scintilla properties. The theme files have + examples.</td> + </tr> + + <tr> + <td><code>fold</code></td> + + <td>For Lua lexers that have a folder, folding is turned on if + <code>fold</code> is set to <code>1</code>. The default is + <code>0</code>.</td> + </tr> + + <tr> + <td><code>fold.by.indentation</code</td> + + <td>For Lua lexers that do not have a folder, if + <code>fold.by.indentation</code> is set to <code>1</code>, folding is + done based on indentation level (like Python). The default is + <code>0</code>.</td> + </tr> + + <tr> + <td><code>fold.line.comments</code></td> + + <td>If <code>fold.line.comments</code> is set to <code>1</code>, + multiple, consecutive line comments are folded, and only the top-level + comment is shown. There is a small performance penalty for large + source files when this option and folding are enabled. The default is + <code>0</code>.</td> + </tr> + + <tr> + <td><code>fold.on.zero.sum.lines</code></td> + + <td>If <code>fold.on.zero.sum.lines</code> is set to <code>1</code>, + lines that contain both an ending and starting fold point are marked + as fold points. For example, the C line <code>} else {</code> would be + marked as a fold point. The default is <code>0</code>.</td> + </tr> + </tbody> + </table> + + <h2>Using the LPeg Lexer</h2> + + <p>Your application communicates with the LPeg lexer using Scintilla's + <a class="message" href="ScintillaDoc.html#SCI_PRIVATELEXERCALL"><code>SCI_PRIVATELEXERCALL</code></a> + API. The operation constants recognized by the LPeg lexer are based on + Scintilla's existing named constants. Note that some of the names of the + operations do not make perfect sense. This is a tradeoff in order to reuse + Scintilla's existing constants.</p> + + <p>In the descriptions that follow, + <code>SCI_PRIVATELEXERCALL(int operation, void *pointer)</code> means you + would call Scintilla like + <code>SendScintilla(sci, SCI_PRIVATELEXERCALL, operation, pointer);</code></p> + + <h3>Usage Example</h3> + + <p>The curses platform demo, jinx, has a C-source example for using the LPeg + lexer. Additionally, here is a pseudo-code example:</p> + + <pre><code> + init_app() { + sci = scintilla_new() + } + + create_doc() { + doc = SendScintilla(sci, SCI_CREATEDOCUMENT, 0, 0) + SendScintilla(sci, SCI_SETDOCPOINTER, 0, doc) + SendScintilla(sci, SCI_SETLEXERLANGUAGE, 0, "lpeg") + home = "/home/mitchell/app/lua_lexers" + SendScintilla(sci, SCI_SETPROPERTY, "lexer.lpeg.home", home) + SendScintilla(sci, SCI_SETPROPERTY, "lexer.lpeg.color.theme", "light") + fn = SendScintilla(sci, SCI_GETDIRECTFUNCTION, 0, 0) + SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_GETDIRECTFUNCTION, fn) + psci = SendScintilla(sci, SCI_GETDIRECTPOINTER, 0, 0) + SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_SETDOCPOINTER, psci) + SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_SETLEXERLANGUAGE, "lua") + } + + set_lexer(lang) { + psci = SendScintilla(sci, SCI_GETDIRECTPOINTER, 0, 0) + SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_SETDOCPOINTER, psci) + SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_SETLEXERLANGUAGE, lang) + } + </code></pre> + + <code><a class="message" href="#SCI_CHANGELEXERSTATE">SCI_PRIVATELEXERCALL(SCI_CHANGELEXERSTATE, lua_State *L)</a><br/> + <a class="message" href="#SCI_GETDIRECTFUNCTION">SCI_PRIVATELEXERCALL(SCI_GETDIRECTFUNCTION, int SciFnDirect)</a><br/> + <a class="message" href="#SCI_GETLEXERLANGUAGE">SCI_PRIVATELEXERCALL(SCI_GETLEXERLANGUAGE, char *languageName) → int</a><br/> + <a class="message" href="#SCI_GETSTATUS">SCI_PRIVATELEXERCALL(SCI_GETSTATUS, char *errorMessage) → int</a><br/> + <a class="message" href="#styleNum">SCI_PRIVATELEXERCALL(int styleNum, char *styleName) → int</a><br/> + <a class="message" href="#SCI_SETDOCPOINTER">SCI_PRIVATELEXERCALL(SCI_SETDOCPOINTER, int sci)</a><br/> + <a class="message" href="#SCI_SETLEXERLANGUAGE">SCI_PRIVATELEXERCALL(SCI_SETLEXERLANGUAGE, languageName)</a><br/> + </code> + + <p><b id="SCI_CHANGELEXERSTATE">SCI_PRIVATELEXERCALL(SCI_CHANGELEXERSTATE, lua_State *L)</b><br/> + Tells the LPeg lexer to use <code>L</code> as its Lua state instead of + creating a separate state.</p> + + <p><code>L</code> must have already opened the "base", "string", "table", + "package", and "lpeg" libraries. If <code>L</code> is a Lua 5.1 state, it + must have also opened the "io" library.</p> + + <p>The LPeg lexer will create a single <code>lexer</code> package (that can + be used with Lua's <code>require</code> function), as well as a number of + other variables in the <code>LUA_REGISTRYINDEX</code> table with the "sci_" + prefix.</p> + + <p>Rather than including the path to Scintilla's Lua lexers in the + <code>package.path</code> of the given Lua state, set the "lexer.lpeg.home" + property instead. The LPeg lexer uses that property to find and load + lexers.</p> + + <p>Usage:</p> + + <pre><code> + lua = luaL_newstate() + SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_CHANGELEXERSTATE, lua) + </code></pre> + + <p><b id="SCI_GETDIRECTFUNCTION">SCI_PRIVATELEXERCALL(SCI_GETDIRECTFUNCTION, SciFnDirect)</b><br/> + Tells the LPeg lexer the address of <code>SciFnDirect</code>, the function + that handles Scintilla messages.</p> + + <p>Despite the name <code>SCI_GETDIRECTFUNCTION</code>, it only notifies the + LPeg lexer what the value of <code>SciFnDirect</code> obtained from + <a class="message" href="ScintillaDoc.html#SCI_GETDIRECTFUNCTION"><code>SCI_GETDIRECTFUNCTION</code></a> + is. It does not return anything. Use this if you would like to have the LPeg + lexer set all Lua lexer styles automatically. This is useful for maintaining + a consistent color theme. Do not use this if your application maintains its + own color theme.</p> + + <p>If you use this call, it <em>must</em> be made <em>once</em> for each + Scintilla document that was created using Scintilla's + <a class="message" href="ScintillaDoc.html#SCI_CREATEDOCUMENT"><code>SCI_CREATEDOCUMENT</code></a>. + You must also use the + <a class="message" href="#SCI_SETDOCPOINTER"><code>SCI_SETDOCPOINTER</code></a> LPeg lexer + API call.</p> + + <p>Usage:</p> + + <pre><code> + fn = SendScintilla(sci, SCI_GETDIRECTFUNCTION, 0, 0) + SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_GETDIRECTFUNCTION, fn) + </code></pre> + + <p>See also: <a class="message" href="#SCI_SETDOCPOINTER"><code>SCI_SETDOCPOINTER</code></a></p> + + <p><b id="SCI_GETLEXERLANGUAGE">SCI_PRIVATELEXERCALL(SCI_GETLEXERLANGUAGE, char *languageName) → int</b><br/> + Returns the length of the string name of the current Lua lexer or stores the + name into the given buffer. If the buffer is long enough, the name is + terminated by a <code>0</code> character.</p> + + <p>For parent lexers with embedded children or child lexers embedded into + parents, the name is in "lexer/current" format, where "lexer" is the actual + lexer's name and "current" is the parent or child lexer at the current caret + position. In order for this to work, you must have called + <a class="message" href="#SCI_GETDIRECTFUNCTION"><code>SCI_GETDIRECTFUNCTION</code></a> + and + <a class="message" href="#SCI_SETDOCPOINTER"><code>SCI_SETDOCPOINTER</code></a>.</p> + + <p><b id="SCI_GETSTATUS">SCI_PRIVATELEXERCALL(SCI_GETSTATUS, char *errorMessage) → int</b><br/> + Returns the length of the error message of the LPeg lexer or Lua lexer error + that occurred (if any), or stores the error message into the given buffer.</p> + + <p>If no error occurred, the returned message will be empty.</p> + + <p>Since the LPeg lexer does not throw errors as they occur, errors can only + be handled passively. Note that the LPeg lexer does print all errors to + stderr.</p> + + <p>Usage:</p> + + <pre><code> + SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_GETSTATUS, errmsg) + if (strlen(errmsg) > 0) { /* handle error */ } + </code></pre> + + <p><b id="SCI_PRIVATELEXERCALL">SCI_PRIVATELEXERCALL(int styleNum, char *styleName) → int</b><br/> + Returns the length of the token name associated with the given style number + or stores the style name into the given buffer. If the buffer is long + enough, the string is terminated by a <code>0</code> character.</p> + + <p>Usage:</p> + + <pre><code> + style = SendScintilla(sci, SCI_GETSTYLEAT, pos, 0) + SendScintilla(sci, SCI_PRIVATELEXERCALL, style, token) + // token now contains the name of the style at pos + </code></pre> + + <p><b id="SCI_SETDOCPOINTER">SCI_PRIVATELEXERCALL(SCI_SETDOCPOINTER, int sci)</b><br/> + Tells the LPeg lexer the address of the Scintilla window (obtained via + Scintilla's + <a class="message" href="ScintillaDoc.html#SCI_GETDIRECTPOINTER"><code>SCI_GETDIRECTPOINTER</code></a>) + currently in use.</p> + + <p>Despite the name <code>SCI_SETDOCPOINTER</code>, it has no relationship + to Scintilla documents.</p> + + <p>Use this call only if you are using the + <a class="message" href="#SCI_GETDIRECTFUNCTION"><code>SCI_GETDIRECTFUNCTION</code></a> + LPeg lexer API call. It <em>must</em> be made <em>before</em> each call to + the <a class="message" href="#SCI_SETLEXERLANGUAGE"><code>SCI_SETLEXERLANGUAGE</code></a> + LPeg lexer API call.</p> + + <p>Usage:</p> + + <pre><code> + SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_SETDOCPOINTER, sci) + </code></pre> + + <p>See also: <a class="message" href="#SCI_GETDIRECTFUNCTION"><code>SCI_GETDIRECTFUNCTION</code></a>, + <a class="message" href="#SCI_SETLEXERLANGUAGE"><code>SCI_SETLEXERLANGUAGE</code></a></p> + + <p><b id="SCI_SETLEXERLANGUAGE">SCI_PRIVATELEXERCALL(SCI_SETLEXERLANGUAGE, const char *languageName)</b><br/> + Sets the current Lua lexer to <code>languageName</code>.</p> + + <p>If you are having the LPeg lexer set the Lua lexer styles automatically, + make sure you call the + <a class="message" href="#SCI_SETDOCPOINTER"><code>SCI_SETDOCPOINTER</code></a> + LPeg lexer API <em>first</em>.</p> + + <p>Usage:</p> + + <pre><code> + SendScintilla(sci, SCI_PRIVATELEXERCALL, SCI_SETLEXERLANGUAGE, "lua") + </code></pre> + + <p>See also: <a class="message" href="#SCI_SETDOCPOINTER"><code>SCI_SETDOCPOINTER</code></a></p> + + <h2 id="lexer">Writing Lua Lexers</h2> + + <p>Lexers highlight the syntax of source code. Scintilla (the editing component + behind <a href="http://foicica.com/textadept">Textadept</a>) traditionally uses static, compiled C++ + lexers which are notoriously difficult to create and/or extend. On the other + hand, <a href="http://lua.org">Lua</a> makes it easy to to rapidly create new lexers, extend existing + ones, and embed lexers within one another. Lua lexers tend to be more + readable than C++ lexers too.</p> + + <p>Lexers are Parsing Expression Grammars, or PEGs, composed with the Lua + <a href="http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html">LPeg library</a>. The following table comes from the LPeg documentation and + summarizes all you need to know about constructing basic LPeg patterns. This + module provides convenience functions for creating and working with other + more advanced patterns and concepts.</p> + + <table class="standard"> + <thead> + <tr> + <th>Operator </th> + <th> Description</th> + </tr> + </thead> + <tbody> + <tr> + <td><code>lpeg.P(string)</code> </td> + <td> Matches <code>string</code> literally.</td> + </tr> + <tr> + <td><code>lpeg.P(</code><em><code>n</code></em><code>)</code> </td> + <td> Matches exactly <em><code>n</code></em> characters.</td> + </tr> + <tr> + <td><code>lpeg.S(string)</code> </td> + <td> Matches any character in set <code>string</code>.</td> + </tr> + <tr> + <td><code>lpeg.R("</code><em><code>xy</code></em><code>")</code> </td> + <td> Matches any character between range <code>x</code> and <code>y</code>.</td> + </tr> + <tr> + <td><code>patt^</code><em><code>n</code></em> </td> + <td> Matches at least <em><code>n</code></em> repetitions of <code>patt</code>.</td> + </tr> + <tr> + <td><code>patt^-</code><em><code>n</code></em> </td> + <td> Matches at most <em><code>n</code></em> repetitions of <code>patt</code>.</td> + </tr> + <tr> + <td><code>patt1 * patt2</code> </td> + <td> Matches <code>patt1</code> followed by <code>patt2</code>.</td> + </tr> + <tr> + <td><code>patt1 + patt2</code> </td> + <td> Matches <code>patt1</code> or <code>patt2</code> (ordered choice).</td> + </tr> + <tr> + <td><code>patt1 - patt2</code> </td> + <td> Matches <code>patt1</code> if <code>patt2</code> does not match.</td> + </tr> + <tr> + <td><code>-patt</code> </td> + <td> Equivalent to <code>("" - patt)</code>.</td> + </tr> + <tr> + <td><code>#patt</code> </td> + <td> Matches <code>patt</code> but consumes no input.</td> + </tr> + </tbody> + </table> + + + <p>The first part of this document deals with rapidly constructing a simple + lexer. The next part deals with more advanced techniques, such as custom + coloring and embedding lexers within one another. Following that is a + discussion about code folding, or being able to tell Scintilla which code + blocks are "foldable" (temporarily hideable from view). After that are + instructions on how to use Lua lexers with the aforementioned Textadept + editor. Finally there are comments on lexer performance and limitations.</p> + + <p><a id="lexer.Lexer.Basics"></a></p> + + <h3>Lexer Basics</h3> + + <p>The <em>lexlua/</em> directory contains all lexers, including your new one. Before + attempting to write one from scratch though, first determine if your + programming language is similar to any of the 100+ languages supported. If + so, you may be able to copy and modify that lexer, saving some time and + effort. The filename of your lexer should be the name of your programming + language in lower case followed by a <em>.lua</em> extension. For example, a new Lua + lexer has the name <em>lua.lua</em>.</p> + + <p>Note: Try to refrain from using one-character language names like "c", "d", + or "r". For example, Lua lexers for those languages are named "ansi_c", "dmd", and "rstats", + respectively.</p> + + <p><a id="lexer.New.Lexer.Template"></a></p> + + <h4>New Lexer Template</h4> + + <p>There is a <em>lexlua/template.txt</em> file that contains a simple template for a + new lexer. Feel free to use it, replacing the '?'s with the name of your + lexer. Consider this snippet from the template:</p> + + <pre><code> + -- ? LPeg lexer. + + local lexer = require('lexer') + local token, word_match = lexer.token, lexer.word_match + local P, R, S = lpeg.P, lpeg.R, lpeg.S + + local lex = lexer.new('?') + + -- Whitespace. + local ws = token(lexer.WHITESPACE, lexer.space^1) + lex:add_rule('whitespace', ws) + + [...] + + return lex + </code></pre> + + <p>The first 3 lines of code simply define often used convenience variables. The + fourth and last lines <a href="#lexer.new">define</a> and return the lexer object + Scintilla uses; they are very important and must be part of every lexer. The + fifth line defines something called a "token", an essential building block of + lexers. You will learn about tokens shortly. The sixth line defines a lexer + grammar rule, which you will learn about later, as well as token styles. (Be + aware that it is common practice to combine these two lines for short rules.) + Note, however, the <code>local</code> prefix in front of variables, which is needed + so-as not to affect Lua's global environment. All in all, this is a minimal, + working lexer that you can build on.</p> + + <p><a id="lexer.Tokens"></a></p> + + <h4>Tokens</h4> + + <p>Take a moment to think about your programming language's structure. What kind + of key elements does it have? In the template shown earlier, one predefined + element all languages have is whitespace. Your language probably also has + elements like comments, strings, and keywords. Lexers refer to these elements + as "tokens". Tokens are the fundamental "building blocks" of lexers. Lexers + break down source code into tokens for coloring, which results in the syntax + highlighting familiar to you. It is up to you how specific your lexer is when + it comes to tokens. Perhaps only distinguishing between keywords and + identifiers is necessary, or maybe recognizing constants and built-in + functions, methods, or libraries is desirable. The Lua lexer, for example, + defines 11 tokens: whitespace, keywords, built-in functions, constants, + built-in libraries, identifiers, strings, comments, numbers, labels, and + operators. Even though constants, built-in functions, and built-in libraries + are subsets of identifiers, Lua programmers find it helpful for the lexer to + distinguish between them all. It is perfectly acceptable to just recognize + keywords and identifiers.</p> + + <p>In a lexer, tokens consist of a token name and an LPeg pattern that matches a + sequence of characters recognized as an instance of that token. Create tokens + using the <a href="#lexer.token"><code>lexer.token()</code></a> function. Let us examine the "whitespace" token + defined in the template shown earlier:</p> + + <pre><code> + local ws = token(lexer.WHITESPACE, lexer.space^1) + </code></pre> + + <p>At first glance, the first argument does not appear to be a string name and + the second argument does not appear to be an LPeg pattern. Perhaps you + expected something like:</p> + + <pre><code> + local ws = token('whitespace', S('\t\v\f\n\r ')^1) + </code></pre> + + <p>The <code>lexer</code> module actually provides a convenient list of common token names + and common LPeg patterns for you to use. Token names include + <a href="#lexer.DEFAULT"><code>lexer.DEFAULT</code></a>, <a href="#lexer.WHITESPACE"><code>lexer.WHITESPACE</code></a>, <a href="#lexer.COMMENT"><code>lexer.COMMENT</code></a>, + <a href="#lexer.STRING"><code>lexer.STRING</code></a>, <a href="#lexer.NUMBER"><code>lexer.NUMBER</code></a>, <a href="#lexer.KEYWORD"><code>lexer.KEYWORD</code></a>, + <a href="#lexer.IDENTIFIER"><code>lexer.IDENTIFIER</code></a>, <a href="#lexer.OPERATOR"><code>lexer.OPERATOR</code></a>, <a href="#lexer.ERROR"><code>lexer.ERROR</code></a>, + <a href="#lexer.PREPROCESSOR"><code>lexer.PREPROCESSOR</code></a>, <a href="#lexer.CONSTANT"><code>lexer.CONSTANT</code></a>, <a href="#lexer.VARIABLE"><code>lexer.VARIABLE</code></a>, + <a href="#lexer.FUNCTION"><code>lexer.FUNCTION</code></a>, <a href="#lexer.CLASS"><code>lexer.CLASS</code></a>, <a href="#lexer.TYPE"><code>lexer.TYPE</code></a>, <a href="#lexer.LABEL"><code>lexer.LABEL</code></a>, + <a href="#lexer.REGEX"><code>lexer.REGEX</code></a>, and <a href="#lexer.EMBEDDED"><code>lexer.EMBEDDED</code></a>. Patterns include + <a href="#lexer.any"><code>lexer.any</code></a>, <a href="#lexer.ascii"><code>lexer.ascii</code></a>, <a href="#lexer.extend"><code>lexer.extend</code></a>, <a href="#lexer.alpha"><code>lexer.alpha</code></a>, + <a href="#lexer.digit"><code>lexer.digit</code></a>, <a href="#lexer.alnum"><code>lexer.alnum</code></a>, <a href="#lexer.lower"><code>lexer.lower</code></a>, <a href="#lexer.upper"><code>lexer.upper</code></a>, + <a href="#lexer.xdigit"><code>lexer.xdigit</code></a>, <a href="#lexer.cntrl"><code>lexer.cntrl</code></a>, <a href="#lexer.graph"><code>lexer.graph</code></a>, <a href="#lexer.print"><code>lexer.print</code></a>, + <a href="#lexer.punct"><code>lexer.punct</code></a>, <a href="#lexer.space"><code>lexer.space</code></a>, <a href="#lexer.newline"><code>lexer.newline</code></a>, + <a href="#lexer.nonnewline"><code>lexer.nonnewline</code></a>, <a href="#lexer.nonnewline_esc"><code>lexer.nonnewline_esc</code></a>, <a href="#lexer.dec_num"><code>lexer.dec_num</code></a>, + <a href="#lexer.hex_num"><code>lexer.hex_num</code></a>, <a href="#lexer.oct_num"><code>lexer.oct_num</code></a>, <a href="#lexer.integer"><code>lexer.integer</code></a>, + <a href="#lexer.float"><code>lexer.float</code></a>, and <a href="#lexer.word"><code>lexer.word</code></a>. You may use your own token names if + none of the above fit your language, but an advantage to using predefined + token names is that your lexer's tokens will inherit the universal syntax + highlighting color theme used by your text editor.</p> + + <p><a id="lexer.Example.Tokens"></a></p> + + <h5>Example Tokens</h5> + + <p>So, how might you define other tokens like keywords, comments, and strings? + Here are some examples.</p> + + <p><strong>Keywords</strong></p> + + <p>Instead of matching <em>n</em> keywords with <em>n</em> <code>P('keyword_</code><em><code>n</code></em><code>')</code> ordered + choices, use another convenience function: <a href="#lexer.word_match"><code>lexer.word_match()</code></a>. It is + much easier and more efficient to write word matches like:</p> + + <pre><code> + local keyword = token(lexer.KEYWORD, lexer.word_match[[ + keyword_1 keyword_2 ... keyword_n + ]]) + + local case_insensitive_keyword = token(lexer.KEYWORD, lexer.word_match([[ + KEYWORD_1 keyword_2 ... KEYword_n + ]], true)) + + local hyphened_keyword = token(lexer.KEYWORD, lexer.word_match[[ + keyword-1 keyword-2 ... keyword-n + ]]) + </code></pre> + + <p>In order to more easily separate or categorize keyword sets, you can use Lua + line comments within keyword strings. Such comments will be ignored. For + example:</p> + + <pre><code> + local keyword = token(lexer.KEYWORD, lexer.word_match[[ + -- Version 1 keywords. + keyword_11, keyword_12 ... keyword_1n + -- Version 2 keywords. + keyword_21, keyword_22 ... keyword_2n + ... + -- Version N keywords. + keyword_m1, keyword_m2 ... keyword_mn + ]]) + </code></pre> + + <p><strong>Comments</strong></p> + + <p>Line-style comments with a prefix character(s) are easy to express with LPeg:</p> + + <pre><code> + local shell_comment = token(lexer.COMMENT, '#' * lexer.nonnewline^0) + local c_line_comment = token(lexer.COMMENT, + '//' * lexer.nonnewline_esc^0) + </code></pre> + + <p>The comments above start with a '#' or "//" and go to the end of the line. + The second comment recognizes the next line also as a comment if the current + line ends with a '\' escape character.</p> + + <p>C-style "block" comments with a start and end delimiter are also easy to + express:</p> + + <pre><code> + local c_comment = token(lexer.COMMENT, + '/*' * (lexer.any - '*/')^0 * P('*/')^-1) + </code></pre> + + <p>This comment starts with a "/*" sequence and contains anything up to and + including an ending "*/" sequence. The ending "*/" is optional so the lexer + can recognize unfinished comments as comments and highlight them properly.</p> + + <p><strong>Strings</strong></p> + + <p>It is tempting to think that a string is not much different from the block + comment shown above in that both have start and end delimiters:</p> + + <pre><code> + local dq_str = '"' * (lexer.any - '"')^0 * P('"')^-1 + local sq_str = "'" * (lexer.any - "'")^0 * P("'")^-1 + local simple_string = token(lexer.STRING, dq_str + sq_str) + </code></pre> + + <p>However, most programming languages allow escape sequences in strings such + that a sequence like "\"" in a double-quoted string indicates that the + '"' is not the end of the string. The above token incorrectly matches + such a string. Instead, use the <a href="#lexer.delimited_range"><code>lexer.delimited_range()</code></a> convenience + function.</p> + + <pre><code> + local dq_str = lexer.delimited_range('"') + local sq_str = lexer.delimited_range("'") + local string = token(lexer.STRING, dq_str + sq_str) + </code></pre> + + <p>In this case, the lexer treats '\' as an escape character in a string + sequence.</p> + + <p><strong>Numbers</strong></p> + + <p>Most programming languages have the same format for integer and float tokens, + so it might be as simple as using a couple of predefined LPeg patterns:</p> + + <pre><code> + local number = token(lexer.NUMBER, lexer.float + lexer.integer) + </code></pre> + + <p>However, some languages allow postfix characters on integers.</p> + + <pre><code> + local integer = P('-')^-1 * (lexer.dec_num * S('lL')^-1) + local number = token(lexer.NUMBER, lexer.float + lexer.hex_num + integer) + </code></pre> + + <p>Your language may need other tweaks, but it is up to you how fine-grained you + want your highlighting to be. After all, you are not writing a compiler or + interpreter!</p> + + <p><a id="lexer.Rules"></a></p> + + <h4>Rules</h4> + + <p>Programming languages have grammars, which specify valid token structure. For + example, comments usually cannot appear within a string. Grammars consist of + rules, which are simply combinations of tokens. Recall from the lexer + template the <a href="#lexer.add_rule"><code>lexer.add_rule()</code></a> call, which adds a rule to the lexer's + grammar:</p> + + <pre><code> + lex:add_rule('whitespace', ws) + </code></pre> + + <p>Each rule has an associated name, but rule names are completely arbitrary and + serve only to identify and distinguish between different rules. Rule order is + important: if text does not match the first rule added to the grammar, the + lexer tries to match the second rule added, and so on. Right now this lexer + simply matches whitespace tokens under a rule named "whitespace".</p> + + <p>To illustrate the importance of rule order, here is an example of a + simplified Lua lexer:</p> + + <pre><code> + lex:add_rule('whitespace', token(lexer.WHITESPACE, ...)) + lex:add_rule('keyword', token(lexer.KEYWORD, ...)) + lex:add_rule('identifier', token(lexer.IDENTIFIER, ...)) + lex:add_rule('string', token(lexer.STRING, ...)) + lex:add_rule('comment', token(lexer.COMMENT, ...)) + lex:add_rule('number', token(lexer.NUMBER, ...)) + lex:add_rule('label', token(lexer.LABEL, ...)) + lex:add_rule('operator', token(lexer.OPERATOR, ...)) + </code></pre> + + <p>Note how identifiers come after keywords. In Lua, as with most programming + languages, the characters allowed in keywords and identifiers are in the same + set (alphanumerics plus underscores). If the lexer added the "identifier" + rule before the "keyword" rule, all keywords would match identifiers and thus + incorrectly highlight as identifiers instead of keywords. The same idea + applies to function, constant, etc. tokens that you may want to distinguish + between: their rules should come before identifiers.</p> + + <p>So what about text that does not match any rules? For example in Lua, the '!' + character is meaningless outside a string or comment. Normally the lexer + skips over such text. If instead you want to highlight these "syntax errors", + add an additional end rule:</p> + + <pre><code> + lex:add_rule('whitespace', ws) + ... + lex:add_rule('error', token(lexer.ERROR, lexer.any)) + </code></pre> + + <p>This identifies and highlights any character not matched by an existing + rule as a <code>lexer.ERROR</code> token.</p> + + <p>Even though the rules defined in the examples above contain a single token, + rules may consist of multiple tokens. For example, a rule for an HTML tag + could consist of a tag token followed by an arbitrary number of attribute + tokens, allowing the lexer to highlight all tokens separately. That rule + might look something like this:</p> + + <pre><code> + lex:add_rule('tag', tag_start * (ws * attributes)^0 * tag_end^-1) + </code></pre> + + <p>Note however that lexers with complex rules like these are more prone to lose + track of their state, especially if they span multiple lines.</p> + + <p><a id="lexer.Summary"></a></p> + + <h4>Summary</h4> + + <p>Lexers primarily consist of tokens and grammar rules. At your disposal are a + number of convenience patterns and functions for rapidly creating a lexer. If + you choose to use predefined token names for your tokens, you do not have to + define how the lexer highlights them. The tokens will inherit the default + syntax highlighting color theme your editor uses.</p> + + <p><a id="lexer.Advanced.Techniques"></a></p> + + <h3>Advanced Techniques</h3> + + <p><a id="lexer.Styles.and.Styling"></a></p> + + <h4>Styles and Styling</h4> + + <p>The most basic form of syntax highlighting is assigning different colors to + different tokens. Instead of highlighting with just colors, Scintilla allows + for more rich highlighting, or "styling", with different fonts, font sizes, + font attributes, and foreground and background colors, just to name a few. + The unit of this rich highlighting is called a "style". Styles are simply + strings of comma-separated property settings. By default, lexers associate + predefined token names like <code>lexer.WHITESPACE</code>, <code>lexer.COMMENT</code>, + <code>lexer.STRING</code>, etc. with particular styles as part of a universal color + theme. These predefined styles include <a href="#lexer.STYLE_CLASS"><code>lexer.STYLE_CLASS</code></a>, + <a href="#lexer.STYLE_COMMENT"><code>lexer.STYLE_COMMENT</code></a>, <a href="#lexer.STYLE_CONSTANT"><code>lexer.STYLE_CONSTANT</code></a>, + <a href="#lexer.STYLE_ERROR"><code>lexer.STYLE_ERROR</code></a>, <a href="#lexer.STYLE_EMBEDDED"><code>lexer.STYLE_EMBEDDED</code></a>, + <a href="#lexer.STYLE_FUNCTION"><code>lexer.STYLE_FUNCTION</code></a>, <a href="#lexer.STYLE_IDENTIFIER"><code>lexer.STYLE_IDENTIFIER</code></a>, + <a href="#lexer.STYLE_KEYWORD"><code>lexer.STYLE_KEYWORD</code></a>, <a href="#lexer.STYLE_LABEL"><code>lexer.STYLE_LABEL</code></a>, <a href="#lexer.STYLE_NUMBER"><code>lexer.STYLE_NUMBER</code></a>, + <a href="#lexer.STYLE_OPERATOR"><code>lexer.STYLE_OPERATOR</code></a>, <a href="#lexer.STYLE_PREPROCESSOR"><code>lexer.STYLE_PREPROCESSOR</code></a>, + <a href="#lexer.STYLE_REGEX"><code>lexer.STYLE_REGEX</code></a>, <a href="#lexer.STYLE_STRING"><code>lexer.STYLE_STRING</code></a>, <a href="#lexer.STYLE_TYPE"><code>lexer.STYLE_TYPE</code></a>, + <a href="#lexer.STYLE_VARIABLE"><code>lexer.STYLE_VARIABLE</code></a>, and <a href="#lexer.STYLE_WHITESPACE"><code>lexer.STYLE_WHITESPACE</code></a>. Like with + predefined token names and LPeg patterns, you may define your own styles. At + their core, styles are just strings, so you may create new ones and/or modify + existing ones. Each style consists of the following comma-separated settings:</p> + + <table class="standard"> + <thead> + <tr> + <th>Setting </th> + <th> Description</th> + </tr> + </thead> + <tbody> + <tr> + <td>font:<em>name</em> </td> + <td> The name of the font the style uses.</td> + </tr> + <tr> + <td>size:<em>int</em> </td> + <td> The size of the font the style uses.</td> + </tr> + <tr> + <td>[not]bold </td> + <td> Whether or not the font face is bold.</td> + </tr> + <tr> + <td>weight:<em>int</em> </td> + <td> The weight or boldness of a font, between 1 and 999.</td> + </tr> + <tr> + <td>[not]italics </td> + <td> Whether or not the font face is italic.</td> + </tr> + <tr> + <td>[not]underlined</td> + <td> Whether or not the font face is underlined.</td> + </tr> + <tr> + <td>fore:<em>color</em> </td> + <td> The foreground color of the font face.</td> + </tr> + <tr> + <td>back:<em>color</em> </td> + <td> The background color of the font face.</td> + </tr> + <tr> + <td>[not]eolfilled </td> + <td> Does the background color extend to the end of the line?</td> + </tr> + <tr> + <td>case:<em>char</em> </td> + <td> The case of the font ('u': upper, 'l': lower, 'm': normal).</td> + </tr> + <tr> + <td>[not]visible </td> + <td> Whether or not the text is visible.</td> + </tr> + <tr> + <td>[not]changeable</td> + <td> Whether the text is changeable or read-only.</td> + </tr> + </tbody> + </table> + + + <p>Specify font colors in either "#RRGGBB" format, "0xBBGGRR" format, or the + decimal equivalent of the latter. As with token names, LPeg patterns, and + styles, there is a set of predefined color names, but they vary depending on + the current color theme in use. Therefore, it is generally not a good idea to + manually define colors within styles in your lexer since they might not fit + into a user's chosen color theme. Try to refrain from even using predefined + colors in a style because that color may be theme-specific. Instead, the best + practice is to either use predefined styles or derive new color-agnostic + styles from predefined ones. For example, Lua "longstring" tokens use the + existing <code>lexer.STYLE_STRING</code> style instead of defining a new one.</p> + + <p><a id="lexer.Example.Styles"></a></p> + + <h5>Example Styles</h5> + + <p>Defining styles is pretty straightforward. An empty style that inherits the + default theme settings is simply an empty string:</p> + + <pre><code> + local style_nothing = '' + </code></pre> + + <p>A similar style but with a bold font face looks like this:</p> + + <pre><code> + local style_bold = 'bold' + </code></pre> + + <p>If you want the same style, but also with an italic font face, define the new + style in terms of the old one:</p> + + <pre><code> + local style_bold_italic = style_bold..',italics' + </code></pre> + + <p>This allows you to derive new styles from predefined ones without having to + rewrite them. This operation leaves the old style unchanged. Thus if you + had a "static variable" token whose style you wanted to base off of + <code>lexer.STYLE_VARIABLE</code>, it would probably look like:</p> + + <pre><code> + local style_static_var = lexer.STYLE_VARIABLE..',italics' + </code></pre> + + <p>The color theme files in the <em>lexlua/themes/</em> folder give more examples of + style definitions.</p> + + <p><a id="lexer.Token.Styles"></a></p> + + <h4>Token Styles</h4> + + <p>Lexers use the <a href="#lexer.add_style"><code>lexer.add_style()</code></a> function to assign styles to + particular tokens. Recall the token definition and from the lexer template:</p> + + <pre><code> + local ws = token(lexer.WHITESPACE, lexer.space^1) + lex:add_rule('whitespace', ws) + </code></pre> + + <p>Why is a style not assigned to the <code>lexer.WHITESPACE</code> token? As mentioned + earlier, lexers automatically associate tokens that use predefined token + names with a particular style. Only tokens with custom token names need + manual style associations. As an example, consider a custom whitespace token:</p> + + <pre><code> + local ws = token('custom_whitespace', lexer.space^1) + </code></pre> + + <p>Assigning a style to this token looks like:</p> + + <pre><code> + lex:add_style('custom_whitespace', lexer.STYLE_WHITESPACE) + </code></pre> + + <p>Do not confuse token names with rule names. They are completely different + entities. In the example above, the lexer associates the "custom_whitespace" + token with the existing style for <code>lexer.WHITESPACE</code> tokens. If instead you + prefer to color the background of whitespace a shade of grey, it might look + like:</p> + + <pre><code> + local custom_style = lexer.STYLE_WHITESPACE..',back:$(color.grey)' + lex:add_style('custom_whitespace', custom_style) + </code></pre> + + <p>Notice that the lexer peforms Scintilla-style "$()" property expansion. + You may also use "%()". Remember to refrain from assigning specific colors in + styles, but in this case, all user color themes probably define the + "color.grey" property.</p> + + <p><a id="lexer.Line.Lexers"></a></p> + + <h4>Line Lexers</h4> + + <p>By default, lexers match the arbitrary chunks of text passed to them by + Scintilla. These chunks may be a full document, only the visible part of a + document, or even just portions of lines. Some lexers need to match whole + lines. For example, a lexer for the output of a file "diff" needs to know if + the line started with a '+' or '-' and then style the entire line + accordingly. To indicate that your lexer matches by line, create the lexer + with an extra parameter:</p> + + <pre><code> + local lex = lexer.new('?', {lex_by_line = true}) + </code></pre> + + <p>Now the input text for the lexer is a single line at a time. Keep in mind + that line lexers do not have the ability to look ahead at subsequent lines.</p> + + <p><a id="lexer.Embedded.Lexers"></a></p> + + <h4>Embedded Lexers</h4> + + <p>Lexers embed within one another very easily, requiring minimal effort. In the + following sections, the lexer being embedded is called the "child" lexer and + the lexer a child is being embedded in is called the "parent". For example, + consider an HTML lexer and a CSS lexer. Either lexer stands alone for styling + their respective HTML and CSS files. However, CSS can be embedded inside + HTML. In this specific case, the CSS lexer is the "child" lexer with the HTML + lexer being the "parent". Now consider an HTML lexer and a PHP lexer. This + sounds a lot like the case with CSS, but there is a subtle difference: PHP + <em>embeds itself into</em> HTML while CSS is <em>embedded in</em> HTML. This fundamental + difference results in two types of embedded lexers: a parent lexer that + embeds other child lexers in it (like HTML embedding CSS), and a child lexer + that embeds itself into a parent lexer (like PHP embedding itself in HTML).</p> + + <p><a id="lexer.Parent.Lexer"></a></p> + + <h5>Parent Lexer</h5> + + <p>Before embedding a child lexer into a parent lexer, the parent lexer needs to + load the child lexer. This is done with the <a href="#lexer.load"><code>lexer.load()</code></a> function. For + example, loading the CSS lexer within the HTML lexer looks like:</p> + + <pre><code> + local css = lexer.load('css') + </code></pre> + + <p>The next part of the embedding process is telling the parent lexer when to + switch over to the child lexer and when to switch back. The lexer refers to + these indications as the "start rule" and "end rule", respectively, and are + just LPeg patterns. Continuing with the HTML/CSS example, the transition from + HTML to CSS is when the lexer encounters a "style" tag with a "type" + attribute whose value is "text/css":</p> + + <pre><code> + local css_tag = P('<style') * P(function(input, index) + if input:find('^[^>]+type="text/css"', index) then + return index + end + end) + </code></pre> + + <p>This pattern looks for the beginning of a "style" tag and searches its + attribute list for the text "<code>type="text/css"</code>". (In this simplified example, + the Lua pattern does not consider whitespace between the '=' nor does it + consider that using single quotes is valid.) If there is a match, the + functional pattern returns a value instead of <code>nil</code>. In this case, the value + returned does not matter because we ultimately want to style the "style" tag + as an HTML tag, so the actual start rule looks like this:</p> + + <pre><code> + local css_start_rule = #css_tag * tag + </code></pre> + + <p>Now that the parent knows when to switch to the child, it needs to know when + to switch back. In the case of HTML/CSS, the switch back occurs when the + lexer encounters an ending "style" tag, though the lexer should still style + the tag as an HTML tag:</p> + + <pre><code> + local css_end_rule = #P('</style>') * tag + </code></pre> + + <p>Once the parent loads the child lexer and defines the child's start and end + rules, it embeds the child with the <a href="#lexer.embed"><code>lexer.embed()</code></a> function:</p> + + <pre><code> + lex:embed(css, css_start_rule, css_end_rule) + </code></pre> + + <p><a id="lexer.Child.Lexer"></a></p> + + <h5>Child Lexer</h5> + + <p>The process for instructing a child lexer to embed itself into a parent is + very similar to embedding a child into a parent: first, load the parent lexer + into the child lexer with the <a href="#lexer.load"><code>lexer.load()</code></a> function and then create + start and end rules for the child lexer. However, in this case, call + <a href="#lexer.embed"><code>lexer.embed()</code></a> with switched arguments. For example, in the PHP lexer:</p> + + <pre><code> + local html = lexer.load('html') + local php_start_rule = token('php_tag', '<?php ') + local php_end_rule = token('php_tag', '?>') + lex:add_style('php_tag', lexer.STYLE_EMBEDDED) + html:embed(lex, php_start_rule, php_end_rule) + </code></pre> + + <p><a id="lexer.Lexers.with.Complex.State"></a></p> + + <h4>Lexers with Complex State</h4> + + <p>A vast majority of lexers are not stateful and can operate on any chunk of + text in a document. However, there may be rare cases where a lexer does need + to keep track of some sort of persistent state. Rather than using <code>lpeg.P</code> + function patterns that set state variables, it is recommended to make use of + Scintilla's built-in, per-line state integers via <a href="#lexer.line_state"><code>lexer.line_state</code></a>. It + was designed to accommodate up to 32 bit flags for tracking state. + <a href="#lexer.line_from_position"><code>lexer.line_from_position()</code></a> will return the line for any position given + to an <code>lpeg.P</code> function pattern. (Any positions derived from that position + argument will also work.)</p> + + <p>Writing stateful lexers is beyond the scope of this document.</p> + + <p><a id="lexer.Code.Folding"></a></p> + + <h3>Code Folding</h3> + + <p>When reading source code, it is occasionally helpful to temporarily hide + blocks of code like functions, classes, comments, etc. This is the concept of + "folding". In many Scintilla-based editors, such as Textadept, little indicators + in the editor margins appear next to code that can be folded at places called + "fold points". When the user clicks an indicator, the editor hides the code + associated with the indicator until the user clicks the indicator again. The + lexer specifies these fold points and what code exactly to fold.</p> + + <p>The fold points for most languages occur on keywords or character sequences. + Examples of fold keywords are "if" and "end" in Lua and examples of fold + character sequences are '{', '}', "/*", and "*/" in C for code block and + comment delimiters, respectively. However, these fold points cannot occur + just anywhere. For example, lexers should not recognize fold keywords that + appear within strings or comments. The <a href="#lexer.add_fold_point"><code>lexer.add_fold_point()</code></a> function + allows you to conveniently define fold points with such granularity. For + example, consider C:</p> + + <pre><code> + lex:add_fold_point(lexer.OPERATOR, '{', '}') + lex:add_fold_point(lexer.COMMENT, '/*', '*/') + </code></pre> + + <p>The first assignment states that any '{' or '}' that the lexer recognized as + an <code>lexer.OPERATOR</code> token is a fold point. Likewise, the second assignment + states that any "/*" or "*/" that the lexer recognizes as part of a + <code>lexer.COMMENT</code> token is a fold point. The lexer does not consider any + occurrences of these characters outside their defined tokens (such as in a + string) as fold points. How do you specify fold keywords? Here is an example + for Lua:</p> + + <pre><code> + lex:add_fold_point(lexer.KEYWORD, 'if', 'end') + lex:add_fold_point(lexer.KEYWORD, 'do', 'end') + lex:add_fold_point(lexer.KEYWORD, 'function', 'end') + lex:add_fold_point(lexer.KEYWORD, 'repeat', 'until') + </code></pre> + + <p>If your lexer has case-insensitive keywords as fold points, simply add a + <code>case_insensitive_fold_points = true</code> option to <a href="#lexer.new"><code>lexer.new()</code></a>, and + specify keywords in lower case.</p> + + <p>If your lexer needs to do some additional processing in order to determine if + a token is a fold point, pass a function that returns an integer to + <code>lex:add_fold_point()</code>. Returning <code>1</code> indicates the token is a beginning fold + point and returning <code>-1</code> indicates the token is an ending fold point. + Returning <code>0</code> indicates the token is not a fold point. For example:</p> + + <pre><code> + local function fold_strange_token(text, pos, line, s, symbol) + if ... then + return 1 -- beginning fold point + elseif ... then + return -1 -- ending fold point + end + return 0 + end + + lex:add_fold_point('strange_token', '|', fold_strange_token) + </code></pre> + + <p>Any time the lexer encounters a '|' that is a "strange_token", it calls the + <code>fold_strange_token</code> function to determine if '|' is a fold point. The lexer + calls these functions with the following arguments: the text to identify fold + points in, the beginning position of the current line in the text to fold, + the current line's text, the position in the current line the fold point text + starts at, and the fold point text itself.</p> + + <p><a id="lexer.Fold.by.Indentation"></a></p> + + <h4>Fold by Indentation</h4> + + <p>Some languages have significant whitespace and/or no delimiters that indicate + fold points. If your lexer falls into this category and you would like to + mark fold points based on changes in indentation, create the lexer with a + <code>fold_by_indentation = true</code> option:</p> + + <pre><code> + local lex = lexer.new('?', {fold_by_indentation = true}) + </code></pre> + + <p><a id="lexer.Using.Lexers"></a></p> + + <h3>Using Lexers</h3> + + <p><a id="lexer.Textadept"></a></p> + + <h4>Textadept</h4> + + <p>Put your lexer in your <em>~/.textadept/lexers/</em> directory so you do not + overwrite it when upgrading Textadept. Also, lexers in this directory + override default lexers. Thus, Textadept loads a user <em>lua</em> lexer instead of + the default <em>lua</em> lexer. This is convenient for tweaking a default lexer to + your liking. Then add a <a href="https://foicica.com/textadept/api.html#textadept.file_types">file type</a> for your lexer if necessary.</p> + + <p><a id="lexer.Migrating.Legacy.Lexers"></a></p> + + <h3>Migrating Legacy Lexers</h3> + + <p>Legacy lexers are of the form:</p> + + <pre><code> + local l = require('lexer') + local token, word_match = l.token, l.word_match + local P, R, S = lpeg.P, lpeg.R, lpeg.S + + local M = {_NAME = '?'} + + [... token and pattern definitions ...] + + M._rules = { + {'rule', pattern}, + [...] + } + + M._tokenstyles = { + 'token' = 'style', + [...] + } + + M._foldsymbols = { + _patterns = {...}, + ['token'] = {['start'] = 1, ['end'] = -1}, + [...] + } + + return M + </code></pre> + + <p>While such legacy lexers will be handled just fine without any + changes, it is recommended that you migrate yours. The migration process is + fairly straightforward:</p> + + <ol> + <li>Replace all instances of <code>l</code> with <code>lexer</code>, as it's better practice and + results in less confusion.</li> + <li>Replace <code>local M = {_NAME = '?'}</code> with <code>local lex = lexer.new('?')</code>, where + <code>?</code> is the name of your legacy lexer. At the end of the lexer, change + <code>return M</code> to <code>return lex</code>.</li> + <li>Instead of defining rules towards the end of your lexer, define your rules + as you define your tokens and patterns using + <a href="#lexer.add_rule"><code>lex:add_rule()</code></a>.</li> + <li>Similarly, any custom token names should have their styles immediately + defined using <a href="#lexer.add_style"><code>lex:add_style()</code></a>.</li> + <li>Convert any table arguments passed to <a href="#lexer.word_match"><code>lexer.word_match()</code></a> to a + space-separated string of words.</li> + <li>Replace any calls to <code>lexer.embed(M, child, ...)</code> and + <code>lexer.embed(parent, M, ...)</code> with + <a href="#lexer.embed"><code>lex:embed</code></a><code>(child, ...)</code> and <code>parent:embed(lex, ...)</code>, + respectively.</li> + <li>Define fold points with simple calls to + <a href="#lexer.add_fold_point"><code>lex:add_fold_point()</code></a>. No need to mess with Lua + patterns anymore.</li> + <li>Any legacy lexer options such as <code>M._FOLDBYINDENTATION</code>, <code>M._LEXBYLINE</code>, + <code>M._lexer</code>, etc. should be added as table options to <a href="#lexer.new"><code>lexer.new()</code></a>.</li> + <li>Any external lexer rule fetching and/or modifications via <code>lexer._RULES</code> + should be changed to use <a href="#lexer.get_rule"><code>lexer.get_rule()</code></a> and + <a href="#lexer.modify_rule"><code>lexer.modify_rule()</code></a>.</li> + </ol> + + + <p>As an example, consider the following sample legacy lexer:</p> + + <pre><code> + local l = require('lexer') + local token, word_match = l.token, l.word_match + local P, R, S = lpeg.P, lpeg.R, lpeg.S + + local M = {_NAME = 'legacy'} + + local ws = token(l.WHITESPACE, l.space^1) + local comment = token(l.COMMENT, '#' * l.nonnewline^0) + local string = token(l.STRING, l.delimited_range('"')) + local number = token(l.NUMBER, l.float + l.integer) + local keyword = token(l.KEYWORD, word_match{'foo', 'bar', 'baz'}) + local custom = token('custom', P('quux')) + local identifier = token(l.IDENTIFIER, l.word) + local operator = token(l.OPERATOR, S('+-*/%^=<>,.()[]{}')) + + M._rules = { + {'whitespace', ws}, + {'keyword', keyword}, + {'custom', custom}, + {'identifier', identifier}, + {'string', string}, + {'comment', comment}, + {'number', number}, + {'operator', operator} + } + + M._tokenstyles = { + 'custom' = l.STYLE_KEYWORD..',bold' + } + + M._foldsymbols = { + _patterns = {'[{}]'}, + [l.OPERATOR] = {['{'] = 1, ['}'] = -1} + } + + return M + </code></pre> + + <p>Following the migration steps would yield:</p> + + <pre><code> + local lexer = require('lexer') + local token, word_match = lexer.token, lexer.word_match + local P, R, S = lpeg.P, lpeg.R, lpeg.S + + local lex = lexer.new('legacy') + + lex:add_rule('whitespace', token(lexer.WHITESPACE, lexer.space^1)) + lex:add_rule('keyword', token(lexer.KEYWORD, word_match[[foo bar baz]])) + lex:add_rule('custom', token('custom', P('quux'))) + lex:add_style('custom', lexer.STYLE_KEYWORD..',bold') + lex:add_rule('identifier', token(lexer.IDENTIFIER, lexer.word)) + lex:add_rule('string', token(lexer.STRING, lexer.delimited_range('"'))) + lex:add_rule('comment', token(lexer.COMMENT, '#' * lexer.nonnewline^0)) + lex:add_rule('number', token(lexer.NUMBER, lexer.float + lexer.integer)) + lex:add_rule('operator', token(lexer.OPERATOR, S('+-*/%^=<>,.()[]{}'))) + + lex:add_fold_point(lexer.OPERATOR, '{', '}') + + return lex + </code></pre> + + <p><a id="lexer.Considerations"></a></p> + + <h3>Considerations</h3> + + <p><a id="lexer.Performance"></a></p> + + <h4>Performance</h4> + + <p>There might be some slight overhead when initializing a lexer, but loading a + file from disk into Scintilla is usually more expensive. On modern computer + systems, I see no difference in speed between Lua lexers and Scintilla's C++ + ones. Optimize lexers for speed by re-arranging <code>lexer.add_rule()</code> calls so + that the most common rules match first. Do keep in mind that order matters + for similar rules.</p> + + <p>In some cases, folding may be far more expensive than lexing, particularly + in lexers with a lot of potential fold points. If your lexer is exhibiting + signs of slowness, try disabling folding your text editor first. If that + speeds things up, you can try reducing the number of fold points you added, + overriding <code>lexer.fold()</code> with your own implementation, or simply eliminating + folding support from your lexer.</p> + + <p><a id="lexer.Limitations"></a></p> + + <h4>Limitations</h4> + + <p>Embedded preprocessor languages like PHP cannot completely embed in their + parent languages in that the parent's tokens do not support start and end + rules. This mostly goes unnoticed, but code like</p> + + <pre><code> + <div id="<?php echo $id; ?>"> + </code></pre> + + <p>will not style correctly.</p> + + <p><a id="lexer.Troubleshooting"></a></p> + + <h4>Troubleshooting</h4> + + <p>Errors in lexers can be tricky to debug. Lexers print Lua errors to + <code>io.stderr</code> and <code>_G.print()</code> statements to <code>io.stdout</code>. Running your editor + from a terminal is the easiest way to see errors as they occur.</p> + + <p><a id="lexer.Risks"></a></p> + + <h4>Risks</h4> + + <p>Poorly written lexers have the ability to crash Scintilla (and thus its + containing application), so unsaved data might be lost. However, I have only + observed these crashes in early lexer development, when syntax errors or + pattern errors are present. Once the lexer actually starts styling text + (either correctly or incorrectly, it does not matter), I have not observed + any crashes.</p> + + <p><a id="lexer.Acknowledgements"></a></p> + + <h4>Acknowledgements</h4> + + <p>Thanks to Peter Odding for his <a href="http://lua-users.org/lists/lua-l/2007-04/msg00116.html">lexer post</a> on the Lua mailing list + that inspired me, and thanks to Roberto Ierusalimschy for LPeg.</p> + + <h2>Lua <code>lexer</code> module API fields</h2> + + <p><a id="lexer.CLASS"></a></p> + + <h3><code>lexer.CLASS</code> (string)</h3> + + <p>The token name for class tokens.</p> + + <p><a id="lexer.COMMENT"></a></p> + + <h3><code>lexer.COMMENT</code> (string)</h3> + + <p>The token name for comment tokens.</p> + + <p><a id="lexer.CONSTANT"></a></p> + + <h3><code>lexer.CONSTANT</code> (string)</h3> + + <p>The token name for constant tokens.</p> + + <p><a id="lexer.DEFAULT"></a></p> + + <h3><code>lexer.DEFAULT</code> (string)</h3> + + <p>The token name for default tokens.</p> + + <p><a id="lexer.ERROR"></a></p> + + <h3><code>lexer.ERROR</code> (string)</h3> + + <p>The token name for error tokens.</p> + + <p><a id="lexer.FOLD_BASE"></a></p> + + <h3><code>lexer.FOLD_BASE</code> (number)</h3> + + <p>The initial (root) fold level.</p> + + <p><a id="lexer.FOLD_BLANK"></a></p> + + <h3><code>lexer.FOLD_BLANK</code> (number)</h3> + + <p>Flag indicating that the line is blank.</p> + + <p><a id="lexer.FOLD_HEADER"></a></p> + + <h3><code>lexer.FOLD_HEADER</code> (number)</h3> + + <p>Flag indicating the line is fold point.</p> + + <p><a id="lexer.FUNCTION"></a></p> + + <h3><code>lexer.FUNCTION</code> (string)</h3> + + <p>The token name for function tokens.</p> + + <p><a id="lexer.IDENTIFIER"></a></p> + + <h3><code>lexer.IDENTIFIER</code> (string)</h3> + + <p>The token name for identifier tokens.</p> + + <p><a id="lexer.KEYWORD"></a></p> + + <h3><code>lexer.KEYWORD</code> (string)</h3> + + <p>The token name for keyword tokens.</p> + + <p><a id="lexer.LABEL"></a></p> + + <h3><code>lexer.LABEL</code> (string)</h3> + + <p>The token name for label tokens.</p> + + <p><a id="lexer.NUMBER"></a></p> + + <h3><code>lexer.NUMBER</code> (string)</h3> + + <p>The token name for number tokens.</p> + + <p><a id="lexer.OPERATOR"></a></p> + + <h3><code>lexer.OPERATOR</code> (string)</h3> + + <p>The token name for operator tokens.</p> + + <p><a id="lexer.PREPROCESSOR"></a></p> + + <h3><code>lexer.PREPROCESSOR</code> (string)</h3> + + <p>The token name for preprocessor tokens.</p> + + <p><a id="lexer.REGEX"></a></p> + + <h3><code>lexer.REGEX</code> (string)</h3> + + <p>The token name for regex tokens.</p> + + <p><a id="lexer.STRING"></a></p> + + <h3><code>lexer.STRING</code> (string)</h3> + + <p>The token name for string tokens.</p> + + <p><a id="lexer.STYLE_BRACEBAD"></a></p> + + <h3><code>lexer.STYLE_BRACEBAD</code> (string)</h3> + + <p>The style used for unmatched brace characters.</p> + + <p><a id="lexer.STYLE_BRACELIGHT"></a></p> + + <h3><code>lexer.STYLE_BRACELIGHT</code> (string)</h3> + + <p>The style used for highlighted brace characters.</p> + + <p><a id="lexer.STYLE_CALLTIP"></a></p> + + <h3><code>lexer.STYLE_CALLTIP</code> (string)</h3> + + <p>The style used by call tips if <a href="#buffer.call_tip_use_style"><code>buffer.call_tip_use_style</code></a> is set. + Only the font name, size, and color attributes are used.</p> + + <p><a id="lexer.STYLE_CLASS"></a></p> + + <h3><code>lexer.STYLE_CLASS</code> (string)</h3> + + <p>The style typically used for class definitions.</p> + + <p><a id="lexer.STYLE_COMMENT"></a></p> + + <h3><code>lexer.STYLE_COMMENT</code> (string)</h3> + + <p>The style typically used for code comments.</p> + + <p><a id="lexer.STYLE_CONSTANT"></a></p> + + <h3><code>lexer.STYLE_CONSTANT</code> (string)</h3> + + <p>The style typically used for constants.</p> + + <p><a id="lexer.STYLE_CONTROLCHAR"></a></p> + + <h3><code>lexer.STYLE_CONTROLCHAR</code> (string)</h3> + + <p>The style used for control characters. + Color attributes are ignored.</p> + + <p><a id="lexer.STYLE_DEFAULT"></a></p> + + <h3><code>lexer.STYLE_DEFAULT</code> (string)</h3> + + <p>The style all styles are based off of.</p> + + <p><a id="lexer.STYLE_EMBEDDED"></a></p> + + <h3><code>lexer.STYLE_EMBEDDED</code> (string)</h3> + + <p>The style typically used for embedded code.</p> + + <p><a id="lexer.STYLE_ERROR"></a></p> + + <h3><code>lexer.STYLE_ERROR</code> (string)</h3> + + <p>The style typically used for erroneous syntax.</p> + + <p><a id="lexer.STYLE_FOLDDISPLAYTEXT"></a></p> + + <h3><code>lexer.STYLE_FOLDDISPLAYTEXT</code> (string)</h3> + + <p>The style used for fold display text.</p> + + <p><a id="lexer.STYLE_FUNCTION"></a></p> + + <h3><code>lexer.STYLE_FUNCTION</code> (string)</h3> + + <p>The style typically used for function definitions.</p> + + <p><a id="lexer.STYLE_IDENTIFIER"></a></p> + + <h3><code>lexer.STYLE_IDENTIFIER</code> (string)</h3> + + <p>The style typically used for identifier words.</p> + + <p><a id="lexer.STYLE_INDENTGUIDE"></a></p> + + <h3><code>lexer.STYLE_INDENTGUIDE</code> (string)</h3> + + <p>The style used for indentation guides.</p> + + <p><a id="lexer.STYLE_KEYWORD"></a></p> + + <h3><code>lexer.STYLE_KEYWORD</code> (string)</h3> + + <p>The style typically used for language keywords.</p> + + <p><a id="lexer.STYLE_LABEL"></a></p> + + <h3><code>lexer.STYLE_LABEL</code> (string)</h3> + + <p>The style typically used for labels.</p> + + <p><a id="lexer.STYLE_LINENUMBER"></a></p> + + <h3><code>lexer.STYLE_LINENUMBER</code> (string)</h3> + + <p>The style used for all margins except fold margins.</p> + + <p><a id="lexer.STYLE_NUMBER"></a></p> + + <h3><code>lexer.STYLE_NUMBER</code> (string)</h3> + + <p>The style typically used for numbers.</p> + + <p><a id="lexer.STYLE_OPERATOR"></a></p> + + <h3><code>lexer.STYLE_OPERATOR</code> (string)</h3> + + <p>The style typically used for operators.</p> + + <p><a id="lexer.STYLE_PREPROCESSOR"></a></p> + + <h3><code>lexer.STYLE_PREPROCESSOR</code> (string)</h3> + + <p>The style typically used for preprocessor statements.</p> + + <p><a id="lexer.STYLE_REGEX"></a></p> + + <h3><code>lexer.STYLE_REGEX</code> (string)</h3> + + <p>The style typically used for regular expression strings.</p> + + <p><a id="lexer.STYLE_STRING"></a></p> + + <h3><code>lexer.STYLE_STRING</code> (string)</h3> + + <p>The style typically used for strings.</p> + + <p><a id="lexer.STYLE_TYPE"></a></p> + + <h3><code>lexer.STYLE_TYPE</code> (string)</h3> + + <p>The style typically used for static types.</p> + + <p><a id="lexer.STYLE_VARIABLE"></a></p> + + <h3><code>lexer.STYLE_VARIABLE</code> (string)</h3> + + <p>The style typically used for variables.</p> + + <p><a id="lexer.STYLE_WHITESPACE"></a></p> + + <h3><code>lexer.STYLE_WHITESPACE</code> (string)</h3> + + <p>The style typically used for whitespace.</p> + + <p><a id="lexer.TYPE"></a></p> + + <h3><code>lexer.TYPE</code> (string)</h3> + + <p>The token name for type tokens.</p> + + <p><a id="lexer.VARIABLE"></a></p> + + <h3><code>lexer.VARIABLE</code> (string)</h3> + + <p>The token name for variable tokens.</p> + + <p><a id="lexer.WHITESPACE"></a></p> + + <h3><code>lexer.WHITESPACE</code> (string)</h3> + + <p>The token name for whitespace tokens.</p> + + <p><a id="lexer.alnum"></a></p> + + <h3><code>lexer.alnum</code> (pattern)</h3> + + <p>A pattern that matches any alphanumeric character ('A'-'Z', 'a'-'z', + '0'-'9').</p> + + <p><a id="lexer.alpha"></a></p> + + <h3><code>lexer.alpha</code> (pattern)</h3> + + <p>A pattern that matches any alphabetic character ('A'-'Z', 'a'-'z').</p> + + <p><a id="lexer.any"></a></p> + + <h3><code>lexer.any</code> (pattern)</h3> + + <p>A pattern that matches any single character.</p> + + <p><a id="lexer.ascii"></a></p> + + <h3><code>lexer.ascii</code> (pattern)</h3> + + <p>A pattern that matches any ASCII character (codes 0 to 127).</p> + + <p><a id="lexer.cntrl"></a></p> + + <h3><code>lexer.cntrl</code> (pattern)</h3> + + <p>A pattern that matches any control character (ASCII codes 0 to 31).</p> + + <p><a id="lexer.dec_num"></a></p> + + <h3><code>lexer.dec_num</code> (pattern)</h3> + + <p>A pattern that matches a decimal number.</p> + + <p><a id="lexer.digit"></a></p> + + <h3><code>lexer.digit</code> (pattern)</h3> + + <p>A pattern that matches any digit ('0'-'9').</p> + + <p><a id="lexer.extend"></a></p> + + <h3><code>lexer.extend</code> (pattern)</h3> + + <p>A pattern that matches any ASCII extended character (codes 0 to 255).</p> + + <p><a id="lexer.float"></a></p> + + <h3><code>lexer.float</code> (pattern)</h3> + + <p>A pattern that matches a floating point number.</p> + + <p><a id="lexer.fold_level"></a></p> + + <h3><code>lexer.fold_level</code> (table, Read-only)</h3> + + <p>Table of fold level bit-masks for line numbers starting from zero. + Fold level masks are composed of an integer level combined with any of the + following bits:</p> + + <ul> + <li><code>lexer.FOLD_BASE</code> + The initial fold level.</li> + <li><code>lexer.FOLD_BLANK</code> + The line is blank.</li> + <li><code>lexer.FOLD_HEADER</code> + The line is a header, or fold point.</li> + </ul> + + + <p><a id="lexer.graph"></a></p> + + <h3><code>lexer.graph</code> (pattern)</h3> + + <p>A pattern that matches any graphical character ('!' to '~').</p> + + <p><a id="lexer.hex_num"></a></p> + + <h3><code>lexer.hex_num</code> (pattern)</h3> + + <p>A pattern that matches a hexadecimal number.</p> + + <p><a id="lexer.indent_amount"></a></p> + + <h3><code>lexer.indent_amount</code> (table, Read-only)</h3> + + <p>Table of indentation amounts in character columns, for line numbers + starting from zero.</p> + + <p><a id="lexer.integer"></a></p> + + <h3><code>lexer.integer</code> (pattern)</h3> + + <p>A pattern that matches either a decimal, hexadecimal, or octal number.</p> + + <p><a id="lexer.line_state"></a></p> + + <h3><code>lexer.line_state</code> (table)</h3> + + <p>Table of integer line states for line numbers starting from zero. + Line states can be used by lexers for keeping track of persistent states.</p> + + <p><a id="lexer.lower"></a></p> + + <h3><code>lexer.lower</code> (pattern)</h3> + + <p>A pattern that matches any lower case character ('a'-'z').</p> + + <p><a id="lexer.newline"></a></p> + + <h3><code>lexer.newline</code> (pattern)</h3> + + <p>A pattern that matches any set of end of line characters.</p> + + <p><a id="lexer.nonnewline"></a></p> + + <h3><code>lexer.nonnewline</code> (pattern)</h3> + + <p>A pattern that matches any single, non-newline character.</p> + + <p><a id="lexer.nonnewline_esc"></a></p> + + <h3><code>lexer.nonnewline_esc</code> (pattern)</h3> + + <p>A pattern that matches any single, non-newline character or any set of end + of line characters escaped with '\'.</p> + + <p><a id="lexer.oct_num"></a></p> + + <h3><code>lexer.oct_num</code> (pattern)</h3> + + <p>A pattern that matches an octal number.</p> + + <p><a id="lexer.path"></a></p> + + <h3><code>lexer.path</code> (string)</h3> + + <p>The path used to search for a lexer to load. + Identical in format to Lua's <code>package.path</code> string. + The default value is <code>package.path</code>.</p> + + <p><a id="lexer.print"></a></p> + + <h3><code>lexer.print</code> (pattern)</h3> + + <p>A pattern that matches any printable character (' ' to '~').</p> + + <p><a id="lexer.property"></a></p> + + <h3><code>lexer.property</code> (table)</h3> + + <p>Map of key-value string pairs.</p> + + <p><a id="lexer.property_expanded"></a></p> + + <h3><code>lexer.property_expanded</code> (table, Read-only)</h3> + + <p>Map of key-value string pairs with <code>$()</code> and <code>%()</code> variable replacement + performed in values.</p> + + <p><a id="lexer.property_int"></a></p> + + <h3><code>lexer.property_int</code> (table, Read-only)</h3> + + <p>Map of key-value pairs with values interpreted as numbers, or <code>0</code> if not + found.</p> + + <p><a id="lexer.punct"></a></p> + + <h3><code>lexer.punct</code> (pattern)</h3> + + <p>A pattern that matches any punctuation character ('!' to '/', ':' to '@', + '[' to ''', '{' to '~').</p> + + <p><a id="lexer.space"></a></p> + + <h3><code>lexer.space</code> (pattern)</h3> + + <p>A pattern that matches any whitespace character ('\t', '\v', '\f', '\n', + '\r', space).</p> + + <p><a id="lexer.style_at"></a></p> + + <h3><code>lexer.style_at</code> (table, Read-only)</h3> + + <p>Table of style names at positions in the buffer starting from 1.</p> + + <p><a id="lexer.upper"></a></p> + + <h3><code>lexer.upper</code> (pattern)</h3> + + <p>A pattern that matches any upper case character ('A'-'Z').</p> + + <p><a id="lexer.word"></a></p> + + <h3><code>lexer.word</code> (pattern)</h3> + + <p>A pattern that matches a typical word. Words begin with a letter or + underscore and consist of alphanumeric and underscore characters.</p> + + <p><a id="lexer.xdigit"></a></p> + + <h3><code>lexer.xdigit</code> (pattern)</h3> + + <p>A pattern that matches any hexadecimal digit ('0'-'9', 'A'-'F', 'a'-'f').</p> + + <h2>Lua <code>lexer</code> module API functions</h2> + + <p><a id="lexer.add_fold_point"></a></p> + + <h3><code>lexer.add_fold_point</code> (lexer, token_name, start_symbol, end_symbol)</h3> + + <p>Adds to lexer <em>lexer</em> a fold point whose beginning and end tokens are string + <em>token_name</em> tokens with string content <em>start_symbol</em> and <em>end_symbol</em>, + respectively. + In the event that <em>start_symbol</em> may or may not be a fold point depending on + context, and that additional processing is required, <em>end_symbol</em> may be a + function that ultimately returns <code>1</code> (indicating a beginning fold point), + <code>-1</code> (indicating an ending fold point), or <code>0</code> (indicating no fold point). + That function is passed the following arguments:</p> + + <ul> + <li><code>text</code>: The text being processed for fold points.</li> + <li><code>pos</code>: The position in <em>text</em> of the beginning of the line currently + being processed.</li> + <li><code>line</code>: The text of the line currently being processed.</li> + <li><code>s</code>: The position of <em>start_symbol</em> in <em>line</em>.</li> + <li><code>symbol</code>: <em>start_symbol</em> itself.</li> + </ul> + + + <p>Fields:</p> + + <ul> + <li><code>lexer</code>: The lexer to add a fold point to.</li> + <li><code>token_name</code>: The token name of text that indicates a fold point.</li> + <li><code>start_symbol</code>: The text that indicates the beginning of a fold point.</li> + <li><code>end_symbol</code>: Either the text that indicates the end of a fold point, or + a function that returns whether or not <em>start_symbol</em> is a beginning fold + point (1), an ending fold point (-1), or not a fold point at all (0).</li> + </ul> + + + <p>Usage:</p> + + <ul> + <li><code>lex:add_fold_point(lexer.OPERATOR, '{', '}')</code></li> + <li><code>lex:add_fold_point(lexer.KEYWORD, 'if', 'end')</code></li> + <li><code>lex:add_fold_point(lexer.COMMENT, '#', lexer.fold_line_comments('#'))</code></li> + <li><code>lex:add_fold_point('custom', function(text, pos, line, s, symbol) + ... end)</code></li> + </ul> + + + <p><a id="lexer.add_rule"></a></p> + + <h3><code>lexer.add_rule</code> (lexer, id, rule)</h3> + + <p>Adds pattern <em>rule</em> identified by string <em>id</em> to the ordered list of rules + for lexer <em>lexer</em>.</p> + + <p>Fields:</p> + + <ul> + <li><code>lexer</code>: The lexer to add the given rule to.</li> + <li><code>id</code>: The id associated with this rule. It does not have to be the same + as the name passed to <code>token()</code>.</li> + <li><code>rule</code>: The LPeg pattern of the rule.</li> + </ul> + + + <p>See also:</p> + + <ul> + <li><a href="#lexer.modify_rule"><code>lexer.modify_rule</code></a></li> + </ul> + + + <p><a id="lexer.add_style"></a></p> + + <h3><code>lexer.add_style</code> (lexer, token_name, style)</h3> + + <p>Associates string <em>token_name</em> in lexer <em>lexer</em> with Scintilla style string + <em>style</em>. + Style strings are comma-separated property settings. Available property + settings are:</p> + + <ul> + <li><code>font:name</code>: Font name.</li> + <li><code>size:int</code>: Font size.</li> + <li><code>bold</code> or <code>notbold</code>: Whether or not the font face is bold.</li> + <li><code>weight:int</code>: Font weight (between 1 and 999).</li> + <li><code>italics</code> or <code>notitalics</code>: Whether or not the font face is italic.</li> + <li><code>underlined</code> or <code>notunderlined</code>: Whether or not the font face is + underlined.</li> + <li><code>fore:color</code>: Font face foreground color in "#RRGGBB" or 0xBBGGRR format.</li> + <li><code>back:color</code>: Font face background color in "#RRGGBB" or 0xBBGGRR format.</li> + <li><code>eolfilled</code> or <code>noteolfilled</code>: Whether or not the background color + extends to the end of the line.</li> + <li><code>case:char</code>: Font case ('u' for uppercase, 'l' for lowercase, and 'm' for + mixed case).</li> + <li><code>visible</code> or <code>notvisible</code>: Whether or not the text is visible.</li> + <li><code>changeable</code> or <code>notchangeable</code>: Whether or not the text is changeable or + read-only.</li> + </ul> + + + <p>Property settings may also contain "$(property.name)" expansions for + properties defined in Scintilla, theme files, etc.</p> + + <p>Fields:</p> + + <ul> + <li><code>lexer</code>: The lexer to add a style to.</li> + <li><code>token_name</code>: The name of the token to associated with the style.</li> + <li><code>style</code>: A style string for Scintilla.</li> + </ul> + + + <p>Usage:</p> + + <ul> + <li><code>lex:add_style('longstring', lexer.STYLE_STRING)</code></li> + <li><code>lex:add_style('deprecated_function', lexer.STYLE_FUNCTION..',italics')</code></li> + <li><code>lex:add_style('visible_ws', + lexer.STYLE_WHITESPACE..',back:$(color.grey)')</code></li> + </ul> + + + <p><a id="lexer.delimited_range"></a></p> + + <h3><code>lexer.delimited_range</code> (chars, single_line, no_escape, balanced)</h3> + + <p>Creates and returns a pattern that matches a range of text bounded by + <em>chars</em> characters. + This is a convenience function for matching more complicated delimited ranges + like strings with escape characters and balanced parentheses. <em>single_line</em> + indicates whether or not the range must be on a single line, <em>no_escape</em> + indicates whether or not to ignore '\' as an escape character, and <em>balanced</em> + indicates whether or not to handle balanced ranges like parentheses and + requires <em>chars</em> to be composed of two characters.</p> + + <p>Fields:</p> + + <ul> + <li><code>chars</code>: The character(s) that bound the matched range.</li> + <li><code>single_line</code>: Optional flag indicating whether or not the range must be + on a single line.</li> + <li><code>no_escape</code>: Optional flag indicating whether or not the range end + character may be escaped by a '\' character.</li> + <li><code>balanced</code>: Optional flag indicating whether or not to match a balanced + range, like the "%b" Lua pattern. This flag only applies if <em>chars</em> + consists of two different characters (e.g. "()").</li> + </ul> + + + <p>Usage:</p> + + <ul> + <li><code>local dq_str_escapes = lexer.delimited_range('"')</code></li> + <li><code>local dq_str_noescapes = lexer.delimited_range('"', false, true)</code></li> + <li><code>local unbalanced_parens = lexer.delimited_range('()')</code></li> + <li><code>local balanced_parens = lexer.delimited_range('()', false, false, + true)</code></li> + </ul> + + + <p>Return:</p> + + <ul> + <li>pattern</li> + </ul> + + + <p>See also:</p> + + <ul> + <li><a href="#lexer.nested_pair"><code>lexer.nested_pair</code></a></li> + </ul> + + + <p><a id="lexer.embed"></a></p> + + <h3><code>lexer.embed</code> (lexer, child, start_rule, end_rule)</h3> + + <p>Embeds child lexer <em>child</em> in parent lexer <em>lexer</em> using patterns + <em>start_rule</em> and <em>end_rule</em>, which signal the beginning and end of the + embedded lexer, respectively.</p> + + <p>Fields:</p> + + <ul> + <li><code>lexer</code>: The parent lexer.</li> + <li><code>child</code>: The child lexer.</li> + <li><code>start_rule</code>: The pattern that signals the beginning of the embedded + lexer.</li> + <li><code>end_rule</code>: The pattern that signals the end of the embedded lexer.</li> + </ul> + + + <p>Usage:</p> + + <ul> + <li><code>html:embed(css, css_start_rule, css_end_rule)</code></li> + <li><code>html:embed(lex, php_start_rule, php_end_rule) -- from php lexer</code></li> + </ul> + + + <p><a id="lexer.fold"></a></p> + + <h3><code>lexer.fold</code> (lexer, text, start_pos, start_line, start_level)</h3> + + <p>Determines fold points in a chunk of text <em>text</em> using lexer <em>lexer</em>, + returning a table of fold levels associated with line numbers. + <em>text</em> starts at position <em>start_pos</em> on line number <em>start_line</em> with a + beginning fold level of <em>start_level</em> in the buffer.</p> + + <p>Fields:</p> + + <ul> + <li><code>lexer</code>: The lexer to fold text with.</li> + <li><code>text</code>: The text in the buffer to fold.</li> + <li><code>start_pos</code>: The position in the buffer <em>text</em> starts at, starting at + zero.</li> + <li><code>start_line</code>: The line number <em>text</em> starts on.</li> + <li><code>start_level</code>: The fold level <em>text</em> starts on.</li> + </ul> + + + <p>Return:</p> + + <ul> + <li>table of fold levels associated with line numbers.</li> + </ul> + + + <p><a id="lexer.fold_line_comments"></a></p> + + <h3><code>lexer.fold_line_comments</code> (prefix)</h3> + + <p>Returns a fold function (to be passed to <code>lexer.add_fold_point()</code>) that folds + consecutive line comments that start with string <em>prefix</em>.</p> + + <p>Fields:</p> + + <ul> + <li><code>prefix</code>: The prefix string defining a line comment.</li> + </ul> + + + <p>Usage:</p> + + <ul> + <li><code>lex:add_fold_point(lexer.COMMENT, '--', + lexer.fold_line_comments('--'))</code></li> + <li><code>lex:add_fold_point(lexer.COMMENT, '//', + lexer.fold_line_comments('//'))</code></li> + </ul> + + + <p><a id="lexer.get_rule"></a></p> + + <h3><code>lexer.get_rule</code> (lexer, id)</h3> + + <p>Returns the rule identified by string <em>id</em>.</p> + + <p>Fields:</p> + + <ul> + <li><code>lexer</code>: The lexer to fetch a rule from.</li> + <li><code>id</code>: The id of the rule to fetch.</li> + </ul> + + + <p>Return:</p> + + <ul> + <li>pattern</li> + </ul> + + + <p><a id="lexer.last_char_includes"></a></p> + + <h3><code>lexer.last_char_includes</code> (s)</h3> + + <p>Creates and returns a pattern that verifies that string set <em>s</em> contains the + first non-whitespace character behind the current match position.</p> + + <p>Fields:</p> + + <ul> + <li><code>s</code>: String character set like one passed to <code>lpeg.S()</code>.</li> + </ul> + + + <p>Usage:</p> + + <ul> + <li><code>local regex = lexer.last_char_includes('+-*!%^&|=,([{') * + lexer.delimited_range('/')</code></li> + </ul> + + + <p>Return:</p> + + <ul> + <li>pattern</li> + </ul> + + + <p><a id="lexer.lex"></a></p> + + <h3><code>lexer.lex</code> (lexer, text, init_style)</h3> + + <p>Lexes a chunk of text <em>text</em> (that has an initial style number of + <em>init_style</em>) using lexer <em>lexer</em>, returning a table of token names and + positions.</p> + + <p>Fields:</p> + + <ul> + <li><code>lexer</code>: The lexer to lex text with.</li> + <li><code>text</code>: The text in the buffer to lex.</li> + <li><code>init_style</code>: The current style. Multiple-language lexers use this to + determine which language to start lexing in.</li> + </ul> + + + <p>Return:</p> + + <ul> + <li>table of token names and positions.</li> + </ul> + + + <p><a id="lexer.line_from_position"></a></p> + + <h3><code>lexer.line_from_position</code> (pos)</h3> + + <p>Returns the line number of the line that contains position <em>pos</em>, which + starts from 1.</p> + + <p>Fields:</p> + + <ul> + <li><code>pos</code>: The position to get the line number of.</li> + </ul> + + + <p>Return:</p> + + <ul> + <li>number</li> + </ul> + + + <p><a id="lexer.load"></a></p> + + <h3><code>lexer.load</code> (name, alt_name, cache)</h3> + + <p>Initializes or loads and returns the lexer of string name <em>name</em>. + Scintilla calls this function in order to load a lexer. Parent lexers also + call this function in order to load child lexers and vice-versa. The user + calls this function in order to load a lexer when using this module as a Lua + library.</p> + + <p>Fields:</p> + + <ul> + <li><code>name</code>: The name of the lexing language.</li> + <li><code>alt_name</code>: The alternate name of the lexing language. This is useful for + embedding the same child lexer with multiple sets of start and end tokens.</li> + <li><code>cache</code>: Flag indicating whether or not to load lexers from the cache. + This should only be <code>true</code> when initially loading a lexer (e.g. not from + within another lexer for embedding purposes). + The default value is <code>false</code>.</li> + </ul> + + + <p>Return:</p> + + <ul> + <li>lexer object</li> + </ul> + + + <p><a id="lexer.modify_rule"></a></p> + + <h3><code>lexer.modify_rule</code> (lexer, id, rule)</h3> + + <p>Replaces in lexer <em>lexer</em> the existing rule identified by string <em>id</em> with + pattern <em>rule</em>.</p> + + <p>Fields:</p> + + <ul> + <li><code>lexer</code>: The lexer to modify.</li> + <li><code>id</code>: The id associated with this rule.</li> + <li><code>rule</code>: The LPeg pattern of the rule.</li> + </ul> + + + <p><a id="lexer.nested_pair"></a></p> + + <h3><code>lexer.nested_pair</code> (start_chars, end_chars)</h3> + + <p>Returns a pattern that matches a balanced range of text that starts with + string <em>start_chars</em> and ends with string <em>end_chars</em>. + With single-character delimiters, this function is identical to + <code>delimited_range(start_chars..end_chars, false, true, true)</code>.</p> + + <p>Fields:</p> + + <ul> + <li><code>start_chars</code>: The string starting a nested sequence.</li> + <li><code>end_chars</code>: The string ending a nested sequence.</li> + </ul> + + + <p>Usage:</p> + + <ul> + <li><code>local nested_comment = lexer.nested_pair('/*', '*/')</code></li> + </ul> + + + <p>Return:</p> + + <ul> + <li>pattern</li> + </ul> + + + <p>See also:</p> + + <ul> + <li><a href="#lexer.delimited_range"><code>lexer.delimited_range</code></a></li> + </ul> + + + <p><a id="lexer.new"></a></p> + + <h3><code>lexer.new</code> (name, opts)</h3> + + <p>Creates a returns a new lexer with the given name.</p> + + <p>Fields:</p> + + <ul> + <li><code>name</code>: The lexer's name.</li> + <li><code>opts</code>: Table of lexer options. Options currently supported: + + <ul> + <li><code>lex_by_line</code>: Whether or not the lexer only processes whole lines of + text (instead of arbitrary chunks of text) at a time. + Line lexers cannot look ahead to subsequent lines. + The default value is <code>false</code>.</li> + <li><code>fold_by_indentation</code>: Whether or not the lexer does not define any fold + points and that fold points should be calculated based on changes in line + indentation. + The default value is <code>false</code>.</li> + <li><code>case_insensitive_fold_points</code>: Whether or not fold points added via + <code>lexer.add_fold_point()</code> ignore case. + The default value is <code>false</code>.</li> + <li><code>inherit</code>: Lexer to inherit from. + The default value is <code>nil</code>.</li> + </ul> + </li> + </ul> + + + <p>Usage:</p> + + <ul> + <li><code>lexer.new('rhtml', {inherit = lexer.load('html')})</code></li> + </ul> + + + <p><a id="lexer.starts_line"></a></p> + + <h3><code>lexer.starts_line</code> (patt)</h3> + + <p>Creates and returns a pattern that matches pattern <em>patt</em> only at the + beginning of a line.</p> + + <p>Fields:</p> + + <ul> + <li><code>patt</code>: The LPeg pattern to match on the beginning of a line.</li> + </ul> + + + <p>Usage:</p> + + <ul> + <li><code>local preproc = token(lexer.PREPROCESSOR, lexer.starts_line('#') * + lexer.nonnewline^0)</code></li> + </ul> + + + <p>Return:</p> + + <ul> + <li>pattern</li> + </ul> + + + <p><a id="lexer.token"></a></p> + + <h3><code>lexer.token</code> (name, patt)</h3> + + <p>Creates and returns a token pattern with token name <em>name</em> and pattern + <em>patt</em>. + If <em>name</em> is not a predefined token name, its style must be defined via + <code>lexer.add_style()</code>.</p> + + <p>Fields:</p> + + <ul> + <li><code>name</code>: The name of token. If this name is not a predefined token name, + then a style needs to be assiciated with it via <code>lexer.add_style()</code>.</li> + <li><code>patt</code>: The LPeg pattern associated with the token.</li> + </ul> + + + <p>Usage:</p> + + <ul> + <li><code>local ws = token(lexer.WHITESPACE, lexer.space^1)</code></li> + <li><code>local annotation = token('annotation', '@' * lexer.word)</code></li> + </ul> + + + <p>Return:</p> + + <ul> + <li>pattern</li> + </ul> + + + <p><a id="lexer.word_match"></a></p> + + <h3><code>lexer.word_match</code> (words, case_insensitive, word_chars)</h3> + + <p>Creates and returns a pattern that matches any single word in string <em>words</em>. + <em>case_insensitive</em> indicates whether or not to ignore case when matching + words. + This is a convenience function for simplifying a set of ordered choice word + patterns. + If <em>words</em> is a multi-line string, it may contain Lua line comments (<code>--</code>) + that will ultimately be ignored.</p> + + <p>Fields:</p> + + <ul> + <li><code>words</code>: A string list of words separated by spaces.</li> + <li><code>case_insensitive</code>: Optional boolean flag indicating whether or not the + word match is case-insensitive. The default value is <code>false</code>.</li> + <li><code>word_chars</code>: Unused legacy parameter.</li> + </ul> + + + <p>Usage:</p> + + <ul> + <li><code>local keyword = token(lexer.KEYWORD, word_match[[foo bar baz]])</code></li> + <li><code>local keyword = token(lexer.KEYWORD, word_match([[foo-bar foo-baz + bar-foo bar-baz baz-foo baz-bar]], true))</code></li> + </ul> + + + <p>Return:</p> + + <ul> + <li>pattern</li> + </ul> + + <h2 id="LexerList">Supported Languages</h2> + + <p>Scintilla has Lua lexers for all of the languages below. Languages + denoted by a <code>*</code> have native + <a href="#lexer.Code.Folding">folders</a>. For languages without + native folding support, folding based on indentation can be used if + <code>fold.by.indentation</code> is enabled.</p> + + <ol> + <li>Actionscript<code>*</code></li> + <li>Ada</li> + <li>ANTLR<code>*</code></li> + <li>APDL<code>*</code></li> + <li>APL</li> + <li>Applescript</li> + <li>ASM<code>*</code> (NASM)</li> + <li>ASP<code>*</code></li> + <li>AutoIt</li> + <li>AWK<code>*</code></li> + <li>Batch<code>*</code></li> + <li>BibTeX<code>*</code></li> + <li>Boo</li> + <li>C<code>*</code></li> + <li>C++<code>*</code></li> + <li>C#<code>*</code></li> + <li>ChucK</li> + <li>CMake<code>*</code></li> + <li>Coffeescript</li> + <li>ConTeXt<code>*</code></li> + <li>CSS<code>*</code></li> + <li>CUDA<code>*</code></li> + <li>D<code>*</code></li> + <li>Dart<code>*</code></li> + <li>Desktop Entry</li> + <li>Diff</li> + <li>Django<code>*</code></li> + <li>Dockerfile</li> + <li>Dot<code>*</code></li> + <li>Eiffel<code>*</code></li> + <li>Elixir</li> + <li>Erlang<code>*</code></li> + <li>F#</li> + <li>Faust</li> + <li>Fish<code>*</code></li> + <li>Forth</li> + <li>Fortran</li> + <li>GAP<code>*</code></li> + <li>gettext</li> + <li>Gherkin</li> + <li>GLSL<code>*</code></li> + <li>Gnuplot</li> + <li>Go<code>*</code></li> + <li>Groovy<code>*</code></li> + <li>Gtkrc<code>*</code></li> + <li>Haskell</li> + <li>HTML<code>*</code></li> + <li>Icon<code>*</code></li> + <li>IDL</li> + <li>Inform</li> + <li>ini</li> + <li>Io<code>*</code></li> + <li>Java<code>*</code></li> + <li>Javascript<code>*</code></li> + <li>JSON<code>*</code></li> + <li>JSP<code>*</code></li> + <li>LaTeX<code>*</code></li> + <li>Ledger</li> + <li>LESS<code>*</code></li> + <li>LilyPond</li> + <li>Lisp<code>*</code></li> + <li>Literate Coffeescript</li> + <li>Logtalk</li> + <li>Lua<code>*</code></li> + <li>Makefile</li> + <li>Man Page</li> + <li>Markdown</li> + <li>MATLAB<code>*</code></li> + <li>MoonScript</li> + <li>Myrddin</li> + <li>Nemerle<code>*</code></li> + <li>Nim</li> + <li>NSIS</li> + <li>Objective-C<code>*</code></li> + <li>OCaml</li> + <li>Pascal</li> + <li>Perl<code>*</code></li> + <li>PHP<code>*</code></li> + <li>PICO-8<code>*</code></li> + <li>Pike<code>*</code></li> + <li>PKGBUILD<code>*</code></li> + <li>Postscript</li> + <li>PowerShell<code>*</code></li> + <li>Prolog</li> + <li>Properties</li> + <li>Pure</li> + <li>Python</li> + <li>R</li> + <li>rc<code>*</code></li> + <li>REBOL<code>*</code></li> + <li>Rexx<code>*</code></li> + <li>ReStructuredText<code>*</code></li> + <li>RHTML<code>*</code></li> + <li>Ruby<code>*</code></li> + <li>Ruby on Rails<code>*</code></li> + <li>Rust<code>*</code></li> + <li>Sass<code>*</code></li> + <li>Scala<code>*</code></li> + <li>Scheme<code>*</code></li> + <li>Shell<code>*</code></li> + <li>Smalltalk<code>*</code></li> + <li>Standard ML</li> + <li>SNOBOL4</li> + <li>SQL</li> + <li>TaskPaper</li> + <li>Tcl<code>*</code></li> + <li>TeX<code>*</code></li> + <li>Texinfo<code>*</code></li> + <li>TOML</li> + <li>Vala<code>*</code></li> + <li>VBScript</li> + <li>vCard<code>*</code></li> + <li>Verilog<code>*</code></li> + <li>VHDL</li> + <li>Visual Basic</li> + <li>Windows Script File<code>*</code></li> + <li>XML<code>*</code></li> + <li>Xtend<code>*</code></li> + <li>YAML</li> + </ol> + + <h2>Code Contributors</h2> + + <ul> + <li>Alejandro Baez</li> + <li>Alex Saraci</li> + <li>Brian Schott</li> + <li>Carl Sturtivant</li> + <li>Chris Emerson</li> + <li>Christian Hesse</li> + <li>David B. Lamkins</li> + <li>Heck Fy</li> + <li>Jason Schindler</li> + <li>Jeff Stone</li> + <li>Joseph Eib</li> + <li>Joshua Krämer</li> + <li>Klaus Borges</li> + <li>Larry Hynes</li> + <li>M Rawash</li> + <li>Marc André Tanner</li> + <li>Markus F.X.J. Oberhumer</li> + <li>Martin Morawetz</li> + <li>Michael Forney</li> + <li>Michael T. Richter</li> + <li>Michel Martens</li> + <li>Murray Calavera</li> + <li>Neil Hodgson</li> + <li>Olivier Guibé</li> + <li>Peter Odding</li> + <li>Piotr Orzechowski</li> + <li>Richard Philips</li> + <li>Robert Gieseke</li> + <li>Roberto Ierusalimschy</li> + <li>S. Gilles</li> + <li>Stéphane Rivière</li> + <li>Tymur Gubayev</li> + <li>Wolfgang Seeberg</li> + </ul> + + </body> +</html> |