From 8fe23071ec3f82bf2b602f2ba5edee0cf6bc6fa3 Mon Sep 17 00:00:00 2001
From: Neil
+ Scintilla contains lexers for various types of languages:
+
+
+
+
+
+
+
+
+ Scintilla
+
+
+ Language Types
+
+
+
+
+ Some languages can be used in different ways. JavaScript is a programming language but also + the basis of JSON data files. Similarly, + Lisp s expressions can be used for both source code and data. +
++ Each language type has common elements such as identifiers in programming languages. + These common elements should be identified so that languages can be displayed with common + styles for these elements. + Style tags are used for this purpose in Scintilla. +
++ Every style has a list of tags where a tag is a lower-case word containing only the common ASCII letters 'a'-'z' + such as "comment" or "operator". +
++ Tags are ordered from most important to least important. +
+
+ While applications may assign visual attributes for tag lists in many different ways, one reasonable technique is to
+ apply tag-specific attributes in reverse order so that earlier and more important tags override less important tags.
+ For example, the tag list "error comment documentation keyword"
with
+ a set of tag attributes
+ { comment=fore:green,back:very-light-green,font:Serif documentation=fore:light-green error=strikethrough keyword=bold }
+ could be rendered as
+ bold,fore:light-green,back:very-light-green,font:Serif,strikethrough
.
+
+ Alternative renderings could check for multi-tag combinations like
+ { comment.documentation=fore:light-green comment.line=dark-green comment=green }.
+
+ Commonly, a tag list will contain an optional embedded language; optional statuses; a base type; and a set of type modifiers:
+ embedded-language? status* base-type modifiers*
+
+ The embedded language may be a source (client | server)
followed by a language name
+ (javascript | php | python | basic)
.
+ This may be extended in the future with other programming languages and style-definition languages like CSS.
+
+ The statuses may be (error | unused | predefined | inactive)
.
+ The error
status is used for lexical statuses that indicate errors in the source code such as unterminated quoted strings.
+ The unused
status may indicate a gap in the lexical states, possibly because an old lexical class is no longer used or an upcoming lexical class may fill that position.
+ The predefined
status indicates a style in the range 32.39 that is used for non-lexical purposes in Scintilla.
+ The inactive
status is used for text that is not currently interpreted such as C++ code that is contained within a '#if 0' preprocessor block.
+
+ The basic types for programming languages are (default | operator | keyword | identifier | literal | comment | preprocessor | label)
.
+ The default
type is commonly used for spaces and tabs between tokens although it may cover other characters in some languages.
+
+ Assembler languages add (instruction | register)
. to the basic types from programming languages.
+
+ The basic types for markup languages are (default | tag | attribute | comment | preprocessor)
.
+
+ The basic types for data languages are (default | key | data | comment)
.
+
+ Programming languages may differentiate between line and stream comments and treat documentation comments as distinct from other comments.
+ Documentation comments may be marked up with documentation keywords.
+ The additional attributes commonly used are (line | documentation | keyword | taskmarker)
.
+
+ Programming and assembler languages contain a rich set of literals including numbers like 7
and 3.89e23
; "string\n"
; and nullptr
+ and differentiating between these is often wanted.
+ The common literal types are (numeric | boolean | string | regex | date | time | uuid | nil | compound)
.
+ Numeric literal types are subdivided into (integer | real)
.
+ String literal types may add (perhaps multiple) further attributes from (heredoc | character | escapesequence | interpolated | multiline | raw)
.
+
+ An escape sequence within an interpolated heredoc may thus be literal string heredoc escapesequence
.
+
attribute | Markup attribute |
basic | Embedded Basic |
boolean | True or false literal |
character | Single character literal as opposed to a string literal |
client | Script executed on client |
comment | The standard comment type in a language: may be stream or line |
compound | Literal containing multiple subliterals such as a tuple or complex number |
data | A value in a data file |
date | Literal representing a data such as '19/November/1975' |
default | Starting state commonly also used for white space |
documentation | Comment that can be extracted into documentation |
error | State indicating an invalid or erroneous element |
escapesequence | Parts of a string that are not literal such as '\t' for tab in C |
heredoc | Lengthy text literal marked by a word at both ends |
identifier | Name that identifies an object or class of object |
inactive | Code that is not currently interpreted |
instruction | Mnemonic in assembler languages like 'addc' |
integer | Numeric literal with no fraction or exponent like '738' |
interpolated | String that can contain expressions |
javascript | Embedded Javascript |
key | Element which allows finding associated data |
keyword | Reserved word with special meaning like 'while' |
label | Destination for jumps in programming and assembler languages |
line | Differentiates between stream comments and line comments in languages that have both |
literal | Fixed value in source code |
multiline | Differentiates between single line and multiline elements, commonly strings |
nil | Literal for the null pointer such as nullptr in C++ or NULL in C |
numeric | Literal number like '16' |
operator | Punctuation character such as '&' or '[' |
php | Embedded PHP |
predefined | Style in the range 32.39 that is used for non-lexical purposes |
preprocessor | Element that is recognized in an early stage of translation |
python | Embedded Python |
raw | String type that avoids interpretation: may be used for regular expressions in languages without a specific regex type |
real | Numeric literal which may have a fraction or exponent like '3.84e-15' |
regex | Regular expression literal like '^[a-z]+' |
register | CPU register in assembler languages |
server | Script executed on server |
string | Sequence of characters |
tag | Markup tag like '<br />' |
taskmarker | Word in comment that marks future work like 'FIXME' |
time | Literal representing a time such as '9:34:31' |
unused | Style that is not currently used |
uuid | Universally unique identifier often used in interface definition files which may look like '{098f2470-bae0-11cd-b579-08002b30bfeb}' |
+ Each element in this scheme may be extended in the future. This may be done by revising this document to provide a common approach to new features. + Individual lexers may also choose to expose unique language features through new tags. +
+
+ Tags could be exposed directly in user interfaces or configuration languages.
+ However, an application may also translate these to match its naming schema.
+ Capitalization and punctuation could be different (like Here-Doc
instead of heredoc
),
+ terminology changed ("constant" instead of "literal"),
+ or human language changed from English to Chinese or Spanish.
+
+ Starting from a common set of tags makes these modifications tractable. +
++ The C++ lexer (for example) has inactive states and dynamically allocated substyles. + These should be exposed through the metadata mechanism but are not currently. +
+ + -- cgit v1.2.3