--[[ Mitchell's lexers/lexer.lua Copyright (c) 2006-2008 Mitchell Foral. All rights reserved. SciTE-tools homepage: http://caladbolg.net/scite.php Send email to: mitchellcaladbolgnet Permission to use, copy, modify, and distribute this file is granted, provided credit is given to Mitchell. ]]-- --- -- Performs lexxing of Scintilla documents. module('lexer', package.seeall) -- Usage: -- Writing a Dynamic Lexer (somewhat brief tutorial): -- This may seem like a daunting task, judging by the length of this document, -- but the process is actually fairly straight- forward. I just need to -- include every little detail in order for you to be able to utilize -- everything I have provided for your lexer development. -- -- In order to setup a dynamic lexer, create a lua script with your lexer's -- name as the filename followed by '.lua' in the /lexers/ directory. Then at -- the top of your lexer, the following must appear: -- module(..., package.seeall) -- Lexers are meant to be modules, not to be loaded in the global namespace. -- The ... parameter means this module assumes the name it is being 'require'd -- with. So doing -- require 'ruby' -- means the lexer will be the table 'ruby' in the global namespace. (Useful -- for a 'require'd lexer to check if another particular lexer has been -- loaded.) -- -- Now you'll need a way to style patterns of text. This is accomplished -- through tokens. -- -- Tokens: -- Each lexer is composed of a series of tokens. Each token contains a state -- identifier and an associated LPeg pattern. Generally the identifier -- should be prefixed or otherwise individualized in some way so as not to -- create conflicts with other lexer states if either your lexer is to be -- embedded in another, or another lexer is to be embedded in yours. You can -- create a token with a specified pattern by calling the 'token' function. -- e.g. local comment = token('comment', comment_pattern) -- local variable = token('my_variable', var_pattern) -- Note that 'comment' is part of the default Types and Styles, so it will -- be colored with the same style as default comments. If you wish for your -- comments to be different, you should create a token with a unique id and -- add_style() your style in your 'LoadStyles' function (discussed later). -- -- What are the default Types and Styles? You can look in -- /lexers/lexer.lua's DefaultTypesAndStyles function. Each lexer has a -- Types and Styles table. They initially contain types and styles that are -- common to nearly every lexer, saving you the trouble of creating the -- same states and styles for every lexer you write. You can of course -- redefine them in your own lexer if you wish, but they must be redefined -- in the LoadStyles function described later. -- -- So now you can create patterns and give them identifiers in a token. Next -- you need to create: the simple patterns that appear in most every lexer; -- styles; and colors. Rather than it being tedious, it has already been done -- for you and is available globally (from lexer.lua): -- Patterns: -- any - matches any single character -- ascii - matches any ASCII character (0..127) -- extend - matches any ASCII extended character (0..255) -- alpha - matches any alphabetic character (A-Z, a-z) -- digit - matches any digit (0-9) -- alnum - matches any alphanumeric character (A-Z, a-z, 0-9) -- lower - matches any lowercase character (a-z) -- upper - matches any uppercase character (A-Z) -- xdigit - matches any hexadecimal digit (0-9, A-F, a-f) -- cntrl - matches any control character (0..31) -- graph - matches any graphical character (! to ~) -- print - matches any printable character (space to ~) -- punct - matches any punctuation character not alphanumeric (! to /, -- : to @, [ to ', { to ~) -- space - matches any whitespace character (\t, \v, \f, \n, \r, space) -- newline - matches any newline characters -- nonnewline - matches any non-newline character -- nonnewline_esc - matches any non-newline character excluding newlines -- escaped with '\\' -- dec_num - matches a decimal number -- hex_num - matches a hexadecimal number -- oct_num - matches an octal number -- integer - matches a decimal, hexadecimal, or octal number -- float - matches a floating point number -- word = matches a typical word starting with a letter or underscore and -- then any alphanumeric or underscore characters -- any_char - token defined as token('default', any) -- Colors via the 'colors' table: -- red, yellow, green, blue, teal, white, black -- Styles: -- style_nothing, style_char, style_comment, -- style_definition, style_error, style_keyword, -- style_number, style_operator, style_string, -- style_preproc, style_tag, style_identifier -- Note: colors and styles are identical to those defined in -- SciTEGlobal.properties. -- -- Okay, so at this time you're probably thinking about keywords and keyword -- lists that were provided in SciTE properties files because you surely will -- want to style those! Unfortunately there is no way to read those keywords, -- but there are a couple functions that will make your life easier. Rather -- than creating a lpeg.P('keyword1') + lpeg.P('keyword2') + ... pattern for -- keywords, you can use a combination of the 'word_list' and 'word_match' -- functions. -- word_list(words) -- Creates a word hash from a given table of [string] words. -- e.g. local keywords = word_list{ 'foo', 'bar', 'baz' } -- word_match(word_list[, chars, case_insensitive]) -- Creates an LPeg pattern that checks to see if the current word is in -- word_list. -- e.g. local keyword = word_match(keywords) -- where keywords is defined in the previous example. -- Optional second parameter chars is a string of characters that count as -- word characters. Default word characters are alpha-numeric or an -- underscore (_). In HTML and CSS for example, the hyphen (-) is -- considered a word character, so '-' would be the value of the second -- argument. -- Optional third [boolen] parameter is whether or not words are matched -- case insensitively. -- These functions make sense to have because the maximum pattern size for a -- lexer is SHRT_MAX - 10, or generally 32757 elements. If an lpeg.P was -- created for each keyword in a language, this number would probably come -- into effect -- especially for embedded languages. Also, it would be SLOW to -- have a pattern for every keyword. 'word_match' gets the identifier once and -- checks if it exists in word_list using a hash, which is very fast. -- -- When you were creating your tokens, you gave them identifer states. For the -- identifier states that aren't part of the default Types and Styles, styles -- will need to be defined for them. For this, a 'style' function is -- available. It's only parameter is a table which can contain the following -- fields: -- font - font name (string) -- size - font size (integer) -- bold - bold font (boolean) -- italic - italic font (boolean) -- underline - underline text (boolean) -- fore - text foreground color (integer)* -- back - text background color (integer)* -- eolfilled - use background color for entire line, not stopping at a -- newline character (boolean) -- characterset - ? -- case - the default text case; 0 for normal case, 1 for uppercase, 2 for -- lowercase (integer) -- visible - text is visible or not (boolean) -- changeable - text is changeable or not (boolean) -- hotspot - text is hotspot or not (boolean) -- --- -- * Use the 'color' function to create appropriate integer values from hex -- colors (#RRGGBB). Arguments are red, green, blue hexadecimal values as -- STRINGS. e.g. red = color('FF', '00', '00') -- Styles can be simple, like: -- style_bold = style { bold = true } -- or they can be composed of existing styles with added -- style_bold_italic = style_bold..{ italic = true } -- or modified fields -- style_normal = style_bold..{ bold = false } -- Note in both cases that style_bold is left unchanged. -- -- Now that you have styles defined for you identifiers, it's time to add them -- to Scintilla. This is done in a global LoadStyles function. LoadStyles is -- called when the lexer has been initialized and Scintilla is ready to setup -- the lexer's styles. The 'add_style' function provides a way to easily -- define your styles. The first parameter is your token identifier, and the -- second is the style you created for it. For example: -- function LoadStyles() -- add_style('variable', style_variable) -- add_style('function', style_function) -- end -- 'add_style' returns the style number of the identifier added. This is -- useful for associating a particular style with the number returned by the -- function GetStyleAt (see below) or SciTE's editor.StyleAt. -- -- Finally! All your tokens have been created. All that is left to do is add -- them to your lexer. This is done in a global LoadTokens function. -- LoadTokens is called when the lexer has been initialized and the lexer -- library is ready to create the LPeg table capture that will lexx any input -- given. The 'add_token' function provides a way to easily define your -- tokens. The first parameter is your lexer, the second is your token -- identifier, and the third is the pattern returned by the 'token' function. -- For example: -- function LoadTokens() -- add_token(mylexer, 'comment', comment) -- add_token(mylexer, 'variable', variable) -- end -- where comment and variable have been defined in an above example as the -- returns of calls to 'token'. -- Keep in mind order matters. If the match to the first token added fails, -- the next token is tried, then the next, etc. If you want one token to match -- before another, move it's declaration before the latter's. Not having -- tokens in proper order can be tricky to debug if something goes wrong. -- Ah, you have all your tokens added, but what if some input does not match? -- This is where a global 'any_char' variable comes in. It is defined as -- any_char = token('default', lpeg.P(1)) -- so that any pattern you hadn't accounted for is styled (one character -- only). You can of course override any_char to display something you can -- recognize if you are debugging your lexer or you count unmatched patterns -- as syntax errors. Now: -- add_token(mylexer, 'any_char', any_char) -- 'add_token' adds your identifier and pattern to a TokenPatterns table. This -- table is available to any other lexer as a means of accessing or modifying -- your lexer's tokens. This is especially useful for embedded lexer -- functionality. See the supplemental section Writing a lexer that will embed -- in another lexer for more details. -- -- The only thing left to do at this point is to lex the document with your -- LPeg tokens. If your approach is to lex the entire document (not line-by- -- line), you're done! /lexers/lexer.lua realizes this is what you intend and -- does it automatically for you. If you wanted to have a line-by-line lexer -- instead of one that lexxes the entire document at once, set a global -- LexByLine variable to true and you're finished. You can lex your own way if -- you'd like by creating a global Lex function that returns a table whose -- indices contain style numbers and positions to style to. The LPeg table -- capture for a lexer is defined as Tokens and the pattern to match a single -- token is defined as Token. -- -- Because you have your styles and colors defined in the lexer itself, you -- may be wondering if your SciTE properties files can still be used. The -- answer is absolutely! All styling information is ignored though. -- -- Optional -- Code Folding: -- It is sometimes convenient to "fold", or not show blocks of code when -- editing, whether they be functions, classes, comments, etc. The basic -- idea behind implementing a folder is to iterate line by line through the -- document, assigning a fold level to each line. Lines to be "hidden" have -- a higher fold level than lines that are the 'fold header's. This means -- that when you click the 'fold header', it folds all lines below that have -- a higher fold level than it. -- In order to implement a folder, define the following global function in -- your lexer: -- Fold(input, start_pos, start_line, start_level) -- Fold is called when Scintilla is ready to fold your document. -- Parameters are: input, which is the text to fold; start_pos, the -- current position in the buffer of the text (used for obtaining style -- information from the document); start_line, the line number the text -- starts at; start_level, the fold level of the text at start_line. -- -- The following Scintilla fold constants are also available (see -- Scintilla's documentation for more detail on what these flags mean): -- SC_FOLDLEVELBASE -- SC_FOLDLEVELWHITEFLAG -- SC_FOLDLEVELHEADERFLAG -- SC_FOLDLEVELBOXHEADERFLAG -- SC_FOLDLEVELBOXFOOTERFLAG -- SC_FOLDLEVELCONTRACTED -- SC_FOLDLEVELUNINDENT -- SC_FOLDLEVELNUMBERMASK -- An important one to remember is SC_FOLDLEVELBASE which is the value -- you'll add your fold levels to if you aren't using the previous line's -- fold level at all (e.g. folding by indent level). -- -- Now you'll want to iterate over each line, setting fold levels as well as -- keeping track of the line number you're on, the current position at the -- end of each line, and the fold level of the previous line. As an example: -- local current_pos, current_line = start_pos, start_line -- local prev_level = start_level -- for line, data in text:gmatch('((.-)\r?\n)') -- local current_level = prev_level -- if #data > 0 -- not an empty line -- local header -- -- code to determine if this will be a header level -- if header then -- -- header level flag -- current_level = bit.bor(prev_level, SC_FOLDLEVELHEADERFLAG) -- else -- -- code to determine fold level, and add (+) it to -- -- current_level -- current_level = current_level + ... -- end -- else -- -- empty line flag -- current_level = bit.bor(prev_level, SC_FOLDLEVELWHITEFLAG) -- end -- SetFoldLevel(current_line, current_level) -- -- -- keep track of necessary buffer information -- prev_level = current_level -- current_line = current_line + 1 -- current_pos = current_pos + #line -- end -- -- important: keep current flags on next line -- local flags_next = bit.band(GetFoldLevel(current_line), -- bit.bnot(SC_FOLDLEVELNUMBERMASK)) -- SetFoldLevel(current_line, bit.bor(prev_level, flags_next)) -- -- That last 'important' section, just copy and paste to the end of your -- Fold function. -- -- In order to get or set fold levels for a specific line, the following -- functions are provided: -- GetFoldLevel(line) -- Returns the fold level + SC_FOLDLEVELBASE of line. -- -- SetFoldLevel(line, level) -- Sets the fold level of line to level (remember to add -- SC_FOLDLEVELBASE to it if you haven't already). -- -- What is the 'bit.band' and 'bit.bor' stuff about? Well that's where -- bitlib comes in. 'bit' is a global table that contains binary operations. -- Briefly: -- bit.band(b1, b2) performs binary & between b1 and b2 -- bit.bor(b1, b2) performs binary | between b1 and b2 -- bit.bnot(b1) performs binary not of b1 -- ... -- -- There are additional Lua functions provided to help you fold your -- document: -- GetStyleAt(position) -- Returns the integer style at position. -- -- GetIndentAmount(line_number) -- Returns the indent amount of line_number (taking into account -- tabsize, tabs or spaces, etc.) -- -- Note: do not use GetProperty for getting fold options from a .properties -- file because SciTE needs to be compiled to forward those specific -- properties to Scintilla. Instead, provide options that can be set at the -- top of the lexer. -- -- There is a new 'fold.by.indentation' property where if the 'fold' -- property is set for a lexer, but there is no Fold function available, the -- document is folded by indentation. This is done in /lexers/lexer.lua and -- should serve as an example of folding in this manner. -- -- Congratulations! You have finished writing a dynamic lexer. Now you can -- either create a properties file for it (don't forget to 'import' it in your -- Global or User properties file), or elsewhere define the necessary -- file.patterns.[lexer_name]=[file_patterns] -- lexer.$(file.patterns.[lexer_name])=[lexer_name] -- in order for the lexer to be loaded automatically when a specific file type -- is opened. -- -- Supplementals: -- Writing a lexer that will have languages embedded in it: -- This is pretty easy. Nothing. That's right. If you've followed the -- rules for creating lexers, no further modifications are necessary. -- If you want to embed languages in the lexer by default: -- 1: Load the child lexer module by doing something like: -- local child = require('child_lexer') -- 2: Load the child lexer's styles in the LoadStyles function. -- e.g. child.LoadStyles() -- 3: Load the child lexer's tokens in the LoadTokens function. -- e.g. child.LoadTokens() -- 4: In the parent's LoadTokens function, use 'embed_language' as -- described below. -- The html.lua lexer is a good example. -- -- Writing a lexer that will embed in another lexer: -- 1: Load the parent lexer module that you will embed your child lexer -- into by doing something like: -- local parent = require('parent_lexer') -- 2: In the LoadTokens function, create start and end tokens for your -- child lexer. They are tokens that define the start and end of your -- embedded lexer respectively. For example, PHP requires a '' to end. Then modify your lexer's 'any_char' token -- (or equivalent, via the TokenPattern table) to a character that does -- not match the end_token. Finally, call the 'make_embeddable' -- function. It accepts 4 parameters: the language to embed, the parent -- language to embed in, the start_token, and the end_token. Here's an -- example: -- local start_token = foo -- local end_token = bar -- child.TokenPatterns.any_char = token('default', 1 - end_token) -- make_embeddable(child, parent, start_token, end_token) -- 3: Use the 'add_langage' function: -- embed_language(parent, child[, preproc=false]) -- parent is the parent lexer module, child is your lexer module, and -- preproc is an optional [boolean] argument that indicates whether -- this embedded language is a preprocessor language. A preprocessor -- language will have its tokens embedded in each of the parent -- language's embedded languages. (Note the SHRT_MAX limitation may -- come into effect.) -- 4: Load the parent lexer's styles in the LoadStyles function. -- e.g. parent.LoadStyles() -- 5: Load the parent lexer's tokens in the LoadTokens function. -- e.g. parent.LoadTokens() -- 6: If your embedded lexer is a preprocessor language, you may want to -- modify some of parent's tokens to embed your lexer in (i.e. strings). -- You can access them through the parent's TokenPatterns table. Then -- you must rebuild the parent's token patterns by calling -- 'rebuild_token' and 'rebuild_tokens' one after the other passing the -- parent lexer as the only parameter. For example: -- parent.TokenPatterns.string = string_with_embedded -- rebuild_token(parent) -- rebuild_tokens(parent) -- 6: If your child lexer, not the parent lexer, is being loaded, specify -- that you want the parent's tokens to be used for lexxing instead of -- child's. Set a global UseOtherTokens variable to be parent's tokens. -- e.g. UseOtherTokens = parent.Tokens -- The php.lua lexer is a good example. -- -- Optimization: -- Lexers can usually be optimized for speed by re-arranging tokens so -- that the most common tokens are recognized first. Be careful that by -- putting some tokens in front of others, the latter tokens may not be -- recognized because the former tokens may 'eat' them because they match -- first. -- -- Affects on SciTE-tools and SciTE-st Lua modules: -- Because most custom styles aren't fixed numbers, both scope-specific -- snippets and key commands need to be tweaked a bit. SCE_* scope constants -- are no longer available. Instead, named keys are scopes in that lexer. -- See /lexers/lexer.lua for default named scopes. Each individual lexer -- uses the 'add_style' function to add additional styles/scopes to it, so -- use the string argument passed as the scope's name. -- -- Additional Examples: -- See the lexers contained in /lexers/. -- Be sure to see /lexers/lexer.lua for more information too. -- -- When things aren't working: -- Lexers can be tricky to debug if you do not write them carefully. Errors -- are printed to STDOUT as well as any print() statements in the lexer -- itself. -- -- Limitations: -- Patterns can only be comprised of SHRT_MAX - 10 or generally 32757 -- elements. This should be suitable for most language lexers however. -- -- Performance: -- The lexer is not quite as efficient as Scintilla's built-in lexers because -- Scintilla uses a variable endStyled to keep track of the last position in -- the document where the syntax is most likely styled correctly so the entire -- document does not need to be lexxed each time styling is needed. This lexer -- does away with that and lexxes all of the text, but only styles the text -- Scintilla asks it to. This has to be done because of the nature of the -- pattern-based styling. -- -- Although the entire document must be lexxed each time, the operation is -- done in nearly O(1) time. A lot of regex-based syntax highlighting editors -- apply each rule to their documents one at a time, coloring pieces of text -- in chunks. This is not O(1) time because generally the entire document must -- be searched through again and again for each pattern. I personally think -- the slight sacrifice in performance is worth the phenominal amount of power -- the dynamic lexer gives to the user, not to mention how easy it is to write -- LPeg lexers. -- -- Disclaimer: -- Because of its dynamic nature, crashes could potentially occur because of -- malformed lexers. In the event that this happens, I CANNOT be liable for -- any damages such as loss of data. You are encouraged, however, to report -- the crash with any information that can produce it, or submit a patch to me -- that fixes the error. -- -- Acknowledgements: -- When Peter Odding posted his original Lua lexer to the Lua mailing list, it -- was just what I was looking for to start making the LPeg lexer I had been -- dreaming of since Roberto announced the library. Until I saw his code, I -- wasn't sure what the best way to go about implementing a lexer was -- at -- least one that Scintilla could utilize. I liked the way he tokenized -- patterns, because it was really easy for me to assign styles to them. I -- also learned much more about LPeg through his amazingly simple, but -- effective script. -- Path to lexers and libraries. if not WIN32 then package.path = lexer_home..'/?.lua' package.cpath = lexer_home..'/?.so' else package.path = lexer_home..'/?.lua' package.cpath = lexer_home..'/?.dll' end -- Available to lexers. _G.lpeg = require 'lpeg' _G.bit = require 'bit' local lpeg = lpeg local bit = bit --- -- Initializes the lexer language. -- Called by LexLPeg.cxx to initialize lexer. -- @param name The name of the lexxing language. function InitLexer(name) _G.Lexer = require(name or 'container') local Lexer = Lexer if Lexer then -- load styles Lexer.Types, Lexer.Styles = DefaultTypesAndStyles() if Lexer.LoadStyles then Lexer.LoadStyles() end -- load tokens if Lexer.LoadTokens then Lexer.LoadTokens() if not Lexer.UseOtherTokens then rebuild_token(Lexer) rebuild_tokens(Lexer) else Lexer.Tokens = Lexer.UseOtherTokens end end else error('Lexer '..name..'does not exist') end end --- -- Performs the lexing of the document, returning a table of tokens for styling -- by Scintilla. -- Called by LexLPeg.cxx to lex the document. -- If the lexer has a LexByLine flat set, the document is lexed one line at a -- time. If the lexer has a specific Lex function, it is used to lex the -- document. Otherwise, the entire document is lexed at once. -- @param text The text to lex. -- @return A table of tokens. lpeg.match returns a table of tokens. Each token -- contains a string identifier ('comment', 'string', etc.) and a position in -- the document the identifier applies to. function RunLexer(text) local tokens = {} local Lexer, StyleTo = Lexer, StyleTo -- Get the tokens in order to style the text. if not Lexer.LexByLine then -- lex whole document return Lexer.Lex and Lexer.Lex(text) or lpeg.match(Lexer.Tokens, text) else -- lex document line by line local function append(tokens, line_tokens, offset) for _, token in ipairs(line_tokens) do token[2] = token[2] + offset tokens[#tokens + 1] = token end end local offset = 0 local Tokens = Lexer.Tokens for line in text:gmatch('[^\n\r\f]*[\n\r\f]*') do if line then local line_tokens = lpeg.match(Tokens, line) if line_tokens then append(tokens, line_tokens, offset) end offset = offset + #line -- Use the default style to the end of the line if none was specified. if tokens[#tokens][2] ~= offset then tokens[#tokens + 1] = { 'default', offset + 1 } end end end return tokens end end --- -- Performs the folding of the document. -- Called by LexLPeg.cxx to fold the document. If the current Lexer has no Fold -- function, folding by indentation is performed unless forbidden by the -- 'fold.by.indentation' property. -- @param text The document text to fold. -- @param start_pos The position in the document text starts at. -- @param start_line The line number text starts on. -- @param start_level The fold level text starts on. function RunFolder(text, start_pos, start_line, start_level) local Lexer = Lexer if Lexer.Fold then Lexer.Fold(text) elseif GetProperty('fold.by.indentation', 1) == 1 and bit then local GetIndentAmount, GetFoldLevel, SetFoldLevel = GetIndentAmount, GetFoldLevel, SetFoldLevel local SC_FOLDLEVELBASE, SC_FOLDLEVELHEADERFLAG, SC_FOLDLEVELWHITEFLAG = SC_FOLDLEVELBASE, SC_FOLDLEVELHEADERFLAG, SC_FOLDLEVELWHITEFLAG -- Indentation based folding. local current_line = start_line local prev_level = start_level for indent, line in text:gmatch('([\t ]*)(.-)\r?\n') do if #line > 0 then local current_level = SC_FOLDLEVELBASE + GetIndentAmount(current_line) if current_level > prev_level then -- next level local level = bit.bor(prev_level, SC_FOLDLEVELHEADERFLAG) SetFoldLevel(current_line - 1, level) -- low indent SetFoldLevel(current_line, current_level) -- high indent elseif current_level < prev_level then -- prev level SetFoldLevel(current_line - 1, prev_level) -- high indent SetFoldLevel(current_line, current_level) -- low indent else -- same level SetFoldLevel(current_line, prev_level) end prev_level = current_level else local level = bit.bor(prev_level, SC_FOLDLEVELWHITEFLAG) SetFoldLevel(current_line, level) end current_line = current_line + 1 end -- Important: keep current flags on next line. local flags_next = bit.band( GetFoldLevel(current_line), bit.bnot(SC_FOLDLEVELNUMBERMASK) ) SetFoldLevel( current_line, bit.bor(prev_level, flags_next) ) end end -- The following are utility functions lexers will have access to. --- -- Creates an LPeg capture table index with the id and position of the capture. -- @param id The string identifier of patt. It is recommended to prefix id with -- something unique to your lexer if your lexer will be embedded in another -- one. -- @param patt The LPeg pattern associated with the identifier. -- @usage local comment = token('comment', comment_pattern) creates a token -- using the default 'comment' identifier. -- @usage local my_var = token('my_variable', variable_pattern) creates a token -- using a custom identifier. Don't forget to use the add_style function to -- associate this identifer with a style. -- @see add_style function token(id, patt) return lpeg.Ct( lpeg.Cc(id) * patt * lpeg.Cp() ) end --- -- Adds a token to a lexer's current ordered list of tokens. -- @param lexer The lexer adding the token to. -- @param id The string identifier of patt. -- @param token_patt The LPeg pattern (returned by the 'token' function) -- associated with the identifier. -- @param exclude Optional flag indicating whether or not to exclude this token -- from lexer.Token when rebuilding. This flag would be set to true when -- tokens are only meant to be accessible to other lexers in the -- lexer.TokenPatterns table. -- @param pos Optional index to insert this token in TokenOrder. -- @usage add_token(lexer, 'comment', comment_token) adds a 'comment' token to -- the current list of tokens in lexer used to build the Tokens pattern used -- to style input. The value of comment_token in this case would be the value -- returned by token('comment', comment_pattern) function add_token(lexer, id, token_patt, exclude, pos) if not lexer then error('add_token() given a nil lexer value') end if not lexer.TokenOrder then --- -- Ordered list of token identifiers for a specific lexer. -- Contains an ordered list (by numerical index) of token identifier strings. -- This is used in conjunction with TokenPatterns for building the Token and -- Tokens lexer variables. This table doesn't need to be modified manually, as -- calls to the 'add_token' function update this list appropriately. -- @class table -- @name TokenOrder -- @see rebuild_token -- @see rebuild_tokens lexer.TokenOrder = {} --- -- List of token identifiers with associated LPeg patterns for a specific -- lexer. -- It provides a public interface to this lexer's tokens by other lexers. This -- list is used in conjunction with TokenOrder and also doesn't need to be -- modified manually. -- @class table -- @name TokenPatterns -- @see rebuild_token -- @see rebuild_tokens lexer.TokenPatterns = {} end local order, patterns = lexer.TokenOrder, lexer.TokenPatterns if not exclude then if pos then table.insert(order, pos, id) else order[#order + 1] = id end end patterns[id] = token_patt end -- common patterns any = lpeg.P(1) ascii = lpeg.R('\000\127') extend = lpeg.R('\000\255') alpha = lpeg.R('AZ', 'az') digit = lpeg.R('09') alnum = lpeg.R('AZ', 'az', '09') lower = lpeg.R('az') upper = lpeg.R('AZ') xdigit = lpeg.R('09', 'AF', 'af') cntrl = lpeg.R('\000\031') graph = lpeg.R('!~') print = lpeg.R(' ~') punct = lpeg.R('!/', ':@', '[\'', '{~') space = lpeg.S('\t\v\f\n\r ') newline = lpeg.S('\r\n\f')^1 nonnewline = 1 - newline nonnewline_esc = 1 - (newline + '\\') + '\\' * any dec_num = digit^1 hex_num = '0' * lpeg.S('xX') * xdigit^1 oct_num = '0' * lpeg.R('07')^1 integer = lpeg.S('+-')^-1 * (hex_num + oct_num + dec_num) float = lpeg.S('+-')^-1 * (digit^0 * '.' * digit^1 + digit^1 * '.' * digit^0 + digit^1) * lpeg.S('eE') * lpeg.S('+-')^-1 * digit^1 word = (alpha + '_') * (alnum + '_')^0 -- common tokens any_char = token('default', any) --- -- Creates an LPeg pattern that matches a range of characters delimitted by a -- specific character(s). -- This can be used to match a string, parenthesis, etc. -- @param chars The character(s) that bound the matched range. -- @param escape Optional escape character. This parameter may be omitted, nil, -- or the empty string. -- @param end_optional Optional flag indicating whether or not an ending -- delimiter is optional or not. If true, the range begun by the start -- delimiter matches until an end delimiter or the end of the input is -- reached. This is useful for finding unmatched delimiters. -- @param balanced Optional flag indicating whether or not that a a balanced -- range is matched. This flag only applies if 'chars' consists of two -- different characters, like parenthesis for example. Any character -- indicating the start of a range requires its end complement. When the -- complement of the first range-start character is found, the match ends. -- @param forbidden Optional string of characters forbidden in a delimited -- range. Each character is part of the set. -- @usage local sq_str = delimited_range("'", '\\') creates a pattern that -- matches a region bounded by "'" characters, but "\'" is not interpreted as -- a region's end. (It is escaped.) -- @usage local paren = delimited_range('()') creates a pattern that matches a -- region contained in parenthesis with no escape character. Note that this -- does not match a balanced pattern; it interprets the first ')' as the -- region's end. -- @usage local paren = delimited_range('()', '\\', true) creates a pattern -- that matches a region contained in balanced parenthesis with an escape -- character. So sequences like '\)' are not interpreted as the end of a -- balanced range. function delimited_range(chars, escape, end_optional, balanced, forbidden) local s = chars:sub(1, 1) local e = #chars == 2 and chars:sub(2, 2) or s local range local b = balanced and s or '' local f = forbidden or '' if not escape or escape == '' then local invalid = lpeg.S(e..f..b) range = any - invalid else local invalid = lpeg.S(e..f..escape..b) range = any - invalid + escape * any end if balanced and s ~= e then return lpeg.P{ s * (range + lpeg.V(1))^0 * e } else if end_optional then e = lpeg.P(e)^-1 end return s * range^0 * e end end --- -- Creates an LPeg pattern that matches a range of characters delimitted by a -- specific character(s) with an embedded pattern. -- This is useful for embedding additional lexers inside strings for example. -- @param chars The character(s) that bound the matched range. -- @param escape Escape character. If there isn't one, nil or the empty string -- should be passed. -- @param id Specifies the identifier used to create tokens that match -- everything but patt. -- @param patt Pattern embedded in the range. -- @param forbidden Optional string of characters forbidden in a delimited -- range. Each character is part of the set. -- @usage local sq_str = delimited_range_with_embedded("'", '\\', 'string', -- emb_language) creates a pattern that matches a region bounded by "'" -- characters. Any contents in the region that do not match emb_language are -- styled as the default 'string' identifier, and any contents matching -- emb_language are styled appropriately as the tokens in emb_language -- indicate. Basically, emb_language is embedded inside a single quoted -- string and styled correctly. function delimited_range_with_embedded(chars, escape, id, patt, forbidden) local s = chars:sub(1, 1) local e = #chars == 2 and chars:sub(2, 2) or s local range, invalid, valid local f = forbidden or '' if not escape or escape == '' then invalid = patt + lpeg.S(e..f) valid = token(id, (any - invalid)^1) else invalid = patt + lpeg.S(e..f..escape) valid = token(id, (any - invalid + escape * any)^1) end range = lpeg.P { (patt + valid * lpeg.V(1))^0 } return token(id, s) * range^-1 * token(id, e) end --- -- Creates an LPeg pattern from a given pattern that matches the beginning of a -- line and returns it. -- @param patt The LPeg pattern to match at the beginning of a line. -- @usage local preproc = starts_line(lpeg.P('#') * alpha^1) creates a pattern -- that matches a C preprocessor directive such as '#include'. function starts_line(patt) return lpeg.P(function(input, idx) if idx == 1 then return idx end local char = input:sub(idx - 1, idx - 1) if char == '\n' or char == '\r' or char == '\f' then return idx end end) * patt end --- -- Creates an LPeg pattern that matches a range of characters delimitted by a -- set of nested delimitters. -- Use this function for multi-character delimitters, delimited_range otherwise -- with balance set to 'true'. -- This is useful for languages with tokens such as nested block comments. -- @param start_chars The string starting delimiter character sequence. -- @param end_chars The string ending delimiter character sequence. -- @param end_optional Optional flag indicating whether or not an ending -- delimiter is optional or not. If true, the range begun by the start -- delimiter matches until an end delimiter or the end of the input is -- reached. This is useful for finding unmatched delimiters. -- @usage local nested_comment = nested_pair('/*', '*/', true) creates a pattern -- that matches a region contained in a nested set of C-style block comments. function nested_pair(start_chars, end_chars, end_optional) local s, e = start_chars, end_optional and lpeg.P(end_chars)^-1 or end_chars return lpeg.P{ s * (any - s - end_chars + lpeg.V(1))^0 * e } end --- -- Creates a Scintilla color. -- @param r The red component of the hexadecimal color [string]. -- @param g The green component of the color [string]. -- @param b The blue component of the color [string]. -- @usage local red = color('FF', '00', '00') creates a Scintilla color based -- on the hexadecimal representation of red. function color(r, g, b) return tonumber(b..g..r, 16) end --- -- Creates a Scintilla style from a table of style properties. -- @param style_table A table of style properties. -- Style properties available: -- font = [string] -- size = [integer] -- bold = [boolean] -- italic = [boolean] -- underline = [boolean] -- fore = [integer]* -- back = [integer]* -- eolfilled = [boolean] -- characterset = ? -- case = [integer] -- visible = [boolean] -- changeable = [boolean] -- hotspot = [boolean] -- * Use the value returned by the color function. -- @usage local bold_italic = style { bold = true, italic = true } -- @see color function style(style_table) setmetatable(style_table, { __concat = function(t1, t2) local t = {} -- duplicate t1 so t1 is unmodified for k,v in pairs(t1) do t[k] = v end for k,v in pairs(t2) do t[k] = v end return t end }) return style_table end local ret, errmsg if color_theme and color_theme ~= '' then ret, errmsg = pcall(dofile, lexer_home..'/themes/'..color_theme..'.lua') if not ret and errmsg then _G.print(errmsg) end end if not color_theme or not ret then --- -- Dark theme initial colors. -- @class table -- @name colors -- @field green The color green. -- @field blue The color blue. -- @field red The color red. -- @field yellow The color yellow. -- @field teal The color teal. -- @field white The color white. -- @field black The color black. -- @field grey The color grey. -- @field purple The color purple. -- @field orange The color orange. -- @see color colors = { green = color('4D', '99', '4D'), blue = color('40', '80', 'C0'), red = color('99', '4C', '4C'), yellow = color('99', '99', '4D'), teal = color('4D', '99', '99'), white = color('AA', 'AA', 'AA'), black = color('33', '33', '33'), grey = color('99', '99', '99'), purple = color('99', '4D', '99'), orange = color('C0', '80', '40'), } -- Useful styles. style_nothing = style { } style_char = style { fore = colors.red, bold = true } style_class = style { fore = colors.white, underline = true } style_comment = style { fore = colors.blue, bold = true } style_constant = style { fore = colors.teal, bold = true } style_definition = style { fore = colors.red, bold = true } style_error = style { fore = colors.red, italic = true } style_function = style { fore = colors.white, bold = true } style_keyword = style { fore = colors.yellow, bold = true } style_number = style { fore = colors.teal } style_operator = style { fore = colors.white, bold = true } style_string = style { fore = colors.green, bold = true } style_preproc = style { fore = colors.blue } style_tag = style { fore = colors.teal, bold = true } style_type = style { fore = colors.green } style_variable = style { fore = colors.white, italic = true } style_identifier = style_nothing -- Default styles. style_default = style{ font = WIN32 and '!Courier New' or '!Bitstream Vera Sans Mono', size = 8, fore = colors.white, back = colors.black } style_line_number = style { fore = colors.black, back = colors.grey } style_bracelight = style { fore = color('66', '99', 'FF'), bold = true } style_bracebad = style { fore = color('FF', '66', '99'), bold = true } style_controlchar = style_nothing style_indentguide = style { fore = colors.grey, back = colors.white } style_calltip = style { fore = colors.white, back = color('44', '44', '44') } end --- -- Returns default Types and Styles common to most every lexer. -- Note this does not need to be called by the lexer. It is called for the -- lexer automatically when it is initialized. -- @return Types and Styles tables. function DefaultTypesAndStyles() --- -- [Local table] Default (initial) Types. -- Contains token identifiers and associated style numbers. -- @class table -- @name types -- @field whitespace The whitespace type (0). -- @field default The default type (1). -- @field comment The comment type (2). -- @field string The string type (3). -- @field number The number type (4). -- @field keyword The keyword type (5). -- @field identifier The identifier type (6). -- @field operator The operator type (7). -- @field error The error type (8). -- @field preprocessor The preprocessor type (9). -- @field constant The constant type (10). -- @field function The function type (11). -- @field class The class type (12). -- @field type The type type (13). local types = { whitespace = 0, default = 1, comment = 2, string = 3, number = 4, keyword = 5, identifier = 6, operator = 7, error = 8, preprocessor = 9, constant = 10, variable = 11, ['function'] = 12, class = 13, type = 14, } --- -- [Local table] Default (initial) Styles. -- Contains style numbers and associated styles. -- @class table -- @name styles local styles = { [0] = style_nothing, [1] = style_nothing, [2] = style_comment, [3] = style_string, [4] = style_number, [5] = style_keyword, [6] = style_identifier, [7] = style_operator, [8] = style_error, [9] = style_preproc, [10] = style_constant, [11] = style_variable, [12] = style_function, [13] = style_class, [14] = style_type, len = 15, -- Predefined styles. [32] = style_default, [33] = style_line_number, [34] = style_bracelight, [35] = style_bracebad, [36] = style_controlchar, [37] = style_indentguide, [38] = style_calltip, } return types, styles end --- -- Adds a new Scintilla style to Scintilla. -- @param id An identifier passed when creating a token. -- @param style A Scintilla style created from style(). -- @usage add_style('comment', my_comment_style) overrides the default style -- for tokens with default identifier 'comment' with a user-defined style. -- @usage add_style('my_variable', variable_style) adds a user-defined style -- for tokens with the identifier 'my_variable'. -- @see token -- @see style function add_style(id, style) local len = Lexer.Styles.len if len == 32 then len = len + 8 end -- skip predefined styles if len < 128 then Lexer.Types[id] = len Lexer.Styles[len] = style Lexer.Styles.len = len + 1 return len else _G.print('Too many styles defined (128 MAX)') end end --- -- Allows a child lexer to be embedded in a parent one. -- An appropriate entry in child.EmbeddedIn is created; then the -- 'embed_language' function can be called to embed the child lexer in the -- parent. -- @param child The child lexer language. -- @param parent The parent lexer language. -- @param start_token The token that signals the beginning of the embedded -- lexer. -- @param end_token The token that signals the end of the embedded lexer. function make_embeddable(child, parent, start_token, end_token) if not child.EmbeddedIn then child.EmbeddedIn = {} end child.EmbeddedIn[parent._NAME] = {} local elang = child.EmbeddedIn[parent._NAME] elang.start_token = start_token elang.token = rebuild_token(child) elang.end_token = end_token end --- -- Embeds a child lexer language in a parent one. -- The 'make_embeddable' function must be called first to prepare the child -- lexer for embedding in the parent. The child's tokens are placed before the -- parent's and maybe inside other embedded lexers depending on the preproc -- argument. -- @param parent The parent lexer language. -- @param child The child lexer language. -- @param preproc Boolean flag specifying if the child lexer is a preprocessor -- language. If so, its tokens are placed before all embedded lexers' tokens. -- @usage embed_language(parent_lang, child_lang) embeds child_lang inside -- parent_lang, keeping other embedded languages unmodified. -- @usage embed_language(parent_lang, child_lang, true) embeds child_lang -- inside parent_lang and all of its other embedded languages. -- @see make_embeddable function embed_language(parent, child, preproc) if not parent.Token then rebuild_token(parent) end local token = parent.Token if not parent.languages then parent.languages = {} parent.languages.preproc = {} end -- Iterate over all languages embedded in parent, putting all embedded -- languages' tokens before parent's tokens. for _, elanguage in ipairs(parent.languages) do local elang = elanguage.EmbeddedIn[parent._NAME] if preproc then -- Preprocessor language; add preproc's tokens before each embedded -- language's tokens. local plang = child.EmbeddedIn[parent._NAME] token = elang.start_token * ( plang.start_token * plang.token^0 * plang.end_token^-1 + elang.token )^0 * elang.end_token^-1 + token parent.languages.preproc[child._NAME] = child else token = elang.start_token * elang.token^0 * elang.end_token^-1 + token end end parent.languages[#parent.languages + 1] = child -- Now add child's tokens before everything else. local elang = child.EmbeddedIn[parent._NAME] local elang_token = elang.start_token * elang.token^0 * elang.end_token^-1 parent.Tokens = lpeg.Ct( (elang_token + token)^0 ) end --- -- (Re)constructs parent.Token. -- Creates the token pattern from parent.TokenOrder, an ordered list of tokens. -- Rebuilding is useful for modifying parent's tokens for embedded lexers. -- Generally calling 'rebuild_tokens' is also necessary after this. -- @param parent The parent lexer language. -- @return token pattern (for convenience), but parent.Token is still modified, -- so setting it manually is not necessary. -- @see rebuild_tokens function rebuild_token(parent) local patterns, order = parent.TokenPatterns, parent.TokenOrder if not (patterns or order) then error('No tokens found') end local token = patterns[ order[1] ] for i = 2, #parent.TokenOrder do if not patterns[ order[i] ] then error('One of your tokens is not a pattern') end token = token + patterns[ order[i] ] end parent.Token = token return token end --- -- (Re)constructs parent.Tokens. -- This is generally called after 'rebuild_token' in order to create the -- pattern used to lexx input. -- @param parent The parent lexer language. -- @see rebuild_token function rebuild_tokens(parent) local token = parent.Token if parent.languages then if #parent.languages.preproc > 0 then for _, planguage in ipairs(parent.languages.preproc) do local plang = planguage.EmbeddedIn[parent._NAME] local eplang = plang.start_token * plang.token^0 * plang.end_token^-1 -- Add preproc's tokens before tokens of all other embedded languages. for _, elanguage in ipairs(parent.languages) do if elanguage ~= planguage then local elang = elanguage.EmbeddedIn[parent._NAME] token = elang.start_token * (eplang + elang.token)^0 * elang.end_token^-1 + token end end token = eplang + token end else for _, elanguage in ipairs(parent.languages) do local elang = elanguage.EmbeddedIn[parent._NAME] token = elang.start_token * elang.token^0 * elang.end_token^-1 + token end end end parent.Tokens = lpeg.Ct(token^0) end --- -- Creates a table of given words for hash lookup. -- This is usually used in conjunction with word_match. -- @param word_table A table of words. -- @usage local keywords = word_list{ 'foo', 'bar', 'baz' } creates a pattern -- that matches words 'foo', 'bar', or 'baz'. -- @see word_match function word_list(word_table) local hash = {} for _, v in ipairs(word_table) do hash[v] = true end return hash end --- -- Creates an LPeg pattern function that checks to see if the current word is -- in word_list, returning the index of the end of the word. (Thus the pattern -- succeeds.) -- @param word_list A word list constructed from word_list. -- @param word_chars Optional string of additional characters considered to be -- part of a word. -- @param case_insensitive Optional boolean flag indicating whether the word -- match is case-insensitive. -- @usage local keyword = token('keyword', word_match(word_list{ 'foo', 'bar', -- 'baz' }, nil, true)) creates a token whose pattern matches any of the -- words 'foo', 'bar', or 'baz' case insensitively. -- @see word_list function word_match(word_list, word_chars, case_insensitive) local chars = '%w_' -- escape 'magic' characters -- TODO: append chars to the end so ^_ can be passed for not including '_'s if word_chars then chars = chars..word_chars:gsub('([%^%]%-])', '%%%1') end return lpeg.P(function(input, index) local s, e, word = input:find('^(['..chars..']+)', index) if word then if case_insensitive then word = word:lower() end return word_list[word] and e + 1 or nil end end) end setfenv(0, lexer)