SciTE Pattern-based Lexxing by Mitchell email: mitchell {att} caladbolg {dott} net Copyright (c) 2006-2007 Mitchell Foral. All rights reserved. SciTE-tools homepage: http://caladbolg.net/scite.php Send email to: mitchellcaladbolgnet Permission to use, copy, modify, and distribute this utility is granted, provided credit is given to Mitchell. Scintilla's default lexers are essentially C++ character iterators. They can be hard to write, get complicated, and be messy to change or add other languages to if the syntax is similar. If you don't believe me, look at LexHTML.cxx. How it even works I have no idea. Also, in order to have embedded languages, say Ruby in HTML, you have to basically rewrite the Ruby lexer inside the HTML one, upping the complexity. Note the HTML lexer already has its style capacity full: 128 styles. There's no room for another language. This seems like a waste of styles if you never use ASP or Python in HTML. I wanted to write a dynamic lexer that used ONLY the styles the user wanted it to; and I wanted it to be easy to write, modify, and understand individual lexers. Roberto's Lua LPeg library(1) came out recently and I thought it would be the perfect solution for a dynamic lexer. By dynamic I mean that it can have dynamic lexical states, dynamic patterns that can be styled, and can style dynamically loaded [embedded] languages. The lexer is not quite as efficient as Scintilla's built-in lexers, however, for a couple reasons: 1: Scintilla uses a variable endStyled to keep track of the last position in the document where the syntax is most likely styled correctly so the entire document does not need to be lexxed each time styling is needed. This lexer does away with that and lexxes all of the text, but only styles the text Scintilla asks it to. This has to be done because of the nature of the pattern-based styling. 2: Although the entire document must be lexxed each time, the operation is done in nearly O(1) time. A lot of regex-based syntax highlighting editors apply each rule to their documents one at a time, coloring pieces of text in chunks. This is not O(1) time because generally the entire document must be searched through again and again for each pattern. I personally think the slight sacrifice in performance is worth the phenominal amount of power the dynamic lexer gives to the user, not to mention how easy it is to write LPeg lexers. (1) http://www.inf.puc-rio.br/~roberto/lpeg.html Requirements: - SciTE-st (http://caladbolg.net/scite_st.php) - Lua 5.1 (http://lua.org) Linux: Lua 5.1: Lua 5.1 must be installed on the host system where the shared objects and header files can be seen. (Typically /usr/lib and /usr/include respectively.) Windows: Lua 5.1: It's in the SciTE-st root directory. Now you must make sure of a few things: 1: The location of /lexers/ is set properly in your SciTEGlobal.properties ('lexer.lua.home' property). 2: The location of /lexers/lexer.lua is set properly in your SciTEGlobal.properties ('lexer.lua.script' property). 3: PLATFORM is set properly in lexer.lua depending on your operating system. (linux or windows) 4: The package paths are set properly in lexer.lua. Compiling: The compiling procedure is the same as a standard Scintilla and SciTE one. Remember to make sure the Lua shared object, library, and include files can be seen by the compiler/linker. Writing a Dynamic Lexer (somewhat brief tutorial): This may seem like a daunting task, judging by the length of this document, but the process is actually fairly straight- forward. I just need to include every little detail in order for you to be able to utilize everything I have provided for your lexer development. In order to setup a dynamic lexer, create a lua script with your lexer's name as the filename followed by '.lua' in the /lexers/ directory. Then at the top of your lexer, the following must appear: module(..., package.seeall) Lexers are meant to be modules, not to be loaded in the global namespace. The ... parameter means this module assumes the name it is being 'require'd with. So doing require 'ruby' means the lexer will be the table 'ruby' in the global namespace. (Useful for a 'require'd lexer to check if another particular lexer has been loaded.) Now you'll need a way to style patterns of text. This is accomplished through tokens. Tokens: Each lexer is composed of a series of tokens. Each token contains a state identifier and an associated LPeg pattern. Generally the identifier should be prefixed or otherwise individualized in some way so as not to create conflicts with other lexer states if either your lexer is to be embedded in another, or another lexer is to be embedded in yours. You can create a token with a specified pattern by calling the 'token' function. e.g. local comment = token('comment', comment_pattern) local variable = token('my_variable', var_pattern) Note that 'comment' is part of the default Types and Styles, so it will be colored with the same style as default comments. If you wish for your comments to be different, you should create a token with a unique id and add_style() your style in your 'LoadStyles' function (discussed later). What are the default Types and Styles? You can look in /lexers/lexer.lua's DefaultTypesAndStyles function. Each lexer has a Types and Styles table. They initially contain types and styles that are common to nearly every lexer, saving you the trouble of creating the same states and styles for every lexer you write. You can of course redefine them in your own lexer if you wish, but they must be redefined in the LoadStyles function described later. So now you can create patterns and give them identifiers in a token. Next you need to create: the simple patterns that appear in most every lexer; styles; and colors. Rather than it being tedious, it has already been done for you and is available globally (from lexer.lua): Patterns: alpha, digit, alphanum, whitespace Colors via the 'colors' table: red, yellow, green, blue, teal, white, black Styles: style_nothing, style_char, style_comment, style_definition, style_error, style_keyword, style_number, style_operator, style_string, style_preproc, style_tag, style_identifier Note: colors and styles are identical to those defined in SciTEGlobal.properties. Okay, so at this time you're probably thinking about keywords and keyword lists that were provided in SciTE properties files because you surely will want to style those! Unfortunately there is no way to read those keywords, but there are a couple functions that will make your life easier. Rather than creating a lpeg.P('keyword1') + lpeg.P('keyword2') + ... pattern for keywords, you can use a combination of the 'word_list' and 'word_match' functions. word_list(words) Creates a word hash from a given table of [string] words. e.g. local keywords = word_list{ 'foo', 'bar', 'baz' } word_match(word_list[, chars, case_insensitive]) Creates an LPeg pattern that checks to see if the current word is in word_list. e.g. local keyword = word_match(keywords) where keywords is defined in the previous example. Optional second parameter chars is a string of characters that count as word characters. Default word characters are alpha-numeric or an underscore (_). In HTML and CSS for example, the hyphen (-) is considered a word character, so '-' would be the value of the second argument. Optional third [boolen] parameter is whether or not words are matched case insensitively. These functions make sense to have because the maximum pattern size for a lexer is SHRT_MAX - 10, or generally 32757 elements. If an lpeg.P was created for each keyword in a language, this number would probably come into effect -- especially for embedded languages. Also, it would be SLOW to have a pattern for every keyword. 'word_match' gets the identifier once and checks if it exists in word_list using a hash, which is very fast. When you were creating your tokens, you gave them identifer states. For the identifier states that aren't part of the default Types and Styles, styles will need to be defined for them. For this, a 'style' function is available. It's only parameter is a table which can contain the following fields: font - font name (string) size - font size (integer) bold - bold font (boolean) italic - italic font (boolean) underline - underline text (boolean) fore - text foreground color (integer)* back - text background color (integer)* eolfilled - use background color for entire line, not stopping at a newline character (boolean) characterset - ? case - the default text case; 0 for normal case, 1 for uppercase, 2 for lowercase (integer) visible - text is visible or not (boolean) changeable - text is changeable or not (boolean) hotspot - text is hotspot or not (boolean) --- * Use the 'color' function to create appropriate integer values from hex colors (#RRGGBB). Arguments are red, green, blue hexadecimal values as STRINGS. e.g. red = color('FF', '00', '00') Styles can be simple, like: style_bold = style { bold = true } or they can be composed of existing styles with added style_bold_italic = style_bold..{ italic = true } or modified fields style_normal = style_bold..{ bold = false } Note in both cases that style_bold is left unchanged. Now that you have styles defined for you identifiers, it's time to add them to Scintilla. This is done in a global LoadStyles function. LoadStyles is called when the lexer has been initialized and Scintilla is ready to setup the lexer's styles. The 'add_style' function provides a way to easily define your styles. The first parameter is your token identifier, and the second is the style you created for it. For example: function LoadStyles() add_style('variable', style_variable) add_style('function', style_function) end 'add_style' returns the style number of the identifier added. This is useful for associating a particular style with the number returned by the function GetStyleAt (see below) or SciTE's editor.StyleAt. Finally! All your tokens have been created. All that is left to do is add them to your lexer. This is done in a global LoadTokens function. LoadTokens is called when the lexer has been initialized and the lexer library is ready to create the LPeg table capture that will lexx any input given. The 'add_token' function provides a way to easily define your tokens. The first parameter is your lexer, the second is your token identifier, and the third is the pattern returned by the 'token' function. For example: function LoadTokens() add_token(mylexer, 'comment', comment) add_token(mylexer, 'variable', variable) end where comment and variable have been defined in an above example as the returns of calls to 'token'. Keep in mind order matters. If the match to the first token added fails, the next token is tried, then the next, etc. If you want one token to match before another, move it's declaration before the latter's. Not having tokens in proper order can be tricky to debug if something goes wrong. Ah, you have all your tokens added, but what if some input does not match? This is where a global 'any_char' variable comes in. It is defined as any_char = token('default', lpeg.P(1)) so that any pattern you hadn't accounted for is styled (one character only). You can of course override any_char to display something you can recognize if you are debugging your lexer or you count unmatched patterns as syntax errors. Now: add_token(mylexer, 'any_char', any_char) 'add_token' adds your identifier and pattern to a TokenPatterns table. This table is available to any other lexer as a means of accessing or modifying your lexer's tokens. This is especially useful for embedded lexer functionality. See the supplemental section Writing a lexer that will embed in another lexer for more details. The only thing left to do at this point is to lex the document with your LPeg tokens. If your approach is to lex the entire document (not line-by- line), you're done! /lexers/lexer.lua realizes this is what you intend and does it automatically for you. If you wanted to have a line-by-line lexer instead of one that lexxes the entire document at once, set a global LexByLine variable to true and you're finished. You can lex your own way if you'd like by creating a global Lex function that returns a table whose indices contain style numbers and positions to style to. The LPeg table capture for a lexer is defined as Tokens and the pattern to match a single token is defined as Token. Because you have your styles and colors defined in the lexer itself, you may be wondering if your SciTE properties files can still be used. The answer is absolutely! All styling information is ignored though. Optional -- Code Folding: It is sometimes convenient to "fold", or not show blocks of code when editing, whether they be functions, classes, comments, etc. The basic idea behind implementing a folder is to iterate line by line through the document, assigning a fold level to each line. Lines to be "hidden" have a higher fold level than lines that are the 'fold header's. This means that when you click the 'fold header', it folds all lines below that have a higher fold level than it. In order to implement a folder, define the following global function in your lexer: Fold(input, start_pos, start_line, start_level) Fold is called when Scintilla is ready to fold your document. Parameters are: input, which is the text to fold; start_pos, the current position in the buffer of the text (used for obtaining style information from the document); start_line, the line number the text starts at; start_level, the fold level of the text at start_line. The following Scintilla fold constants are also available (see Scintilla's documentation for more detail on what these flags mean): SC_FOLDLEVELBASE SC_FOLDLEVELWHITEFLAG SC_FOLDLEVELHEADERFLAG SC_FOLDLEVELBOXHEADERFLAG SC_FOLDLEVELBOXFOOTERFLAG SC_FOLDLEVELCONTRACTED SC_FOLDLEVELUNINDENT SC_FOLDLEVELNUMBERMASK An important one to remember is SC_FOLDLEVELBASE which is the value you'll add your fold levels to if you aren't using the previous line's fold level at all (e.g. folding by indent level). Now you'll want to iterate over each line, setting fold levels as well as keeping track of the line number you're on, the current position at the end of each line, and the fold level of the previous line. As an example: local current_pos, current_line = start_pos, start_line local prev_level = start_level for line, data in text:gmatch('((.-)\r?\n)') local current_level = prev_level if #data > 0 -- not an empty line local header -- code to determine if this will be a header level if header then -- header level flag current_level = bit.bor(prev_level, SC_FOLDLEVELHEADERFLAG) else -- code to determine fold level, and add (+) it to -- current_level current_level = current_level + ... end else -- empty line flag current_level = bit.bor(prev_level, SC_FOLDLEVELWHITEFLAG) end SetFoldLevel(current_line, current_level) -- keep track of necessary buffer information prev_level = current_level current_line = current_line + 1 current_pos = current_pos + #line end -- important: keep current flags on next line local flags_next = bit.band(GetFoldLevel(current_line), bit.bnot(SC_FOLDLEVELNUMBERMASK)) SetFoldLevel(current_line, bit.bor(prev_level, flags_next)) That last 'important' section, just copy and paste to the end of your Fold function. In order to get or set fold levels for a specific line, the following functions are provided: GetFoldLevel(line) Returns the fold level + SC_FOLDLEVELBASE of line. SetFoldLevel(line, level) Sets the fold level of line to level (remember to add SC_FOLDLEVELBASE to it if you haven't already). What is the 'bit.band' and 'bit.bor' stuff about? Well that's where bitlib comes in. 'bit' is a global table that contains binary operations. Briefly: bit.band(b1, b2) performs binary & between b1 and b2 bit.bor(b1, b2) performs binary | between b1 and b2 bit.bnot(b1) performs binary not of b1 ... There are additional Lua functions provided to help you fold your document: GetStyleAt(position) Returns the integer style at position. GetIndentAmount(line_number) Returns the indent amount of line_number (taking into account tabsize, tabs or spaces, etc.) Note: do not use GetProperty for getting fold options from a .properties file because SciTE needs to be compiled to forward those specific properties to Scintilla. Instead, provide options that can be set at the top of the lexer. There is a new 'fold.by.indentation' property where if the 'fold' property is set for a lexer, but there is no Fold function available, the document is folded by indentation. This is done in /lexers/lexer.lua and should serve as an example of folding in this manner. Congratulations! You have finished writing a dynamic lexer. Now you can either create a properties file for it (don't forget to 'import' it in your Global or User properties file), or elsewhere define the necessary file.patterns.[lexer_name]=[file_patterns] lexer.$(file.patterns.[lexer_name])=[lexer_name] in order for the lexer to be loaded automatically when a specific file type is opened. Supplementals: Writing a lexer that will have languages embedded in it: This is pretty easy. Nothing. That's right. If you've followed the rules for creating lexers, no further modifications are necessary. If you want to embed languages in the lexer by default: 1: Load the child lexer module by doing something like: local child = require('child_lexer') 2: Load the child lexer's styles in the LoadStyles function. e.g. child.LoadStyles() 3: Load the child lexer's tokens in the LoadTokens function. e.g. child.LoadTokens() 4: In the parent's LoadTokens function, use 'embed_language' as described below. The html.lua lexer is a good example. Writing a lexer that will embed in another lexer: 1: Load the parent lexer module that you will embed your child lexer into by doing something like: local parent = require('parent_lexer') 2: In the LoadTokens function, create start and end tokens for your child lexer. They are tokens that define the start and end of your embedded lexer respectively. For example, PHP requires a '' to end. Then modify your lexer's 'any_char' token (or equivalent, via the TokenPattern table) to a character that does not match the end_token. Finally, call the 'make_embeddable' function. It accepts 4 parameters: the language to embed, the parent language to embed in, the start_token, and the end_token. Here's an example: local start_token = foo local end_token = bar child.TokenPatterns.any_char = token('default', 1 - end_token) make_embeddable(child, parent, start_token, end_token) 3: Use the 'add_langage' function: embed_language(parent, child[, preproc=false]) parent is the parent lexer module, child is your lexer module, and preproc is an optional [boolean] argument that indicates whether this embedded language is a preprocessor language. A preprocessor language will have its tokens embedded in each of the parent language's embedded languages. (Note the SHRT_MAX limitation may come into effect.) 4: Load the parent lexer's styles in the LoadStyles function. e.g. parent.LoadStyles() 5: Load the parent lexer's tokens in the LoadTokens function. e.g. parent.LoadTokens() 6: If your embedded lexer is a preprocessor language, you may want to modify some of parent's tokens to embed your lexer in (i.e. strings). You can access them through the parent's TokenPatterns table. Then you must rebuild the parent's token patterns by calling 'rebuild_token' and 'rebuild_tokens' one after the other passing the parent lexer as the only parameter. For example: parent.TokenPatterns.string = string_with_embedded rebuild_token(parent) rebuild_tokens(parent) 6: If your child lexer, not the parent lexer, is being loaded, specify that you want the parent's tokens to be used for lexxing instead of child's. Set a global UseOtherTokens variable to be parent's tokens. e.g. UseOtherTokens = parent.Tokens The php.lua lexer is a good example. Affects on SciTE-tools Lua modules: Because most custom styles aren't fixed numbers, both scope-specific snippets and key commands need to be tweaked a bit. SCE_* scope constants are no longer available. Instead, named keys are scopes in that lexer. See /lexers/lexer.lua for default named scopes. Each individual lexer uses the 'add_style' function to add additional styles/scopes to it, so use the string argument passed as the scope's name. Additional Examples: See the lexers contained in /lexers/. Be sure to see /lexers/lexer.lua for more information too. When things aren't working: Lexers can be tricky to debug if you do not write them carefully. Errors are printed to STDOUT as well as any print() statements in the lexer itself. Limitations: Patterns can only be comprised of SHRT_MAX - 10 or generally 32757 elements. This should be suitable for most language lexers however. Disclaimer: Because of its dynamic nature, crashes could potentially occur because of malformed lexers. In the event that this happens, I CANNOT be liable for any damages such as loss of data. You are encouraged, however, to report the crash with any information that can produce it, or submit a patch to me that fixes the error. Acknowledgements: When Peter Odding posted his original Lua lexer to the Lua mailing list, it was just what I was looking for to start making the LPeg lexer I had been dreaming of since Roberto announced the library. Until I saw his code, I wasn't sure what the best way to go about implementing a lexer was -- at least one that Scintilla could utilize. I liked the way he tokenized patterns, because it was really easy for me to assign styles to them. I also learned much more about LPeg through his amazingly simple, but effective script. Questions? Comments? Suggestions? Additions? mitchell {att} caladbolg {dott} net