aboutsummaryrefslogtreecommitdiffhomepage
path: root/src
AgeCommit message (Collapse)AuthorFilesLines
2024-09-21syntax errors are reported with "echoed" characters, ie. as purely printable ↵Robin Haberkorn1-1/+3
characters * Some characters like LF wouldn't be displayed in the message line correctly. * In fact the Gtk UI cannot display any of the control characters correctly. * I was considering deferring all echoing/formatting to the UIs, so they can use TecoGtkLabel or teco_curses_format_str(). This is not possible since messages transmitted via GError must not contain null-bytes, so these need to be sorted out earlier anyway. * This should also fix syntax errors in PDCurses for Windows where "%C" apparently doesn't work with non-ANSI codepoints.
2024-09-20^W^W and ^V^V can be typed completely with upcarets now and they case fold ↵Robin Haberkorn2-25/+99
all expansions of ^EQq, ^EUq and so on * Previously, there was no way to enter upper-case mode in interactive commands since the Ctrl+W immediate editing command is interpreted everywhere. * Without the case folding of ^EQq/^EUq results, the upper and lower case modes are actually pretty useless considering that modern keyboards have caps lock. So it was clear we need this, regardless of what the classic TECOs did. The TECO-11 manual is not very clear on this. tecoc apparently does not case-fold ^EQq results. * This opens up new idioms, for instance `EUq^W^W^EQq$` in order to upper case register q. It's also the only way you can currently upper-case Unicode codepoints.
2024-09-19Ctrl+^ is no longer translated to a single caret in string building (refs #20)Robin Haberkorn1-4/+21
* Ctrl+^ (30) and Caret+caret (^^) were both translated to a single caret. While there might be some reason to keep this behavior for double-caret, it is certainly pointless for Ctrl+^. * That gives you an easy way to insert Ctrl+^ (code 30) into documents with <I>. Perviously, you either had to insert a double-caret, typing 4 carets in a row, or you had to use <EI> or 30I$. * The special handling of double-caret could perhaps be abolished altogether, as we also have ^Q^ to escape plain carets. The double-caret syntax is very archaic from the time that there was no proper ^Q as far as I recall correctly.
2024-09-19"special" Q-Registers now support EQq/.../ (load) and E%q/.../ (save) commandsRobin Haberkorn4-66/+140
* @EQ$/.../ sets the current directory from the contents of the given file. @E%$/.../ stores the currend directory in the given file. * @EQ*/.../ will fail, just like ^U*...$. @E%*/.../ stores the current buffer's name in the given file. * It's especially useful with the clipboard registers. There could still be a minor bug in @E%~/.../ with regard to EOL normalization as teco_view_save() will use the EOL style of the current document, which may not be the style of the Q-Reg contents. Conversions can generally be avoided for these particular commands. But without teco_view_save() we'd have to care about save point creation.
2024-09-18check that local register is not edited at the end of macro callsRobin Haberkorn3-0/+22
* This was unsafe and could easily result in crashes, since teco_qreg_current would afterwards point to an already freed Q-Register. * Since automatically editing another register or buffer is not easy to do right, we throw an error instead.
2024-09-17fixed searches on completely new and empty documentsRobin Haberkorn1-1/+2
This was throwing glib assertions.
2024-09-16updated lists of external links in sciteco(1) and sciteco(7)Robin Haberkorn1-1/+1
* Unfortunately, the list in sciteco(7) does not format with FreeBSD's man or within SciTECO. * Removed references to the old sciteco.sf.net. We don't have a proper "homepage" for the time being.
2024-09-16Curses: added support for cool Unicode icons (refs #5)Robin Haberkorn7-14/+464
* Practically requires one of the "Nerd Font" fonts, so it's disabled by default. Add 0,512ED to the profile to enable them. * The new ED flag could be used to control Gtk icons as well, but they are left always-enabled for the time being. Is there any reason anybody would like to disable icons in Gtk? * The list of icons has been adapted and extended from exa: https://github.com/ogham/exa/blob/master/src/output/icons.rs * The icons are hardcoded as presorted lists, so we can binary search them. This could change in the future. If there is any demand, they could be made configurable via Q-Registers as well.
2024-09-16fixed rubout of empty forward kill (FK)Robin Haberkorn1-7/+12
Test case: IF$ J IX$ FKF$ ^W The range to delete is empty, Scintilla would not generate an undo action, but SCI_UNDO would still be exected on rubout which removes the "X" too early. * We should really get rid of Scintilla undo actions as they are a source of trouble and complexity. There could be a custom undo token to undo SCI_DELETERANGE that automatically fetches the text that's going to be deleted and stores it in the token's data. This could replace most uses of SCI_UNDO. The rest is to undo insertions, which can easily be replaced with undo__teco_interface_ssm(SCI_DELETERANGE...). * We should really allow rubout tests in the test suite...
2024-09-16minor search optimization: use SCI_GETRANGEPOINTERRobin Haberkorn1-4/+4
* if the buffer gap does not fall into the searched area, the gap will no longer be removed. * If it does fall into the range, there is nothing I can do about it. Only Gnulib's re_search_2() allows searching over two buffers.
2024-09-13remaining types of program counters changed to gsize/gssizeRobin Haberkorn6-26/+26
* This fixes F< to the beginning of the macro, which was broken in 73d574b71a10d4661ada20275cafde75aff6c1ba. teco_machine_main_t::macro_pc actually has to be signed as it is sometimes set to -1.
2024-09-13fixup abb5d23eba21a2aafda0346c0c5dd845561b2aa2: commandline glitches after ↵Robin Haberkorn1-2/+2
errors * teco_cmdline.pc is not correct after an error occurred. Therefore start_pc is initialized with teco_cmdline.effective_len.
2024-09-13fixed up 68578072bfaf6054a96bb6bcedfccb6e56a508fe: negative numbers weren't ↵Robin Haberkorn1-1/+1
parsed correctly
2024-09-12function key macros have been reworked into a more generic key macro featureRobin Haberkorn7-156/+217
* ALL keypresses (the UTF-8 sequences resulting from key presses) can now be remapped. * This is especially useful with Unicode support, as you might want to alias international characters to their corresponding latin form in the start state, so you don't have to change keyboard layouts so often. This is done automatically in Gtk, where we have hardware key press information, but has to be done with key macros in Curses. There is a new key mask 4 (bit 3) for that purpose now. * Also, you might want to define non-ANSI letters to perform special functions in the start state where it won't be accepted by the parser anyway. Suppose you have a macro M→, you could define @^U[^K→]{m→} 1^_U[^K→] This effectively "extends" the parser and allow you to call macro "→" by a single key press. See also #5. * The register prefix has been changed from ^F (for function) to ^K (for key). This is the only thing you have to change in order to migrate existing function key macros. * Key macros are enabled by default. There is no longer any way to disable function key handling in curses, as I never found any reason or need to disable it. Theoretically, the default ESCDELAY could turn out to be too small and function keys don't get through. I doubt that's possible unless on extremely slow serial lines. Even then, you'd have to increase ESCDELAY and instead of disabling function keys simply define an escape surrogate. * The ED flag has been removed and its place is reserved for a future mouse support flag (which does make sense to disable in curses sometimes). fnkeys.tes is consequently also enabled by default in sample.teco_ini. * Key macros are handled as an unit. If one character results in an error, the entire string is rubbed out. This fixes the "CLOSE" key on Gtk. It also makes sure that the original error message is preserved and not overwritten by some subsequent syntax error. It was never useful that we kept inserting characters after the first error.
2024-09-12teco_string_get_coord() returns character offsets now (refs #5)Robin Haberkorn8-16/+21
* This is used for error messages (TECO macro stackframes), so it's important to display columns in characters. * Program counters are in bytes and therefore everywhere gsize. This is by glib convention.
2024-09-11improved file name autocompletionRobin Haberkorn5-7/+74
* pressing ^W in FG now deletes the entire directory component as in EB * commands without glob patterns (eg. EW) can now autocomplete file names containing glob patterns * When the autocompletion contains a glob character in commands accepting glob patterns like EB or EN, we now escape the glob pattern. This already helps if the remaining file name can be autocompleted in one go. Unfortunately, this is still insufficient if we can only partially complete and the partial completion contains glob characters. For instance, if there are 2 files: `file?.txt` and `file?.foo`, completing after `f` will insert `ile[?].`. The second try to press Tab will already do nothing. To fully support these cases, we need a version of teco_file_auto_complete() accepting glob patterns. Perhaps we can simply append `*` to the given glob pattern.
2024-09-11fixed searches in single-byte encoded documentsRobin Haberkorn3-28/+58
* while code is guaranteed to be in valid UTF-8, this cannot be said about the result of string building. * The search pattern can end up with invalid Unicode bytes even when searching on UTF-8 buffers, e.g. if ^EQq inserts garbage. There are currently no checks. * When searching on a raw buffer, it must be possible to search for arbitrary bytes (^EUq). Since teco_pattern2regexp() was always expecting clean UTF-8 input, this would sometimes skip over too many bytes and could even crash. * Instead, teco_pattern2regexp() now takes the <S> target codepage into account.
2024-09-11the SciTECO parser is Unicode-based now (refs #5)Robin Haberkorn27-192/+308
The following rules apply: * All SciTECO macros __must__ be in valid UTF-8, regardless of the the register's configured encoding. This is checked against before execution, so we can use glib's non-validating UTF-8 API afterwards. * Things will inevitably get slower as we have to validate all macros first and convert to gunichar for each and every character passed into the parser. As an optimization, it may make sense to have our own inlineable version of g_utf8_get_char() (TODO). Also, Unicode glyphs in syntactically significant positions may be case-folded - just like ASCII chars were. This is is of course slower than case folding ASCII. The impact of this should be measured and perhaps we should restrict case folding to a-z via teco_ascii_toupper(). * The language itself does not use any non-ANSI characters, so you don't have to use UTF-8 characters. * Wherever the parser expects a single character, it will now accept an arbitrary Unicode/UTF-8 glyph as well. In other words, you can call macros like M§ instead of having to write M[§]. You can also get the codepoint of any Unicode character with ^^x. Pressing an Unicode character in the start state or in Ex and Fx will now give a sane error message. * When pressing a key which produces a multi-byte UTF-8 sequence, the character gets translated back and forth multiple times: 1. It's converted to an UTF-8 string, either buffered or by IME methods (Gtk). On Curses we could directly get a wide char using wget_wch(), but it's not currently used, so we don't depend on widechar curses. 2. Parsed into gunichar for passing into the edit command callbacks. This also validates the codepoint - everything later on can assume valid codepoints and valid UTF-8 strings. 3. Once the edit command handling decides to insert the key into the command line, it is serialized back into an UTF-8 string as the command line macro has to be in UTF-8 (like all other macros). 4. The parser reads back gunichars without validation for passing into the parser callbacks. * Flickering in the Curses UI and Pango warnings in Gtk, due to incompletely inserted and displayed UTF-8 sequences, are now fixed.
2024-09-10win32: fixed opening and saving UTF-8 filenames (refs #5)Robin Haberkorn1-5/+15
* The default ANSI versions of the Win32 API calls worked only as long as we used the ANSI subset of UTF-8 in filenames. * There is g_win32_locale_filename_from_utf8(), but it's not guaranteed to derive an unique filename. * Therefore we define UNICODE and convert between UTF-8 and UTF-16 (Windows' native Unicode encoding).
2024-09-10win32: convert command line to UTF-8 (refs #5)Robin Haberkorn2-17/+31
* Should prevent data loss due to system locale conversions when parsing command line arguments. * Should also fix passing Unicode arguments to munged macros and therefore opening files via ~/.teco_ini. * The entire option parsing is based on GStrv (null-terminated string lists) now, also on UNIX.
2024-09-09define G_DISABLE_ASSERT unless --enable-debug is specifiedRobin Haberkorn1-1/+1
* turns out that glib's g_assert() does not depend on NDEBUG like Standard C's assert() * this disables assertions in release builds and should speed up things slightly
2024-09-09<f,tXq>: fixed for very large character rangesRobin Haberkorn1-3/+7
* use SCI_GETTEXTRANGEFULL instead of deprecated SCI_GETTEXTRANGE
2024-09-09symbols-extract.tes works in 8-bit mode now (refs #5)Robin Haberkorn2-3/+3
* significantly speeds up build time * Scintilla and Lexilla headers and symbols are all-ASCII anyway. * We should probably have a look at the quicksort implementation in string.tes, as it can probably be optimized in UTF-8 documents as well.
2024-09-09teco_glyphs2bytes() and teco_bytes2glyphs() renamed to ↵Robin Haberkorn5-26/+26
teco_interface_glyphs2bytes() and teco_interface_bytes2glyphs() (refs #5) * for consistency with all the other teco_view wrappers in interface.h
2024-09-09added raw ANSI mode to facilitate 8-bit clean editing (refs #5)Robin Haberkorn13-104/+143
* When enabled with bit 2 in the ED flags (0,4ED), all registers and buffers will get the raw ANSI encoding (as if 0EE had been called on them). You can still manually change the encoding, eg. by calling 65001EE afterwards. * Also the ANSI mode sets up character representations for all bytes >= 0x80. This is currently done only depending on the ED flag, not when setting 0EE. * Since setting 16,4ED for 8-bit clean editing in a macro can be tricky - the default unnamed buffer will still be at UTF-8 and at least a bunch of environment registers as well - we added the command line option `--8bit` (short `-8`) which configures the ED flags very early on. As another advantage you can mung the profile in 8-bit mode as well when using SciTECO as a sort of interactive hex editor. * Disable UTF-8 checks in 8-bit clean mode (sample.teco_ini).
2024-09-09Xq and ]q inherit the document encoding from the source document (refs #5)Robin Haberkorn14-112/+177
* ^Uq however always sets an UTF8 register as the source is supposed to be a SciTECO macro which is always UTF-8. * :^Uq preserves the register's encoding * teco_doc_set_string() now also sets the encoding * instead of trying to restore the encoding in teco_doc_undo_set_string(), we now swap out the document in a teco_doc_t and pass it to an undo token. * The get_codepage() Q-Reg method has been removed as the same can now be done with teco_doc_get_string() and the get_string() method.
2024-09-09n^Uq now checks the input codepoints for validity (refs #5)Robin Haberkorn1-1/+5
* <nI> and ^EUq do the same
2024-09-09Gtk: ignore the keyboard layout whereever possible (refs #5)Robin Haberkorn2-23/+89
* Eg. when typing with a Russian layout, CTRL+I will always insert ^I. * Works with all of the start-state command Ex, Fx, ^x commands and string building constructs. This is exactly where process_edit_cmd_cb() case folds case-insensitive characters. The corresponding state therefore sets an is_case_insensitive flag now. * Does not yet work with anything embedded into Q-Register specifications. This could only be realized with a new state callback (is_case_insensitive()?) that chains to the Q-Register and string building states recursively. * Also it doesn't work with Ё on my Russian phonetic layout, probably because the ANSI caret on that same key is considered dead and not returned by gdk_keyval_to_unicode(). Perhaps we should directly wheck the keyval values? * Whenever a non-ANSI key is pressed in an allowed state, we try to check all other keyvals that could be produced by the same hardware keycode, ie. we check all groups (keyboard layouts).
2024-09-09leave some comments on what to do when converting the parser to Unicode ↵Robin Haberkorn2-1/+21
(refs #5)
2024-09-09search patterns are now expected to be in UTF-8 and the document's encoding ↵Robin Haberkorn1-21/+31
is taken into account (refs #5) * ^Nx and ^EMx constructs work with Unicode glyphs now, even though the main SciTECO parser is still not Unicode-based. (We translate only complete patterns, although they could have incomplete Unicode sequences at their end.) * case-insensitive searching now works with Unicode glyphs
2024-09-09the ^EUq string building escape now respects the encoding (can insert bytes ↵Robin Haberkorn10-16/+98
or codepoints) (refs #5) * This is trickier than it sounds because there isn't one single place to consult. It depends on the context. If the string argument relates to buffer contents - as in <I>, <S>, <FR> etc. - the buffer's encoding is consulted. If it goes into a register (EU), the register's encoding is consulted. Everything else (O, EN, EC, ES...) expects only Unicode codepoints. * This is communicated through a new field teco_machine_stringbuilding_t::codepage which must be set in the states' initial callback. * Seems overkill just for ^EUq, but it can be used for context-sensitive processing of all the other string building constructs as well. * ^V and ^W cannot be supported for Unicode characters for the time being without an Unicode-aware parser
2024-09-09<I> command evaluates input codepoints (refs #5)Robin Haberkorn1-10/+18
2024-09-09conditionals now check for Unicode codepoints (refs #5)Robin Haberkorn1-7/+7
* This will naturally work with both ASCII characters and various non-English scripts. * Unfortunately, it cannot work with the other non-ANSI single-byte codepages. * If we'd like to support scripts working with all sorts of codepoints, we'd have to introduce a new command for translating individual codepoints from the current codepage (as reported by EE) to Unicode.
2024-09-09glob patterns fully support Unicode now (refs #5)Robin Haberkorn1-13/+16
* The ASCII compiler would try to escape ("\") all bytes of a multibyte UTF-8 glyph. * The new implementation escapes only metacharacters and passes down all non-ANSI glyphs unchanged. On the downside, this will work only with PCREs.
2024-09-09:EL can be used to perform codepage conversions now (refs #5)Robin Haberkorn2-35/+231
* I decoded the Scintilla charset values into codepages, at least those used on Gtk. * make sure that the line character index is not allocated or released too often, as it is actually internally reference counted, which could result in it missing when we really need it. * The line character index still appears to be released whenever the document pointer changes, which will happen after using a different Q-Register. This could be a performance bottleneck (FIXME).
2024-09-09avoid redunancies between teco_qreg_plain_get_character() and ↵Robin Haberkorn6-48/+54
teco_state_start_get() (refs #5)
2024-09-09reserve at most 4 bytes for UTF-8 encoded characters (refs #5)Robin Haberkorn3-3/+4
There is a widespread myth that they could take up to 6 bytes.
2024-09-09implemented <EE> and <^E> commands for configuring encodings and translating ↵Robin Haberkorn2-1/+129
between glyph and byte offsets (refs #5) * ^E is heavily overloaded and can also be used to check whether a given index is valid (as it is the same that most movement commands to internally). Besides that, it is mainly useful for interfacing with Scintilla messages. * EE takes a code page or 0 for ANSI/ASCII. Currently all documents and new registers are UTF-8. There will have to be some kind of codepage inheritance and a single-byte-only mode.
2024-09-09Unicode support for the Q-Register commands (refs #5)Robin Haberkorn10-145/+274
* this required adding several Q-Register vtable methods * it should still be investigated whether the repeated calling of SCI_ALLOCATELINECHARACTERINDEX causes any overhead.
2024-09-09allow Unicode characters in command line arguments (refs #5)Robin Haberkorn2-4/+8
* the locale must be initialized very early before g_option_context_parse() * will allow UTF-8 characters in the test suite
2024-09-09Glyph to byte offset mapping is now using the line character index (refs #5)Robin Haberkorn7-68/+130
* This works reasonably well unless lines are exceedingly long (as on a line we always count characters). The following test case is still slow (on Unicode buffers): 10000<@I/XX/> <%a-1:J;> While the following is now also fast: 10000<@I/X^J/> <%a-1:J;> * Commands with relative character offsets (C, R, A, D) have a special optimization where they always count characters beginning at dot, as long as the argument is now exceedingly large. This means they are fast even on exceedingly long lines. * The remaining commands (search, EC/EG, Xq) now accept glyph indexes.
2024-09-09implemented Unicode support for rubin/rubout and a number of commands (WIP) ↵Robin Haberkorn5-44/+148
(refs #5) certain test cases are still way too slow: 10000<@I/X^J/> 20000<R> or 10000<@I/X^J/> 20000<%a-1J> SCI_ALLOCATELINECHARACTERINDEX does not help much here. It probably speeds up only SCI_LINEFROMINDEXPOSITION and SCI_INDEXPOSITIONFROMLINE.
2024-09-09input and displaying of Unicode characters is now possible (refs #5)Robin Haberkorn6-27/+73
* All non-ASCII characters are inserted as Unicode. On Curses, this also requires a properly set up locale. * We still do not need any widechar Curses, as waddch() handles multibyte characters on ncurses. We will see whether there is any Curses variant that strictly requires wadd_wch(). If this will be an exception, we might keep both widechar and non-widechar support. * By convention gsize is used exclusively for byte sizes. Character offsets or lengths use int or long.
2024-08-28fixed retrieval of characters with codes larger than 127 - always return ↵Robin Haberkorn2-5/+7
unsigned integer * SCI_GETCHARAT is internally casted to `char`, which may be signed. Characters > 127 therefore become negative and stay so when casted to sptr_t. We therefore cast it back to guchar (unsigned char). * The same is true whenever returning a string's character to SciTECO (teco_int_t) as our string type is `gchar *`. * <^^x> now also works for those characters. Eventually, the parser will probably become UTF8-aware and this will have to be done differently.
2024-08-23fully support out of tree buildsRobin Haberkorn1-3/+2
* You no longer have to copy contrib/scintilla, contrib/scinterm and contrib/lexilla manually to the build directory. * It turns out, that Scintilla/Lexilla was supporting this since 2016. Scintilla allows pointing to a source directory (srdir) and Lexilla to a binary directory (DIR_O). * For Scinterm I opened a pull request in order to add srcdir/basedir variables: https://github.com/orbitalquark/scinterm/pull/21 * `make distcheck` is therefore now also fixed. * The FreeBSD package is now allowed to build out of source. I haven't tested it yet. * See also https://github.com/ScintillaOrg/lexilla/issues/266
2024-02-08fixed expressions like `1,(2)` or `(1),(2)`: they are reported as two ↵Robin Haberkorn1-0/+3
numbers now * Instead of TECO_OP_NEW, there should perhaps simply be a flag of whether `,` was used.
2024-02-06fixed the power (^*) operator: did not handle corner cases and was inefficientRobin Haberkorn1-1/+22
* in fact, with a negative exponent the previous naive implementation would even hang indefinitely! * Now uses the squaring algorithm. This is slightly longer but significantly more efficient. * added test cases
2024-02-06avoid Groff warnings due to `\` escapesRobin Haberkorn1-2/+2
* It's generally a bad idea to pass backslashes as a glyph in macro arguments, even as `\\` since this could easily be interpreted as an escape. * Instead we now always use `\[rs]`.
2024-02-06use bool instead of guint for 1-bit fieldsRobin Haberkorn2-7/+10
* gboolean cannot be used since it is a signed type * bool is still more readable, even though we mostly use glib typedefs. * AFAIK the glib types are deprecated, so sooner or later we will switch to stdint/stdbool types anyway.
2024-02-03GTK: allow disabling client-side decorations by setting $GTK_CSD=0Robin Haberkorn1-7/+2
* This is the same variable used by gtk3-nocsd, but we will now work even without preloading any libraries. Also, it turns out that gtk3-nocsd does not ship as a FreeBSD port and hasn't been updated in a long time. * Setting this in .teco_ini wouldn't have been easy since the teco_interface_init() is called before any TECO code. Also, you might not even want disable this globally but depending on the window manager. * Therefore, you are advised to `export GTK_CSD=0` in ~/.xsession. * The --no-csd command line option is kept for the time being, but probably serves no more purpose.