sciteco/src/string-utils.h, branch v2.5.2

updated copyright to 2026

2026-01-01T06:59:49+00:00

teco_string_t is now passed by value like a scalar if the callee isn't expected to modify it

2025-12-28T19:57:31+00:00

* When passing a struct that should not be modified, I usually use a const pointer.
* Strings however are small 2-word objects and they are often now already passed via separate
  `gchar*` and gsize parameters. So it is consistent to pass teco_string_t by value as well.
  A teco_string_t will usually fit into registers just like a pointer.
* It's now obvious which function just _uses_ and which function _modifies_ a string.
  There is also no chance to pass a NULL pointer to those functions.

fixed clicking the "(Unnamed)" buffer in 0EB popups

2025-12-23T12:54:17+00:00

* When constructing the list of popup items, the unnamed buffer is stored as the empty string
  instead of a prerendered "(Unnamed)".
  Using the empty string simplifies autocompletions, which will actually have to insert nothing
  at all (in addition to terminating the string).
* Since unnamed buffers are now special in the popup list, we can render them with special
  icons as well.
  Currently, only on Curses we use a file symbol with a question mark.
  There doesn't appear to be a fitting standard Freedesktop icon to use on GTK and there
  isn't even any fitting standard emblem to lay over the default file icon.

updated copyright to 2025

2025-01-12T23:39:34+00:00

teco_string_get_coord() returns character offsets now (refs #5)

2024-09-12T14:42:08+00:00

* This is used for error messages (TECO macro stackframes),
  so it's important to display columns in characters.
* Program counters are in bytes and therefore everywhere gsize.
  This is by glib convention.

fixed searches in single-byte encoded documents

2024-09-11T14:14:27+00:00

* while code is guaranteed to be in valid UTF-8, this cannot be
  said about the result of string building.
* The search pattern can end up with invalid Unicode bytes even when
  searching on UTF-8 buffers, e.g. if ^EQq inserts garbage.
  There are currently no checks.
* When searching on a raw buffer, it must be possible to
  search for arbitrary bytes (^EUq).
  Since teco_pattern2regexp() was always expecting clean UTF-8 input,
  this would sometimes skip over too many bytes and could even crash.
* Instead, teco_pattern2regexp() now takes the  target codepage
  into account.

the SciTECO parser is Unicode-based now (refs #5)

2024-09-11T14:14:27+00:00

The following rules apply: * All SciTECO macros __must__ be in valid UTF-8, regardless of the the register's configured encoding. This is checked against before execution, so we can use glib's non-validating UTF-8 API afterwards. * Things will inevitably get slower as we have to validate all macros first and convert to gunichar for each and every character passed into the parser. As an optimization, it may make sense to have our own inlineable version of g_utf8_get_char() (TODO). Also, Unicode glyphs in syntactically significant positions may be case-folded - just like ASCII chars were. This is is of course slower than case folding ASCII. The impact of this should be measured and perhaps we should restrict case folding to a-z via teco_ascii_toupper(). * The language itself does not use any non-ANSI characters, so you don't have to use UTF-8 characters. * Wherever the parser expects a single character, it will now accept an arbitrary Unicode/UTF-8 glyph as well. In other words, you can call macros like M§ instead of having to write M[§]. You can also get the codepoint of any Unicode character with ^^x. Pressing an Unicode character in the start state or in Ex and Fx will now give a sane error message. * When pressing a key which produces a multi-byte UTF-8 sequence, the character gets translated back and forth multiple times: 1. It's converted to an UTF-8 string, either buffered or by IME methods (Gtk). On Curses we could directly get a wide char using wget_wch(), but it's not currently used, so we don't depend on widechar curses. 2. Parsed into gunichar for passing into the edit command callbacks. This also validates the codepoint - everything later on can assume valid codepoints and valid UTF-8 strings. 3. Once the edit command handling decides to insert the key into the command line, it is serialized back into an UTF-8 string as the command line macro has to be in UTF-8 (like all other macros). 4. The parser reads back gunichars without validation for passing into the parser callbacks. * Flickering in the Curses UI and Pango warnings in Gtk, due to incompletely inserted and displayed UTF-8 sequences, are now fixed.

win32: convert command line to UTF-8 (refs #5)

2024-09-10T09:43:01+00:00

* Should prevent data loss due to system locale conversions when parsing command line arguments. * Should also fix passing Unicode arguments to munged macros and therefore opening files via ~/.teco_ini. * The entire option parsing is based on GStrv (null-terminated string lists) now, also on UNIX.

input and displaying of Unicode characters is now possible (refs #5)

2024-09-09T16:16:07+00:00

* All non-ASCII characters are inserted as Unicode. On Curses, this also requires a properly set up locale. * We still do not need any widechar Curses, as waddch() handles multibyte characters on ncurses. We will see whether there is any Curses variant that strictly requires wadd_wch(). If this will be an exception, we might keep both widechar and non-widechar support. * By convention gsize is used exclusively for byte sizes. Character offsets or lengths use int or long.

updated copyright to 2024

2024-01-21T11:45:05+00:00