sciteco/src/string-utils.c, branch v2.5.2

fixed auto-completion of Unicode file names

2026-01-18T20:36:56+00:00

* teco_string_diff() could return a number of bytes in the middle of
  an Unicode sequence. It now also requires Unicode strings.
* Added a missing Unicode-validity check when replacing command lines (`{` and `}`).
  teco_cmdline_insert() should really be refactored, though (FIXME).
* Added test case

updated copyright to 2026

2026-01-01T06:59:49+00:00

teco_string_t is now passed by value like a scalar if the callee isn't expected to modify it

2025-12-28T19:57:31+00:00

* When passing a struct that should not be modified, I usually use a const pointer.
* Strings however are small 2-word objects and they are often now already passed via separate
  `gchar*` and gsize parameters. So it is consistent to pass teco_string_t by value as well.
  A teco_string_t will usually fit into registers just like a pointer.
* It's now obvious which function just _uses_ and which function _modifies_ a string.
  There is also no chance to pass a NULL pointer to those functions.

updated copyright to 2025

2025-01-12T23:39:34+00:00

teco_string_get_coord() returns character offsets now (refs #5)

2024-09-12T14:42:08+00:00

* This is used for error messages (TECO macro stackframes),
  so it's important to display columns in characters.
* Program counters are in bytes and therefore everywhere gsize.
  This is by glib convention.

the SciTECO parser is Unicode-based now (refs #5)

2024-09-11T14:14:27+00:00

The following rules apply:
 * All SciTECO macros __must__ be in valid UTF-8, regardless of the
   the register's configured encoding.
   This is checked against before execution, so we can use glib's non-validating
   UTF-8 API afterwards.
 * Things will inevitably get slower as we have to validate all macros first
   and convert to gunichar for each and every character passed into the parser.
   As an optimization, it may make sense to have our own inlineable version of
   g_utf8_get_char() (TODO).
   Also, Unicode glyphs in syntactically significant positions may be case-folded -
   just like ASCII chars were. This is is of course slower than case folding
   ASCII. The impact of this should be measured and perhaps we should restrict
   case folding to a-z via teco_ascii_toupper().
 * The language itself does not use any non-ANSI characters, so you don't have to
   use UTF-8 characters.
 * Wherever the parser expects a single character, it will now accept an arbitrary
   Unicode/UTF-8 glyph as well.
   In other words, you can call macros like M§ instead of having to write M[§].
   You can also get the codepoint of any Unicode character with ^^x.
   Pressing an Unicode character in the start state or in Ex and Fx will now
   give a sane error message.
 * When pressing a key which produces a multi-byte UTF-8 sequence, the character
   gets translated back and forth multiple times:
   1. It's converted to an UTF-8 string, either buffered or by IME methods (Gtk).
      On Curses we could directly get a wide char using wget_wch(), but it's
      not currently used, so we don't depend on widechar curses.
   2. Parsed into gunichar for passing into the edit command callbacks.
      This also validates the codepoint - everything later on can assume valid
      codepoints and valid UTF-8 strings.
   3. Once the edit command handling decides to insert the key into the command line,
      it is serialized back into an UTF-8 string as the command line macro has
      to be in UTF-8 (like all other macros).
   4. The parser reads back gunichars without validation for passing into
      the parser callbacks.
 * Flickering in the Curses UI and Pango warnings in Gtk, due to incompletely
   inserted and displayed UTF-8 sequences, are now fixed.

leave some comments on what to do when converting the parser to Unicode (refs #5)

2024-09-09T16:22:21+00:00

2024-01-21T11:45:05+00:00

2023-04-05T15:11:32+00:00

2022-06-21T01:41:16+00:00