<feed xmlns='http://www.w3.org/2005/Atom'>
<title>sciteco/src/string-utils.h, branch v2.5.2</title>
<subtitle>Scintilla-based Text Editor and COrrector</subtitle>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/'/>
<entry>
<title>updated copyright to 2026</title>
<updated>2026-01-01T06:59:49+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>rhaberkorn@fmsbw.de</email>
</author>
<published>2026-01-01T06:59:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=c2feb2a6f71fc9adb20226fb3c2260c236e974e0'/>
<id>c2feb2a6f71fc9adb20226fb3c2260c236e974e0</id>
<content type='text'>
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
</pre>
</div>
</content>
</entry>
<entry>
<title>teco_string_t is now passed by value like a scalar if the callee isn't expected to modify it</title>
<updated>2025-12-28T19:57:31+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>rhaberkorn@fmsbw.de</email>
</author>
<published>2025-12-28T15:23:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=ea0a23645f03a42252ab1ce8df45ae4076ebae75'/>
<id>ea0a23645f03a42252ab1ce8df45ae4076ebae75</id>
<content type='text'>
* When passing a struct that should not be modified, I usually use a const pointer.
* Strings however are small 2-word objects and they are often now already passed via separate
  `gchar*` and gsize parameters. So it is consistent to pass teco_string_t by value as well.
  A teco_string_t will usually fit into registers just like a pointer.
* It's now obvious which function just _uses_ and which function _modifies_ a string.
  There is also no chance to pass a NULL pointer to those functions.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
* When passing a struct that should not be modified, I usually use a const pointer.
* Strings however are small 2-word objects and they are often now already passed via separate
  `gchar*` and gsize parameters. So it is consistent to pass teco_string_t by value as well.
  A teco_string_t will usually fit into registers just like a pointer.
* It's now obvious which function just _uses_ and which function _modifies_ a string.
  There is also no chance to pass a NULL pointer to those functions.
</pre>
</div>
</content>
</entry>
<entry>
<title>fixed clicking the "(Unnamed)" buffer in 0EB popups</title>
<updated>2025-12-23T12:54:17+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>rhaberkorn@fmsbw.de</email>
</author>
<published>2025-12-23T12:54:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=0c89fb700957e411885e7e7835e15f441e8b5e84'/>
<id>0c89fb700957e411885e7e7835e15f441e8b5e84</id>
<content type='text'>
* When constructing the list of popup items, the unnamed buffer is stored as the empty string
  instead of a prerendered "(Unnamed)".
  Using the empty string simplifies autocompletions, which will actually have to insert nothing
  at all (in addition to terminating the string).
* Since unnamed buffers are now special in the popup list, we can render them with special
  icons as well.
  Currently, only on Curses we use a file symbol with a question mark.
  There doesn't appear to be a fitting standard Freedesktop icon to use on GTK and there
  isn't even any fitting standard emblem to lay over the default file icon.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
* When constructing the list of popup items, the unnamed buffer is stored as the empty string
  instead of a prerendered "(Unnamed)".
  Using the empty string simplifies autocompletions, which will actually have to insert nothing
  at all (in addition to terminating the string).
* Since unnamed buffers are now special in the popup list, we can render them with special
  icons as well.
  Currently, only on Curses we use a file symbol with a question mark.
  There doesn't appear to be a fitting standard Freedesktop icon to use on GTK and there
  isn't even any fitting standard emblem to lay over the default file icon.
</pre>
</div>
</content>
</entry>
<entry>
<title>updated copyright to 2025</title>
<updated>2025-01-12T23:39:34+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>robin.haberkorn@googlemail.com</email>
</author>
<published>2025-01-12T23:39:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=d842eaee19e2723f845d4b8314a230cf68e82653'/>
<id>d842eaee19e2723f845d4b8314a230cf68e82653</id>
<content type='text'>
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
</pre>
</div>
</content>
</entry>
<entry>
<title>teco_string_get_coord() returns character offsets now (refs #5)</title>
<updated>2024-09-12T14:42:08+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>robin.haberkorn@googlemail.com</email>
</author>
<published>2024-09-12T11:33:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=73d574b71a10d4661ada20275cafde75aff6c1ba'/>
<id>73d574b71a10d4661ada20275cafde75aff6c1ba</id>
<content type='text'>
* This is used for error messages (TECO macro stackframes),
  so it's important to display columns in characters.
* Program counters are in bytes and therefore everywhere gsize.
  This is by glib convention.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
* This is used for error messages (TECO macro stackframes),
  so it's important to display columns in characters.
* Program counters are in bytes and therefore everywhere gsize.
  This is by glib convention.
</pre>
</div>
</content>
</entry>
<entry>
<title>fixed searches in single-byte encoded documents</title>
<updated>2024-09-11T14:14:27+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>robin.haberkorn@googlemail.com</email>
</author>
<published>2024-09-11T12:30:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=2a050759ab621b87d0782cc8235907a1757b46cc'/>
<id>2a050759ab621b87d0782cc8235907a1757b46cc</id>
<content type='text'>
* while code is guaranteed to be in valid UTF-8, this cannot be
  said about the result of string building.
* The search pattern can end up with invalid Unicode bytes even when
  searching on UTF-8 buffers, e.g. if ^EQq inserts garbage.
  There are currently no checks.
* When searching on a raw buffer, it must be possible to
  search for arbitrary bytes (^EUq).
  Since teco_pattern2regexp() was always expecting clean UTF-8 input,
  this would sometimes skip over too many bytes and could even crash.
* Instead, teco_pattern2regexp() now takes the &lt;S&gt; target codepage
  into account.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
* while code is guaranteed to be in valid UTF-8, this cannot be
  said about the result of string building.
* The search pattern can end up with invalid Unicode bytes even when
  searching on UTF-8 buffers, e.g. if ^EQq inserts garbage.
  There are currently no checks.
* When searching on a raw buffer, it must be possible to
  search for arbitrary bytes (^EUq).
  Since teco_pattern2regexp() was always expecting clean UTF-8 input,
  this would sometimes skip over too many bytes and could even crash.
* Instead, teco_pattern2regexp() now takes the &lt;S&gt; target codepage
  into account.
</pre>
</div>
</content>
</entry>
<entry>
<title>the SciTECO parser is Unicode-based now (refs #5)</title>
<updated>2024-09-11T14:14:27+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>robin.haberkorn@googlemail.com</email>
</author>
<published>2024-09-11T10:21:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=68578072bfaf6054a96bb6bcedfccb6e56a508fe'/>
<id>68578072bfaf6054a96bb6bcedfccb6e56a508fe</id>
<content type='text'>
The following rules apply:
 * All SciTECO macros __must__ be in valid UTF-8, regardless of the
   the register's configured encoding.
   This is checked against before execution, so we can use glib's non-validating
   UTF-8 API afterwards.
 * Things will inevitably get slower as we have to validate all macros first
   and convert to gunichar for each and every character passed into the parser.
   As an optimization, it may make sense to have our own inlineable version of
   g_utf8_get_char() (TODO).
   Also, Unicode glyphs in syntactically significant positions may be case-folded -
   just like ASCII chars were. This is is of course slower than case folding
   ASCII. The impact of this should be measured and perhaps we should restrict
   case folding to a-z via teco_ascii_toupper().
 * The language itself does not use any non-ANSI characters, so you don't have to
   use UTF-8 characters.
 * Wherever the parser expects a single character, it will now accept an arbitrary
   Unicode/UTF-8 glyph as well.
   In other words, you can call macros like M§ instead of having to write M[§].
   You can also get the codepoint of any Unicode character with ^^x.
   Pressing an Unicode character in the start state or in Ex and Fx will now
   give a sane error message.
 * When pressing a key which produces a multi-byte UTF-8 sequence, the character
   gets translated back and forth multiple times:
   1. It's converted to an UTF-8 string, either buffered or by IME methods (Gtk).
      On Curses we could directly get a wide char using wget_wch(), but it's
      not currently used, so we don't depend on widechar curses.
   2. Parsed into gunichar for passing into the edit command callbacks.
      This also validates the codepoint - everything later on can assume valid
      codepoints and valid UTF-8 strings.
   3. Once the edit command handling decides to insert the key into the command line,
      it is serialized back into an UTF-8 string as the command line macro has
      to be in UTF-8 (like all other macros).
   4. The parser reads back gunichars without validation for passing into
      the parser callbacks.
 * Flickering in the Curses UI and Pango warnings in Gtk, due to incompletely
   inserted and displayed UTF-8 sequences, are now fixed.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The following rules apply:
 * All SciTECO macros __must__ be in valid UTF-8, regardless of the
   the register's configured encoding.
   This is checked against before execution, so we can use glib's non-validating
   UTF-8 API afterwards.
 * Things will inevitably get slower as we have to validate all macros first
   and convert to gunichar for each and every character passed into the parser.
   As an optimization, it may make sense to have our own inlineable version of
   g_utf8_get_char() (TODO).
   Also, Unicode glyphs in syntactically significant positions may be case-folded -
   just like ASCII chars were. This is is of course slower than case folding
   ASCII. The impact of this should be measured and perhaps we should restrict
   case folding to a-z via teco_ascii_toupper().
 * The language itself does not use any non-ANSI characters, so you don't have to
   use UTF-8 characters.
 * Wherever the parser expects a single character, it will now accept an arbitrary
   Unicode/UTF-8 glyph as well.
   In other words, you can call macros like M§ instead of having to write M[§].
   You can also get the codepoint of any Unicode character with ^^x.
   Pressing an Unicode character in the start state or in Ex and Fx will now
   give a sane error message.
 * When pressing a key which produces a multi-byte UTF-8 sequence, the character
   gets translated back and forth multiple times:
   1. It's converted to an UTF-8 string, either buffered or by IME methods (Gtk).
      On Curses we could directly get a wide char using wget_wch(), but it's
      not currently used, so we don't depend on widechar curses.
   2. Parsed into gunichar for passing into the edit command callbacks.
      This also validates the codepoint - everything later on can assume valid
      codepoints and valid UTF-8 strings.
   3. Once the edit command handling decides to insert the key into the command line,
      it is serialized back into an UTF-8 string as the command line macro has
      to be in UTF-8 (like all other macros).
   4. The parser reads back gunichars without validation for passing into
      the parser callbacks.
 * Flickering in the Curses UI and Pango warnings in Gtk, due to incompletely
   inserted and displayed UTF-8 sequences, are now fixed.
</pre>
</div>
</content>
</entry>
<entry>
<title>win32: convert command line to UTF-8 (refs #5)</title>
<updated>2024-09-10T09:43:01+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>robin.haberkorn@googlemail.com</email>
</author>
<published>2024-09-09T19:34:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=4a4ffd9379ad247f0a303aa58d297c93dce0adcd'/>
<id>4a4ffd9379ad247f0a303aa58d297c93dce0adcd</id>
<content type='text'>
* Should prevent data loss due to system locale conversions
  when parsing command line arguments.
* Should also fix passing Unicode arguments to munged macros and
  therefore opening files via ~/.teco_ini.
* The entire option parsing is based on GStrv (null-terminated string lists)
  now, also on UNIX.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
* Should prevent data loss due to system locale conversions
  when parsing command line arguments.
* Should also fix passing Unicode arguments to munged macros and
  therefore opening files via ~/.teco_ini.
* The entire option parsing is based on GStrv (null-terminated string lists)
  now, also on UNIX.
</pre>
</div>
</content>
</entry>
<entry>
<title>input and displaying of Unicode characters is now possible (refs #5)</title>
<updated>2024-09-09T16:16:07+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>robin.haberkorn@googlemail.com</email>
</author>
<published>2024-08-28T10:59:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=4c6b6814abfc9c022c6ea8d1e23097c2a774fde5'/>
<id>4c6b6814abfc9c022c6ea8d1e23097c2a774fde5</id>
<content type='text'>
* All non-ASCII characters are inserted as Unicode.
  On Curses, this also requires a properly set up locale.
* We still do not need any widechar Curses, as waddch() handles
  multibyte characters on ncurses.
  We will see whether there is any Curses variant that strictly requires
  wadd_wch().
  If this will be an exception, we might keep both widechar and non-widechar
  support.
* By convention gsize is used exclusively for byte sizes.
  Character offsets or lengths use int or long.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
* All non-ASCII characters are inserted as Unicode.
  On Curses, this also requires a properly set up locale.
* We still do not need any widechar Curses, as waddch() handles
  multibyte characters on ncurses.
  We will see whether there is any Curses variant that strictly requires
  wadd_wch().
  If this will be an exception, we might keep both widechar and non-widechar
  support.
* By convention gsize is used exclusively for byte sizes.
  Character offsets or lengths use int or long.
</pre>
</div>
</content>
</entry>
<entry>
<title>updated copyright to 2024</title>
<updated>2024-01-21T11:45:05+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>robin.haberkorn@googlemail.com</email>
</author>
<published>2024-01-21T11:07:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=1cecf04656532e94e1fe9fe25460774324b2197c'/>
<id>1cecf04656532e94e1fe9fe25460774324b2197c</id>
<content type='text'>
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
</pre>
</div>
</content>
</entry>
</feed>
