<feed xmlns='http://www.w3.org/2005/Atom'>
<title>sciteco/src/parser.h, branch v2.5.2</title>
<subtitle>Scintilla-based Text Editor and COrrector</subtitle>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/'/>
<entry>
<title>updated copyright to 2026</title>
<updated>2026-01-01T06:59:49+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>rhaberkorn@fmsbw.de</email>
</author>
<published>2026-01-01T06:59:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=c2feb2a6f71fc9adb20226fb3c2260c236e974e0'/>
<id>c2feb2a6f71fc9adb20226fb3c2260c236e974e0</id>
<content type='text'>
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
</pre>
</div>
</content>
</entry>
<entry>
<title>teco_string_t is now passed by value like a scalar if the callee isn't expected to modify it</title>
<updated>2025-12-28T19:57:31+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>rhaberkorn@fmsbw.de</email>
</author>
<published>2025-12-28T15:23:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=ea0a23645f03a42252ab1ce8df45ae4076ebae75'/>
<id>ea0a23645f03a42252ab1ce8df45ae4076ebae75</id>
<content type='text'>
* When passing a struct that should not be modified, I usually use a const pointer.
* Strings however are small 2-word objects and they are often now already passed via separate
  `gchar*` and gsize parameters. So it is consistent to pass teco_string_t by value as well.
  A teco_string_t will usually fit into registers just like a pointer.
* It's now obvious which function just _uses_ and which function _modifies_ a string.
  There is also no chance to pass a NULL pointer to those functions.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
* When passing a struct that should not be modified, I usually use a const pointer.
* Strings however are small 2-word objects and they are often now already passed via separate
  `gchar*` and gsize parameters. So it is consistent to pass teco_string_t by value as well.
  A teco_string_t will usually fit into registers just like a pointer.
* It's now obvious which function just _uses_ and which function _modifies_ a string.
  There is also no chance to pass a NULL pointer to those functions.
</pre>
</div>
</content>
</entry>
<entry>
<title>TECO_DEFINE_STATE() no longer constructs callback names for mandatory callbacks, but tries to use static assertions</title>
<updated>2025-12-26T17:10:42+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>rhaberkorn@fmsbw.de</email>
</author>
<published>2025-12-26T17:10:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=c2114fa0af73b42bc1ef302f7511ef87690cc0b1'/>
<id>c2114fa0af73b42bc1ef302f7511ef87690cc0b1</id>
<content type='text'>
* Requiring state callbacks by generating their names (e.g. NAME##_input) has several disadvantages:
  * The callback is not explicitly referenced when the state is defined.
    So an unintroduced reader will see some static function, which is nowhere referenced and still
    doesn't cause "unused" warnings.
  * You cannot choose the name of function that implements the callback freely.
  * In "substates" you need to generate a callback function if you want to provide a default.
    You also need to provide dummy wrapper functions whenever you want to reuse some existing
    function as the implementation.
* Instead, we are now using static assertions to check whether certain callbacks have been
  implemented.
  Unfortunately, this does not work on all compilers. In particular GCC won't consider
  references to state objects fully constant (even though they are) and does not allow
  them in _Static_assert (G_STATIC_ASSERT). This could only be made to work in newer GCC
  with -std=c2x or -std=gnu23 in combination with constexpr.
  It does work on Clang, though.
  So I introduced TECO_ASSERT_SAFE() which also passes if the expression is *not* constant.
  These static assertions are not crucial - they do not check anything that can differ between
  systems. So we can always rely on the checks performed by FreeBSD CI for instance.
  Also, you will of course quickly notice missing callbacks at runtime - with and without
  additional runtime assertions.
* All mandatory callbacks must still be explicitly initialized in the TECO_DEFINE_STATE calls.
* After getting rid of generated callback implementations, the TECO_DEFINE_STATE macros
  can finally be qualified with `static`.
* The TECO_DECLARE_STATE() macro has been removed. It no longer abstracts anything
  and cannot be used to declare static teco_state_t anyway.
  Also TECO_DEFINE_UNDO_CALL() also doesn't have a DECLARE counterpart.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
* Requiring state callbacks by generating their names (e.g. NAME##_input) has several disadvantages:
  * The callback is not explicitly referenced when the state is defined.
    So an unintroduced reader will see some static function, which is nowhere referenced and still
    doesn't cause "unused" warnings.
  * You cannot choose the name of function that implements the callback freely.
  * In "substates" you need to generate a callback function if you want to provide a default.
    You also need to provide dummy wrapper functions whenever you want to reuse some existing
    function as the implementation.
* Instead, we are now using static assertions to check whether certain callbacks have been
  implemented.
  Unfortunately, this does not work on all compilers. In particular GCC won't consider
  references to state objects fully constant (even though they are) and does not allow
  them in _Static_assert (G_STATIC_ASSERT). This could only be made to work in newer GCC
  with -std=c2x or -std=gnu23 in combination with constexpr.
  It does work on Clang, though.
  So I introduced TECO_ASSERT_SAFE() which also passes if the expression is *not* constant.
  These static assertions are not crucial - they do not check anything that can differ between
  systems. So we can always rely on the checks performed by FreeBSD CI for instance.
  Also, you will of course quickly notice missing callbacks at runtime - with and without
  additional runtime assertions.
* All mandatory callbacks must still be explicitly initialized in the TECO_DEFINE_STATE calls.
* After getting rid of generated callback implementations, the TECO_DEFINE_STATE macros
  can finally be qualified with `static`.
* The TECO_DECLARE_STATE() macro has been removed. It no longer abstracts anything
  and cannot be used to declare static teco_state_t anyway.
  Also TECO_DEFINE_UNDO_CALL() also doesn't have a DECLARE counterpart.
</pre>
</div>
</content>
</entry>
<entry>
<title>throw an error immediately after nEB if n != 0</title>
<updated>2025-10-06T21:58:11+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>rhaberkorn@fmsbw.de</email>
</author>
<published>2025-10-06T21:58:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=a62323baa454b4a6ac698067043c8d4cf50c6561'/>
<id>a62323baa454b4a6ac698067043c8d4cf50c6561</id>
<content type='text'>
* When typing nEBfilename$ (n != 0) you would find out that the construct is invalid
  only after typing out the entire command.
  We now throw an error immediately, ie. only Escape or string termination will be expected
  in interactive mode.
* In batch mode, nothing should have changed.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
* When typing nEBfilename$ (n != 0) you would find out that the construct is invalid
  only after typing out the entire command.
  We now throw an error immediately, ie. only Escape or string termination will be expected
  in interactive mode.
* In batch mode, nothing should have changed.
</pre>
</div>
</content>
</entry>
<entry>
<title>fixed serious bug with certain alternative string termination chars in commands with multiple string arguments</title>
<updated>2025-08-02T10:16:16+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>robin.haberkorn@googlemail.com</email>
</author>
<published>2025-08-02T10:16:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=e46352bc614cf9777ca76deb47330fb408bc1a23'/>
<id>e46352bc614cf9777ca76deb47330fb408bc1a23</id>
<content type='text'>
* When `@`-modifying a command with several string arguments and choosing `{` as the alternative
  string termination character, the parser would get totally confused.
  Any sequence of `{` would be ignored and only the first non-`{` would become the termination character.
  Consequently you also couldn't choose a new terminator after the closing `}`.
  So even a documented code example from sciteco(7) wouldn't work.
  The same was true when using $ (escape) or ^A as the alternative termination character.
* We can now correctly parse e.g. `@FR{foo}{bar}` or `@FR$foo$bar$` (even though the
  latter one is quite pointless).
* has probably been broken forever (has been broken even before v2.0).
* Whitespace is now ignored in front of alternative termination characters as in TECO-64, so
  we can also write `@S /foo/` or even
  ```
  @^Um
  {
    !* blabla *!
  }
  ```
  I wanted to disallow whitespace termination characters, so the alternative would have been
  to throw an error.
  The new implementation at least adds some functionality.
  * Avoid redundancies when parsing no-op characters via teco_is_noop().
    I assume that this is inlined and drawn into any jump-table what would be
    generated for the switch-statement in teco_state_start_input().
 * Alternative termination characters are still case-folded, even if they are Unicode glyphs,
   so `@IЖfooж` would work and insert `foo`.
   This should perhaps be restricted to ANSI characters?
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
* When `@`-modifying a command with several string arguments and choosing `{` as the alternative
  string termination character, the parser would get totally confused.
  Any sequence of `{` would be ignored and only the first non-`{` would become the termination character.
  Consequently you also couldn't choose a new terminator after the closing `}`.
  So even a documented code example from sciteco(7) wouldn't work.
  The same was true when using $ (escape) or ^A as the alternative termination character.
* We can now correctly parse e.g. `@FR{foo}{bar}` or `@FR$foo$bar$` (even though the
  latter one is quite pointless).
* has probably been broken forever (has been broken even before v2.0).
* Whitespace is now ignored in front of alternative termination characters as in TECO-64, so
  we can also write `@S /foo/` or even
  ```
  @^Um
  {
    !* blabla *!
  }
  ```
  I wanted to disallow whitespace termination characters, so the alternative would have been
  to throw an error.
  The new implementation at least adds some functionality.
  * Avoid redundancies when parsing no-op characters via teco_is_noop().
    I assume that this is inlined and drawn into any jump-table what would be
    generated for the switch-statement in teco_state_start_input().
 * Alternative termination characters are still case-folded, even if they are Unicode glyphs,
   so `@IЖfooж` would work and insert `foo`.
   This should perhaps be restricted to ANSI characters?
</pre>
</div>
</content>
</entry>
<entry>
<title>fixed minor memory leaks of per-state data in teco_machine_main_t</title>
<updated>2025-07-17T21:34:56+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>robin.haberkorn@googlemail.com</email>
</author>
<published>2025-07-17T21:34:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=3a2583e918bcc805fe860252f8a520fc2f9b26ce'/>
<id>3a2583e918bcc805fe860252f8a520fc2f9b26ce</id>
<content type='text'>
* These were leaked e.g. in case of end-of-macro errors,
  but also in case of syntax highlighting (teco_lexer_style()).
  I considered to solve this by overwriting more of the end_of_macro_cb,
  but it didn't turn out to be trivial always.
* Considering that the union in teco_machine_main_t saved only 3 machine words
  of memory, I decided to sacrifice those for more robust memory management.
* teco_machine_qregspec_t cannot be directly embedded into teco_machine_main_t
  due to recursive dependencies with teco_machine_stringbuilding_t.
  It could now and should perhaps be allocated only once in teco_machine_main_init(),
  but it would require more refactoring.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
* These were leaked e.g. in case of end-of-macro errors,
  but also in case of syntax highlighting (teco_lexer_style()).
  I considered to solve this by overwriting more of the end_of_macro_cb,
  but it didn't turn out to be trivial always.
* Considering that the union in teco_machine_main_t saved only 3 machine words
  of memory, I decided to sacrifice those for more robust memory management.
* teco_machine_qregspec_t cannot be directly embedded into teco_machine_main_t
  due to recursive dependencies with teco_machine_stringbuilding_t.
  It could now and should perhaps be allocated only once in teco_machine_main_init(),
  but it would require more refactoring.
</pre>
</div>
</content>
</entry>
<entry>
<title>implemented ^E&lt;code&gt; string building constructs for embedding bytes and codepoints in a strtoul()-like manner</title>
<updated>2025-07-03T14:04:01+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>robin.haberkorn@googlemail.com</email>
</author>
<published>2025-07-03T13:21:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=7bc7662f3cd1ceaf55e00f3d5f84e9772574afc8'/>
<id>7bc7662f3cd1ceaf55e00f3d5f84e9772574afc8</id>
<content type='text'>
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
</pre>
</div>
</content>
</entry>
<entry>
<title>new string building construct ^P disables all further string building magic</title>
<updated>2025-05-23T22:31:22+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>robin.haberkorn@googlemail.com</email>
</author>
<published>2025-05-23T22:31:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=4b266c9616f4eb359be71c44b9b2fa3373265bb0'/>
<id>4b266c9616f4eb359be71c44b9b2fa3373265bb0</id>
<content type='text'>
* Now, `I^P` can replace `EI`.
  EI is therefore now free to be repurposed as the new "mung file" command for improved TECO-11 compatibility.
* On the downside when inserting large blocks of TECO code, you will have to write something like
  `@I{^P !...! }`
* The construct is also useful when searching for carets as in `S^P^Q^`.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
* Now, `I^P` can replace `EI`.
  EI is therefore now free to be repurposed as the new "mung file" command for improved TECO-11 compatibility.
* On the downside when inserting large blocks of TECO code, you will have to write something like
  `@I{^P !...! }`
* The construct is also useful when searching for carets as in `S^P^Q^`.
</pre>
</div>
</content>
</entry>
<entry>
<title>fixed undoing bitfields on Windows</title>
<updated>2025-04-12T22:33:43+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>robin.haberkorn@googlemail.com</email>
</author>
<published>2025-04-12T21:40:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=628c73d984fd7663607cc3fd9368f809855906fd'/>
<id>628c73d984fd7663607cc3fd9368f809855906fd</id>
<content type='text'>
* It turns out that `bool` (_Bool) in bitfields may cause
  padding to the next 32-bit word.
  This was only observed on MinGW.
  I am not entirely sure why, although the C standard does
  not guarantee much with regard to bitfield memory layout
  and there are 64-bit available due to passing anyway.
  Actually, they could also be layed out in a different order.
* I am now consistently using guint instead of `bool` in bitfields
  to prevent any potential surprises.
* The way that guint was aliased with bitfield structs
  for undoing teco_machine_main_t and teco_machine_qregspec_t flags
  was therefore insecure.
  It was not guaranteed that the __flags field really "captures"
  all of the bit field.
  Even with `guint v : 1` fields, this was not guaranteed.
  We would have required a static assertion for robustness.
  Alternatively, we could have declared a `gsize __flags` variable
  as well. This __should__ be safe since gsize should always be
  pointer sized and correspond to the platform's alignment.
  However, it's also not 100% guaranteed.
  Using classic ANSI C enums with bit operations to encode multiple
  fields and flags into a single integer also doesn't look very
  attractive.
* Instead, we now define scalar types with their own teco_undo_push()
  shortcuts for the bitfield structs.
  This is in one way simpler and much more robust, but on the other
  hand complicates access to the flag variables.
* It's a good question whether a `struct __attribute__((packed))` bitfield
  with guint fields would be a reliable replacement for flag enums, that
  are communicated with the "outside" (TECO) world.
  I am not going to risk it until GCC gives any guarantees, though.
  For the time being, bitfields are only used internally where
  the concrete memory layout (bit positions) is not crucial.
* This fixes the test suite and therefore probably CI and nightly
  builds on Windows.
* Test case: Rub out `@I//` or `@Xq` until before the `@`.
  The parser doesn't know that `@` is still set and allows
  all sorts of commands where `@` should be forbidden.
* It's unknown how long this has been broken on Windows - quite
  possibly since v2.0.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
* It turns out that `bool` (_Bool) in bitfields may cause
  padding to the next 32-bit word.
  This was only observed on MinGW.
  I am not entirely sure why, although the C standard does
  not guarantee much with regard to bitfield memory layout
  and there are 64-bit available due to passing anyway.
  Actually, they could also be layed out in a different order.
* I am now consistently using guint instead of `bool` in bitfields
  to prevent any potential surprises.
* The way that guint was aliased with bitfield structs
  for undoing teco_machine_main_t and teco_machine_qregspec_t flags
  was therefore insecure.
  It was not guaranteed that the __flags field really "captures"
  all of the bit field.
  Even with `guint v : 1` fields, this was not guaranteed.
  We would have required a static assertion for robustness.
  Alternatively, we could have declared a `gsize __flags` variable
  as well. This __should__ be safe since gsize should always be
  pointer sized and correspond to the platform's alignment.
  However, it's also not 100% guaranteed.
  Using classic ANSI C enums with bit operations to encode multiple
  fields and flags into a single integer also doesn't look very
  attractive.
* Instead, we now define scalar types with their own teco_undo_push()
  shortcuts for the bitfield structs.
  This is in one way simpler and much more robust, but on the other
  hand complicates access to the flag variables.
* It's a good question whether a `struct __attribute__((packed))` bitfield
  with guint fields would be a reliable replacement for flag enums, that
  are communicated with the "outside" (TECO) world.
  I am not going to risk it until GCC gives any guarantees, though.
  For the time being, bitfields are only used internally where
  the concrete memory layout (bit positions) is not crucial.
* This fixes the test suite and therefore probably CI and nightly
  builds on Windows.
* Test case: Rub out `@I//` or `@Xq` until before the `@`.
  The parser doesn't know that `@` is still set and allows
  all sorts of commands where `@` should be forbidden.
* It's unknown how long this has been broken on Windows - quite
  possibly since v2.0.
</pre>
</div>
</content>
</entry>
<entry>
<title>tightened rules for specifying modifiers</title>
<updated>2025-04-08T21:33:40+00:00</updated>
<author>
<name>Robin Haberkorn</name>
<email>robin.haberkorn@googlemail.com</email>
</author>
<published>2025-04-08T20:26:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.fmsbw.de/sciteco/commit/?id=7c0e4fbb1d1f0d19d11c7417c55a305654ab1c83'/>
<id>7c0e4fbb1d1f0d19d11c7417c55a305654ab1c83</id>
<content type='text'>
* Instead of separate stand-alone commands, they are now allowed only immediately
  in front of the commands that accept them.
* The order is still insignificant if both `@` and `:` are accepted.
* The number of colon modifiers is now also checked.
  We basically get this for free.
* `@` has syntactic significance, so it could not be set conditionally anyway.
  Still, it was possible to provoke bugs were `@` was interpreted conditionally
  as in `@ 2&lt;I/foo/$&gt;`.
* Even when not causing bugs, a mistyped `@` would often influence the
  __next__ command, causing unexpected behavior, for instance when
  typing `@(233C)W`.
* While it was theoretically possible to set `:` conditionally, it could also
  be "passed through" accidentally to some command where it wasn't expected as in
  `:Ifoo$ C`.
  I do not know of any real useful application or idiom of a conditionally set `:`.
  If there would happen to be some kind of useful application, `:'` and `:|` could
  be re-allowed easily, though.
* I was condidering introducing a common parser state for modified commands,
  but that would have been tricky and introduce a lot of redundant command lists.
  So instead, we now simply everywhere check for excess modifiers.
  To simplify this task, teco_machine_main_transition_t now contains flags
  signaling whether the transition is allowed with `@` or `:` modifiers set.
  It currently only has to be checked in the start state, after `E` and `F`.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
* Instead of separate stand-alone commands, they are now allowed only immediately
  in front of the commands that accept them.
* The order is still insignificant if both `@` and `:` are accepted.
* The number of colon modifiers is now also checked.
  We basically get this for free.
* `@` has syntactic significance, so it could not be set conditionally anyway.
  Still, it was possible to provoke bugs were `@` was interpreted conditionally
  as in `@ 2&lt;I/foo/$&gt;`.
* Even when not causing bugs, a mistyped `@` would often influence the
  __next__ command, causing unexpected behavior, for instance when
  typing `@(233C)W`.
* While it was theoretically possible to set `:` conditionally, it could also
  be "passed through" accidentally to some command where it wasn't expected as in
  `:Ifoo$ C`.
  I do not know of any real useful application or idiom of a conditionally set `:`.
  If there would happen to be some kind of useful application, `:'` and `:|` could
  be re-allowed easily, though.
* I was condidering introducing a common parser state for modified commands,
  but that would have been tricky and introduce a lot of redundant command lists.
  So instead, we now simply everywhere check for excess modifiers.
  To simplify this task, teco_machine_main_transition_t now contains flags
  signaling whether the transition is allowed with `@` or `:` modifiers set.
  It currently only has to be checked in the start state, after `E` and `F`.
</pre>
</div>
</content>
</entry>
</feed>
