sciteco/src, branch hsrex

fixup

2024-09-14T19:14:57+00:00

imported Henry Spencer's regex implementation from Tcl

2024-09-14T17:00:01+00:00

Source: github.com/garyhouston/hsrex

* This version should be a Thompson NFA, using backtracking only
  for backreferences, so it should be much safer than PCRE (GRegex).
  Search times should be linear and there should be no way to cause
  stack overflows (unless we would generate backreferences).
* Importing the lib makes sure we don't add another compile-time
  dependency. Also, we could implement our own regcomp() which
  translates directly from TECO patterns.
* This is still WIP and currently only works with the ASCII version.
  The widechar version does not define re_comp() and re_exec().
* Apparently we can't have an ASCII and widechar version at the same time,
  so we must build two libtool libraries and somehow mangle the names.
* Ideally the widechar version will also work with UTF-8 strings.
* An alternative might be to import the Gnulib regex module.
  How does it choose the encoding anyway?
* Or we could just use Oniguruma - but this would have to be a new
  external library dependency.

remaining types of program counters changed to gsize/gssize

2024-09-12T23:49:22+00:00

* This fixes F< to the beginning of the macro, which was broken in 73d574b71a10d4661ada20275cafde75aff6c1ba.
  teco_machine_main_t::macro_pc actually has to be signed as it is sometimes set to -1.

fixup abb5d23eba21a2aafda0346c0c5dd845561b2aa2: commandline glitches after errors

2024-09-12T23:26:43+00:00

* teco_cmdline.pc is not correct after an error occurred.
  Therefore start_pc is initialized with teco_cmdline.effective_len.

fixed up 68578072bfaf6054a96bb6bcedfccb6e56a508fe: negative numbers weren't parsed correctly

2024-09-12T23:07:31+00:00

function key macros have been reworked into a more generic key macro feature

2024-09-12T14:44:13+00:00

* ALL keypresses (the UTF-8 sequences resulting from key presses) can now be remapped.
* This is especially useful with Unicode support, as you might want to alias
  international characters to their corresponding latin form in the start state,
  so you don't have to change keyboard layouts so often.
  This is done automatically in Gtk, where we have hardware key press information,
  but has to be done with key macros in Curses.
  There is a new key mask 4 (bit 3) for that purpose now.
* Also, you might want to define non-ANSI letters to perform special functions in
  the start state where it won't be accepted by the parser anyway.
  Suppose you have a macro M→, you could define
  @^U[^K→]{m→} 1^_U[^K→]
  This effectively "extends" the parser and allow you to call macro "→" by a single
  key press. See also #5.
* The register prefix has been changed from ^F (for function) to ^K (for key).
  This is the only thing you have to change in order to migrate existing
  function key macros.
* Key macros are enabled by default. There is no longer any way to disable
  function key handling in curses, as I never found any reason or need to disable it.
  Theoretically, the default ESCDELAY could turn out to be too small and function
  keys don't get through. I doubt that's possible unless on extremely slow serial lines.
  Even then, you'd have to increase ESCDELAY and instead of disabling function keys
  simply define an escape surrogate.
* The ED flag has been removed and its place is reserved for a future mouse support flag
  (which does make sense to disable in curses sometimes).
  fnkeys.tes is consequently also enabled by default in sample.teco_ini.
* Key macros are handled as an unit. If one character results in an error,
  the entire string is rubbed out.
  This fixes the "CLOSE" key on Gtk.
  It also makes sure that the original error message is preserved and not overwritten
  by some subsequent syntax error.
  It was never useful that we kept inserting characters after the first error.

teco_string_get_coord() returns character offsets now (refs #5)

2024-09-12T14:42:08+00:00

* This is used for error messages (TECO macro stackframes),
  so it's important to display columns in characters.
* Program counters are in bytes and therefore everywhere gsize.
  This is by glib convention.

improved file name autocompletion

2024-09-11T14:14:27+00:00

* pressing ^W in FG now deletes the entire directory component as in EB
* commands without glob patterns (eg. EW) can now autocomplete file names containing
  glob patterns
* When the autocompletion contains a glob character in commands accepting
  glob patterns like EB or EN, we now escape the glob pattern.
  This already helps if the remaining file name can be autocompleted in one go.
  Unfortunately, this is still insufficient if we can only partially complete
  and the partial completion contains glob characters.
  For instance, if there are 2 files: `file?.txt` and `file?.foo`,
  completing after `f` will insert `ile[?].`.
  The second try to press Tab will already do nothing.
  To fully support these cases, we need a version of teco_file_auto_complete()
  accepting glob patterns.
  Perhaps we can simply append `*` to the given glob pattern.

fixed searches in single-byte encoded documents

2024-09-11T14:14:27+00:00

* while code is guaranteed to be in valid UTF-8, this cannot be
  said about the result of string building.
* The search pattern can end up with invalid Unicode bytes even when
  searching on UTF-8 buffers, e.g. if ^EQq inserts garbage.
  There are currently no checks.
* When searching on a raw buffer, it must be possible to
  search for arbitrary bytes (^EUq).
  Since teco_pattern2regexp() was always expecting clean UTF-8 input,
  this would sometimes skip over too many bytes and could even crash.
* Instead, teco_pattern2regexp() now takes the  target codepage
  into account.

the SciTECO parser is Unicode-based now (refs #5)

2024-09-11T14:14:27+00:00

The following rules apply: * All SciTECO macros __must__ be in valid UTF-8, regardless of the the register's configured encoding. This is checked against before execution, so we can use glib's non-validating UTF-8 API afterwards. * Things will inevitably get slower as we have to validate all macros first and convert to gunichar for each and every character passed into the parser. As an optimization, it may make sense to have our own inlineable version of g_utf8_get_char() (TODO). Also, Unicode glyphs in syntactically significant positions may be case-folded - just like ASCII chars were. This is is of course slower than case folding ASCII. The impact of this should be measured and perhaps we should restrict case folding to a-z via teco_ascii_toupper(). * The language itself does not use any non-ANSI characters, so you don't have to use UTF-8 characters. * Wherever the parser expects a single character, it will now accept an arbitrary Unicode/UTF-8 glyph as well. In other words, you can call macros like M§ instead of having to write M[§]. You can also get the codepoint of any Unicode character with ^^x. Pressing an Unicode character in the start state or in Ex and Fx will now give a sane error message. * When pressing a key which produces a multi-byte UTF-8 sequence, the character gets translated back and forth multiple times: 1. It's converted to an UTF-8 string, either buffered or by IME methods (Gtk). On Curses we could directly get a wide char using wget_wch(), but it's not currently used, so we don't depend on widechar curses. 2. Parsed into gunichar for passing into the edit command callbacks. This also validates the codepoint - everything later on can assume valid codepoints and valid UTF-8 strings. 3. Once the edit command handling decides to insert the key into the command line, it is serialized back into an UTF-8 string as the command line macro has to be in UTF-8 (like all other macros). 4. The parser reads back gunichars without validation for passing into the parser callbacks. * Flickering in the Curses UI and Pango warnings in Gtk, due to incompletely inserted and displayed UTF-8 sequences, are now fixed.