sciteco/tests, branch hsrex

remaining types of program counters changed to gsize/gssize

2024-09-12T23:49:22+00:00

* This fixes F< to the beginning of the macro, which was broken in 73d574b71a10d4661ada20275cafde75aff6c1ba.
  teco_machine_main_t::macro_pc actually has to be signed as it is sometimes set to -1.

fixed searches in single-byte encoded documents

2024-09-11T14:14:27+00:00

* while code is guaranteed to be in valid UTF-8, this cannot be
  said about the result of string building.
* The search pattern can end up with invalid Unicode bytes even when
  searching on UTF-8 buffers, e.g. if ^EQq inserts garbage.
  There are currently no checks.
* When searching on a raw buffer, it must be possible to
  search for arbitrary bytes (^EUq).
  Since teco_pattern2regexp() was always expecting clean UTF-8 input,
  this would sometimes skip over too many bytes and could even crash.
* Instead, teco_pattern2regexp() now takes the  target codepage
  into account.

the SciTECO parser is Unicode-based now (refs #5)

2024-09-11T14:14:27+00:00

The following rules apply: * All SciTECO macros __must__ be in valid UTF-8, regardless of the the register's configured encoding. This is checked against before execution, so we can use glib's non-validating UTF-8 API afterwards. * Things will inevitably get slower as we have to validate all macros first and convert to gunichar for each and every character passed into the parser. As an optimization, it may make sense to have our own inlineable version of g_utf8_get_char() (TODO). Also, Unicode glyphs in syntactically significant positions may be case-folded - just like ASCII chars were. This is is of course slower than case folding ASCII. The impact of this should be measured and perhaps we should restrict case folding to a-z via teco_ascii_toupper(). * The language itself does not use any non-ANSI characters, so you don't have to use UTF-8 characters. * Wherever the parser expects a single character, it will now accept an arbitrary Unicode/UTF-8 glyph as well. In other words, you can call macros like M§ instead of having to write M[§]. You can also get the codepoint of any Unicode character with ^^x. Pressing an Unicode character in the start state or in Ex and Fx will now give a sane error message. * When pressing a key which produces a multi-byte UTF-8 sequence, the character gets translated back and forth multiple times: 1. It's converted to an UTF-8 string, either buffered or by IME methods (Gtk). On Curses we could directly get a wide char using wget_wch(), but it's not currently used, so we don't depend on widechar curses. 2. Parsed into gunichar for passing into the edit command callbacks. This also validates the codepoint - everything later on can assume valid codepoints and valid UTF-8 strings. 3. Once the edit command handling decides to insert the key into the command line, it is serialized back into an UTF-8 string as the command line macro has to be in UTF-8 (like all other macros). 4. The parser reads back gunichars without validation for passing into the parser callbacks. * Flickering in the Curses UI and Pango warnings in Gtk, due to incompletely inserted and displayed UTF-8 sequences, are now fixed.

fixed win32 CI and nightly builds (refs #5)

2024-09-10T10:13:38+00:00

* The libtool wrapper binaries do not pass down UTF-8 strings correctly, so the Unicode tests failed under some circumstances. * As we aren't actually linking against any locally-built shared libraries, we are passing --disable-shared to libtool which inhibts wrapper generation on win32 and fixes the test suite. * Also use up to date autotools. This didn't fix anything, though. * test suite: try writing an Unicode filename as well * There have been problems doing that on Win32 where UTF-8 was not correctly passed down from the command line and some Windows API calls were only working with ANSI filenames etc.

try a different value for LC_ALL on Mac OS to accept UTF-8 command lines (refs #5)

2024-09-09T18:28:14+00:00

testsuite: try different locale on Mac OS (refs #5)

2024-09-09T18:05:16+00:00

hopefully fixes the Unicode test cases on Mac OS

improved 8-bit cleanliness test cases and added Unicode test cases (refs #5)

2024-09-09T16:22:21+00:00

fixed retrieval of characters with codes larger than 127 - always return unsigned integer

2024-08-27T22:03:04+00:00

* SCI_GETCHARAT is internally casted to `char`, which may be signed. Characters > 127 therefore become negative and stay so when casted to sptr_t. We therefore cast it back to guchar (unsigned char). * The same is true whenever returning a string's character to SciTECO (teco_int_t) as our string type is `gchar *`. * <^^x> now also works for those characters. Eventually, the parser will probably become UTF8-aware and this will have to be done differently.

fully support out of tree builds

2024-08-23T02:51:55+00:00

* You no longer have to copy contrib/scintilla, contrib/scinterm and contrib/lexilla manually to the build directory. * It turns out, that Scintilla/Lexilla was supporting this since 2016. Scintilla allows pointing to a source directory (srdir) and Lexilla to a binary directory (DIR_O). * For Scinterm I opened a pull request in order to add srcdir/basedir variables: https://github.com/orbitalquark/scinterm/pull/21 * `make distcheck` is therefore now also fixed. * The FreeBSD package is now allowed to build out of source. I haven't tested it yet. * See also https://github.com/ScintillaOrg/lexilla/issues/266

fixed expressions like `1,(2)` or `(1),(2)`: they are reported as two numbers now

2024-02-08T01:45:54+00:00

* Instead of TECO_OP_NEW, there should perhaps simply be a flag of whether `,` was used.