sciteco - Scintilla-based Text Editor and COrrector

Age	Commit message (Collapse)	Author	Files	Lines
2024-09-11	fixed searches in single-byte encoded documents	Robin Haberkorn	4	-36/+59
	* while code is guaranteed to be in valid UTF-8, this cannot be said about the result of string building. * The search pattern can end up with invalid Unicode bytes even when searching on UTF-8 buffers, e.g. if ^EQq inserts garbage. There are currently no checks. * When searching on a raw buffer, it must be possible to search for arbitrary bytes (^EUq). Since teco_pattern2regexp() was always expecting clean UTF-8 input, this would sometimes skip over too many bytes and could even crash. * Instead, teco_pattern2regexp() now takes the <S> target codepage into account.
2024-09-11	the SciTECO parser is Unicode-based now (refs #5)	Robin Haberkorn	29	-202/+325
	The following rules apply: * All SciTECO macros __must__ be in valid UTF-8, regardless of the the register's configured encoding. This is checked against before execution, so we can use glib's non-validating UTF-8 API afterwards. * Things will inevitably get slower as we have to validate all macros first and convert to gunichar for each and every character passed into the parser. As an optimization, it may make sense to have our own inlineable version of g_utf8_get_char() (TODO). Also, Unicode glyphs in syntactically significant positions may be case-folded - just like ASCII chars were. This is is of course slower than case folding ASCII. The impact of this should be measured and perhaps we should restrict case folding to a-z via teco_ascii_toupper(). * The language itself does not use any non-ANSI characters, so you don't have to use UTF-8 characters. * Wherever the parser expects a single character, it will now accept an arbitrary Unicode/UTF-8 glyph as well. In other words, you can call macros like M§ instead of having to write M[§]. You can also get the codepoint of any Unicode character with ^^x. Pressing an Unicode character in the start state or in Ex and Fx will now give a sane error message. * When pressing a key which produces a multi-byte UTF-8 sequence, the character gets translated back and forth multiple times: 1. It's converted to an UTF-8 string, either buffered or by IME methods (Gtk). On Curses we could directly get a wide char using wget_wch(), but it's not currently used, so we don't depend on widechar curses. 2. Parsed into gunichar for passing into the edit command callbacks. This also validates the codepoint - everything later on can assume valid codepoints and valid UTF-8 strings. 3. Once the edit command handling decides to insert the key into the command line, it is serialized back into an UTF-8 string as the command line macro has to be in UTF-8 (like all other macros). 4. The parser reads back gunichars without validation for passing into the parser callbacks. * Flickering in the Curses UI and Pango warnings in Gtk, due to incompletely inserted and displayed UTF-8 sequences, are now fixed.
2024-09-10	fixed win32 CI and nightly builds (refs #5)	Robin Haberkorn	3	-8/+19
	* The libtool wrapper binaries do not pass down UTF-8 strings correctly, so the Unicode tests failed under some circumstances. * As we aren't actually linking against any locally-built shared libraries, we are passing --disable-shared to libtool which inhibts wrapper generation on win32 and fixes the test suite. * Also use up to date autotools. This didn't fix anything, though. * test suite: try writing an Unicode filename as well * There have been problems doing that on Win32 where UTF-8 was not correctly passed down from the command line and some Windows API calls were only working with ANSI filenames etc.
2024-09-10	win32: fixed opening and saving UTF-8 filenames (refs #5)	Robin Haberkorn	1	-5/+15
	* The default ANSI versions of the Win32 API calls worked only as long as we used the ANSI subset of UTF-8 in filenames. * There is g_win32_locale_filename_from_utf8(), but it's not guaranteed to derive an unique filename. * Therefore we define UNICODE and convert between UTF-8 and UTF-16 (Windows' native Unicode encoding).
2024-09-10	win32: convert command line to UTF-8 (refs #5)	Robin Haberkorn	2	-17/+31
	* Should prevent data loss due to system locale conversions when parsing command line arguments. * Should also fix passing Unicode arguments to munged macros and therefore opening files via ~/.teco_ini. * The entire option parsing is based on GStrv (null-terminated string lists) now, also on UNIX.
2024-09-10	fixed Mac OS nightly builds by installing an up-to-date Groff	Robin Haberkorn	1	-1/+2
	The Mac OS 12 Groff apparently does not accept `-K` for preconv.
2024-09-09	try a different value for LC_ALL on Mac OS to accept UTF-8 command lines ↵	Robin Haberkorn	1	-2/+1
	(refs #5)
2024-09-09	testsuite: try different locale on Mac OS (refs #5)	Robin Haberkorn	1	-1/+9
	hopefully fixes the Unicode test cases on Mac OS
2024-09-09	updated TODO	Robin Haberkorn	1	-10/+48

2024-09-09	disable unused Scintilla features at build time	Robin Haberkorn	1	-1/+3
	should slightly reduce binary size
2024-09-09	define G_DISABLE_ASSERT unless --enable-debug is specified	Robin Haberkorn	2	-1/+5
	* turns out that glib's g_assert() does not depend on NDEBUG like Standard C's assert() * this disables assertions in release builds and should speed up things slightly
2024-09-09	sample.teco_ini: fixed opening files with glob characters in their names	Robin Haberkorn	1	-1/+1

2024-09-09	grosciteco.tes manpage: fixed formatting of list of troff macros	Robin Haberkorn	1	-1/+0

2024-09-09	added an improvised lexer for styling Git commit messages	Robin Haberkorn	2	-1/+18
	It's not a real Lexilla lexer, but simply styles the document once in lexer.set.git in order to highlight comment lines.
2024-09-09	<f,tXq>: fixed for very large character ranges	Robin Haberkorn	1	-3/+7
	* use SCI_GETTEXTRANGEFULL instead of deprecated SCI_GETTEXTRANGE
2024-09-09	symbols-extract.tes works in 8-bit mode now (refs #5)	Robin Haberkorn	2	-3/+3
	* significantly speeds up build time * Scintilla and Lexilla headers and symbols are all-ASCII anyway. * We should probably have a look at the quicksort implementation in string.tes, as it can probably be optimized in UTF-8 documents as well.
2024-09-09	improved 8-bit cleanliness test cases and added Unicode test cases (refs #5)	Robin Haberkorn	2	-5/+27

2024-09-09	teco_glyphs2bytes() and teco_bytes2glyphs() renamed to ↵	Robin Haberkorn	5	-26/+26
	teco_interface_glyphs2bytes() and teco_interface_bytes2glyphs() (refs #5) * for consistency with all the other teco_view wrappers in interface.h
2024-09-09	added raw ANSI mode to facilitate 8-bit clean editing (refs #5)	Robin Haberkorn	17	-107/+158
	* When enabled with bit 2 in the ED flags (0,4ED), all registers and buffers will get the raw ANSI encoding (as if 0EE had been called on them). You can still manually change the encoding, eg. by calling 65001EE afterwards. * Also the ANSI mode sets up character representations for all bytes >= 0x80. This is currently done only depending on the ED flag, not when setting 0EE. * Since setting 16,4ED for 8-bit clean editing in a macro can be tricky - the default unnamed buffer will still be at UTF-8 and at least a bunch of environment registers as well - we added the command line option `--8bit` (short `-8`) which configures the ED flags very early on. As another advantage you can mung the profile in 8-bit mode as well when using SciTECO as a sort of interactive hex editor. * Disable UTF-8 checks in 8-bit clean mode (sample.teco_ini).
2024-09-09	Xq and ]q inherit the document encoding from the source document (refs #5)	Robin Haberkorn	15	-112/+177
	* ^Uq however always sets an UTF8 register as the source is supposed to be a SciTECO macro which is always UTF-8. * :^Uq preserves the register's encoding * teco_doc_set_string() now also sets the encoding * instead of trying to restore the encoding in teco_doc_undo_set_string(), we now swap out the document in a teco_doc_t and pass it to an undo token. * The get_codepage() Q-Reg method has been removed as the same can now be done with teco_doc_get_string() and the get_string() method.
2024-09-09	n^Uq now checks the input codepoints for validity (refs #5)	Robin Haberkorn	1	-1/+5
	* <nI> and ^EUq do the same
2024-09-09	updated README and sciteco(7) with information about Unicode support (refs #5)	Robin Haberkorn	2	-10/+32

2024-09-09	Gtk: ignore the keyboard layout whereever possible (refs #5)	Robin Haberkorn	2	-23/+89
	* Eg. when typing with a Russian layout, CTRL+I will always insert ^I. * Works with all of the start-state command Ex, Fx, ^x commands and string building constructs. This is exactly where process_edit_cmd_cb() case folds case-insensitive characters. The corresponding state therefore sets an is_case_insensitive flag now. * Does not yet work with anything embedded into Q-Register specifications. This could only be realized with a new state callback (is_case_insensitive()?) that chains to the Q-Register and string building states recursively. * Also it doesn't work with Ё on my Russian phonetic layout, probably because the ANSI caret on that same key is considered dead and not returned by gdk_keyval_to_unicode(). Perhaps we should directly wheck the keyval values? * Whenever a non-ANSI key is pressed in an allowed state, we try to check all other keyvals that could be produced by the same hardware keycode, ie. we check all groups (keyboard layouts).
2024-09-09	leave some comments on what to do when converting the parser to Unicode ↵	Robin Haberkorn	2	-1/+21
	(refs #5)
2024-09-09	search patterns are now expected to be in UTF-8 and the document's encoding ↵	Robin Haberkorn	1	-21/+31
	is taken into account (refs #5) * ^Nx and ^EMx constructs work with Unicode glyphs now, even though the main SciTECO parser is still not Unicode-based. (We translate only complete patterns, although they could have incomplete Unicode sequences at their end.) * case-insensitive searching now works with Unicode glyphs
2024-09-09	the ^EUq string building escape now respects the encoding (can insert bytes ↵	Robin Haberkorn	11	-16/+104
	or codepoints) (refs #5) * This is trickier than it sounds because there isn't one single place to consult. It depends on the context. If the string argument relates to buffer contents - as in <I>, <S>, <FR> etc. - the buffer's encoding is consulted. If it goes into a register (EU), the register's encoding is consulted. Everything else (O, EN, EC, ES...) expects only Unicode codepoints. * This is communicated through a new field teco_machine_stringbuilding_t::codepage which must be set in the states' initial callback. * Seems overkill just for ^EUq, but it can be used for context-sensitive processing of all the other string building constructs as well. * ^V and ^W cannot be supported for Unicode characters for the time being without an Unicode-aware parser
2024-09-09	<I> command evaluates input codepoints (refs #5)	Robin Haberkorn	1	-10/+18

2024-09-09	conditionals now check for Unicode codepoints (refs #5)	Robin Haberkorn	2	-13/+13
	* This will naturally work with both ASCII characters and various non-English scripts. * Unfortunately, it cannot work with the other non-ANSI single-byte codepages. * If we'd like to support scripts working with all sorts of codepoints, we'd have to introduce a new command for translating individual codepoints from the current codepage (as reported by EE) to Unicode.
2024-09-09	glob patterns fully support Unicode now (refs #5)	Robin Haberkorn	1	-13/+16
	* The ASCII compiler would try to escape ("\") all bytes of a multibyte UTF-8 glyph. * The new implementation escapes only metacharacters and passes down all non-ANSI glyphs unchanged. On the downside, this will work only with PCREs.
2024-09-09	:EL can be used to perform codepage conversions now (refs #5)	Robin Haberkorn	2	-35/+231
	* I decoded the Scintilla charset values into codepages, at least those used on Gtk. * make sure that the line character index is not allocated or released too often, as it is actually internally reference counted, which could result in it missing when we really need it. * The line character index still appears to be released whenever the document pointer changes, which will happen after using a different Q-Register. This could be a performance bottleneck (FIXME).
2024-09-09	lexer.checkheader is Unicode-aware now (refs #5)	Robin Haberkorn	1	-1/+1

2024-09-09	avoid redunancies between teco_qreg_plain_get_character() and ↵	Robin Haberkorn	6	-48/+54
	teco_state_start_get() (refs #5)
2024-09-09	grosciteco: support Unicode (refs #5)	Robin Haberkorn	2	-28/+46
	* All manpages are processed with the "utf8" device and with preconv. Manpage sources can contain Unicode glyphs now. * grosciteco supports CuXXXX and N commands now * Lines are drawn with Unicode box characters now. This works at least with tbl and -Tutf8. It's probably still too simplistic for pic graphics. * The topic list at the top of .woman.tec contain byte offsets, so that we don't need glyphs2bytes conversion when looking up topics.
2024-09-09	reserve at most 4 bytes for UTF-8 encoded characters (refs #5)	Robin Haberkorn	3	-3/+4
	There is a widespread myth that they could take up to 6 bytes.
2024-09-09	Codepage guessing is done in .teco_ini (refs #5)	Robin Haberkorn	1	-0/+3
	* There isn't much we can do anyway. We can detect if it's Unicode and otherwise default to _some_ codepage. However, we do not even known which codepage should be preferred. * This is actually trivial to implement in pure SciTECO. Having it in the profile gives you the ability to customize the default non-UTF code page. E.g. if you are working a lot with KOI-8 documents, you could change <1EE> to <204EE>. * Since the Unicode validity check is a noticable slowdown, we limit it to the first 1024 bytes. This speeds up startup significantly compared to checking all codepoints in every document.
2024-09-09	implemented <EE> and <^E> commands for configuring encodings and translating ↵	Robin Haberkorn	4	-18/+146
	between glyph and byte offsets (refs #5) * ^E is heavily overloaded and can also be used to check whether a given index is valid (as it is the same that most movement commands to internally). Besides that, it is mainly useful for interfacing with Scintilla messages. * EE takes a code page or 0 for ANSI/ASCII. Currently all documents and new registers are UTF-8. There will have to be some kind of codepage inheritance and a single-byte-only mode.
2024-09-09	Unicode support for the Q-Register commands (refs #5)	Robin Haberkorn	10	-145/+274
	* this required adding several Q-Register vtable methods * it should still be investigated whether the repeated calling of SCI_ALLOCATELINECHARACTERINDEX causes any overhead.
2024-09-09	allow Unicode characters in command line arguments (refs #5)	Robin Haberkorn	2	-4/+8
	* the locale must be initialized very early before g_option_context_parse() * will allow UTF-8 characters in the test suite
2024-09-09	Glyph to byte offset mapping is now using the line character index (refs #5)	Robin Haberkorn	7	-68/+130
	* This works reasonably well unless lines are exceedingly long (as on a line we always count characters). The following test case is still slow (on Unicode buffers): 10000<@I/XX/> <%a-1:J;> While the following is now also fast: 10000<@I/X^J/> <%a-1:J;> * Commands with relative character offsets (C, R, A, D) have a special optimization where they always count characters beginning at dot, as long as the argument is now exceedingly large. This means they are fast even on exceedingly long lines. * The remaining commands (search, EC/EG, Xq) now accept glyph indexes.
2024-09-09	implemented Unicode support for rubin/rubout and a number of commands (WIP) ↵	Robin Haberkorn	5	-44/+148
	(refs #5) certain test cases are still way too slow: 10000<@I/X^J/> 20000<R> or 10000<@I/X^J/> 20000<%a-1J> SCI_ALLOCATELINECHARACTERINDEX does not help much here. It probably speeds up only SCI_LINEFROMINDEXPOSITION and SCI_INDEXPOSITIONFROMLINE.
2024-09-09	prefer libncursesw (widechar variant) (refs #5)	Robin Haberkorn	3	-4/+9
	* Some platforms like Ubuntu actually ship widechar and non-widechar versions of ncurses with different pkg-config files. Other platforms like FreeBSD will ship an "ncursesw" and "ncurses" pkg-config file but both point to the same wide-char library anyway. * Currently we are not using wide-char APIs to ensure maximum compatibility even with embedded systems where ncurses might be built without widechar support. But in order to handle Unicode correctly, we still need to link against the widechar version of ncurses (if available). * Compilation on platforms without a widechar ncurses is now handled by the common AC_CHECK_LIB() fallback (which might actually find a widechar version anyway if it just didn't install the pkg-config file). If necessary, we could also check for the "ncurses" package if "ncursesw" is not found. * This fixes Unicode display and input on Ubuntu.
2024-09-09	input and displaying of Unicode characters is now possible (refs #5)	Robin Haberkorn	6	-27/+73
	* All non-ASCII characters are inserted as Unicode. On Curses, this also requires a properly set up locale. * We still do not need any widechar Curses, as waddch() handles multibyte characters on ncurses. We will see whether there is any Curses variant that strictly requires wadd_wch(). If this will be an exception, we might keep both widechar and non-widechar support. * By convention gsize is used exclusively for byte sizes. Character offsets or lengths use int or long.
2024-08-28	fixed retrieval of characters with codes larger than 127 - always return ↵	Robin Haberkorn	3	-5/+10
	unsigned integer * SCI_GETCHARAT is internally casted to `char`, which may be signed. Characters > 127 therefore become negative and stay so when casted to sptr_t. We therefore cast it back to guchar (unsigned char). * The same is true whenever returning a string's character to SciTECO (teco_int_t) as our string type is `gchar `. <^^x> now also works for those characters. Eventually, the parser will probably become UTF8-aware and this will have to be done differently.
2024-08-24	win32 CI: also set PDCURSES_CFLAGS	Robin Haberkorn	1	-1/+4
	Should fix `make distcheck`.
2024-08-23	hopefully fixed the Windows CI tests	Robin Haberkorn	1	-1/+2
	* `make distcheck` will try to build against libncurses, which is not installed. Therefore, I set DISTCHECK_CONFIGURE_FLAGS in order to force it to PDCurses.
2024-08-23	debian package: updated copyright to 2024	Robin Haberkorn	1	-1/+1

2024-08-23	Lexilla: the troff branch has been merged, so we point to the upstream ↵	Robin Haberkorn	1	-0/+0
	repository again
2024-08-23	fully support out of tree builds	Robin Haberkorn	12	-88/+57
	* You no longer have to copy contrib/scintilla, contrib/scinterm and contrib/lexilla manually to the build directory. * It turns out, that Scintilla/Lexilla was supporting this since 2016. Scintilla allows pointing to a source directory (srdir) and Lexilla to a binary directory (DIR_O). * For Scinterm I opened a pull request in order to add srcdir/basedir variables: https://github.com/orbitalquark/scinterm/pull/21 * `make distcheck` is therefore now also fixed. * The FreeBSD package is now allowed to build out of source. I haven't tested it yet. * See also https://github.com/ScintillaOrg/lexilla/issues/266
2024-08-22	some updates on Scintilla/Lexilla out-of-tree builds	Robin Haberkorn	2	-0/+7

2024-08-22	bumped Lexilla submodule: it has just been rebased	Robin Haberkorn	1	-0/+0
	This should not change anything functionally.