the SciTECO parser is Unicode-based now (refs #5)

The following rules apply: * All SciTECO macros __must__ be in valid UTF-8, regardless of the the register's configured encoding. This is checked against before execution, so we can use glib's non-validating UTF-8 API afterwards. * Things will inevitably get slower as we have to validate all macros first and convert to gunichar for each and every character passed into the parser. As an optimization, it may make sense to have our own inlineable version of g_utf8_get_char() (TODO). Also, Unicode glyphs in syntactically significant positions may be case-folded - just like ASCII chars were. This is is of course slower than case folding ASCII. The impact of this should be measured and perhaps we should restrict case folding to a-z via teco_ascii_toupper(). * The language itself does not use any non-ANSI characters, so you don't have to use UTF-8 characters. * Wherever the parser expects a single character, it will now accept an arbitrary Unicode/UTF-8 glyph as well. In other words, you can call macros like M§ instead of having to write M[§]. You can also get the codepoint of any Unicode character with ^^x. Pressing an Unicode character in the start state or in Ex and Fx will now give a sane error message. * When pressing a key which produces a multi-byte UTF-8 sequence, the character gets translated back and forth multiple times: 1. It's converted to an UTF-8 string, either buffered or by IME methods (Gtk). On Curses we could directly get a wide char using wget_wch(), but it's not currently used, so we don't depend on widechar curses. 2. Parsed into gunichar for passing into the edit command callbacks. This also validates the codepoint - everything later on can assume valid codepoints and valid UTF-8 strings. 3. Once the edit command handling decides to insert the key into the command line, it is serialized back into an UTF-8 string as the command line macro has to be in UTF-8 (like all other macros). 4. The parser reads back gunichars without validation for passing into the parser callbacks. * Flickering in the Curses UI and Pango warnings in Gtk, due to incompletely inserted and displayed UTF-8 sequences, are now fixed.
author: Robin Haberkorn <robin.haberkorn@googlemail.com> 2024-09-11 12:21:42 +0200
committer: Robin Haberkorn <robin.haberkorn@googlemail.com> 2024-09-11 16:14:27 +0200
commit: 68578072bfaf6054a96bb6bcedfccb6e56a508fe (patch)
tree: b7916f665e77c698d2d0fda7cb9f3ac4356f502b /src/string-utils.c
parent: adc067ba745cebf2e2a2f9523bc14136ca1d2680 (diff)
download: sciteco-68578072bfaf6054a96bb6bcedfccb6e56a508fe.tar.gz
1 files changed, 26 insertions, 10 deletions
diff --git a/src/string-utils.c b/src/string-utils.c
index ac5835b..d9b12e0 100644
--- a/src/string-utils.c
+++ b/src/string-utils.c
@@ -78,7 +78,17 @@ teco_string_get_coord(const gchar *str, guint pos, guint *line, guint *column)
 	}
 }
 
-/** @memberof teco_string_t */
+/**
+ * Get the length of the prefix common to two strings.
+ * Works with UTF-8 and single-byte encodings.
+ *
+ * @param a Left string.
+ * @param b Right string.
+ * @param b_len Length of right string.
+ * @return Length of the common prefix in bytes.
+ *
+ * @memberof teco_string_t
+ */
 gsize
 teco_string_diff(const teco_string_t *a, const gchar *b, gsize b_len)
 {
@@ -92,14 +102,16 @@ teco_string_diff(const teco_string_t *a, const gchar *b, gsize b_len)
 }
 
 /**
- * Get the length of the prefix common to two strings
+ * Get the length of the prefix common to two UTF-8 strings
  * without considering case.
  *
- * @fixme This is currently only used for symbols and one/two letter
- * Q-Register names, which cannot be UTF-8.
- * If we rewrote this to perform Unicode case folding, we would
- * also have to check for character validity.
- * Once our parser is Unicode-aware, this is not necessary.
+ * The UTF-8 strings must be validated, which should be the case
+ * for help labels and short Q-Register names.
+ *
+ * @param a Left UTF-8 string.
+ * @param b Right UTF-8 string.
+ * @param b_len Length of right UTF-8 string.
+ * @return Length of the common prefix in bytes.
  *
  * @memberof teco_string_t
  */
@@ -108,9 +120,13 @@ teco_string_casediff(const teco_string_t *a, const gchar *b, gsize b_len)
 {
 	gsize len = 0;
 
-	while (len < a->len && len < b_len &&
-	       g_ascii_tolower(a->data[len]) == g_ascii_tolower(b[len]))
-		len++;
+	while (len < a->len && len < b_len) {
+		gunichar a_chr = g_utf8_get_char(a->data+len);
+		gunichar b_chr = g_utf8_get_char(b+len);
+		if (g_unichar_tolower(a_chr) != g_unichar_tolower(b_chr))
+			break;
+		len = g_utf8_next_char(b+len) - b;
+	}
 
 	return len;
 }
author	Robin Haberkorn <robin.haberkorn@googlemail.com>	2024-09-11 12:21:42 +0200
committer	Robin Haberkorn <robin.haberkorn@googlemail.com>	2024-09-11 16:14:27 +0200
commit	68578072bfaf6054a96bb6bcedfccb6e56a508fe (patch)
tree	b7916f665e77c698d2d0fda7cb9f3ac4356f502b /src/string-utils.c
parent	adc067ba745cebf2e2a2f9523bc14136ca1d2680 (diff)
download	sciteco-68578072bfaf6054a96bb6bcedfccb6e56a508fe.tar.gz