From 4fe5bc6f3867096965270c90f2e1e5df77b8825f Mon Sep 17 00:00:00 2001
From: Robin Haberkorn <rhaberkorn@fmsbw.de>
Date: Sun, 28 Jun 2026 00:39:51 +0200
Subject: terex is the new regular expression engine now and replaces PCRE
 (GRegex)

* terex is based on Henry Spencer's regular expression engine for Tcl.
  It is a hybrid NFA/DFA design which has better worst-time runtimes than
  the backtracking PCRE. Memory usage is also limited and can no longer
  increase catastrophically.
* It should no longer be possible to crash SciTECO with pathological
  searches.
* Since it reliably supports partial matches (REG_EXPECT) we can
  now enable the new backwards-search algorithm by default.
  This used to be broken because of a glib bug, which I already
  fixed. It would however take a long time until this ends up
  on the majority of glib installations.
* Regexp executions can still be quite slow if you are looking
  for a pattern at the end of a huge file, which can hang the editor,
  but this can now at least theoretically be solved by adding
  hooks into terex to poll for interruptions.
* We can now also get rid of a TECO-pattern to regexp translation
  step by directly generating terex tokens (TODO).
* Performance-wise terex appears to be slower than PCRE for simple
  forward searches even when linking everything with optimzations (FIXME).
* Having a stand-alone regular expression engine is also a huge
  step in getting rid of glib.

See also: https://git.fmsbw.de/terex/about/
---
 TODO | 27 ---------------------------
 1 file changed, 27 deletions(-)

(limited to 'TODO')

diff --git a/TODO b/TODO
index 6588fe7..b2de61c 100644
--- a/TODO
+++ b/TODO
@@ -74,33 +74,6 @@ Known Bugs:
    and b) the file mode and ownership of re-created files can be preserved.
    We should fall back silently to an (inefficient) memory copy or temporary
    file strategy if this is detected.
- * All backward searches from the end of excessively large files can be very
-   slow, especially in UTF mode, since you are always producing
-   all matches over the entire document.
-   Perhaps scan in 4kb blocks from dot upwards, but with partial matches.
-   When getting partial matches, the match falls on a block boundary and
-   we can extended the scanned area downwards until dot.
-   This currently doesn't work with glib's regexp (PCRE) since
-   g_match_info_fetch_pos() handles partial matches like errors.
-   Here's an upstream merge request to fix that:
-   https://gitlab.gnome.org/GNOME/glib/-/merge_requests/5199
- * Crashes on large files: S^EM^X$ (regexp: (?:.)+)
-   Happens because the Glib regex engine is based on a recursive (backtracking)
-   Perl regex library and glib doesn't expose pcre_extra.
-   We could include `(*LIMIT_RECURSION=d)` in the pattern, though.
-   I can provoke the problem only on Ubuntu 20.04.
-   We can try g_regex_match_all_full() which will use a DFA, but
-   it doesn't capture subexpressions.
-   We need something based on a non-backtracking Thompson's NFA with Unicode (UTF-8), see
-   https://swtch.com/~rsc/regexp/
-   Basically only RE2 would check all the boxes.
-   RE2 doesn't have a native C API, so we would also have to import the
-   https://github.com/marcomaggi/cre2/ wrapper.
-   re2 should be an optional dependency, so we can still build against the
-   glib APIs.
-   Optionally, I could build a PCRE-compatible wrapper for Rust's regex crate.
-   It would also be possible to port one of Henry Spencer's engines (hxrex or its
-   PosgreSQL derivation or the version from Vim) to UTF-8 and add it as a submodule.
  * It is still possible to hang searches on huge files since a single match
    could still scan too much memory - e.g. try searching for a word that
    occurs only at the end of the huge file.
-- 
cgit v1.2.3