diff options
| author | Robin Haberkorn <rhaberkorn@fmsbw.de> | 2026-06-28 00:39:51 +0200 |
|---|---|---|
| committer | Robin Haberkorn <rhaberkorn@fmsbw.de> | 2026-06-28 00:39:51 +0200 |
| commit | 4fe5bc6f3867096965270c90f2e1e5df77b8825f (patch) | |
| tree | 07823673c598cf4289ea0ae769c32924e1fcce10 /TODO | |
| parent | c5cb45fab6d4a63a4fcff2cf7f6801dae2ac4db2 (diff) | |
terex is the new regular expression engine now and replaces PCRE (GRegex)
* terex is based on Henry Spencer's regular expression engine for Tcl.
It is a hybrid NFA/DFA design which has better worst-time runtimes than
the backtracking PCRE. Memory usage is also limited and can no longer
increase catastrophically.
* It should no longer be possible to crash SciTECO with pathological
searches.
* Since it reliably supports partial matches (REG_EXPECT) we can
now enable the new backwards-search algorithm by default.
This used to be broken because of a glib bug, which I already
fixed. It would however take a long time until this ends up
on the majority of glib installations.
* Regexp executions can still be quite slow if you are looking
for a pattern at the end of a huge file, which can hang the editor,
but this can now at least theoretically be solved by adding
hooks into terex to poll for interruptions.
* We can now also get rid of a TECO-pattern to regexp translation
step by directly generating terex tokens (TODO).
* Performance-wise terex appears to be slower than PCRE for simple
forward searches even when linking everything with optimzations (FIXME).
* Having a stand-alone regular expression engine is also a huge
step in getting rid of glib.
See also: https://git.fmsbw.de/terex/about/
Diffstat (limited to 'TODO')
| -rw-r--r-- | TODO | 27 |
1 files changed, 0 insertions, 27 deletions
@@ -74,33 +74,6 @@ Known Bugs: and b) the file mode and ownership of re-created files can be preserved. We should fall back silently to an (inefficient) memory copy or temporary file strategy if this is detected. - * All backward searches from the end of excessively large files can be very - slow, especially in UTF mode, since you are always producing - all matches over the entire document. - Perhaps scan in 4kb blocks from dot upwards, but with partial matches. - When getting partial matches, the match falls on a block boundary and - we can extended the scanned area downwards until dot. - This currently doesn't work with glib's regexp (PCRE) since - g_match_info_fetch_pos() handles partial matches like errors. - Here's an upstream merge request to fix that: - https://gitlab.gnome.org/GNOME/glib/-/merge_requests/5199 - * Crashes on large files: S^EM^X$ (regexp: (?:.)+) - Happens because the Glib regex engine is based on a recursive (backtracking) - Perl regex library and glib doesn't expose pcre_extra. - We could include `(*LIMIT_RECURSION=d)` in the pattern, though. - I can provoke the problem only on Ubuntu 20.04. - We can try g_regex_match_all_full() which will use a DFA, but - it doesn't capture subexpressions. - We need something based on a non-backtracking Thompson's NFA with Unicode (UTF-8), see - https://swtch.com/~rsc/regexp/ - Basically only RE2 would check all the boxes. - RE2 doesn't have a native C API, so we would also have to import the - https://github.com/marcomaggi/cre2/ wrapper. - re2 should be an optional dependency, so we can still build against the - glib APIs. - Optionally, I could build a PCRE-compatible wrapper for Rust's regex crate. - It would also be possible to port one of Henry Spencer's engines (hxrex or its - PosgreSQL derivation or the version from Vim) to UTF-8 and add it as a submodule. * It is still possible to hang searches on huge files since a single match could still scan too much memory - e.g. try searching for a word that occurs only at the end of the huge file. |
