diff options
| author | Robin Haberkorn <rhaberkorn@fmsbw.de> | 2026-06-28 00:39:51 +0200 |
|---|---|---|
| committer | Robin Haberkorn <rhaberkorn@fmsbw.de> | 2026-06-28 00:39:51 +0200 |
| commit | 4fe5bc6f3867096965270c90f2e1e5df77b8825f (patch) | |
| tree | 07823673c598cf4289ea0ae769c32924e1fcce10 | |
| parent | c5cb45fab6d4a63a4fcff2cf7f6801dae2ac4db2 (diff) | |
terex is the new regular expression engine now and replaces PCRE (GRegex)
* terex is based on Henry Spencer's regular expression engine for Tcl.
It is a hybrid NFA/DFA design which has better worst-time runtimes than
the backtracking PCRE. Memory usage is also limited and can no longer
increase catastrophically.
* It should no longer be possible to crash SciTECO with pathological
searches.
* Since it reliably supports partial matches (REG_EXPECT) we can
now enable the new backwards-search algorithm by default.
This used to be broken because of a glib bug, which I already
fixed. It would however take a long time until this ends up
on the majority of glib installations.
* Regexp executions can still be quite slow if you are looking
for a pattern at the end of a huge file, which can hang the editor,
but this can now at least theoretically be solved by adding
hooks into terex to poll for interruptions.
* We can now also get rid of a TECO-pattern to regexp translation
step by directly generating terex tokens (TODO).
* Performance-wise terex appears to be slower than PCRE for simple
forward searches even when linking everything with optimzations (FIXME).
* Having a stand-alone regular expression engine is also a huge
step in getting rid of glib.
See also: https://git.fmsbw.de/terex/about/
| -rw-r--r-- | .gitmodules | 3 | ||||
| -rw-r--r-- | Makefile.am | 3 | ||||
| -rw-r--r-- | TODO | 27 | ||||
| -rw-r--r-- | configure.ac | 1 | ||||
| m--------- | contrib/terex | 0 | ||||
| -rw-r--r-- | debian/copyright | 29 | ||||
| -rw-r--r-- | src/Makefile.am | 5 | ||||
| -rw-r--r-- | src/core-commands.c | 3 | ||||
| -rw-r--r-- | src/error.h | 1 | ||||
| -rw-r--r-- | src/search.c | 254 | ||||
| -rw-r--r-- | tests/testsuite.at | 36 |
11 files changed, 187 insertions, 175 deletions
diff --git a/.gitmodules b/.gitmodules index af9fd68..d825212 100644 --- a/.gitmodules +++ b/.gitmodules @@ -8,3 +8,6 @@ [submodule "lexilla"] path = contrib/lexilla url = https://github.com/ScintillaOrg/lexilla.git +[submodule "terex"] + path = contrib/terex + url = git://git.fmsbw.de/terex diff --git a/Makefile.am b/Makefile.am index 8284bc1..e956878 100644 --- a/Makefile.am +++ b/Makefile.am @@ -5,7 +5,8 @@ ACLOCAL_AMFLAGS = -I m4 if REPLACE_MALLOC MAYBE_DLMALLOC = contrib/dlmalloc endif -SUBDIRS = lib $(MAYBE_DLMALLOC) contrib/rb3ptr src doc tests +SUBDIRS = lib $(MAYBE_DLMALLOC) contrib/rb3ptr contrib/terex \ + src doc tests dist_scitecodata_DATA = fallback.teco_ini @@ -74,33 +74,6 @@ Known Bugs: and b) the file mode and ownership of re-created files can be preserved. We should fall back silently to an (inefficient) memory copy or temporary file strategy if this is detected. - * All backward searches from the end of excessively large files can be very - slow, especially in UTF mode, since you are always producing - all matches over the entire document. - Perhaps scan in 4kb blocks from dot upwards, but with partial matches. - When getting partial matches, the match falls on a block boundary and - we can extended the scanned area downwards until dot. - This currently doesn't work with glib's regexp (PCRE) since - g_match_info_fetch_pos() handles partial matches like errors. - Here's an upstream merge request to fix that: - https://gitlab.gnome.org/GNOME/glib/-/merge_requests/5199 - * Crashes on large files: S^EM^X$ (regexp: (?:.)+) - Happens because the Glib regex engine is based on a recursive (backtracking) - Perl regex library and glib doesn't expose pcre_extra. - We could include `(*LIMIT_RECURSION=d)` in the pattern, though. - I can provoke the problem only on Ubuntu 20.04. - We can try g_regex_match_all_full() which will use a DFA, but - it doesn't capture subexpressions. - We need something based on a non-backtracking Thompson's NFA with Unicode (UTF-8), see - https://swtch.com/~rsc/regexp/ - Basically only RE2 would check all the boxes. - RE2 doesn't have a native C API, so we would also have to import the - https://github.com/marcomaggi/cre2/ wrapper. - re2 should be an optional dependency, so we can still build against the - glib APIs. - Optionally, I could build a PCRE-compatible wrapper for Rust's regex crate. - It would also be possible to port one of Henry Spencer's engines (hxrex or its - PosgreSQL derivation or the version from Vim) to UTF-8 and add it as a submodule. * It is still possible to hang searches on huge files since a single match could still scan too much memory - e.g. try searching for a word that occurs only at the end of the huge file. diff --git a/configure.ac b/configure.ac index 43b4d1b..09d0804 100644 --- a/configure.ac +++ b/configure.ac @@ -545,6 +545,7 @@ AC_CONFIG_FILES([GNUmakefile:Makefile.in src/GNUmakefile:src/Makefile.in] [src/interface-curses/GNUmakefile:src/interface-curses/Makefile.in] [contrib/dlmalloc/GNUmakefile:contrib/dlmalloc/Makefile.in] [contrib/rb3ptr/GNUmakefile:contrib/rb3ptr/Makefile.in] + [contrib/terex/GNUmakefile:contrib/terex/Makefile.in] [lib/GNUmakefile:lib/Makefile.in] [doc/GNUmakefile:doc/Makefile.in doc/Doxyfile] [tests/GNUmakefile:tests/Makefile.in tests/atlocal]) diff --git a/contrib/terex b/contrib/terex new file mode 160000 +Subproject fa3d463a4cd563f3c5f29331f48a0161bf58686 diff --git a/debian/copyright b/debian/copyright index b8908ae..442c3a3 100644 --- a/debian/copyright +++ b/debian/copyright @@ -33,6 +33,35 @@ License: MIT IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. +Files: contrib/terex/*.c contrib/terex/*.h +Copyright: Copyright (c) 1998, 1999 Henry Spencer. All rights reserved. +License: + Copyright (c) 1998, 1999 Henry Spencer. All rights reserved. + . + Development of this software was funded, in part, by Cray Research Inc., + UUNET Communications Services Inc., Sun Microsystems Inc., and Scriptics + Corporation, none of whom are responsible for the results. The author + thanks all of them. + . + Redistribution and use in source and binary forms -- with or without + modification -- are permitted for any purpose, provided that + redistributions in source form retain this entire copyright notice and + indicate the origin and nature of any modifications. + . + I'd appreciate being given credit for this package in the documentation of + software which uses it, but that is not a requirement. + . + THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, + INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY + AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL + HENRY SPENCER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; + OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, + WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR + OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF + ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + Files: contrib/scintilla/* contrib/lexilla/* Copyright: Copyright 1998-2021 Neil Hodgson <neilh@scintilla.org> License: MIT-Hodgson diff --git a/src/Makefile.am b/src/Makefile.am index ff2e86b..8ac58c7 100644 --- a/src/Makefile.am +++ b/src/Makefile.am @@ -13,7 +13,7 @@ include $(top_srcdir)/contrib/scintilla.am # FIXME: Common flags should be in configure.ac AM_CFLAGS = -std=gnu11 -Wall -Wno-initializer-overrides -Wno-unused-value -AM_CPPFLAGS += -I$(top_srcdir)/contrib/rb3ptr +AM_CPPFLAGS += -I$(top_srcdir)/contrib/rb3ptr -I$(top_srcdir)/contrib/terex AM_LDFLAGS = if STATIC_EXECUTABLES @@ -60,7 +60,8 @@ libsciteco_base_la_SOURCES = main.c sciteco.h list.h \ # NOTE: We cannot link in Scintilla (static library) into # a libtool convenience library libsciteco_base_la_LIBADD = $(LIBSCITECO_INTERFACE) \ - $(top_builddir)/contrib/rb3ptr/librb3ptr.la + $(top_builddir)/contrib/rb3ptr/librb3ptr.la \ + $(top_builddir)/contrib/terex/libterex.la if REPLACE_MALLOC libsciteco_base_la_LIBADD += $(top_builddir)/contrib/dlmalloc/libdlmalloc.la endif diff --git a/src/core-commands.c b/src/core-commands.c index 5ca508c..81d5869 100644 --- a/src/core-commands.c +++ b/src/core-commands.c @@ -2224,8 +2224,7 @@ teco_state_ecommand_flags(teco_machine_main_t *ctx, GError **error) * Only this setting guarantees leftmost longest matches that * are entirely symmetric to forward searches, but can be * unpractically slow on huge files. - * The default is 0. - * \# FIXME: Feature is currently broken! + * The default is 4kb. * . * .IP -1: * Type of the last mouse event (\fBread-only\fP). diff --git a/src/error.h b/src/error.h index 67de4aa..3d4334f 100644 --- a/src/error.h +++ b/src/error.h @@ -57,6 +57,7 @@ typedef enum { TECO_ERROR_CLIPBOARD, TECO_ERROR_WIN32, TECO_ERROR_MODULE, + TECO_ERROR_REGEX, /** Interrupt current operation */ TECO_ERROR_INTERRUPTED, diff --git a/src/search.c b/src/search.c index 90975c2..601cc55 100644 --- a/src/search.c +++ b/src/search.c @@ -25,6 +25,10 @@ #include <glib.h> #include <glib/gprintf.h> +/* should always be from contrib/terex */ +#include <regalone.h> +#include <regex.h> + #include "sciteco.h" #include "string-utils.h" #include "expressions.h" @@ -57,6 +61,18 @@ TECO_DEFINE_UNDO_SCALAR(teco_search_parameters_t); */ static teco_search_parameters_t teco_search_parameters; +G_DEFINE_AUTO_CLEANUP_CLEAR_FUNC(regex_t, tere_free); + +/* not in error.h since we don't want to draw in the terex headers */ +static inline void +teco_error_regex_set(GError **error, gint rc, const regex_t *re) +{ + gchar buf[1024]; + tere_error(rc, re, buf, sizeof(buf)); + g_set_error(error, TECO_ERROR, TECO_ERROR_REGEX, + "Error executing regular expression: %s", buf); +} + /*$ "^X" "search mode" * mode^X -- Set or get search mode flag * -^X @@ -551,24 +567,21 @@ TECO_DEFINE_UNDO_OBJECT_OWN(ranges, teco_range_t *, g_free); /** * Extract the ranges of the given GMatchInfo. * - * @param match_info The result of g_regex_match(). + * @param match_info The result of re_exec(). + * @param count Number of matches (subpatterns). * @param offset The beginning of the match operation in bytes. * Match results will be relative to this offset. - * @param count Where to store the number of ranges (subpatterns). * @returns Ranges (subpatterns) in absolute byte positions. * They \b must still be converted to glyph positions afterwards. */ static teco_range_t * -teco_get_ranges(const GMatchInfo *match_info, gsize offset, guint *count) +teco_get_ranges(const regmatch_t *match_info, guint count, gsize offset) { - *count = g_match_info_get_match_count(match_info); - teco_range_t *ranges = g_new(teco_range_t, *count); - - for (gint i = 0; i < *count; i++) { - gint from, to; - g_match_info_fetch_pos(match_info, i, &from, &to); - ranges[i].from = offset+MAX(from, 0); - ranges[i].to = offset+MAX(to, 0); + teco_range_t *ranges = g_new(teco_range_t, count); + + for (gint i = 0; i < count; i++) { + ranges[i].from = offset+match_info[i].rm_so; + ranges[i].to = offset+match_info[i].rm_eo; } return ranges; @@ -613,32 +626,26 @@ G_DEFINE_AUTOPTR_CLEANUP_FUNC(teco_matches_t, teco_matches_free); * @return FALSE if an error occurred */ static gboolean -teco_do_search_forward(GRegex *re, gsize from, gsize to, gint *count, GError **error) +teco_do_search_forward(regex_t *re, gsize from, gsize to, gint *count, GError **error) { - g_autoptr(GMatchInfo) info = NULL; /* NOTE: can return NULL pointer for completely new and empty documents */ const gchar *buffer = (const gchar *)teco_interface_ssm(SCI_GETRANGEPOINTER, from, to-from) ? : ""; - GError *tmp_error = NULL; + + g_assert(*count > 0); /* - * NOTE: The return boolean does NOT signal whether an error was generated. + * FIXME: Repeated allocation could be avoided when scanning over buffer boundaries. + * If it's worth it... */ - g_regex_match_full(re, buffer, to-from, 0, 0, &info, &tmp_error); - if (tmp_error) { - g_propagate_error(error, tmp_error); - return FALSE; - } + g_autofree regmatch_t *info = g_new(regmatch_t, 1+re->re_nsub); - g_assert(*count > 0); - while (g_match_info_matches(info) && --(*count)) { - /* - * NOTE: The return boolean does NOT signal whether an error was generated. - */ - g_match_info_next(info, &tmp_error); - if (tmp_error) { - g_propagate_error(error, tmp_error); - return FALSE; - } + static const gint eflags = REG_NOTEOL | REG_NOTBOL; + + gint rc; + while ((rc = tere_exec(re, (const chr *)buffer, to-from, NULL, + 1+re->re_nsub, info, eflags)) == REG_OKAY && --(*count)) { + buffer += info[0].rm_eo; + from += info[0].rm_eo; /* * FIXME: A single pathological match could already be excessively slow. @@ -649,22 +656,21 @@ teco_do_search_forward(GRegex *re, gsize from, gsize to, gint *count, GError **e } } - if (!*count) { + if (rc == REG_OKAY) { /* successful */ - teco_undo_guint(teco_ranges_count); - teco_undo_ranges_own(teco_ranges) = teco_get_ranges(info, from, &teco_ranges_count); + g_assert(*count == 0); + teco_undo_guint(teco_ranges_count) = 1+re->re_nsub; + teco_undo_ranges_own(teco_ranges) = teco_get_ranges(info, teco_ranges_count, from); + } else if (rc != REG_NOMATCH) { + teco_error_regex_set(error, rc, re); + return FALSE; } return TRUE; } -/** - * Block size for backwards scanning or 0 - * - * @bug Block-wise matching is currently broken, - * so we disable this by default - see below. - */ -gsize teco_search_block_size = 0; //4*1024; +/** block size for backwards scanning or 0 */ +gsize teco_search_block_size = 4*1024; /** * Search backwards, in blocks of teco_search_block_size @@ -681,7 +687,7 @@ gsize teco_search_block_size = 0; //4*1024; * @see teco_do_search_forward */ static gboolean -teco_do_search_backwards(GRegex *re, gsize from, gsize to, gint *count, GError **error) +teco_do_search_backwards(regex_t *re, gsize from, gsize to, gint *count, GError **error) { /* * NOTE: can return NULL pointer for completely new and empty documents. @@ -692,6 +698,7 @@ teco_do_search_backwards(GRegex *re, gsize from, gsize to, gint *count, GError * const gchar *buffer = (const gchar *)teco_interface_ssm(SCI_GETRANGEPOINTER, from, to-from) ? : ""; g_assert(*count < 0); + guint matched_num = -*count; gsize total_size = sizeof(teco_matches_t) + sizeof(teco_match_t[matched_num]); @@ -713,98 +720,83 @@ teco_do_search_backwards(GRegex *re, gsize from, gsize to, gint *count, GError * if (!teco_memory_check(total_size, error)) return FALSE; + /* + * FIXME: The `matched` and `info` allocations are repeated when scanning + * over buffer boundaries and could be avoided by sharing them between + * teco_do_search() calls. If it's worth it... + */ g_autoptr(teco_matches_t) matched = g_malloc0(total_size); matched->count = matched_num; + g_autofree regmatch_t *info = g_new(regmatch_t, 1+re->re_nsub); + gint matched_total = 0; gint i = 0; /* ring buffer pointer into the `matched->matches` array */ gsize to_block = to-from; while (to_block > 0) { - g_autoptr(GMatchInfo) info = NULL; - gsize from_block = teco_search_block_size > 0 ? MAX(0, to_block - teco_search_block_size) : 0; - /* - * FIXME: DEC TECO search semantics could actually demand - * allowing matches to extend beyond the [from,to] range. - */ - GRegexMatchFlags flags = to_block != to-from ? G_REGEX_MATCH_PARTIAL_HARD : 0; + /* how many bytes have been consumed in the current block */ + gsize offset = 0; - GError *tmp_error = NULL; + static const gint eflags = REG_NOTEOL | REG_NOTBOL; + /* for partial matches - mandatory when using REG_EXPECT */ + rm_detail_t details; - /* - * NOTE: The return boolean does NOT signal whether an error was generated. - * FIXME: Why isn't it possible to specify a start_position != 0? - */ - g_regex_match_full(re, buffer+from_block, to_block-from_block, 0, - flags, &info, &tmp_error); - if (tmp_error) { - g_propagate_error(error, tmp_error); - return FALSE; - } + gint rc; for (;;) { + /* + * FIXME: A single pathological match could already be excessively slow. + */ if (G_UNLIKELY(teco_interface_is_interrupted())) { teco_error_interrupted_set(error); return FALSE; } - if (g_match_info_matches(info)) { - g_free(matched->matches[i].ranges); - matched->matches[i].ranges = teco_get_ranges(info, from+from_block, - &matched->matches[i].num_ranges); - i = ++matched_total % matched_num; - } else if (G_UNLIKELY(g_match_info_is_partial_match(info))) { - /* - * Match may fall on the block boundary, - * so retry matching the rest of the document. - * This is the only case where we have to rescan - * the same memory more than once. - * - * FIXME FIXME FIXME: We cannot retrieve the position here - * since g_match_info_fetch_pos() treats partial matches as errors. - * This is a confirmed glib bug and fast backwards searches - * will continue to be broken until we switch to a custom regexp - * engine. - */ - gint partial_start, partial_end; - G_GNUC_UNUSED gboolean rc; - rc = g_match_info_fetch_pos(info, 0, &partial_start, &partial_end); - //g_assert(rc == TRUE); - if (!rc) - /* make sure that test case fails */ - abort(); - g_assert(partial_end == to_block-from_block); - - g_autoptr(GMatchInfo) partial_info = NULL; - - g_regex_match_full(re, buffer+partial_start, to-from-partial_start, 0, - G_REGEX_MATCH_ANCHORED, &partial_info, &tmp_error); - if (tmp_error) { - g_propagate_error(error, tmp_error); - return FALSE; - } + rc = tere_exec(re, (const chr *)buffer+from_block+offset, to_block-from_block-offset, + &details, 1+re->re_nsub, info, eflags); + if (rc != REG_OKAY) + break; - if (g_match_info_matches(partial_info)) { - g_free(matched->matches[i].ranges); - matched->matches[i].ranges = teco_get_ranges(partial_info, from+partial_start, - &matched->matches[i].num_ranges); - i = ++matched_total % matched_num; - } + /* normal full match */ + g_free(matched->matches[i].ranges); + matched->matches[i].num_ranges = 1+re->re_nsub; + matched->matches[i].ranges = teco_get_ranges(info, matched->matches[i].num_ranges, + from+from_block+offset); + i = ++matched_total % matched_num; - /* there might still be other matches within the current block */ - } else { - break; - } + offset += info[0].rm_eo; + } + + if (rc != REG_NOMATCH) { + teco_error_regex_set(error, rc, re); + return FALSE; + } + if (G_UNLIKELY(to_block != to-from && + details.rm_extend.rm_eo == to_block-from_block-offset)) { /* - * NOTE: The return boolean does NOT signal whether an error was generated. + * Match may fall on the block boundary, + * so retry matching the rest of the document. + * This is the only case where we have to rescan + * the same memory more than once. */ - g_match_info_next(info, &tmp_error); - if (tmp_error) { - g_propagate_error(error, tmp_error); + gsize partial_start = from_block+offset+details.rm_extend.rm_so; + + rc = tere_exec(re, (const chr *)buffer+partial_start, to-from-partial_start, + &details, 1+re->re_nsub, info, eflags | REG_ANCHORED); + + if (rc == REG_OKAY) { + g_free(matched->matches[i].ranges); + matched->matches[i].num_ranges = 1+re->re_nsub; + matched->matches[i].ranges = teco_get_ranges(info, matched->matches[i].num_ranges, + from+partial_start); + i = ++matched_total % matched_num; + } else if (rc != REG_NOMATCH) { + teco_error_regex_set(error, rc, re); return FALSE; } } @@ -831,6 +823,7 @@ teco_do_search_backwards(GRegex *re, gsize from, gsize to, gint *count, GError * matched_num -= matched_total; i = 0; + /* try previous block */ to_block = from_block; } @@ -840,7 +833,7 @@ teco_do_search_backwards(GRegex *re, gsize from, gsize to, gint *count, GError * } static gboolean -teco_do_search(GRegex *re, gsize from, gsize to, gint *count, GError **error) +teco_do_search(regex_t *re, gsize from, gsize to, gint *count, GError **error) { gboolean rc = *count >= 0 ? teco_do_search_forward(re, from, to, count, error) : teco_do_search_backwards(re, from, to, count, error); @@ -871,8 +864,7 @@ teco_do_search(GRegex *re, gsize from, gsize to, gint *count, GError **error) static gboolean teco_state_search_process(teco_machine_main_t *ctx, teco_string_t str, gsize new_chars, GError **error) { - /* FIXME: Should G_REGEX_OPTIMIZE be added under certain circumstances? */ - GRegexCompileFlags flags = G_REGEX_MULTILINE | G_REGEX_DOTALL; + gint cflags = REG_ADVANCED; teco_qreg_t *reg = teco_qreg_table_find(ctx->qreg_table_locals, "\x18", 1); /* ^X */ g_assert(reg != NULL); @@ -880,15 +872,24 @@ teco_state_search_process(teco_machine_main_t *ctx, teco_string_t str, gsize new if (!reg->vtable->get_integer(reg, &search_mode, error)) return FALSE; if (teco_is_failure(search_mode)) - flags |= G_REGEX_CASELESS; + cflags |= REG_ICASE; if (ctx->flags.modifier_colon == 2) - flags |= G_REGEX_ANCHORED; + cflags |= REG_BOSONLY; /* anchored */ + + gint count = teco_search_parameters.count; + + /* + * Backwards searches require partial match information. + * Fortunately, it appears to be almost for free. + */ + if (count < 0) + cflags |= REG_EXPECT; /* this is set in teco_state_search_initial() */ if (ctx->expectstring.machine.codepage != SC_CP_UTF8) { /* single byte encoding */ - flags |= G_REGEX_RAW; + cflags |= REG_RAW; } else if (!teco_string_validate_utf8(str)) { /* * While SciTECO code is always guaranteed to be in valid UTF-8, @@ -913,7 +914,6 @@ teco_state_search_process(teco_machine_main_t *ctx, teco_string_t str, gsize new g_autoptr(teco_machine_qregspec_t) qreg_machine; qreg_machine = teco_machine_qregspec_new(TECO_QREG_REQUIRED, ctx->qreg_table_locals, FALSE); - g_autoptr(GRegex) re = NULL; g_autofree gchar *re_pattern; /* NOTE: teco_pattern2regexp() modifies str pointer */ re_pattern = teco_pattern2regexp(&str, qreg_machine, @@ -923,13 +923,19 @@ teco_state_search_process(teco_machine_main_t *ctx, teco_string_t str, gsize new #ifdef DEBUG g_printf("REGEXP: %s\n", re_pattern); #endif + + g_auto(regex_t) re; + memset(&re, 0, sizeof(re)); + if (!*re_pattern) goto failure; + /* - * FIXME: Should we propagate at least some of the errors? + * FIXME: No need to escape null-chars in re_pattern. + * Actually no need to generate a regexp for TECO patterns. */ - re = g_regex_new(re_pattern, flags, 0, NULL); - if (!re) + gint rc = tere_comp(&re, (chr *)re_pattern, strlen(re_pattern), cflags); + if (rc != REG_OKAY) goto failure; if (!teco_qreg_current && @@ -938,9 +944,7 @@ teco_state_search_process(teco_machine_main_t *ctx, teco_string_t str, gsize new teco_buffer_edit(teco_search_parameters.from_buffer); } - gint count = teco_search_parameters.count; - - if (!teco_do_search(re, teco_search_parameters.from, teco_search_parameters.to, &count, error)) + if (!teco_do_search(&re, teco_search_parameters.from, teco_search_parameters.to, &count, error)) return FALSE; if (teco_search_parameters.to_buffer && count) { @@ -956,12 +960,12 @@ teco_state_search_process(teco_machine_main_t *ctx, teco_string_t str, gsize new teco_buffer_edit(buffer); if (buffer == teco_search_parameters.to_buffer) { - if (!teco_do_search(re, 0, teco_search_parameters.dot, &count, error)) + if (!teco_do_search(&re, 0, teco_search_parameters.dot, &count, error)) return FALSE; break; } - if (!teco_do_search(re, 0, teco_interface_ssm(SCI_GETLENGTH, 0, 0), + if (!teco_do_search(&re, 0, teco_interface_ssm(SCI_GETLENGTH, 0, 0), &count, error)) return FALSE; } while (count); @@ -972,14 +976,14 @@ teco_state_search_process(teco_machine_main_t *ctx, teco_string_t str, gsize new teco_buffer_edit(buffer); if (buffer == teco_search_parameters.to_buffer) { - if (!teco_do_search(re, teco_search_parameters.dot, + if (!teco_do_search(&re, teco_search_parameters.dot, teco_interface_ssm(SCI_GETLENGTH, 0, 0), &count, error)) return FALSE; break; } - if (!teco_do_search(re, 0, teco_interface_ssm(SCI_GETLENGTH, 0, 0), + if (!teco_do_search(&re, 0, teco_interface_ssm(SCI_GETLENGTH, 0, 0), &count, error)) return FALSE; } while (count); diff --git a/tests/testsuite.at b/tests/testsuite.at index fc8ab37..a97e0f8 100644 --- a/tests/testsuite.at +++ b/tests/testsuite.at @@ -519,6 +519,24 @@ AT_SETUP([Search accesses wrong Q-Register table]) TE_CHECK([[@^U.#xx/123/ @^Um{:@S/^EG.#xx/$} :Mm Mm]], 1, ignore, ignore) AT_CLEANUP +# NOTE: This used to be a bug in the old GRegex-based implementation, +# which surfaced only with specific build options of Glib's +# PCRE which was not predictable. +# It segfaulted at least on Ubuntu 20.04 (libpcre3 v2:8.39). +# It could fail because the memory limit is exceeed, +# but not in this case since the match string isn't too large. +AT_SETUP([Pattern matching overflow]) +# NOTE: Creating very long lines would currently be ineffective +# at least in UTF-8 mode. +TE_CHECK([[100000<@I"^J">J @S"^EM^X"]], 0, ignore, ignore) +AT_CLEANUP + +AT_SETUP([Block-wise backwards search]) +# Failed when using GRegex (PCRE), which had broken support for partial matches. +# This is not an issue with terex. +TE_CHECK([[2,8EJ @I/ABCD/ -:@S/BC/"F(0/0)' .-3"N(0/0)' ^S+2"N(0/0)']], 0, ignore, ignore) +AT_CLEANUP + AT_SETUP([Invalid buffer ids]) TE_CHECK([[42@EB//]], 1, ignore, ignore) TE_CHECK([[23@EW//]], 1, ignore, ignore) @@ -659,24 +677,6 @@ TE_CHECK([[| (0/0) ']], 1, ignore, ignore) AT_XFAIL_IF(true) AT_CLEANUP -# NOTE: This bug depends on specific build options of Glib's -# PCRE which is not predictable. -# It segfaults at least on Ubuntu 20.04 (libpcre3 v2:8.39). -#AT_SETUP([Pattern matching overflow]) -## Should no longer dump core. -## It could fail because the memory limit is exceeed, -## but not in this case since the match string isn't too large. -#TE_CHECK([[100000<@I"X">J @S"^EM^X"]], 0, ignore, ignore) -#AT_XFAIL_IF(true) -#AT_CLEANUP - -AT_SETUP([Block-wise backwards search]) -# Crashes are caused by a glib bug when a match falls on block boundaries. -# See teco_do_search_backwards() -TE_CHECK([[2,8EJ @I/ABCD/ -:@S/BC/"F(0/0)' .-3"N(0/0)' ^S+2"N(0/0)']], 0, ignore, ignore) -AT_XFAIL_IF(true) -AT_CLEANUP - AT_SETUP([Backtracking in patterns]) # ^ES should be greedy and posessive TE_CHECK([[@I/ /J :@S/^ES^X/"S(0/0)']], 0, ignore, ignore) |
