Age | Commit message (Collapse) | Author | Files | Lines |
|
|
|
Source: github.com/garyhouston/hsrex
* This version should be a Thompson NFA, using backtracking only
for backreferences, so it should be much safer than PCRE (GRegex).
Search times should be linear and there should be no way to cause
stack overflows (unless we would generate backreferences).
* Importing the lib makes sure we don't add another compile-time
dependency. Also, we could implement our own regcomp() which
translates directly from TECO patterns.
* This is still WIP and currently only works with the ASCII version.
The widechar version does not define re_comp() and re_exec().
* Apparently we can't have an ASCII and widechar version at the same time,
so we must build two libtool libraries and somehow mangle the names.
* Ideally the widechar version will also work with UTF-8 strings.
* An alternative might be to import the Gnulib regex module.
How does it choose the encoding anyway?
* Or we could just use Oniguruma - but this would have to be a new
external library dependency.
|
|
* There are patches on top of Scintilla as were before
* Scinterm has been switched back to the upstream repository and there are unreleased
commits - especially for out-of-tree builds.
* Lexilla hasn't been released since my troff lexer was merged.
|
|
* This fixes F< to the beginning of the macro, which was broken in 73d574b71a10d4661ada20275cafde75aff6c1ba.
teco_machine_main_t::macro_pc actually has to be signed as it is sometimes set to -1.
|
|
errors
* teco_cmdline.pc is not correct after an error occurred.
Therefore start_pc is initialized with teco_cmdline.effective_len.
|
|
parsed correctly
|
|
|
|
|
|
* ALL keypresses (the UTF-8 sequences resulting from key presses) can now be remapped.
* This is especially useful with Unicode support, as you might want to alias
international characters to their corresponding latin form in the start state,
so you don't have to change keyboard layouts so often.
This is done automatically in Gtk, where we have hardware key press information,
but has to be done with key macros in Curses.
There is a new key mask 4 (bit 3) for that purpose now.
* Also, you might want to define non-ANSI letters to perform special functions in
the start state where it won't be accepted by the parser anyway.
Suppose you have a macro M→, you could define
@^U[^K→]{m→} 1^_U[^K→]
This effectively "extends" the parser and allow you to call macro "→" by a single
key press. See also #5.
* The register prefix has been changed from ^F (for function) to ^K (for key).
This is the only thing you have to change in order to migrate existing
function key macros.
* Key macros are enabled by default. There is no longer any way to disable
function key handling in curses, as I never found any reason or need to disable it.
Theoretically, the default ESCDELAY could turn out to be too small and function
keys don't get through. I doubt that's possible unless on extremely slow serial lines.
Even then, you'd have to increase ESCDELAY and instead of disabling function keys
simply define an escape surrogate.
* The ED flag has been removed and its place is reserved for a future mouse support flag
(which does make sense to disable in curses sometimes).
fnkeys.tes is consequently also enabled by default in sample.teco_ini.
* Key macros are handled as an unit. If one character results in an error,
the entire string is rubbed out.
This fixes the "CLOSE" key on Gtk.
It also makes sure that the original error message is preserved and not overwritten
by some subsequent syntax error.
It was never useful that we kept inserting characters after the first error.
|
|
* This is used for error messages (TECO macro stackframes),
so it's important to display columns in characters.
* Program counters are in bytes and therefore everywhere gsize.
This is by glib convention.
|
|
|
|
|
|
* pressing ^W in FG now deletes the entire directory component as in EB
* commands without glob patterns (eg. EW) can now autocomplete file names containing
glob patterns
* When the autocompletion contains a glob character in commands accepting
glob patterns like EB or EN, we now escape the glob pattern.
This already helps if the remaining file name can be autocompleted in one go.
Unfortunately, this is still insufficient if we can only partially complete
and the partial completion contains glob characters.
For instance, if there are 2 files: `file?.txt` and `file?.foo`,
completing after `f` will insert `ile[?].`.
The second try to press Tab will already do nothing.
To fully support these cases, we need a version of teco_file_auto_complete()
accepting glob patterns.
Perhaps we can simply append `*` to the given glob pattern.
|
|
* while code is guaranteed to be in valid UTF-8, this cannot be
said about the result of string building.
* The search pattern can end up with invalid Unicode bytes even when
searching on UTF-8 buffers, e.g. if ^EQq inserts garbage.
There are currently no checks.
* When searching on a raw buffer, it must be possible to
search for arbitrary bytes (^EUq).
Since teco_pattern2regexp() was always expecting clean UTF-8 input,
this would sometimes skip over too many bytes and could even crash.
* Instead, teco_pattern2regexp() now takes the <S> target codepage
into account.
|
|
The following rules apply:
* All SciTECO macros __must__ be in valid UTF-8, regardless of the
the register's configured encoding.
This is checked against before execution, so we can use glib's non-validating
UTF-8 API afterwards.
* Things will inevitably get slower as we have to validate all macros first
and convert to gunichar for each and every character passed into the parser.
As an optimization, it may make sense to have our own inlineable version of
g_utf8_get_char() (TODO).
Also, Unicode glyphs in syntactically significant positions may be case-folded -
just like ASCII chars were. This is is of course slower than case folding
ASCII. The impact of this should be measured and perhaps we should restrict
case folding to a-z via teco_ascii_toupper().
* The language itself does not use any non-ANSI characters, so you don't have to
use UTF-8 characters.
* Wherever the parser expects a single character, it will now accept an arbitrary
Unicode/UTF-8 glyph as well.
In other words, you can call macros like M§ instead of having to write M[§].
You can also get the codepoint of any Unicode character with ^^x.
Pressing an Unicode character in the start state or in Ex and Fx will now
give a sane error message.
* When pressing a key which produces a multi-byte UTF-8 sequence, the character
gets translated back and forth multiple times:
1. It's converted to an UTF-8 string, either buffered or by IME methods (Gtk).
On Curses we could directly get a wide char using wget_wch(), but it's
not currently used, so we don't depend on widechar curses.
2. Parsed into gunichar for passing into the edit command callbacks.
This also validates the codepoint - everything later on can assume valid
codepoints and valid UTF-8 strings.
3. Once the edit command handling decides to insert the key into the command line,
it is serialized back into an UTF-8 string as the command line macro has
to be in UTF-8 (like all other macros).
4. The parser reads back gunichars without validation for passing into
the parser callbacks.
* Flickering in the Curses UI and Pango warnings in Gtk, due to incompletely
inserted and displayed UTF-8 sequences, are now fixed.
|
|
* The libtool wrapper binaries do not pass down UTF-8 strings correctly,
so the Unicode tests failed under some circumstances.
* As we aren't actually linking against any locally-built shared libraries,
we are passing --disable-shared to libtool which inhibts wrapper generation
on win32 and fixes the test suite.
* Also use up to date autotools. This didn't fix anything, though.
* test suite: try writing an Unicode filename as well
* There have been problems doing that on Win32 where UTF-8 was not
correctly passed down from the command line and some Windows API
calls were only working with ANSI filenames etc.
|
|
* The default ANSI versions of the Win32 API calls worked only as
long as we used the ANSI subset of UTF-8 in filenames.
* There is g_win32_locale_filename_from_utf8(), but it's not guaranteed
to derive an unique filename.
* Therefore we define UNICODE and convert between UTF-8 and UTF-16
(Windows' native Unicode encoding).
|
|
* Should prevent data loss due to system locale conversions
when parsing command line arguments.
* Should also fix passing Unicode arguments to munged macros and
therefore opening files via ~/.teco_ini.
* The entire option parsing is based on GStrv (null-terminated string lists)
now, also on UNIX.
|
|
The Mac OS 12 Groff apparently does not accept `-K` for preconv.
|
|
(refs #5)
|
|
hopefully fixes the Unicode test cases on Mac OS
|
|
|
|
should slightly reduce binary size
|
|
* turns out that glib's g_assert() does not depend on NDEBUG like Standard C's assert()
* this disables assertions in release builds and should speed up things slightly
|
|
|
|
|
|
It's not a real Lexilla lexer, but simply styles the document
once in lexer.set.git in order to highlight comment lines.
|
|
* use SCI_GETTEXTRANGEFULL instead of deprecated SCI_GETTEXTRANGE
|
|
* significantly speeds up build time
* Scintilla and Lexilla headers and symbols are all-ASCII anyway.
* We should probably have a look at the quicksort implementation
in string.tes, as it can probably be optimized in UTF-8 documents as well.
|
|
|
|
teco_interface_glyphs2bytes() and teco_interface_bytes2glyphs() (refs #5)
* for consistency with all the other teco_view wrappers in interface.h
|
|
* When enabled with bit 2 in the ED flags (0,4ED),
all registers and buffers will get the raw ANSI encoding (as if 0EE had been
called on them).
You can still manually change the encoding, eg. by calling 65001EE afterwards.
* Also the ANSI mode sets up character representations for all bytes >= 0x80.
This is currently done only depending on the ED flag, not when setting 0EE.
* Since setting 16,4ED for 8-bit clean editing in a macro can be tricky -
the default unnamed buffer will still be at UTF-8 and at least a bunch
of environment registers as well - we added the command line option
`--8bit` (short `-8`) which configures the ED flags very early on.
As another advantage you can mung the profile in 8-bit mode as well
when using SciTECO as a sort of interactive hex editor.
* Disable UTF-8 checks in 8-bit clean mode (sample.teco_ini).
|
|
* ^Uq however always sets an UTF8 register as the source
is supposed to be a SciTECO macro which is always UTF-8.
* :^Uq preserves the register's encoding
* teco_doc_set_string() now also sets the encoding
* instead of trying to restore the encoding in teco_doc_undo_set_string(),
we now swap out the document in a teco_doc_t and pass it to an undo token.
* The get_codepage() Q-Reg method has been removed as the same
can now be done with teco_doc_get_string() and the get_string() method.
|
|
* <nI> and ^EUq do the same
|
|
|
|
* Eg. when typing with a Russian layout, CTRL+I will always insert ^I.
* Works with all of the start-state command Ex, Fx, ^x commands and
string building constructs.
This is exactly where process_edit_cmd_cb() case folds case-insensitive
characters.
The corresponding state therefore sets an is_case_insensitive flag now.
* Does not yet work with anything embedded into Q-Register specifications.
This could only be realized with a new state callback (is_case_insensitive()?)
that chains to the Q-Register and string building states recursively.
* Also it doesn't work with Ё on my Russian phonetic layout,
probably because the ANSI caret on that same key is considered dead
and not returned by gdk_keyval_to_unicode().
Perhaps we should directly wheck the keyval values?
* Whenever a non-ANSI key is pressed in an allowed state,
we try to check all other keyvals that could be produced by the same
hardware keycode, ie. we check all groups (keyboard layouts).
|
|
(refs #5)
|
|
is taken into account (refs #5)
* ^Nx and ^EMx constructs work with Unicode glyphs now,
even though the main SciTECO parser is still not Unicode-based.
(We translate only complete patterns, although they could have incomplete Unicode sequences at their end.)
* case-insensitive searching now works with Unicode glyphs
|
|
or codepoints) (refs #5)
* This is trickier than it sounds because there isn't one single place to consult.
It depends on the context.
If the string argument relates to buffer contents - as in <I>, <S>, <FR> etc. -
the buffer's encoding is consulted.
If it goes into a register (EU), the register's encoding is consulted.
Everything else (O, EN, EC, ES...) expects only Unicode codepoints.
* This is communicated through a new field teco_machine_stringbuilding_t::codepage
which must be set in the states' initial callback.
* Seems overkill just for ^EUq, but it can be used for context-sensitive
processing of all the other string building constructs as well.
* ^V and ^W cannot be supported for Unicode characters for the time being without an Unicode-aware parser
|
|
|
|
* This will naturally work with both ASCII characters and various
non-English scripts.
* Unfortunately, it cannot work with the other non-ANSI single-byte codepages.
* If we'd like to support scripts working with all sorts of codepoints,
we'd have to introduce a new command for translating individual codepoints
from the current codepage (as reported by EE) to Unicode.
|
|
* The ASCII compiler would try to escape ("\") all bytes of a multibyte
UTF-8 glyph.
* The new implementation escapes only metacharacters and passes down
all non-ANSI glyphs unchanged.
On the downside, this will work only with PCREs.
|
|
* I decoded the Scintilla charset values into codepages, at least
those used on Gtk.
* make sure that the line character index is not allocated or released
too often, as it is actually internally reference counted, which
could result in it missing when we really need it.
* The line character index still appears to be released whenever
the document pointer changes, which will happen after using
a different Q-Register.
This could be a performance bottleneck (FIXME).
|
|
|
|
teco_state_start_get() (refs #5)
|
|
* All manpages are processed with the "utf8" device and with preconv.
Manpage sources can contain Unicode glyphs now.
* grosciteco supports CuXXXX and N commands now
* Lines are drawn with Unicode box characters now.
This works at least with tbl and -Tutf8.
It's probably still too simplistic for pic graphics.
* The topic list at the top of .woman.tec contain byte offsets,
so that we don't need glyphs2bytes conversion when looking up topics.
|
|
There is a widespread myth that they could take up to 6 bytes.
|
|
* There isn't much we can do anyway.
We can detect if it's Unicode and otherwise default to _some_ codepage.
However, we do not even known which codepage should be preferred.
* This is actually trivial to implement in pure SciTECO.
Having it in the profile gives you the ability to customize the default non-UTF code page.
E.g. if you are working a lot with KOI-8 documents, you could change <1EE> to <204EE>.
* Since the Unicode validity check is a noticable slowdown,
we limit it to the first 1024 bytes.
This speeds up startup significantly compared to checking all codepoints in every document.
|
|
between glyph and byte offsets (refs #5)
* ^E is heavily overloaded and can also be used to check whether a given index is valid
(as it is the same that most movement commands to internally).
Besides that, it is mainly useful for interfacing with Scintilla messages.
* EE takes a code page or 0 for ANSI/ASCII.
Currently all documents and new registers are UTF-8.
There will have to be some kind of codepage inheritance and a single-byte-only mode.
|
|
* this required adding several Q-Register vtable methods
* it should still be investigated whether the repeated calling of
SCI_ALLOCATELINECHARACTERINDEX causes any overhead.
|