aboutsummaryrefslogtreecommitdiffhomepage
diff options
context:
space:
mode:
authorRobin Haberkorn <robin.haberkorn@googlemail.com>2024-09-06 22:30:56 +0200
committerRobin Haberkorn <robin.haberkorn@googlemail.com>2024-09-09 18:22:21 +0200
commit359e2571ab00234de27a13f88b85231b51dae48f (patch)
tree48a1d32150bab47f0a6a31ef541c07e60a7f0d7c
parent0e6e0590272c8ba2303af3682d29209f439177d9 (diff)
downloadsciteco-359e2571ab00234de27a13f88b85231b51dae48f.tar.gz
updated README and sciteco(7) with information about Unicode support (refs #5)
-rw-r--r--README6
-rw-r--r--doc/sciteco.7.template36
2 files changed, 32 insertions, 10 deletions
diff --git a/README b/README
index ea16ba0..ba26222 100644
--- a/README
+++ b/README
@@ -74,8 +74,10 @@ Features
* Munging: Macros may be munged, that is executed in batch mode. In other words, SciTECO
can be used for scripting.
By default, a profile is munged.
-* 8-bit clean: SciTECO can be used to edit binary files if automatic EOL conversion
- is turned off (`16,0ED`).
+* Full Unicode (UTF-8) support: The document is still represented as a random-accessible
+ codepoint sequence.
+* 8-bit clean: SciTECO can be used to edit binary files if the encoding is changed to
+ ANSI (`0EE`) and automatic EOL conversion is turned off (`16,0ED`).
* Self-documenting: An integrated indexed help system allows browsing formatted documentation
about commands, macros and concepts within SciTECO (`?` command).
Macro packages can be documented with the `tedoc` tool, generating man pages.
diff --git a/doc/sciteco.7.template b/doc/sciteco.7.template
index ca23c93..f344820 100644
--- a/doc/sciteco.7.template
+++ b/doc/sciteco.7.template
@@ -86,17 +86,22 @@ regular commands for command-line editing.
.SH KEY TRANSLATION
.
When the user presses a key or key-combination it is first translated
-to an ASCII character.
-All immediate editing commands and regular \*(ST commands operate on
+to an UTF-8 string.
+All immediate editing commands and regular \*(ST commands however operate on
a language based solely on
.B ASCII
-characters.
+codes, which is a subset of Unicode.
The rules for translating keys are as follows:
.RS
.IP 1. 4
Keys with a printable representation (letters, digits and special
-characters) are translated to their printable representation.
-Shift-combinations automatically result in upper-case letters.
+characters) are translated to their printable representation
+according to the current keyboard layout and modifier keys.
+On the Gtk UI, \*(ST tries to automatically take ANSI letter
+values in situations where the parser accepts only ANSI
+characters.
+\# On Curses, you might need key macros to achieve the same,
+\# but they are not yet implemented.
.IP 2.
.SCITECO_TOPIC ctrl
Control-combinations (e.g. CTRL+A) are translated to control
@@ -104,7 +109,9 @@ codes, that is a code smaller than 32.
The control code can be calculated by stripping the seventh bit
from the upper-case letter's ASCII code.
So for instance, the upper or lower case A (65) will be translated
-to code 1, B to code 2, ecetera.
+to code 1, B (66) to code 2, ecetera.
+\*(ST will always use latin letters regardless of the current
+keyboard layout.
\*(ST echos control codes as Caret followed by the corresponding
upper case letter, so you seldomly need to know a control codes
actual numeric code.
@@ -1068,11 +1075,24 @@ Every document has a current position called dot
(after the \(lq.\(rq command that returns it).
A document may contain any sequence of bytes but positions
refer to characters that might not correspond to individual
-bytes depending on the document's encoding.
+bytes depending on the document's encoding (see \fBEE\fP command).
+The \fB^E\fP command can be used to translate between byte
+and character/glyph positions.
Consequently when querying the code at a character position
or inserting characters by code, the code may be an Unicode
codepoint instead of byte-sized integer.
-Currently however, \*(ST will only handle ASCII files.
+.LP
+Currently, \*(ST supports UTF-8 and single-byte ANSI encodings,
+that can also be used for editing raw binary files.
+\# You can configure other single-byte code pages with EE,
+\# but there isn't yet any way to insert characters.
+UTF-8 is the default codepage for new buffers and Q-Registers.
+While navigation in documents with single-byte encodings
+takes place in constant time, \*(ST uses heuristics in
+UTF-8 documents for translating between byte and character
+offsets which are slower especially when \(lqjumping\(rq
+into very large lines.
+\# But there are optimizations for R, C and A...
.LP
.SCITECO_TOPIC "EOL translation"
To simplify working with files using different end of line