updated README and sciteco(7) with information about Unicode support (refs #5)

author: Robin Haberkorn <robin.haberkorn@googlemail.com> 2024-09-06 22:30:56 +0200
committer: Robin Haberkorn <robin.haberkorn@googlemail.com> 2024-09-09 18:22:21 +0200
commit: 359e2571ab00234de27a13f88b85231b51dae48f (patch)
tree: 48a1d32150bab47f0a6a31ef541c07e60a7f0d7c
parent: 0e6e0590272c8ba2303af3682d29209f439177d9 (diff)
download: sciteco-359e2571ab00234de27a13f88b85231b51dae48f.tar.gz
2 files changed, 32 insertions, 10 deletions
diff --git a/README b/README
index ea16ba0..ba26222 100644
--- a/README
+++ b/README
@@ -74,8 +74,10 @@ Features
 * Munging: Macros may be munged, that is executed in batch mode. In other words, SciTECO
   can be used for scripting.
   By default, a profile is munged.
-* 8-bit clean: SciTECO can be used to edit binary files if automatic EOL conversion
-  is turned off (`16,0ED`).
+* Full Unicode (UTF-8) support: The document is still represented as a random-accessible
+  codepoint sequence.
+* 8-bit clean: SciTECO can be used to edit binary files if the encoding is changed to
+  ANSI (`0EE`) and automatic EOL conversion is turned off (`16,0ED`).
 * Self-documenting: An integrated indexed help system allows browsing formatted documentation
   about commands, macros and concepts within SciTECO (`?` command).
   Macro packages can be documented with the `tedoc` tool, generating man pages.
diff --git a/doc/sciteco.7.template b/doc/sciteco.7.template
index ca23c93..f344820 100644
--- a/doc/sciteco.7.template
+++ b/doc/sciteco.7.template
@@ -86,17 +86,22 @@ regular commands for command-line editing.
 .SH KEY TRANSLATION
 .
 When the user presses a key or key-combination it is first translated
-to an ASCII character.
-All immediate editing commands and regular \*(ST commands operate on
+to an UTF-8 string.
+All immediate editing commands and regular \*(ST commands however operate on
 a language based solely on
 .B ASCII
-characters.
+codes, which is a subset of Unicode.
 The rules for translating keys are as follows:
 .RS
 .IP 1. 4
 Keys with a printable representation (letters, digits and special
-characters) are translated to their printable representation.
-Shift-combinations automatically result in upper-case letters.
+characters) are translated to their printable representation
+according to the current keyboard layout and modifier keys.
+On the Gtk UI, \*(ST tries to automatically take ANSI letter
+values in situations where the parser accepts only ANSI
+characters.
+\# On Curses, you might need key macros to achieve the same,
+\# but they are not yet implemented.
 .IP 2.
 .SCITECO_TOPIC ctrl
 Control-combinations (e.g. CTRL+A) are translated to control
@@ -104,7 +109,9 @@ codes, that is a code smaller than 32.
 The control code can be calculated by stripping the seventh bit
 from the upper-case letter's ASCII code.
 So for instance, the upper or lower case A (65) will be translated
-to code 1, B to code 2, ecetera.
+to code 1, B (66) to code 2, ecetera.
+\*(ST will always use latin letters regardless of the current
+keyboard layout.
 \*(ST echos control codes as Caret followed by the corresponding
 upper case letter, so you seldomly need to know a control codes
 actual numeric code.
@@ -1068,11 +1075,24 @@ Every document has a current position called dot
 (after the \(lq.\(rq command that returns it).
 A document may contain any sequence of bytes but positions
 refer to characters that might not correspond to individual
-bytes depending on the document's encoding.
+bytes depending on the document's encoding (see \fBEE\fP command).
+The \fB^E\fP command can be used to translate between byte
+and character/glyph positions.
 Consequently when querying the code at a character position
 or inserting characters by code, the code may be an Unicode
 codepoint instead of byte-sized integer.
-Currently however, \*(ST will only handle ASCII files.
+.LP
+Currently, \*(ST supports UTF-8 and single-byte ANSI encodings,
+that can also be used for editing raw binary files.
+\# You can configure other single-byte code pages with EE,
+\# but there isn't yet any way to insert characters.
+UTF-8 is the default codepage for new buffers and Q-Registers.
+While navigation in documents with single-byte encodings
+takes place in constant time, \*(ST uses heuristics in
+UTF-8 documents for translating between byte and character
+offsets which are slower especially when \(lqjumping\(rq
+into very large lines.
+\# But there are optimizations for R, C and A...
 .LP
 .SCITECO_TOPIC "EOL translation"
 To simplify working with files using different end of line
author	Robin Haberkorn <robin.haberkorn@googlemail.com>	2024-09-06 22:30:56 +0200
committer	Robin Haberkorn <robin.haberkorn@googlemail.com>	2024-09-09 18:22:21 +0200
commit	359e2571ab00234de27a13f88b85231b51dae48f (patch)
tree	48a1d32150bab47f0a6a31ef541c07e60a7f0d7c
parent	0e6e0590272c8ba2303af3682d29209f439177d9 (diff)
download	sciteco-359e2571ab00234de27a13f88b85231b51dae48f.tar.gz