[LOADED_DOCS: 2 files, no new loads]

AcBinary: Add ASCII string markers, doc optimizations

Enhanced string encoding with FixStrAscii/StringAscii markers for efficient ASCII handling, updated header flag base to 0xB0, and expanded documentation with marker-dispatch logic, performance results, and markerless schema lane plans.
This commit is contained in:
Loretta 2026-05-04 14:36:16 +02:00
parent 7b94d81485
commit e139eca389
3 changed files with 138 additions and 10 deletions

File diff suppressed because one or more lines are too long

View File

@ -17,7 +17,7 @@ Complete wire format specification for the AcBinary serializer. Source of truth:
## Header Flags ## Header Flags
The flags byte uses `0x90` (144) as base with bit flags in the lower nibble: The flags byte uses `0xB0` (176) as base with bit flags in the lower nibble. (Moved from `0x90` / 144 to make codepoints 135-167 contiguous for the FixStrAscii / StringAscii string-marker block.)
| Bit | Mask | Flag | Meaning | | Bit | Mask | Flag | Meaning |
|-----|------|------|---------| |-----|------|------|---------|
@ -116,14 +116,17 @@ Second occurrence of a referenced polymorphic object uses plain `ObjectRef(65)`
| 89 | Decimal | `[89] [16 bytes]` | | 89 | Decimal | `[89] [16 bytes]` |
| 90 | Char | `[90] [VarUInt]` | | 90 | Char | `[90] [VarUInt]` |
### Strings (9194) ### Strings (9194, 167)
| Code | Name | Wire format | | Code | Name | Wire format |
|------|------|-------------| |------|------|-------------|
| 91 | String | `[91] [VarUInt byteLength] [UTF-8 bytes]` | | 91 | String | `[91] [VarUInt byteLength] [UTF-8 bytes]` — generic UTF-8 (any content) |
| 92 | StringInterned | `[92] [VarUInt cacheIndex]` — 2nd+ occurrence | | 92 | StringInterned | `[92] [VarUInt cacheIndex]` — 2nd+ occurrence |
| 93 | StringEmpty | `[93]` — no payload | | 93 | StringEmpty | `[93]` — no payload |
| 94 | StringInternFirst | `[94] [VarUInt cacheIndex] [VarUInt byteLength] [UTF-8 bytes]` — 1st occurrence | | 94 | StringInternFirst | `[94] [VarUInt cacheIndex] [VarUInt byteLength] [UTF-8 bytes]` — 1st occurrence |
| 167 | StringAscii | `[167] [VarUInt byteLength] [ASCII bytes]` — pure ASCII (every byte < 0x80); reader bytechar widens, no UTF-8 decode |
The writer detects ASCII via `bytesWritten == charLength` after a single-pass UTF-8 encode (every UTF-16 char < 0x80 produces exactly 1 UTF-8 byte; non-ASCII chars always produce 2-4 bytes), then emits `StringAscii` (167) or `String` (91) accordingly. The reader uses the marker as the ASCII-validity contract `StringAscii` bypasses UTF-8 decode entirely.
### Date/Time (9598) ### Date/Time (9598)
@ -143,17 +146,33 @@ Second occurrence of a referenced polymorphic object uses plain `ObjectRef(65)`
| 101 | NoMetadataHeader | Legacy: implies `RefHandling=true`, no metadata | | 101 | NoMetadataHeader | Legacy: implies `RefHandling=true`, no metadata |
| 102 | PropertySkip | `[102]` — marks skipped property (default/null value) | | 102 | PropertySkip | `[102]` — marks skipped property (default/null value) |
### FixStr (103134) ### FixStr (103134) — short UTF-8 strings
Short ASCII strings encoded in a single marker byte + raw bytes (no length prefix): Short strings (any UTF-8 content) encoded in a single marker byte + raw UTF-8 bytes (no length prefix):
``` ```
[FixStrBase + byteLength] [ASCII bytes] [FixStrBase + byteLength] [UTF-8 bytes]
``` ```
- Length range: 031 bytes (`FixStrBase=103`, `FixStrMax=134`) - Length range: 031 **bytes** (`FixStrBase=103`, `FixStrMax=134`)
- Saves 1 byte vs `String` marker + VarUInt length - Saves 1 byte vs `String` marker + VarUInt length
- Falls back to `String(91)` if content is non-ASCII - Content semantics: UTF-8 (may contain multi-byte sequences for non-ASCII chars)
- Reader dispatches via the (universal-)UTF-8 decode path
### FixStrAscii (135166) — short ASCII strings
Short ASCII-only strings encoded in a single marker byte + raw ASCII bytes:
```
[FixStrAsciiBase + byteLength] [ASCII bytes]
```
- Length range: 031 **bytes** = chars (1:1 for ASCII) (`FixStrAsciiBase=135`, `FixStrAsciiMax=166`)
- Same wire size as `FixStr` (1 marker byte + bytes), but the marker IS the ASCII-validity contract
- Reader byte→char widens directly (`Encoding.Latin1.GetString` SIMD-accelerated path) — no UTF-8 decode, no run-time `Ascii.IsValid` scan
- Writer chooses between `FixStrAscii` and `FixStr` post-encode via `bytesWritten == charLength`
Codepoints **168175** are reserved for future string-related markers (e.g., compressed / base64 / mixed-ASCII variants), keeping the 91167 range a single contiguous string-marker block.
### TinyInt (192255) ### TinyInt (192255)

File diff suppressed because one or more lines are too long