[LOADED_DOCS: 2 files, no new loads]
AcBinary: Add ASCII string markers, doc optimizations Enhanced string encoding with FixStrAscii/StringAscii markers for efficient ASCII handling, updated header flag base to 0xB0, and expanded documentation with marker-dispatch logic, performance results, and markerless schema lane plans.
This commit is contained in:
parent
7b94d81485
commit
e139eca389
File diff suppressed because one or more lines are too long
|
|
@ -17,7 +17,7 @@ Complete wire format specification for the AcBinary serializer. Source of truth:
|
|||
|
||||
## Header Flags
|
||||
|
||||
The flags byte uses `0x90` (144) as base with bit flags in the lower nibble:
|
||||
The flags byte uses `0xB0` (176) as base with bit flags in the lower nibble. (Moved from `0x90` / 144 to make codepoints 135-167 contiguous for the FixStrAscii / StringAscii string-marker block.)
|
||||
|
||||
| Bit | Mask | Flag | Meaning |
|
||||
|-----|------|------|---------|
|
||||
|
|
@ -116,14 +116,17 @@ Second occurrence of a referenced polymorphic object uses plain `ObjectRef(65)`
|
|||
| 89 | Decimal | `[89] [16 bytes]` |
|
||||
| 90 | Char | `[90] [VarUInt]` |
|
||||
|
||||
### Strings (91–94)
|
||||
### Strings (91–94, 167)
|
||||
|
||||
| Code | Name | Wire format |
|
||||
|------|------|-------------|
|
||||
| 91 | String | `[91] [VarUInt byteLength] [UTF-8 bytes]` |
|
||||
| 91 | String | `[91] [VarUInt byteLength] [UTF-8 bytes]` — generic UTF-8 (any content) |
|
||||
| 92 | StringInterned | `[92] [VarUInt cacheIndex]` — 2nd+ occurrence |
|
||||
| 93 | StringEmpty | `[93]` — no payload |
|
||||
| 94 | StringInternFirst | `[94] [VarUInt cacheIndex] [VarUInt byteLength] [UTF-8 bytes]` — 1st occurrence |
|
||||
| 167 | StringAscii | `[167] [VarUInt byteLength] [ASCII bytes]` — pure ASCII (every byte < 0x80); reader byte→char widens, no UTF-8 decode |
|
||||
|
||||
The writer detects ASCII via `bytesWritten == charLength` after a single-pass UTF-8 encode (every UTF-16 char < 0x80 produces exactly 1 UTF-8 byte; non-ASCII chars always produce 2-4 bytes), then emits `StringAscii` (167) or `String` (91) accordingly. The reader uses the marker as the ASCII-validity contract — `StringAscii` bypasses UTF-8 decode entirely.
|
||||
|
||||
### Date/Time (95–98)
|
||||
|
||||
|
|
@ -143,17 +146,33 @@ Second occurrence of a referenced polymorphic object uses plain `ObjectRef(65)`
|
|||
| 101 | NoMetadataHeader | Legacy: implies `RefHandling=true`, no metadata |
|
||||
| 102 | PropertySkip | `[102]` — marks skipped property (default/null value) |
|
||||
|
||||
### FixStr (103–134)
|
||||
### FixStr (103–134) — short UTF-8 strings
|
||||
|
||||
Short ASCII strings encoded in a single marker byte + raw bytes (no length prefix):
|
||||
Short strings (any UTF-8 content) encoded in a single marker byte + raw UTF-8 bytes (no length prefix):
|
||||
|
||||
```
|
||||
[FixStrBase + byteLength] [ASCII bytes]
|
||||
[FixStrBase + byteLength] [UTF-8 bytes]
|
||||
```
|
||||
|
||||
- Length range: 0–31 bytes (`FixStrBase=103`, `FixStrMax=134`)
|
||||
- Length range: 0–31 **bytes** (`FixStrBase=103`, `FixStrMax=134`)
|
||||
- Saves 1 byte vs `String` marker + VarUInt length
|
||||
- Falls back to `String(91)` if content is non-ASCII
|
||||
- Content semantics: UTF-8 (may contain multi-byte sequences for non-ASCII chars)
|
||||
- Reader dispatches via the (universal-)UTF-8 decode path
|
||||
|
||||
### FixStrAscii (135–166) — short ASCII strings
|
||||
|
||||
Short ASCII-only strings encoded in a single marker byte + raw ASCII bytes:
|
||||
|
||||
```
|
||||
[FixStrAsciiBase + byteLength] [ASCII bytes]
|
||||
```
|
||||
|
||||
- Length range: 0–31 **bytes** = chars (1:1 for ASCII) (`FixStrAsciiBase=135`, `FixStrAsciiMax=166`)
|
||||
- Same wire size as `FixStr` (1 marker byte + bytes), but the marker IS the ASCII-validity contract
|
||||
- Reader byte→char widens directly (`Encoding.Latin1.GetString` SIMD-accelerated path) — no UTF-8 decode, no run-time `Ascii.IsValid` scan
|
||||
- Writer chooses between `FixStrAscii` and `FixStr` post-encode via `bytesWritten == charLength`
|
||||
|
||||
Codepoints **168–175** are reserved for future string-related markers (e.g., compressed / base64 / mixed-ASCII variants), keeping the 91–167 range a single contiguous string-marker block.
|
||||
|
||||
### TinyInt (192–255)
|
||||
|
||||
|
|
|
|||
File diff suppressed because one or more lines are too long
Loading…
Reference in New Issue