AyCode.Core/docs/BINARY_FORMAT.md

388 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# AcBinary Wire Format
Complete wire format specification for the AcBinary serializer. Source of truth: [`AyCode.Core/Serializers/Binaries/BinaryTypeCode.cs`](../AyCode.Core/Serializers/Binaries/BinaryTypeCode.cs).
## Stream Layout
```
[version : 1 byte] [flags : 1 byte] [cacheCount : VarUInt?] [payload...]
```
- **version** — `FormatVersion = 1` (current).
- **flags** — See [Header Flags](#header-flags).
- **cacheCount** — Present only when `HeaderFlag_HasCacheCount` is set. Number of type wrapper slots used by serializer.
## Header Flags
The flags byte uses `0x90` (144) as base with bit flags in the lower nibble:
| Bit | Mask | Flag | Meaning |
|-----|------|------|---------|
| 0 | `0x01` | Metadata | Property hash metadata included (cross-type deserialization) |
| 1 | `0x02` | RefHandling_OnlyId | Reference tracking for `IId` objects only |
| 2 | `0x04` | RefHandling_All | Reference tracking for all objects (always combined with bit 1) |
| 3 | `0x08` | HasCacheCount | VarUInt cache count follows the flags byte |
**Reference handling modes:** None = `0x00`, OnlyId = `0x02`, All = `0x06` (bits 1+2).
## Variable-Length Encoding
### VarUInt (unsigned)
LEB128: 7 data bits per byte, MSB = continuation flag.
```
value < 128 → 1 byte [0xxxxxxx]
value < 16384 → 2 bytes [1xxxxxxx] [0xxxxxxx]
value < 2097152 → 3 bytes ...
(max 5 bytes for uint32)
```
### VarInt (signed)
ZigZag encoding maps signed to unsigned, then LEB128:
```
encode: (value << 1) ^ (value >> 31)
decode: (raw >> 1) ^ -(raw & 1)
```
Maps: `0 → 0`, `-1 → 1`, `1 → 2`, `-2 → 3`, etc.
### VarULong (unsigned 64-bit)
Same LEB128 encoding, max 10 bytes for uint64.
## Type Markers
All markers defined in `BinaryTypeCode.cs`. `SlotCount = 64`.
### FixObj (063)
Single-byte object type. The marker byte **is** the type slot index — no additional type identifier needed.
```
[FixObj(N)] [properties...]
```
**Slot allocation:** Slots 063 are reserved for runtime polymorphic types, assigned dynamically on first encounter during serialization. Source-generated (SGen) types receive slots starting at 64+ via `AllocateWrapperSlot()` (sequential, `Interlocked.Increment`). SGen slots are compile-time stable; runtime slots depend on serialization order.
### Complex Types (6471)
| Code | Name | Wire format |
|------|------|-------------|
| 64 | Object | `[64] [VarUInt typeIndex] [properties...]` |
| 65 | ObjectRef | `[65] [VarUInt refCacheIndex]` |
| 66 | Array | `[66] [VarUInt count] [elements...]` |
| 67 | Dictionary | `[67] [VarUInt count] [key, value pairs...]` |
| 68 | ByteArray | `[68] [VarUInt length] [raw bytes]` |
| 69 | ObjectWithMetadata | `[69] [VarUInt typeIndex] [VarUInt hashCount] [hashes...] [properties...]` |
| 70 | ObjectRefFirst | `[70] [VarUInt refCacheIndex] [object body...]` |
| 71 | ObjectWithMetadataRefFirst | `[71] [VarUInt refCacheIndex] [metadata + properties...]` |
### Polymorphic Types (7275)
Used when runtime type differs from declared property type and `UseMetadata=false`.
| Code | Name | Wire format |
|------|------|-------------|
| 72 | ObjectWithTypeName | `[72] [UTF8 typeName] [inner marker] [body...]` — prefix, inner Object/Array/Dict follows |
| 73 | ObjectWithTypeNameRefFirst | `[73] [UTF8 typeName] [VarUInt refCacheIndex] [properties...]` — combined, no inner marker |
| 74 | ObjectWithTypeIndex | `[74] [VarUInt typeIndex] [inner marker] [body...]` — prefix |
| 75 | ObjectWithTypeIndexRefFirst | `[75] [VarUInt typeIndex] [VarUInt refCacheIndex] [properties...]` — combined |
Second occurrence of a referenced polymorphic object uses plain `ObjectRef(65)` — no polymorphic prefix needed.
### Primitives (7690)
| Code | Name | Wire format |
|------|------|-------------|
| 76 | Null | `[76]` — no payload |
| 77 | True | `[77]` — no payload |
| 78 | False | `[78]` — no payload |
| 79 | Int8 | `[79] [1 byte]` |
| 80 | UInt8 | `[80] [1 byte]` |
| 81 | Int16 | `[81] [VarInt]` |
| 82 | UInt16 | `[82] [VarUInt]` |
| 83 | Int32 | `[83] [VarInt]` |
| 84 | UInt32 | `[84] [VarUInt]` |
| 85 | Int64 | `[85] [VarLong]` |
| 86 | UInt64 | `[86] [VarULong]` |
| 87 | Float32 | `[87] [4 bytes IEEE 754]` |
| 88 | Float64 | `[88] [8 bytes IEEE 754]` |
| 89 | Decimal | `[89] [16 bytes]` |
| 90 | Char | `[90] [VarUInt]` |
### Strings (9194)
| Code | Name | Wire format |
|------|------|-------------|
| 91 | String | `[91] [VarUInt byteLength] [UTF-8 bytes]` |
| 92 | StringInterned | `[92] [VarUInt cacheIndex]` — 2nd+ occurrence |
| 93 | StringEmpty | `[93]` — no payload |
| 94 | StringInternFirst | `[94] [VarUInt cacheIndex] [VarUInt byteLength] [UTF-8 bytes]` — 1st occurrence |
### Date/Time (9598)
| Code | Name | Wire format |
|------|------|-------------|
| 95 | DateTime | `[95] [8 bytes ticks]` |
| 96 | DateTimeOffset | `[96] [8 bytes ticks] [VarInt offsetMinutes]` |
| 97 | TimeSpan | `[97] [VarLong ticks]` |
| 98 | Guid | `[98] [16 bytes]` |
### Other Markers
| Code | Name | Wire format |
|------|------|-------------|
| 99 | Enum | `[99] [VarInt underlyingValue]` |
| 100 | MetadataHeader | Legacy: implies `RefHandling=true` + metadata present |
| 101 | NoMetadataHeader | Legacy: implies `RefHandling=true`, no metadata |
| 102 | PropertySkip | `[102]` — marks skipped property (default/null value) |
### FixStr (103134)
Short ASCII strings encoded in a single marker byte + raw bytes (no length prefix):
```
[FixStrBase + byteLength] [ASCII bytes]
```
- Length range: 031 bytes (`FixStrBase=103`, `FixStrMax=134`)
- Saves 1 byte vs `String` marker + VarUInt length
- Falls back to `String(91)` if content is non-ASCII
### TinyInt (192255)
Single-byte integer encoding for small values:
```
value = marker - 192 - 16 (range: -16 to 47)
marker = value + 16 + 192 (64 values total)
```
Saves 2+ bytes vs `Int32(83)` + VarInt for frequently occurring small integers.
## Compact Encoding Selection
The serializer applies compact encodings automatically:
| Data | Condition | Encoding | Savings |
|------|-----------|----------|---------|
| Integer | 16 ≤ v ≤ 47 | TinyInt (1 byte) | 25 bytes |
| String | ≤31 bytes, ASCII | FixStr (1+N bytes) | 1 byte (no length prefix) |
| Object | type index < 64 | FixObj (1 byte) | 15 bytes (no VarUInt index) |
| String | empty | StringEmpty (1 byte) | 1+ bytes |
| Bool | | True/False (1 byte) | no payload |
## String Interning Protocol
Controls deduplication of repeated string values.
**Modes** (`StringInterningMode`):
- `None` all strings inline, no overhead
- `Attribute` only `[AcStringIntern]` properties interned (default)
- `All` all strings within length limits interned
**Length limits:** `MinStringInternLength=4`, `MaxStringInternLength=64` (configurable).
**Wire protocol:**
1. Serializer pre-scans all eligible strings to build a plan (which strings repeat)
2. First occurrence: `[StringInternFirst(94)] [VarUInt cacheIndex] [VarUInt byteLength] [UTF-8 bytes]`
3. Subsequent: `[StringInterned(92)] [VarUInt cacheIndex]`
4. Single-occurrence strings: written as normal `String`/`FixStr` (no interning overhead)
## Reference Tracking
Prevents infinite loops and preserves object identity for repeated references.
**Modes** (`ReferenceHandlingMode`):
- `None` no tracking (fastest, use when graph is a tree)
- `OnlyId` track only `IId` objects (matched by ID value)
- `All` track all reference types (two-phase scan required)
**Two-phase process:**
1. **Scan pass** (`ScanPass.cs`) walks the object graph, detects multi-referenced objects and repeated strings. Builds a `WriteDuplicateEntry[]` array (the "write plan") containing `VisitIndex`, `CacheMapIndex`, `IsFirst`, and `Value` for each duplicate.
2. **Sort** write plan entries are sorted by `VisitIndex` to match the write pass traversal order.
3. **Serialize pass** consumes the sorted write plan via `TryConsumeWritePlanEntry()`. A cursor (`_nextWritePlanVisitIndex`) advances through the plan in O(1) no dictionary lookups during serialization.
**Wire protocol:**
- First occurrence: `[ObjectRefFirst(70)] [VarUInt refCacheIndex] [object body...]`
- Subsequent: `[ObjectRef(65)] [VarUInt refCacheIndex]`
## Property Ordering
Properties are serialized in a deterministic order defined by `TypeMetadataBase.GetUnfilteredProperties()`:
1. Walk the inheritance chain from **derived → base** (`currentType.BaseType` loop)
2. At each level, collect declared public instance properties
3. Sort **alphabetically** (`StringComparer.Ordinal`) within each level
4. Result: **base properties first, then derived, alphabetical within each level**
This order is stable across serializer/deserializer as long as the type hierarchy doesn't change.
### Cross-Type Deserialization (UseMetadata)
When `UseMetadata=true`, property name hashes (FNV-1a via `FnvHash.ComputeString`) are written per type, enabling schema evolution:
- **Serializer** writes property hashes in the metadata section (`ObjectWithMetadata(69)`)
- **Deserializer** builds an index mapping array (`GetIndexMapping()`) that maps source property indices to destination indices by matching FNV-1a hashes
- This allows deserialization even when source and destination types have different property sets or ordering
When `UseMetadata=false`, properties are matched by **positional index only** source and destination must have identical property layouts.
## Configuration Options
Options defined in `AcBinarySerializerOptions` (inherits `AcSerializerOptions`). Each option controls which code paths execute and how the wire format changes.
### WireMode
| Value | Integers | Strings | Output size | Speed |
|-------|----------|---------|-------------|-------|
| `Compact` (default) | VarInt/VarUInt (15 bytes) | UTF-8 with speculative ASCII fast path | Smaller | Slightly slower |
| `Fast` | Fixed-width raw bytes (4/8 bytes) | UTF-16 memcpy (`charCount * 2` bytes) | Larger | Fastest encode/decode |
**Format difference for strings:**
- Compact: `[VarUInt byteLength] [UTF-8 bytes]` speculative ASCII (1 pass if all ASCII, rewind+UTF-8 fallback otherwise)
- Fast: `[VarUInt charCount] [raw UTF-16 bytes]` zero-encoding memcpy
**Code branch:** `context.FastWire` flag set at `context.Reset()`. Checked in `WriteStringUtf8()` and integer write methods. FixStr optimization is skipped in Fast mode (UTF-8 specific).
### ReferenceHandling
| Value | Tracked objects | Scan pass | Header flags | Wire markers |
|-------|----------------|-----------|--------------|-------------|
| `None` | Nothing | Skipped | `0x00` | Standard object markers only |
| `OnlyId` | `IId` objects only (by ID value) | Partial | `0x02` | `ObjectRefFirst(70)` + `ObjectRef(65)` |
| `All` (default) | All reference types | Full graph walk | `0x06` | `ObjectRefFirst(70)` + `ObjectRef(65)` |
**Format impact:** When enabled, multi-referenced objects are written once with `ObjectRefFirst(70) + VarUInt(refCacheIndex)` on first encounter, then replaced by `ObjectRef(65) + VarUInt(refCacheIndex)` on subsequent encounters. Header `HasCacheCount` flag is set and cache count written.
**Interaction with `ThrowOnCircularReference` (default: `true`):**
- `true` + ref handling enabled: all objects tracked for cycle detection, throws `InvalidOperationException` on circular reference
- `false` + ref handling enabled: only IId types tracked for deduplication, non-IId circular refs silently truncated at `MaxDepth`
### UseMetadata
| Value | Wire markers | Property matching | Overhead |
|-------|-------------|-------------------|----------|
| `false` (default) | `FixObj`/`Object` | Positional index only types must match | None |
| `true` | `ObjectWithMetadata(69)` / `ObjectWithMetadataRefFirst(71)` | FNV-1a property name hashes | 4 bytes per property per type |
**Format impact:** When enabled, each type's first occurrence writes `[VarUInt hashCount] [FNV-1a hash × N]` before properties. Deserializer uses hashes to build sourcedestination index mapping, enabling cross-type deserialization (different property sets/ordering).
**Code branch:** `context.UseMetadata` controls whether `ObjectWithMetadata(69)` or plain `Object(64)` markers are used. When `false`, `IsDirectObjectWrite=true` allows source-generated writers to bypass `WriteObject` entirely and inline property writes.
**Related:** `CheckDuplicatePropName` (default: `true`) throws if FNV-1a hash collision detected between property names of the same type. Disable in production for performance.
### UseStringInterning
| Value | Eligible strings | Scan overhead | Wire markers |
|-------|-----------------|---------------|-------------|
| `None` | Nothing | None | `String(91)` / `FixStr` only |
| `Attribute` (default) | Properties with `[AcStringIntern(true)]` | Scans marked properties | `StringInternFirst(94)` + `StringInterned(92)` |
| `All` | All strings within length limits | Scans all strings | `StringInternFirst(94)` + `StringInterned(92)` |
**Length limits:** `MinStringInternLength` (default: 4) and `MaxStringInternLength` (default: 64, 0=unlimited). Strings outside this range are always written inline.
**Format impact:** Interned strings on first occurrence: `[StringInternFirst(94)] [VarUInt cacheIndex] [string data]`. Subsequent: `[StringInterned(92)] [VarUInt cacheIndex]` (12 bytes vs full string). Single-occurrence strings are never interned no overhead for unique strings.
**Code branch:** `context.StringInternEligible` flag set per-property before `WriteString`. Scan pass builds a `WriteDuplicateEntry[]` plan; write pass consumes it via cursor.
### MaxDepth
| Value | Behavior |
|-------|----------|
| `255` (default) | Effectively unlimited nesting |
| `0` | Root level only nested objects/collections written as `Null(76)` |
| `N` | Objects deeper than N levels written as `Null(76)` |
**Format impact:** Depth-exceeded values appear as `Null(76)` in the stream indistinguishable from actual null values. No special marker.
**Code branch:** Checked at entry of every object/collection write: `if (depth > MaxDepth) { WriteByte(Null); return; }`.
### UseCompression
| Value | Method | Granularity | Memory |
|-------|--------|-------------|--------|
| `None` (default) | No compression | | |
| `Block` | LZ4 single block | Entire payload | Full buffer in memory |
| `BlockArray` | LZ4 chunked | 64KB chunks | Streaming-friendly, lower peak memory |
**Format impact:** Compression is applied **post-serialization** as a transparent wrapper the inner wire format is unchanged. Both modes are pure managed C# (WASM-compatible, no native dependencies).
**Code branch:** Applied in `AcBinarySerializer.Serialize()` after the serialization context produces the raw buffer: `if (UseCompression != None) Lz4.Compress(buffer, mode)`. Decompression is automatic on deserialize.
### PropertyFilter
Optional delegate `BinaryPropertyFilter?` (default: `null`). When set, invoked for each property to decide inclusion.
```
delegate bool BinaryPropertyFilter(in BinaryPropertyFilterContext context);
```
**BinaryPropertyFilterContext fields:** `DeclaringType`, `PropertyName`, `PropertyType`, `Instance` (null during metadata phase), `IsMetadataPhase`, `GetValue()` (lazy).
**Format impact:** Excluded properties are completely absent from the stream no marker, no placeholder. The deserializer must use `UseMetadata=true` or identical filter to correctly match property indices.
**Code branch:** `context.HasPropertyFilter` checked in `ShouldSerializeProperty()`. Called twice: once during metadata registration (`Instance=null`), once during write phase.
### PropertyMapper
Optional delegate `PropertyMapperDelegate?` (default: `null`) for cross-type deserialization property remapping.
```
delegate PropertyInfo? PropertyMapperDelegate(PropertyInfo sourceProperty, Type destinationType);
```
**Purpose:** Maps properties between different class hierarchies (renamed properties, external DTOs). Result is cached zero overhead on same-type operations (`Deserialize<T>`).
### WASM Options
| Option | Default | Purpose |
|--------|---------|---------|
| `IsWasm` | `OperatingSystem.IsBrowser()` | Auto-detect WASM environment |
| `UseStringCaching` | follows `IsWasm` | Cache short strings during deserialization to reduce GC pressure |
| `MaxCachedStringLength` | 64 | Max string length to cache |
**Format impact:** None — these are deserialization-only optimizations. When `UseStringCaching=true`, the deserializer maintains an intern cache for strings ≤ `MaxCachedStringLength` chars. Disabled automatically when `StringInternFirst` marker is encountered (interning takes precedence).
### Other Options
| Option | Type | Default | Purpose |
|--------|------|---------|---------|
| `UseGeneratedCode` | bool | `true` | Use source-generated writers/readers when available |
| `InitialBufferCapacity` | int | 4096 | Starting buffer size (bytes) for serialization output |
| `RemoveOrphanedItems` | bool | `false` | During `PopulateMerge`: remove destination collection items with no matching source ID |
| `UseAsync` | bool | `false` | Async context pool return via ThreadPool. Auto-disabled in WASM and when `ReferenceHandling=None` |
| `MaxContextPoolSize` | int | 8 | Max serialization contexts kept in pool |
## Presets
| Preset | WireMode | Metadata | StringInterning | RefHandling | MaxDepth | Compression | Other |
|--------|----------|----------|-----------------|-------------|----------|-------------|-------|
| `Default` | Compact | false | Attribute | All | 255 | None | — |
| `FastMode` | Compact | false | None | None | 255 | None | No scan pass |
| `ShallowCopy` | Compact | false | None | None | **0** | None | Root level only |
| `WasmOptimized` | Compact | false | Attribute | All | 255 | None | +StringCaching |
| `WithoutReferenceHandling` | Compact | false | Attribute | **None** | 255 | None | No scan pass |
| `WithoutMetadata` | Compact | **false** | Attribute | All | 255 | None | — |
**Performance implication of presets:**
- `Default` / `WasmOptimized` — two-phase (scan + serialize) due to `ReferenceHandling=All`
- `FastMode` / `ShallowCopy` — single-phase (no scan pass) since both interning and refs are disabled
- The scan pass adds ~20-30% overhead; disable it when the object graph is a simple tree
## Option Interactions
Key interdependencies that affect which code branches execute:
| Combination | Effect |
|-------------|--------|
| `ReferenceHandling=None` + `UseStringInterning=None` | **No scan pass** — fastest path, single-phase serialization |
| `ReferenceHandling=All` + `UseMetadata=true` | Uses `ObjectWithMetadataRefFirst(71)` marker — combined ref + metadata |
| `UseMetadata=false` + `UseGeneratedCode=true` | `IsDirectObjectWrite=true` — generated code inlines property writes, bypasses `WriteObject` |
| `UseMetadata=true` + `PropertyFilter` set | Filter invoked twice (metadata phase + write phase); filter results must be stable |
| `WireMode=Fast` + `UseStringInterning!=None` | Interned strings still use the fast string path (UTF-16 for first occurrence, VarUInt index for subsequent) |
| `UseCompression!=None` + any other option | Compression is orthogonal — applied post-serialization, inner format unchanged |