AyCode.Core/AyCode.Core/docs/BINARY/BINARY_WRITERS.md

12 KiB
Raw Blame History

Binary Output Writers

Output strategies for AcBinarySerializer. Generic over TOutput : struct, IBinaryOutputBase → compile-time specialization, zero virtual dispatch.

Buffer management, hot-path rules: BINARY_IMPLEMENTATION.md

IBinaryOutputBase

Cold-path contract between serializer context and output strategy:

void Initialize(out byte[] buffer, out int position, out int bufferEnd);
void Grow(ref byte[] buffer, ref int position, ref int bufferEnd, int needed);
int GetTotalPosition(int currentPosition);
void Reset();

Critical: All write methods (WriteByte, WriteVarUInt, WriteStringUtf8, etc.) live on BinarySerializationContext<TOutput> sealed class — NOT on the output. Output handles only buffer lifecycle. See Why Writes Are on the Context.

ArrayBinaryOutput

struct ArrayBinaryOutput : IBinaryOutputBase, IDisposable — fastest for byte[]/ReadOnlySpan<byte> result.

  • Initialize: provides pooled buffer, position=0
  • Grow: rent doubled buffer from ArrayPool, copy, return old
  • Pooling: ≤32KB kept across serializations (faster than pool round-trip); >32KB returned, next rent halved
  • Results: ToArray (allocate+memcpy), DetachResult (caller owns pooled buffer), AsSpan (zero-alloc view)
  • OutputInitialized flag: single instance per pooled context, reused

BufferWriterBinaryOutput

struct BufferWriterBinaryOutput : IBinaryOutputBase — writes directly to IBufferWriter<byte> (PipeWriter, ArrayBufferWriter). Zero-copy streaming.

Cached Chunk Pattern

Instead of GetSpan/Advance per write (interface dispatch), acquires large chunk once:

  1. GetMemory(chunkSize)TryGetArray → backing byte[] + offset
  2. All writes: buffer[position++] — direct array indexing, zero dispatch
  3. Grow: Advance(bytesInChunk) → acquire next chunk
  4. Flush: commit final bytes via Advance
  5. Fallback: TryGetArray fails → rent temp buffer, copy on Grow/Flush

Two Usage Modes

Separate buffer states, never concurrent:

Context mode — serialization pipeline:

  • Buffer state on context: _buffer, _position, _bufferEnd
  • BWO invoked only via Initialize/Grow/Flush with out/ref params
  • Context write methods operate on own fields → max JIT optimization

Standalone mode — direct writes outside serializer (e.g. AcBinaryHubProtocol framing):

  • Buffer state on struct: _buffer, _position, _bufferEnd
  • Write methods: WriteByte, WriteVarUInt, WriteStringUtf8, WriteBytes, WriteRaw<T>
  • Position: total bytes (committed + pending)
  • Flush(): commit pending, finalize
  • FlushAndReset(): commit pending, invalidate chunk → IBufferWriter available for another writer

Context/standalone share only IBufferWriter ref and _committedBytes.

Known Limitations

Buffer-writer contract limitations: BINARY_ISSUES.md under Buffer Writer (BWO) category — struct copy semantics, init-reset tracking, ctor chunk acquisition, no-mode-mixing rule.

Chunk Size

Default 65536 (64KB), configurable via AcBinarySerializerOptions.BufferWriterChunkSize.

  • Memory-backed (ArrayBufferWriter): 64KB optimal — fewer Grow calls
  • Network-backed (PipeWriter): smaller (4096) aligns with transport segments. But 64KB default is safe — PipeWriter may return less than requested
  • Too-small default would cause excessive Grow; 64KB is never catastrophic

Why Writes Are on the Context

Key architectural decision in the output layer.

Attempted: write methods on output struct. Context calls Output.WriteByte(value). Result: measurably slower, even with struct + AggressiveInlining + generic constraint devirtualization.

Root cause: JIT generates better code for sealed class field access (this._buffer, this._position — fixed offsets) than generic struct field access (this.Output._buffer — extra address computation at lower optimization tiers).

Current: writes on BinarySerializationContext<TOutput> (sealed class, hot path). Output struct handles only Initialize/Grow/Flush (cold path).

Rule: Do NOT move write methods to output. Measure with full benchmark suite before proposing changes.

IBinaryInputBase (Read Side Mirror)

Deserialization mirrors the output pattern. IBinaryInputBase provides buffer lifecycle; all read methods live on BinaryDeserializationContext<TInput>.

void Initialize(out byte[] buffer, out int position, out int bufferLength);
bool TryAdvanceSegment(ref byte[] buffer, ref int position, ref int bufferLength, int needed);
void Release();
  • ArrayBinaryInput: single byte[], TryAdvanceSegment => false (JIT-eliminated), Release no-op.
  • SequenceBinaryInput: lazy TryGet iteration over ReadOnlySequence<byte>. Context _buffer points to segment backing byte[] (zero-copy). Cross-boundary: ArrayPool scratch, N-segment loop. Release returns scratch to pool.
  • PipeReaderBinaryInput: reads from PipeReader with on-demand data via ReadAsync. Same cross-boundary pattern as SequenceBinaryInput; when all segments in current ReadResult exhausted, calls AdvanceTo + ReadAsync().GetAwaiter().GetResult() for more data. Enables pipeline parallelism with AsyncPipeWriterOutput: deserializer processes chunks as they arrive, not after full payload. Release returns scratch + signals pipe consumption via AdvanceTo.

AsyncPipeWriterOutput

struct AsyncPipeWriterOutput : IBinaryOutputBase — writes to PipeWriter with per-chunk network flush and self-describing chunked framing. Each chunk is framed as [201][UINT16 size][data] — zero-copy for both intermediate and final chunks.

Chunked Protocol Framing

Each chunk has a 3-byte header reserved via header reservation (skip 3 bytes in AcquireChunk, patch before Advance):

  1. AcquireChunk: request chunkSize + 3 from PipeWriter, set position = offset + 3 (skip reserved header), force bufferEnd = offset + 3 + chunkSize
  2. Context writes serializer data into buffer[position..bufferEnd]
  3. Grow(): patch [201][UINT16 dataBytes] header, Advance(3 + dataBytes), FlushAsync().Forget()
  4. Flush(): same as Grow — patch header, Advance(3 + dataBytes). Zero-copy, no data copying. The protocol writes a single [202] byte after.

Backpressure Modes

Constructor parameter flushPolicy of type FlushPolicy (default FlushPolicy.DoubleBuffered):

  • FlushPolicy.PerChunk: Grow() commits → flushes → awaits → acquires next chunk. Strictly bounded peak memory (~chunk_size × 1). No producer/flush parallelism — wall-clock = sum of (serialize + flush) per chunk. Auto-applied on Stream-backed PipeWriter regardless of policy. Recommended for memory-sensitive scenarios where payload size is unpredictable.
  • FlushPolicy.DoubleBuffered (default): Grow() is fire-and-forget for the previous flush; only blocks at the NEXT chunk's Grow if the previous flush hasn't completed. Peak memory ~chunk_size × 2 (current + previous overlapping). Maximum producer/flush parallelism with bounded memory — wall-clock = max(serialize, flush) × N_chunks. The recommended balanced default for typical streaming.
  • FlushPolicy.Coalesced: Grow() does not wait per-chunk. While a previous FlushAsync is in-flight, new chunks accumulate in the PipeWriter buffer (a per-window counter _unflushedBytes tracks the accumulation). When the window approaches the safety threshold (~64 KB), the producer waits for the in-flight flush, then fires one batched FlushAsync covering the entire window — and the window-counter resets. This produces ~64 KB-sized flush windows instead of per-chunk flushes (e.g. a 9.5 MB payload at 4 KB chunks fires ~150 batched flushes, not ~2 300 per-chunk flushes). Major throughput win on transports where each FlushAsync has non-trivial overhead (network sockets, Kestrel WebSocket, kernel TCP buffers). Per-window peak memory ~64 KB (vs chunk_size × 2 in DoubleBuffered); under heavy backpressure may fall back to an owned buffer, losing zero-copy for that chunk.

In all three modes, flush is only initiated when _lastFlush.IsCompleted — no overlapping FlushAsync calls.

Migration note: FlushPolicy replaces the historical bool waitForFlush parameter. Mapping: old trueFlushPolicy.DoubleBuffered, old falseFlushPolicy.Coalesced. The new FlushPolicy.PerChunk value is a NEW capability that previously was only auto-applied on Stream-backed PipeWriter; it can now be explicitly chosen on Pipe-based writers for strictly bounded peak memory.

Two parallel-flush regimes (auto-detected)

Runtime check pipeWriter.GetType() splits flush behavior into two regimes — auto-detected at ctor via _serializeFlushAndAcquire = StreamPipeWriterType.IsInstanceOfType(pipeWriter). No caller intervention. Orthogonal to FlushPolicy and to the wire-format mode choice (Bytes / Segment / AsyncSegment).

True parallel — Pipe-based / parallel-capable PipeWriters: new Pipe().Writer, Kestrel transport output, custom parallel-capable impls. Grow() uses FlushAsync().Forget() pattern: serializer continues with the next chunk while the network async-flushes the previous one. Round-trip wall-clock = max(serialize, flush) × N_chunks — flush hides behind serialize-time. Production-stable on SignalR / Kestrel; "minimally slower than raw byte[]" empirically.

Half parallelStreamPipeWriter-backed transports: PipeWriter.Create(stream) for NamedPipe / FileStream / NetworkStream / MemoryStream / SslStream / etc. The BCL StreamPipeWriter._tailMemory = default reset on flush completion races against the parallel-acquire pattern, forcing FlushAsync().GetAwaiter().GetResult() after every commit. Kernel-IO is strictly sequential. Managed-side parallelism (drain-task, deser-task, calling-thread) still possible, but wall-clock = (serialize + flush) × N_chunks — flush accumulates per chunk.

Regime Detected on Flush pattern in Grow() Wall-clock formula vs raw byte[] (NamedPipe, 1000 iter / 5000 warmup)
True parallel non-StreamPipeWriter FlushAsync().Forget() then continue max(ser, flush) × N minimal — flush hides behind serialize
Half parallel StreamPipeWriter (any stream) SyncAwaitFlush(FlushAsync()) (ser + flush) × N Small 2KB: -2% / Med 7.5KB: +15% / Large 50KB: +34% / Repeated 10KB: +3% / Deep 11KB: +10%

Allocation savings (chunked vs. raw single-shot byte[]) are regime-independent and payload-size-independent: ~30% fewer allocated KB per round-trip, because MemoryMarshal.TryGetArray direct-buffer-write into the PipeWriter's internal slab eliminates the intermediate pooled-byte[] rent that the raw path pays.

The half-parallel wall-clock cost is bounded by the BCL StreamPipeWriter design, not by AsyncPipeWriterOutput — the runtime detect is the optimal managed-layer response. Multiple sequential FlushAsync round-trips on the kernel-IO are the dominant cost on Stream-backed transports; payload-size-arányos as the chunk count grows. JIT tier-1 warmup matters: ~5000 warmup iterations needed to stabilize the async-state-machine code (vs ~500 for the raw path) — see BINARY_TODO.md#accore-bin-t-t5j8.

Real-world latency-budget context. The +34% slowdown on Large-payload NamedPipe is +66 µs absolute. Typical IPC / HTTP / SignalR / frame-budget round-trips are 1-50 ms, where the absolute differential is 0.13-5% — within the noise floor. Choose chunked for: GC-pressure-sensitive hot-paths (~30% alloc savings, 0 GC-pressure for the serialize), memory-bounded hosts (chunk-bounded peak vs. payload-bounded), GB-scale payloads (Bytes mode OOMs), and parallel-capable transports (true-parallel pipeline overlap). Avoid chunked when: the workload is wall-clock-throughput-bound on a Stream-backed transport AND payload is Medium-to-Large AND GC-pressure is tolerable.

Wire Format (per chunk)

CHUNK_DATA: [201][UINT16 size][data bytes]  — every chunk (self-describing, variable size)
CHUNK_END:  [202]                           — end signal (1 byte, no data)

Max chunk data size: 65535 bytes (UINT16 max).

Usage

Selected via BinaryProtocolMode.AsyncSegment in AcBinaryHubProtocol. The protocol's WriteMessageChunked sends CHUNK_START (standard SignalR framing), the serializer writes all chunks via AsyncPipeWriterOutput, the protocol writes [202].

AcBinarySerializer.Serialize(value, pipeWriter, options);
// All chunks already committed to PipeWriter. Protocol writes [202] and flushes.