12 KiB
Binary Output Writers
Output strategies for AcBinarySerializer. Generic over TOutput : struct, IBinaryOutputBase → compile-time specialization, zero virtual dispatch.
Buffer management, hot-path rules:
BINARY_IMPLEMENTATION.md
IBinaryOutputBase
Cold-path contract between serializer context and output strategy:
void Initialize(out byte[] buffer, out int position, out int bufferEnd);
void Grow(ref byte[] buffer, ref int position, ref int bufferEnd, int needed);
int GetTotalPosition(int currentPosition);
void Reset();
Critical: All write methods (WriteByte, WriteVarUInt, WriteStringUtf8, etc.) live on BinarySerializationContext<TOutput> sealed class — NOT on the output. Output handles only buffer lifecycle. See Why Writes Are on the Context.
ArrayBinaryOutput
struct ArrayBinaryOutput : IBinaryOutputBase, IDisposable — fastest for byte[]/ReadOnlySpan<byte> result.
- Initialize: provides pooled buffer, position=0
- Grow: rent doubled buffer from
ArrayPool, copy, return old - Pooling: ≤32KB kept across serializations (faster than pool round-trip); >32KB returned, next rent halved
- Results:
ToArray(allocate+memcpy),DetachResult(caller owns pooled buffer),AsSpan(zero-alloc view) - OutputInitialized flag: single instance per pooled context, reused
BufferWriterBinaryOutput
struct BufferWriterBinaryOutput : IBinaryOutputBase — writes directly to IBufferWriter<byte> (PipeWriter, ArrayBufferWriter). Zero-copy streaming.
Cached Chunk Pattern
Instead of GetSpan/Advance per write (interface dispatch), acquires large chunk once:
GetMemory(chunkSize)→TryGetArray→ backingbyte[]+ offset- All writes:
buffer[position++]— direct array indexing, zero dispatch Grow:Advance(bytesInChunk)→ acquire next chunkFlush: commit final bytes viaAdvance- Fallback:
TryGetArrayfails → rent temp buffer, copy onGrow/Flush
Two Usage Modes
Separate buffer states, never concurrent:
Context mode — serialization pipeline:
- Buffer state on context:
_buffer,_position,_bufferEnd - BWO invoked only via
Initialize/Grow/Flushwithout/refparams - Context write methods operate on own fields → max JIT optimization
Standalone mode — direct writes outside serializer (e.g. AcBinaryHubProtocol framing):
- Buffer state on struct:
_buffer,_position,_bufferEnd - Write methods:
WriteByte,WriteVarUInt,WriteStringUtf8,WriteBytes,WriteRaw<T> Position: total bytes (committed + pending)Flush(): commit pending, finalizeFlushAndReset(): commit pending, invalidate chunk →IBufferWriteravailable for another writer
Context/standalone share only IBufferWriter ref and _committedBytes.
Known Limitations
Buffer-writer contract limitations: BINARY_ISSUES.md under Buffer Writer (BWO) category — struct copy semantics, init-reset tracking, ctor chunk acquisition, no-mode-mixing rule.
Chunk Size
Default 65536 (64KB), configurable via AcBinarySerializerOptions.BufferWriterChunkSize.
- Memory-backed (ArrayBufferWriter): 64KB optimal — fewer
Growcalls - Network-backed (PipeWriter): smaller (4096) aligns with transport segments. But 64KB default is safe —
PipeWritermay return less than requested - Too-small default would cause excessive
Grow; 64KB is never catastrophic
Why Writes Are on the Context
Key architectural decision in the output layer.
Attempted: write methods on output struct. Context calls Output.WriteByte(value).
Result: measurably slower, even with struct + AggressiveInlining + generic constraint devirtualization.
Root cause: JIT generates better code for sealed class field access (this._buffer, this._position — fixed offsets) than generic struct field access (this.Output._buffer — extra address computation at lower optimization tiers).
Current: writes on BinarySerializationContext<TOutput> (sealed class, hot path). Output struct handles only Initialize/Grow/Flush (cold path).
Rule: Do NOT move write methods to output. Measure with full benchmark suite before proposing changes.
IBinaryInputBase (Read Side Mirror)
Deserialization mirrors the output pattern. IBinaryInputBase provides buffer lifecycle; all read methods live on BinaryDeserializationContext<TInput>.
void Initialize(out byte[] buffer, out int position, out int bufferLength);
bool TryAdvanceSegment(ref byte[] buffer, ref int position, ref int bufferLength, int needed);
void Release();
- ArrayBinaryInput: single
byte[],TryAdvanceSegment => false(JIT-eliminated),Releaseno-op. - SequenceBinaryInput: lazy
TryGetiteration overReadOnlySequence<byte>. Context_bufferpoints to segment backingbyte[](zero-copy). Cross-boundary:ArrayPoolscratch, N-segment loop.Releasereturns scratch to pool. - PipeReaderBinaryInput: reads from
PipeReaderwith on-demand data viaReadAsync. Same cross-boundary pattern asSequenceBinaryInput; when all segments in currentReadResultexhausted, callsAdvanceTo+ReadAsync().GetAwaiter().GetResult()for more data. Enables pipeline parallelism withAsyncPipeWriterOutput: deserializer processes chunks as they arrive, not after full payload.Releasereturns scratch + signals pipe consumption viaAdvanceTo.
AsyncPipeWriterOutput
struct AsyncPipeWriterOutput : IBinaryOutputBase — writes to PipeWriter with per-chunk network flush and self-describing chunked framing. Each chunk is framed as [201][UINT16 size][data] — zero-copy for both intermediate and final chunks.
Chunked Protocol Framing
Each chunk has a 3-byte header reserved via header reservation (skip 3 bytes in AcquireChunk, patch before Advance):
AcquireChunk: requestchunkSize + 3from PipeWriter, setposition = offset + 3(skip reserved header), forcebufferEnd = offset + 3 + chunkSize- Context writes serializer data into
buffer[position..bufferEnd] Grow(): patch[201][UINT16 dataBytes]header,Advance(3 + dataBytes),FlushAsync().Forget()Flush(): same as Grow — patch header,Advance(3 + dataBytes). Zero-copy, no data copying. The protocol writes a single[202]byte after.
Backpressure Modes
Constructor parameter flushPolicy of type FlushPolicy (default FlushPolicy.DoubleBuffered):
FlushPolicy.PerChunk:Grow()commits → flushes → awaits → acquires next chunk. Strictly bounded peak memory (~chunk_size × 1). No producer/flush parallelism — wall-clock = sum of (serialize + flush) per chunk. Auto-applied on Stream-backed PipeWriter regardless of policy. Recommended for memory-sensitive scenarios where payload size is unpredictable.FlushPolicy.DoubleBuffered(default):Grow()is fire-and-forget for the previous flush; only blocks at the NEXT chunk'sGrowif the previous flush hasn't completed. Peak memory ~chunk_size × 2 (current + previous overlapping). Maximum producer/flush parallelism with bounded memory — wall-clock = max(serialize, flush) × N_chunks. The recommended balanced default for typical streaming.FlushPolicy.Coalesced:Grow()does not wait per-chunk. While a previousFlushAsyncis in-flight, new chunks accumulate in thePipeWriterbuffer (a per-window counter_unflushedBytestracks the accumulation). When the window approaches the safety threshold (~64 KB), the producer waits for the in-flight flush, then fires one batchedFlushAsynccovering the entire window — and the window-counter resets. This produces ~64 KB-sized flush windows instead of per-chunk flushes (e.g. a 9.5 MB payload at 4 KB chunks fires ~150 batched flushes, not ~2 300 per-chunk flushes). Major throughput win on transports where eachFlushAsynchas non-trivial overhead (network sockets, Kestrel WebSocket, kernel TCP buffers). Per-window peak memory ~64 KB (vschunk_size × 2inDoubleBuffered); under heavy backpressure may fall back to an owned buffer, losing zero-copy for that chunk.
In all three modes, flush is only initiated when _lastFlush.IsCompleted — no overlapping FlushAsync calls.
Migration note:
FlushPolicyreplaces the historicalbool waitForFlushparameter. Mapping: oldtrue→FlushPolicy.DoubleBuffered, oldfalse→FlushPolicy.Coalesced. The newFlushPolicy.PerChunkvalue is a NEW capability that previously was only auto-applied on Stream-backed PipeWriter; it can now be explicitly chosen on Pipe-based writers for strictly bounded peak memory.
Two parallel-flush regimes (auto-detected)
Runtime check pipeWriter.GetType() splits flush behavior into two regimes — auto-detected at ctor via _serializeFlushAndAcquire = StreamPipeWriterType.IsInstanceOfType(pipeWriter). No caller intervention. Orthogonal to FlushPolicy and to the wire-format mode choice (Bytes / Segment / AsyncSegment).
True parallel — Pipe-based / parallel-capable PipeWriters: new Pipe().Writer, Kestrel transport output, custom parallel-capable impls. Grow() uses FlushAsync().Forget() pattern: serializer continues with the next chunk while the network async-flushes the previous one. Round-trip wall-clock = max(serialize, flush) × N_chunks — flush hides behind serialize-time. Production-stable on SignalR / Kestrel; "minimally slower than raw byte[]" empirically.
Half parallel — StreamPipeWriter-backed transports: PipeWriter.Create(stream) for NamedPipe / FileStream / NetworkStream / MemoryStream / SslStream / etc. The BCL StreamPipeWriter._tailMemory = default reset on flush completion races against the parallel-acquire pattern, forcing FlushAsync().GetAwaiter().GetResult() after every commit. Kernel-IO is strictly sequential. Managed-side parallelism (drain-task, deser-task, calling-thread) still possible, but wall-clock = (serialize + flush) × N_chunks — flush accumulates per chunk.
| Regime | Detected on | Flush pattern in Grow() |
Wall-clock formula | vs raw byte[] (NamedPipe, 1000 iter / 5000 warmup) |
|---|---|---|---|---|
| True parallel | non-StreamPipeWriter | FlushAsync().Forget() then continue |
max(ser, flush) × N |
minimal — flush hides behind serialize |
| Half parallel | StreamPipeWriter (any stream) |
SyncAwaitFlush(FlushAsync()) |
(ser + flush) × N |
Small 2KB: -2% / Med 7.5KB: +15% / Large 50KB: +34% / Repeated 10KB: +3% / Deep 11KB: +10% |
Allocation savings (chunked vs. raw single-shot byte[]) are regime-independent and payload-size-independent: ~30% fewer allocated KB per round-trip, because MemoryMarshal.TryGetArray direct-buffer-write into the PipeWriter's internal slab eliminates the intermediate pooled-byte[] rent that the raw path pays.
The half-parallel wall-clock cost is bounded by the BCL StreamPipeWriter design, not by AsyncPipeWriterOutput — the runtime detect is the optimal managed-layer response. Multiple sequential FlushAsync round-trips on the kernel-IO are the dominant cost on Stream-backed transports; payload-size-arányos as the chunk count grows. JIT tier-1 warmup matters: ~5000 warmup iterations needed to stabilize the async-state-machine code (vs ~500 for the raw path) — see BINARY_TODO.md#accore-bin-t-t5j8.
Real-world latency-budget context. The +34% slowdown on Large-payload NamedPipe is +66 µs absolute. Typical IPC / HTTP / SignalR / frame-budget round-trips are 1-50 ms, where the absolute differential is 0.13-5% — within the noise floor. Choose chunked for: GC-pressure-sensitive hot-paths (~30% alloc savings, 0 GC-pressure for the serialize), memory-bounded hosts (chunk-bounded peak vs. payload-bounded), GB-scale payloads (Bytes mode OOMs), and parallel-capable transports (true-parallel pipeline overlap). Avoid chunked when: the workload is wall-clock-throughput-bound on a Stream-backed transport AND payload is Medium-to-Large AND GC-pressure is tolerable.
Wire Format (per chunk)
CHUNK_DATA: [201][UINT16 size][data bytes] — every chunk (self-describing, variable size)
CHUNK_END: [202] — end signal (1 byte, no data)
Max chunk data size: 65535 bytes (UINT16 max).
Usage
Selected via BinaryProtocolMode.AsyncSegment in AcBinaryHubProtocol. The protocol's WriteMessageChunked sends CHUNK_START (standard SignalR framing), the serializer writes all chunks via AsyncPipeWriterOutput, the protocol writes [202].
AcBinarySerializer.Serialize(value, pipeWriter, options);
// All chunks already committed to PipeWriter. Protocol writes [202] and flushes.