223 KiB

Raw Blame History

AcBinarySerializer — TODO

This page covers planned work for the binary serializer core (format, SGen, options, deserialization context, buffer writer). Work specific to the streaming I/O layer (AsyncPipeReaderInput + AsyncPipeWriterOutput, multi-message wire framing, sliding-window buffer, producer-consumer synchronization) is tracked separately in BINARY_ASYNCPIPE_TODO.md.

Priority legend

P0 blocker · P1 important · P2 nice-to-have · P3 idea

ACCORE-BIN-T-P6M4: Universal hotpath optimization guardrails + follow-up backlog

Priority: P1 · Type: Performance

AcBinary is a universal serializer. Hotpath work must avoid benchmark-only overfitting.

For each performance TODO, validate on representative workload mixes (ASCII-heavy, mixed Latin, multi-byte UTF-8; small/medium/large/deep payloads) and evaluate throughput + latency + allocation + wire-size together.

Follow-up backlog (short):

Split oversized hot methods into inline-friendly dispatcher + cold helpers (writer/reader/populate).
Add direct fast branches for the most frequent markers before generic table-dispatch.
Reduce repeated EnsureAvailable checks by grouping fixed-width reads under one bounds check.
Extend VarUInt fast-path coverage for common 3-byte cases on metadata/index/cache-id routes.
Reorder populate/property-loop branches by runtime frequency (PropertySkip/Null/primitive fast-setters first).
Minimize pool/clear overhead by avoiding unnecessary aggressive array clearing in hot lifecycle paths.
Add early scan-pass short-circuit when options guarantee no ref/intern benefit.

ACCORE-BIN-T-K9M3: Hoist wire codec primitives to context instance methods (ser + deser, feature-aware SGen emit)

Priority: P2 · Type: Refactor + Performance · Related: ACCORE-BIN-T-P6M4 (hotpath guardrails), BINARY_ISSUES.md#accore-bin-i-t7k3 (polymorph compile-time guard)

Motivation

Wire codec logic is currently triplicated:

SGen-emit inlines marker decode/encode at every property emit site (StringInternFirstSmall, Object/ObjectRefFirst/Null/ObjectRef/FixObj-slot dispatch, etc.).
Runtime TypeReaderTable dispatches via static (ctx, _) => ReadXxx(ctx) lambdas to per-marker static helpers in AcBinaryDeserializer.
Cross-type populate (PopulateProperty fallback) repeats the same per-marker switch.

Result: bug-fix risk (three copies drift), ad-hoc divergence (the polymorph ObjectWithTypeName emit was missing on the SGen side for months — ACCORE-BIN-I-T7K3), larger generated assemblies, longer JIT time. A single instance method on the context is the natural single-source-of-truth for each wire primitive.

Pilot landed

ReadAndRegisterInternedStringSmall / Medium moved from static helpers on AcBinaryDeserializer to internal instance methods on BinaryDeserializationContext. All three call paths (TypeReaderTable lambdas, cross-type PopulateProperty switch, SGen-emit EmitReadProp case-body) now call context.ReadAndRegister...(). Generated case-body shrank from 12 lines to 3 per case — no perf regression ([AggressiveInlining] keeps the JIT/AOT inline footprint identical).

Scope — both ser and deser

Phase A — Decode primitives (deser context)

ReadStringSmall / Medium / Big (H2Q6 non-ASCII tiers).
ReadPlainStringAscii (long ASCII tier).
ReadObject family — careful: this branches on targetType and on the writer's runtime polymorphic slot table, both of which are call-site-context-specific. May not be a clean hoist; see "Caveat" below.

Phase B — Encode primitives (ser context)

WriteStringWithDispatch, WriteStringInternFirstWithDispatch — already partly on the context, audit completeness.
Marker-write helpers (WriteObjectFullMarker*) — already on the context post-T7K3.
Audit: scan ser-side SGen-emit for any inline encode duplication that should move to the context.

Phase C — Feature-conditional SGen-emit

EmitReadProp (and the symmetric emit paths) must consult the per-type Enable*Feature flags to omit case-branches for disabled features. Today the SGen reader handles every marker regardless of the type's feature opt-outs — wasteful, and worse, it silently accepts markers the writer would never emit (instead of fail-fast):

Disabled feature	Cases to skip in SGen reader emit
`EnableInternStringFeature = false`	`StringInterned`, `StringInternFirstSmall`, `StringInternFirstMedium`
`EnableRefHandlingFeature = false`	`ObjectRef`, `ObjectRefFirst`, `ObjectWithMetadataRefFirst`
`EnableMetadataFeature = false`	`ObjectWithMetadata`, `ObjectWithMetadataRefFirst`
`EnablePolymorphDetectFeature = false`	Already guarded by ACBIN002 (compile error if any `object` property remains on the type) — symmetric here.

After Phase C: leaner generated code per opt-out type AND wire-misuse (e.g. mixed writer/reader feature configurations) surfaces as explicit fail-fast in the default switch arm — same philosophy as ACBIN002.

Perf guardrails (NON-NEGOTIABLE)

The hoisting MUST NOT regress SGen hot-path performance. The pilot iteration was a net positive (less IL → faster cold-start JIT, smaller native code, identical inline body); this property has to hold for every subsequent hoist.

Rules of thumb:

Every hoisted method MUST have [MethodImpl(MethodImplOptions.AggressiveInlining)].
Body must stay small (≤ ~30 IL instructions after compile) so the JIT/AOT actually inlines — verify via dotnet jit-dasm spot-check on representative callers.
Single-purpose; no if-branches across distinct call-site contexts (those stay inline at the call site where the context-specific constants are visible).
Benchmark verification before/after each hoist (Console.FullBenchmark).

JIT / NativeAOT outlook

Modern .NET JIT (≥7) and NativeAOT both honour AggressiveInlining for small bodies → the hoisted methods inline back into the caller at compile time → identical native code to the previous inline-emit. The IL is smaller (less SGen-emit per file), which gives:

Faster cold-start JIT (less IL to translate on first call per type).
Smaller assemblies on disk (NativeAOT publish size shrinks).
Smaller i-cache footprint per active hot type (since SGen-emit no longer balloons per property).

The generic <TInput> specialization remains: each ArrayBinaryInput / SequenceBinaryInput / AsyncPipeReaderInput still gets its own native body (TInput.IsTrustedSingleSegment constant-folds per specialization), so no overhead vs. the current state.

NativeAOT additionally prefers small, single-purpose methods: register-allocation (LSRA) is more effective, peephole / loop-unroll / dead-code passes run faster per method, and the published native image is denser. The previous "giant SGen-emitted ReadProperties body" pattern was actively hostile to AOT in this respect.

Caveat — where NOT to hoist

Not every inline emit is a candidate. If the inline body carries compile-time constants (typeof(TFoo) literal, direct Instance.ReadProperties call on a concrete generated reader class, nameof(prop) constant), hoisting forces those into runtime parameters: constant-folding opportunity lost AND a direct call may become virtual via interface dispatch. The Complex property dispatch (Object → new T + ReadProperties direct call) is in this category and should stay inline at the SGen emit site.

Decision per primitive: can it be expressed as a context method that takes only wire-bytes-relevant inputs (no targetType literal, no per-property setter callback)? If yes → hoist. If no → keep inline.

Acceptance

Phase A: all shared decode primitives reachable as instance methods on BinaryDeserializationContext. TypeReaderTable + cross-type populate + SGen-emit all call them. SGen-generated case-body for each affected marker is ≤ 3 lines.
Phase B: ser-side audit complete; any encode duplication closed by hoist or explicit "keep inline — see caveat" note in the SGen comment.
Phase C: SGen-emit reader honours Enable*Feature flags. Verified by spot-checking generated *.g.cs files: an EnableInternStringFeature=false type's reader does NOT contain StringInternFirstSmall / Medium / StringInterned cases.
Per-phase benchmark run (Console.FullBenchmark) confirms no hot-path regression (within noise floor).

ACCORE-BIN-T-S8P4: Replace JSON-in-Binary request parameters

Priority: P1 · Type: Refactor · Status: Closed (2026-04-26, landed in commits cdd54d3 2026-04-05 + 3b70070 2026-04-06) · Related: ../XCUT/XCUT_ISSUES.md#accore-xcut-i-x8q1 (canonical), AyCode.Services/docs/SIGNALR/SIGNALR_TODO.md

Migrate client→server request parameters from JSON-in-Binary envelope to direct Binary serialization (matching response path). Coordinated change across client, server, and all consuming projects. Do NOT attempt as side-effect of unrelated work.

Acceptance: SignalPostJsonDataMessage<T> replaced by a SignalPostBinaryDataMessage<T> (or equivalent); no JSON round-trip on the wire for request params; benchmarks confirm no regression.

Resolution

What: Length-prefixed, per-parameter binary format introduced via SignalRSerializationHelper.SerializeParametersToBinary / DeserializeParametersFromBinary; further unified into SignalParams (single byte[] carrying packed method parameters with SetParameterValues / GetParameterValues).
Where: AyCode.Services/SignalRs/AcSignalRClientBase.cs, AcWebSignalRHubBase.cs, ISignalParams.cs (server + client dispatch); IAcSignalRHubClient.cs (legacy wrappers).
Equivalent (not literal SignalPostBinaryDataMessage<T>): SignalParams was chosen over a 1:1 binary wrapper class — fewer indirections on the hot path, type-safe pack/unpack, and DataSerializerType field on SignalReceiveParams for response format indication.
Wire impact: No JSON round-trip on the wire for request params; this is a breaking change vs. previous JSON-in-Binary clients/servers (see commit message).
Legacy types: SignalPostJsonMessage, SignalPostJsonDataMessage<T>, SignalPostMessage<T>, ISignalPostMessage<T> all marked [Obsolete] in IAcSignalRHubClient.cs; deletion tracked separately in AyCode.Services/docs/SIGNALR/SIGNALR_TODO.md#accore-sig-t-s3n8 (gated on consumer migration).

ACCORE-BIN-T-Q2N7: Re-evaluate DiscountProductMapping SGen exclusion

Priority: P3 · Type: Investigation · Related: BINARY_ISSUES.md#accore-bin-i-f1w8

Investigate whether the new int Id shadowing pattern can be handled by SGen (via base-class introspection, property-setter lookup on the base) to eliminate the runtime compiled-expression fallback for this entity class.

ACCORE-BIN-T-W9F1: Generate `BinarySerializeTypeMetadata` / `BinaryDeserializeTypeMetadata` at compile time

Priority: P1 · Type: Performance · Related: BINARY_ISSUES.md#accore-bin-i-n6q3

Eliminate the dominant first-call cost (reflection + Expression.Compile in metadata ctor) for SGen types by emitting pre-built metadata from the source generator.

Design outline:

TypeMetadataBase / BinarySerializeTypeMetadata / BinaryDeserializeTypeMetadata get a second constructor that accepts pre-computed values (hashes, MinWriteSize, ComplexPropertyCount, flags, IsIId, IdAccessorType, etc.). No reflection executes in this ctor.
Source generator keeps its existing s_typeNameHash / s_propertyHashes static fields (hot-path access stays static, zero indirection) and passes the same references to the metadata — single source of truth, no duplicate computation.
ModuleInit registers both the writer/reader and the pre-built metadata into a GeneratedMetadataRegistry. GetWrapperSlow consults this registry first, falling back to the reflection-based MetadataFactory for runtime-only types.
Lazy RuntimeInit() pattern for Expression.Compile property accessors:
- TypeMetadataBase gets volatile bool _runtimeInitialized + internal void RuntimeInit() (idempotent, no lock needed).
- GetWrapperSlow calls metadata.RuntimeInit() only when wrapper.GeneratedWriter == null || !Options.UseGeneratedCode — SGen types skip it entirely (they never touch runtime accessors on their own metadata; non-SGen child types have their own metadata and run the factory path normally).
- Hybrid mode stays correct: an SGen type on the SGen path never uses its own property accessors; a non-SGen child type's metadata runs the reflection ctor as today.
volatile guards the flag; multiple contexts may race into RuntimeInit, second run is a no-op.

Thread safety: GlobalMetadataCache is ConcurrentDictionary; generated metadata is registered once at ModuleInit; wrapper construction is per-context and unchanged.

Acceptance:

Cold benchmark: first Serialize<T> of a fresh SGen type shows no reflection / Expression.Compile on the call stack.
Runtime fallback (UseGeneratedCode=false) still produces identical wire output and uses the full metadata accessors.
Deserialize side has parity (same approach for BinaryDeserializeTypeMetadata).
Existing tests pass; wire format unchanged.

ACCORE-BIN-T-T5J8: JIT Tier 1 warmup for generated hot methods

Priority: P2 · Type: Performance · Related: BINARY_ISSUES.md#accore-bin-i-n6q3

After ACCORE-BIN-T-W9F1 lands, JIT of generated WriteProperties / ScanObject / ScanForDuplicates becomes the dominant residual first-call cost for SGen types. Options to evaluate (benchmark before committing):

[MethodImpl(MethodImplOptions.AggressiveOptimization)] on the generated hot methods — skips Tier 0, compiles directly at Tier 1. Simple generator change. Trade-off: larger one-time JIT cost in exchange for eliminating the Tier 0→1 recompile step.
Background prewarm from ModuleInit: Task.Run(() => RuntimeHelpers.PrepareMethod(handle)) for each registered writer/reader method. Parallelizes JIT with app startup. Keep it opt-in (option flag) to avoid surprising consumers with extra startup threads.
ReadyToRun (R2R) in consuming projects' publish config — pre-compiles IL to native at publish time. External to SGen, complementary. Document as a recommended publish setting.
Code chunking (split generated methods exceeding a property threshold into sub-methods, e.g. WriteProperties_Part1 / _Part2) — measure first. Only beneficial for unusually large types (20+ properties / nested collections). Call overhead can offset gains; JIT inliner may already handle reasonably-sized methods well.
try / finally audit on hot path — On .NET 9 (project's minimum target), JIT silently refuses to inline any method containing an EH region (AggressiveInlining is ignored). [.NET 10 partially lifts this for same-module try-finally — see dotnet/runtime#112998, merged 2025-03-20 — but catch, cross-module, and P/Invoke-stub cases stay blocked. Until project's minimum runtime moves to .NET 10, treat EH as an absolute inlining barrier; even after the upgrade, several sub-cases keep the rule.] Audit scope:
- Hand-written bridges: WriteValueGenerated / WriteObjectGenerated / WriteStringGenerated / ScanValueGenerated and any helper called from generated WriteProperties for accidental try/finally / using blocks.
- SGen output template (AcBinarySourceGenerator.cs): generated WriteProperties / ScanObject / ScanForDuplicates / ReadObject / ReadProperties MUST stay straight-line. Future feature additions ([CustomSerializer] / [CustomDeserializer] hooks, OnSerializing / OnDeserialized callbacks, validation attributes, rented-buffer using blocks) are tempting candidates for try/catch/finally — emit them in separate cold helpers, never inline into the generated hot method. A single accidental try block in WriteProperties makes the whole generated method non-inlinable, killing the SGen Root Fast Path benefit.
- Resource cleanup (Pool/ArrayPool/Dispose) belongs in Serialize<T> entry-frame only, not in per-property helpers or generated hot methods. See BINARY_IMPLEMENTATION.md Rule #3 (Inlining barriers) and BINARY_SGEN.md (SGen Output Constraints).
stackalloc size discipline on hot path — On .NET 9, methods containing localloc (any C# stackalloc) historically blocked inlining. Modern .NET allows inlining only for fixed-size stackalloc ≤ 32 bytes outside loops (see dotnet/runtime#7113) — anything larger or loop-nested still blocks. Our typical scratch-buffer patterns (UTF-8 encoding scratch, ArrayPool fallbacks) sit far above 32 bytes (256+), so any helper containing such a stackalloc is non-inlinable. Combined with try/finally for ArrayPool.Return cleanup, the method is doubly non-inlinable on .NET 9. Plan accordingly: keep stackalloc-using helpers as deliberate cold call-frames, not as AggressiveInlining candidates.
Native AOT — out of scope for this TODO; separate architectural decision with deployment-model implications.

Acceptance:

Benchmark a realistic entity graph (≥ 3 referenced child types) and show first-call time within ~10% of steady-state after ACCORE-BIN-T-W9F1 + chosen mitigation(s).
Document which combination is recommended for SignalR hot-path workloads vs. batch serialization.

ACCORE-BIN-T-Z3K8: Replace `IId<T>` interface dependency with convention/attribute-based Id detection

Priority: P1 · Type: Refactor

The binary serializer currently detects Id-tracking properties via the IId<T> interface (AyCode.Interfaces). This couples the serializer to a framework-specific abstraction and forces consumer types to implement the interface for tracking participation. Move to a POCO-friendly detection scheme:

IdDetectionMode.Convention (default) — convention-based; any property named Id is treated as the tracking key. Zero-friction onboarding.
IdDetectionMode.Attribute — explicit; only properties marked with a serializer-native [Id] (or similar) attribute are tracked.
[IgnoreId] attribute — escape hatch in Convention mode to exclude an Id-named property from tracking when the developer wants explicit opt-out.

Implicit contract for Convention mode: within a single class, the Id property must be type-level unique. Whether it semantically represents a primary key or a sequence number is irrelevant — the tracker keys by (Type, Id), so per-type uniqueness is the only requirement. Violating this invariant typically signals a domain-modelling problem, not a serializer bug. Design rationale discussed in conversation 2026-04-27.

Acceptance:

Binary serializer no longer references IId<T> in any execution path (no interface checks, no where T : IId<TKey> constraints in the serializer surface).
Wire format unchanged.
Existing consumers using IId<T>-implementing types still work transparently in Convention mode (their Id property is detected via convention).
New consumers can use plain POCOs with no AyCode.Interfaces dependency.
IdDetectionMode exposed on AcBinaryOptions (or successor options class post-rebrand).
Default mode = Convention.

ACCORE-BIN-T-N7V1: Replace `[JsonIgnore]` dependency with serializer-native ignore attribute

Priority: P2 · Type: Refactor

Property exclusion from binary serialization currently relies on [JsonIgnore] (Newtonsoft.Json). This couples the binary serializer to a third-party JSON library's attribute and is conceptually wrong — a binary serializer should not consult a JSON-specific marker for its exclusion semantics.

Define a serializer-native ignore attribute (working name [BinaryIgnore]; final name TBD pending broader rebrand). For backward compatibility during transition, also continue recognizing [JsonIgnore] with a deprecation note.

Possible cross-cutting consideration: if Toon and other future serializers also need property-exclusion, a single shared attribute (e.g., [SerializerIgnore] in a common abstractions package) may be cleaner than per-serializer attributes. Decide before naming finalizes — this may belong in XCUT_TODO.md rather than purely BINARY scope.

Acceptance:

Native ignore attribute defined in the binary serializer's namespace (or shared abstractions package, pending the cross-cutting decision above).
Both native attribute and [JsonIgnore] recognized during a transitional period; native attribute takes precedence on conflict.
[JsonIgnore] recognition flagged for removal in a future major version (track in a follow-up cleanup TODO once consumer projects have migrated).
No new code dependency on Newtonsoft.Json for property-exclusion logic.

ACCORE-BIN-T-Y6R2: Implement projection serialization phase 1 (runtime path)

Priority: P1 · Type: Feature · Related: ../adr/0001-binary-projection-serialization.md (canonical)

Implement the phase 1 runtime path of source→target projection serialization per ADR 0001. See the ADR for full context, decision rationale, alternatives, consequences, and acceptance criteria.

Sibling rebrand-prep TODOs: ACCORE-BIN-T-Z3K8 (IId migration), ACCORE-BIN-T-N7V1 (JsonIgnore replacement).

ACCORE-BIN-T-K3W7: Rename `BufferWriterChunkSize` to reflect actual semantics

Priority: P3 · Type: Refactor · Breaking: Yes (public option API) · Streaming impact: see BINARY_ASYNCPIPE_TODO.md for the streaming-side companion considerations (chunk-on-wire vs internal-buffer semantics)

The property name BufferWriterChunkSize is misleading: across the three output paths it does NOT consistently represent a "chunk".

Output path	What `BufferWriterChunkSize` actually controls	Wire-format chunk?
`ArrayBinaryOutput` (Byte[] API)	Initial buffer capacity of the internal `byte[]`	No
`BufferWriterBinaryOutput` (IBufferWriter overload)	Internal buffer size — how much data accumulates before `Advance()` + new `GetMemory()` on the underlying writer	No
`AsyncPipeWriterOutput` (streaming)	Both internal buffer and wire-format chunk frame size for chunked framing	Yes (only here)
Receive side (`AsyncPipeReaderInput`)	Initial receive buffer = `BufferWriterChunkSize × 2`	No (just sizing hint)

Only the streaming AsyncPipeWriterOutput path has a wire-format "chunk" concept (chunked framing for length-prefixed segments). On the other 75% of paths the property name reads as if the serializer were segmenting the payload, which is not what happens.

Possible directions (decide before implementing):

Single rename, semantic-neutral — BufferWriterChunkSize → BufferWriterBufferSize or BufferWriterPageSize. Minimal API surface change, single-property semantics preserved. Downside: still slightly off for the streaming path where there IS chunked framing.
Two-property split — InternalBufferSize (universal: how much data accumulates before Advance/Grow) + StreamingChunkSize (only meaningful for AsyncPipeWriterOutput; separate knob, defaults to InternalBufferSize). Cleanest semantics, most ceremony, slightly more options to document.
Single rename, streaming-honest — Keep as BufferWriterChunkSize but document explicitly that on non-streaming paths the value is repurposed as buffer size. Cheapest change (docs only). Downside: doesn't fix the underlying confusion the field name causes.

Pick one before touching code. Option 2 is the most correct but adds API surface; Option 1 is the pragmatic middle.

Affected callers / docs to update on rename:

AcBinarySerializerOptions.cs (definition)
AcBinarySerializer.cs × 3 sites (ArrayBinaryOutput ctor, BufferWriterBinaryOutput ctor, AsyncPipeWriterOutput ctor)
AcBinaryDeserializer.cs × 1 site (receive-side initial capacity derivation)
AsyncPipeReaderInput.cs — XML doc cross-refs
BINARY_WRITERS.md, BINARY_TODO.md (this entry), BINARY_ISSUES.md (line 151 — already lists BufferWriterChunkSize among the struct-mutation issue's affected setters)
Consumer-side: AyCode.Services/SignalRs/AcBinaryHubProtocol.cs ctor mutates _options.BufferWriterChunkSize = options.BufferSize; — see BINARY_ISSUES.md#accore-bin-i-... (struct-mutation context). Coordinate the rename with the struct-mutation fix to avoid two cross-cutting churn waves on the same property.

Acceptance:

Property renamed (or split) per the chosen direction; all internal references updated.
XML docs reflect the actual semantics on each output path (initial capacity / advance threshold / chunk frame size — whichever applies).
Consumer-side usage in AcBinaryHubProtocol updated; if Option 2 is chosen, the protocol uses StreamingChunkSize (the streaming knob), not the universal one.
Wire format unchanged. Default values unchanged (65535 / equivalent).
Migration note in CHANGELOG / release notes since this is a breaking change to AcBinarySerializerOptions.

ACCORE-BIN-T-M4D2: Add `ReadOnlyMemory<byte>` / `Memory<byte>` deserialize overloads

Priority: P3 · Type: Feature

The public AcBinaryDeserializer.Deserialize surface accepts byte[] (with optional offset/length) and ReadOnlySequence<byte>, but not ReadOnlyMemory<byte> / Memory<byte>. Consumers that hold a ReadOnlyMemory<byte> (cached payloads, message-broker frames, in-memory pipe slices) must call .ToArray() to round-trip through byte[] — unnecessary copy + GC alloc.

Implementation:

Deserialize<T>(ReadOnlyMemory<byte> data, AcBinarySerializerOptions options) and the non-generic Type-based variant.
Body: MemoryMarshal.TryGetArray(data, out var seg) → array-backed path delegates to Deserialize<T>(seg.Array!, seg.Offset, seg.Count, options) (zero-copy). Non-array-backed fallback (rare — custom MemoryManager<T> with native memory) copies into a pooled byte[].
Memory<byte> overload trivially delegates to the ReadOnlyMemory<byte> one (Memory<byte> is implicitly convertible).
No new input-strategy struct needed — reuses existing ArrayBinaryInput.

Acceptance:

Both overloads compile and pass round-trip tests against byte[]-equivalent input.
Array-backed path measurably zero-alloc (BenchmarkDotNet allocation diagnoser).
Non-array-backed path documented as fallback (separate using var pooled = MemoryPool<byte>.Shared.Rent(...) style copy).
API doc-strings cross-reference the existing byte[] and ReadOnlySequence<byte> overloads.

ACCORE-BIN-T-S7X3: Add `ReadOnlySpan<byte>` deserialize overload

Priority: P2 · Type: Feature · Related: ACCORE-BIN-T-M4D2

The MemoryPack-style Deserialize<T>(ReadOnlySpan<byte>) API enables direct deserialization from stack-allocated buffers (stackalloc byte[256]), pinned native memory (fixed blocks), and ReadOnlyMemory<byte>.Span slices without round-tripping through a heap-allocated byte[]. The current AcBinary surface lacks this entry point.

Design tension: the existing IBinaryInputBase.Initialize(out byte[] buffer, ...) contract returns a byte[] — a ReadOnlySpan<byte> cannot be stored in a regular struct field, only in a ref struct field. Two implementation paths to evaluate:

ref struct SpanBinaryInput + interface bump to support ref byte buffer / int length fields. Pure zero-copy from any span. Cost: BinaryDeserializationContext<TInput> and IBinaryInputBase need a parallel ref-struct-friendly track (the existing pooled context cannot hold a ref struct). Major surgery on the deser core.
MemoryMarshal.CreateReadOnlySpanFromNullTerminated-style hack — accept ReadOnlySpan<byte>, use Unsafe.AsRef/MemoryMarshal.GetReference to obtain a ref byte, then copy into a pooled byte[] before deserialization. Not zero-copy, defeats the purpose. Reject.
Pinned-buffer trampoline — accept ReadOnlySpan<byte>, allocate a Memory<byte> view via a MemoryManager<byte>-like wrapper, delegate to ReadOnlyMemory<byte> overload. Awkward, allocations per call. Reject.

Recommendation: option (1) is the only correct path, but it's a substantial refactor — measure first whether real consumer demand justifies the surgery. The current byte[]-based pool-pattern outperforms MemoryPack on the dominant use-cases per existing benchmarks; this overload addresses an API-surface gap, not a perf gap.

Acceptance:

Deserialize<T>(ReadOnlySpan<byte> data, AcBinarySerializerOptions options) compiles and round-trips against byte[]-equivalent input.
Zero-alloc path verified for stackalloc-source spans (BenchmarkDotNet allocation diagnoser).
IBinaryInputBase (or successor interface) refactor preserves backward compatibility for existing ArrayBinaryInput / SequenceBinaryInput / AsyncPipeReaderInputAdapter consumers.
Doc-strings cross-reference the byte[] / ReadOnlyMemory<byte> (ACCORE-BIN-T-M4D2) / ReadOnlySequence<byte> overloads with use-case guidance.

ACCORE-BIN-T-T8K3: Add `SerializeAsync(Stream, T)` async overloads with mode-driven output strategy

Priority: P1 · Type: Feature · Related: ACCORE-BIN-T-N9G6 (Type-based coordination)

The mainstream serializer ecosystem (System.Text.Json, MessagePack, Newtonsoft.Json, MemoryPack) all expose SerializeAsync(Stream, T) as a primary entry point — async file I/O, network response body, log streaming. AcBinary's public API surface MUST include this overload regardless of what we do internally; consumers expect a Stream parameter and don't navigate PipeWriter.Create(stream) workarounds. Market-entry-blocking otherwise.

Mode-driven output strategy — three lanes for three workload shapes

AcBinary already models the three output strategies in BinaryProtocolMode (AyCode.Services/SignalRs/BinaryProtocolMode.cs) for the SignalR side. The same three-lane shape applies to the public SerializeAsync(Stream) API. Promote the concept to AcBinary core scope (e.g. AcBinaryOutputMode in AyCode.Core/Serializers/Binaries/) and let the SignalR BinaryProtocolMode either alias it or migrate to it. Migration timing: the existing BinaryProtocolMode keeps shipping until the new public API is stabilized; both names live for one major version, then BinaryProtocolMode becomes a using-alias.

Mode	Output strategy	Peak memory	Pipeline parallelism	Use when
`Bytes` (default)	`Serialize(T) → byte[]` + `stream.WriteAsync(bytes)`	Full payload in `byte[]` (pooled)	No	Typical payloads (<10 MB), throughput-focus
`Segment`	`BufferWriterBinaryOutput` → `PipeWriter`, single closing flush	PipeWriter pause-threshold-bounded (~64 KB Kestrel default)	No	Mid-size payloads, zero-copy desired
`AsyncSegment`	`SerializeChunked(PipeWriter)`, per-chunk async flush	Chunk-size-bounded (~8 KB at default `BufferWriterChunkSize`)	Yes (on parallel-capable PipeWriter — Kestrel / `Pipe`)	Very large payloads (>10 MB), memory-tight hosts, parallel-capable transport

Honest performance positioning vs. MemoryPack — three real axes

MemoryPack's SerializeAsync(Stream) is pseudo-streaming — serializes the entire payload into a pool-allocated linked-list buffer first (ReusableLinkedArrayBufferWriter), then writes the completed buffer to the stream in a single closing fence. Peak memory ≈ payload size; no pipeline parallelism. AcBinary's Bytes mode is architecturally similar (single pooled contiguous byte[] vs. MemoryPack's linked-list) — comparable peak-memory cost, often faster on the wire due to one contiguous WriteAsync call.

AcBinary's AsyncSegment mode is architecturally different in three real ways MemoryPack cannot match:

Axis	`Bytes` mode (default)	`AsyncSegment` mode	MemoryPack `SerializeAsync`
Heap allocation per call	Pooled `byte[]` rent (peak ≈ payload size)	Truly zero — `ArrayPool` + pooled context + `MemoryMarshal.TryGetArray` direct-buffer-write into the transport's own `byte[]`	Pool-allocated linked-list buffer per call (peak ≈ payload size)
Peak managed memory	≈ payload size	≈ chunk size (`BufferWriterChunkSize`, e.g. 4-8 KB)	≈ payload size
GC pressure	Touches GC pool on every call	Never touches GC for the serialize itself	Touches GC pool on every call
Pipeline parallelism	No	Yes on parallel-capable PipeWriter (Kestrel transport, `new Pipe()`)	No
GB-scale payload	OOM risk on memory-tight hosts	Works	OOM risk

The AsyncSegment zero-alloc claim is literal, not "almost zero": AsyncPipeWriterOutput.AcquireChunk calls _pipeWriter.GetMemory(chunkSize) and uses MemoryMarshal.TryGetArray(memory, out segment) to obtain the transport's own internal byte[] — the serializer writes directly into it. With chunkSize aligned to the transport's internal buffer (e.g. NamedPipe-server pipe-buffer-size), one chunk is one kernel-level transfer; no managed-side double-fragmentation.

Throughput nuance — `AsyncSegment` cost on Stream-backed transports

AsyncSegment IS slightly slower than Bytes on StreamPipeWriter-backed transports (NamedPipe / FileStream / NetworkStream), but not for the reason that initially seems obvious:

The cost is NOT "managed-side double-fragmentation on top of OS-level fragmentation" — that's not what happens. MemoryMarshal.TryGetArray zero-copy direct-buffer-writes mean the managed chunking is the same chunking the kernel does anyway, not redundant.
The cost IS the per-chunk async-await round-trip (SyncAwaitFlush(_lastFlush) blocks until the kernel acknowledges the write), forced sequential by the StreamPipeWriter._tailMemory reset race (ACCORE-BIN-I-...). N async cycles vs 1 in Bytes mode.
Empirically the gap is roughly 1.2-1.5x on NamedPipe — not 2-5x. The dominant cost on these transports is the transport itself (Windows IRP / Linux FIFO syscall overhead), independent of the serializer mode.

When AsyncSegment wins outright:

GC-sensitive hot-paths (server hubs, real-time game tick loops, mobile UI thread, embedded targets): zero-alloc + zero-GC-pressure beats a 1.2x throughput edge every time.
Memory-tight hosts (mobile, WASM, container-trimmed, embedded): chunk-bounded peak memory is the only option.
GB-scale payloads: Bytes OOMs; AsyncSegment works.
Kestrel transport / parallel-capable Pipe: pipeline parallelism makes AsyncSegment faster than Bytes for medium-to-large payloads.

When Bytes wins outright:

Tipikus NuGet workload (small-to-medium payload, throughput priority, GC-tolerant): one async cycle vs N is the simpler, faster path.
MemoryStream (in-memory): one large byte[] copy decisively beats N managed chunks.

Marketing claim — three-way honest comparison

"AcBinary offers a real choice. Bytes mode for typical throughput-priority workloads (matches MemoryPack's pseudo-streaming, often faster on the wire). AsyncSegment mode for the workloads MemoryPack cannot serve: zero-alloc serialize for GC-sensitive hot-paths, chunk-bounded peak memory for tight-budget hosts, GB-scale payloads, and pipeline parallelism on parallel-capable transports. You pick the mode; MemoryPack picks for you."

This is honest — does not overclaim universal speed, does not hide the small AsyncSegment cost on Stream-backed transports, AND clearly surfaces the three differentiator axes (alloc / memory / parallelism) where AcBinary architecturally beats MemoryPack.

Implementation outline:

New enum AcBinaryOutputMode { Bytes = 0, Segment = 1, AsyncSegment = 2 } in AyCode.Core/Serializers/Binaries/. Default Bytes.
New mode field on AcBinarySerializerOptions: AcBinaryOutputMode OutputMode { get; set; } = AcBinaryOutputMode.Bytes;. (Note: subject to ACCORE-BIN-I-L8N5 thread-safety treatment — defensive copy / immutable refactor coordination.)
public static ValueTask SerializeAsync<T>(T value, Stream stream, AcBinarySerializerOptions? options = null, bool leaveOpen = false, CancellationToken ct = default):
- Switch on options.OutputMode:
  - Bytes → var bytes = Serialize(value, options); await stream.WriteAsync(bytes, ct); ArrayPool.Return(bytes);
  - Segment → var pw = PipeWriter.Create(stream, new(leaveOpen: leaveOpen)); Serialize(value, pw, options); await pw.CompleteAsync();
  - AsyncSegment → var pw = PipeWriter.Create(stream, new(leaveOpen: leaveOpen)); SerializeChunked(value, pw, options); await pw.CompleteAsync();
public static ValueTask SerializeAsync(object? value, Type type, Stream stream, ...) — non-generic, same dispatch (coordinated with ACCORE-BIN-T-N9G6).
leaveOpen parameter standard for stream-async serializers (System.Text.Json, MessagePack convention).
The Bytes mode uses a pooled byte[] from ArrayBinaryOutput to keep alloc cost amortized.

SignalR migration coordination: the existing BinaryProtocolMode enum (in AyCode.Services) keeps shipping unchanged until the new public API is stabilized. After stabilization, BinaryProtocolMode becomes a deprecated alias of AcBinaryOutputMode, eventually removed in a major-bump. No SignalR-side churn during this TODO's implementation.

Acceptance:

SerializeAsync<T> round-trips against Deserialize<T>(byte[]) via MemoryStream in all three modes.
Cancellation propagates correctly (OperationCanceledException on cancelled token mid-stream).
Throughput matrix benchmark: 4 transports (MemoryStream, FileStream, NamedPipeStream, NetworkStream) × 3 modes × 3 payload sizes (small ~1 KB / medium ~100 KB / large ~10 MB). Results documented in Test_Benchmark_Results/Benchmark/SerializeAsync_Stream_Modes.LLM (or similar) and surfaced as a doc-string table for consumer guidance.
Memory-bounded benchmark: 100 MB payload to FileStream in AsyncSegment mode → peak managed-heap delta ≤ 1 MB throughout. Same payload in Bytes mode → peak ~100 MB (expected, documented).
API doc-string contains a "When to use which mode?" decision matrix; explicitly compares with MemoryPack's pseudo-streaming.
leaveOpen parameter behaves per the System.Text.Json / MessagePack convention across all three modes.

ACCORE-BIN-T-D7K4: Add `DeserializeAsync(Stream, T)` async overloads with mode-driven input strategy

Priority: P1 · Type: Feature · Related: ACCORE-BIN-T-T8K3 (companion write-side overload), ACCORE-BIN-T-N9G6 (non-generic Type-based dispatch)

Companion to T8K3 on the receive side. The mainstream serializer ecosystem (System.Text.Json, MessagePack, Newtonsoft.Json, MemoryPack) all expose DeserializeAsync<T>(Stream) — the symmetric counterpart of SerializeAsync(Stream, T). AcBinary's public API surface MUST include this overload for parity; consumers expect a Stream parameter for receive paths (file load, HTTP response body, network stream) and don't navigate PipeReader.Create(stream) workarounds. Market-entry-blocking otherwise.

Implementation: zero new `IBinaryInputBase` impl needed

The existing receive-side primitives cover the full strategy space via BCL PipeReader.Create(stream):

Mode	Input strategy	Peak memory	Pipeline parallelism	Use when
`Bytes` (default)	`await stream.CopyToAsync(MemoryStream)` → `Deserialize<T>(byte[])` (existing overload)	Full payload as `byte[]` (pooled)	No	Typical payloads (<10 MB), throughput-focus
`Segment`	`await PipeReader.Create(stream).ReadAsync()` → `Deserialize<T>(ReadOnlySequence<byte>)` (existing overload)	PipeReader pause-threshold-bounded (~64 KB)	No	Mid-size payloads, no full byte[] alloc desired
`AsyncSegment`	`AsyncPipeReaderInput` + `DrainFromAsync(PipeReader.Create(stream))` + `Deserialize<T>(input)` (existing overload)	Chunk-size-bounded (~8 KB)	Yes (producer drain Task in parallel with deser Task)	Very large payloads (>10 MB), memory-tight hosts

The AcBinaryOutputMode enum (introduced by T8K3) is symmetric — it controls deser-input strategy as well. The same enum value picks the matching read path. No new IBinaryInputBase implementation needed — the trio of existing inputs (ArrayBinaryInput, SequenceBinaryInput, AsyncPipeReaderInput) already cover all three modes; the new overload is a thin shim that wraps the Stream and routes to the right existing overload.

Public API shape

public static ValueTask<T?> DeserializeAsync<T>(
    Stream stream,
    AcBinarySerializerOptions? options = null,
    bool leaveOpen = false,
    CancellationToken ct = default);

// Non-generic Type-based variant (coordinated with N9G6):
public static ValueTask<object?> DeserializeAsync(
    Stream stream,
    Type targetType,
    AcBinarySerializerOptions? options = null,
    bool leaveOpen = false,
    CancellationToken ct = default);

Implementation outline (per mode)

// Bytes mode (default — simplest path, sub-LOH-friendly fast path):
public static async ValueTask<T?> DeserializeAsync_Bytes<T>(Stream stream, ..., CancellationToken ct)
{
    var rented = ArrayPool<byte>.Shared.Rent((int)Math.Min(stream.CanSeek ? stream.Length : 4096, int.MaxValue));
    try
    {
        var totalRead = 0;
        int read;
        while ((read = await stream.ReadAsync(rented.AsMemory(totalRead), ct)) > 0)
        {
            totalRead += read;
            if (totalRead == rented.Length) { /* grow rented */ }
        }
        return Deserialize<T>(rented, 0, totalRead, options);
    }
    finally { ArrayPool<byte>.Shared.Return(rented); }
}

// Segment mode (PipeReader.Create wrapping, then drain to ReadOnlySequence):
public static async ValueTask<T?> DeserializeAsync_Segment<T>(Stream stream, ..., CancellationToken ct)
{
    var pipeReader = PipeReader.Create(stream, new(leaveOpen: leaveOpen));
    var result = await pipeReader.ReadAtLeastAsync(int.MaxValue, ct);   // drain whole stream
    var seq = result.Buffer;
    var obj = Deserialize<T>(seq, options);
    pipeReader.AdvanceTo(seq.End);
    await pipeReader.CompleteAsync();
    return obj;
}

// AsyncSegment mode (chunked streaming pipeline, parallel drain + deser):
public static async ValueTask<T?> DeserializeAsync_AsyncSegment<T>(Stream stream, ..., CancellationToken ct)
{
    using var input = new AsyncPipeReaderInput(options.BufferWriterChunkSize * 2, multiMessage: false);
    var pipeReader = PipeReader.Create(stream, new(leaveOpen: leaveOpen));
    var deserTask = Task.Run(() => Deserialize<T>(input, options), ct);
    await input.DrainFromAsync(pipeReader, ct);
    await pipeReader.CompleteAsync();
    return await deserTask;
}

Honest performance positioning

Symmetric to T8K3's analysis:

Bytes mode: simplest, single contiguous byte[] (pooled) → Deserialize<T>(byte[]). Comparable to MemoryPack's DeserializeAsync (which does similar full-buffer-then-deser). Best for typical workloads.
Segment mode: zero-copy from PipeReader's natural ReadOnlySequence<byte> — no extra byte[] allocation. Best for mid-size payloads where allocation matters but pipeline overlap doesn't.
AsyncSegment mode: producer-drain Task and consumer-deser Task in parallel via AsyncPipeReaderInput. Wall-clock = max(network-drain, deser-CPU) + small overlap-cost. Best for large payloads + slow transports (network, mobile, satellite — where transit dominates and overlap pays).

Acceptance

DeserializeAsync<T> round-trips against SerializeAsync(Stream, T) (T8K3) via MemoryStream in all three modes.
Cancellation propagates correctly (OperationCanceledException on cancelled token mid-stream); partial-buffer state cleaned up; pooled byte[] returned even on cancellation.
Throughput matrix benchmark (mirror of T8K3): 4 transports (MemoryStream, FileStream, NamedPipeStream, NetworkStream) × 3 modes × 3 payload sizes. Results documented in Test_Benchmark_Results/Benchmark/DeserializeAsync_Stream_Modes.LLM.
Memory-bounded benchmark: 100 MB payload from FileStream in AsyncSegment mode → peak managed-heap delta ≤ 1 MB throughout. Same payload in Bytes mode → peak ~100 MB (expected, documented).
API doc-string contains a "When to use which mode?" decision matrix; cross-references T8K3's symmetric write-side guidance.
leaveOpen parameter behaves per the System.Text.Json / MessagePack convention across all three modes.

ACCORE-BIN-T-N9G6: Add non-generic `Type`-based `Serialize(object, Type, ...)` overloads

Priority: P2 · Type: Feature · Status: Closed (2026-05-04) · Related: ACCORE-BIN-T-T8K3

Resolution

Added in AcBinarySerializer.cs:

Serialize(object?, Type, opts) → byte[]
Serialize(object?, Type, IBufferWriter<byte>, opts) → int
SerializeChunked(object?, Type, PipeWriter, opts) → int
SerializeChunkedFramed(object?, Type, PipeWriter, opts) → int

AcBinaryDeserializer.cs already had Deserialize(byte[], Type, opts) / Deserialize(ReadOnlySequence<byte>, Type, opts) / Deserialize(AsyncPipeReaderInput, Type, opts) overloads — no new entries needed.

Layering note: PipeReader → AsyncPipeReaderInput drain-loop is the consumer's responsibility, not the binary serializer's. The serializer surface ends at AsyncPipeReaderInput; transport-specific draining (PipeReader, NamedPipe, SignalR state.Buffer.Write, etc.) lives in the consumer layer (e.g. AcBinaryInputFormatter, AcBinaryHubProtocol.TryParseChunkData).

Consumed by ASP.NET Core MVC formatter package (AyCode.Services/Mvc/) — AcBinaryInputFormatter, AcBinaryOutputFormatter, AddAcBinaryFormatters extension. Media type: application/vnd.acbinary. Drain-loop inlined in AcBinaryInputFormatter.ReadRequestBodyAsync.

Plugin frameworks, ASP.NET ModelBinding, DI middleware, and DataContractSerializer-style "generic-API container" use-cases need to serialize an object whose type is known only at runtime. Current AcBinary surface forces a reflection trampoline through the generic Serialize<T>:

// Today's workaround (slow + noisy):
typeof(AcBinarySerializer).GetMethod("Serialize", new[] { type, typeof(AcBinarySerializerOptions) })
    .MakeGenericMethod(type).Invoke(null, new[] { value, options });

Implementation outline:

public static byte[] Serialize(object? value, Type type, AcBinarySerializerOptions? options = null)
public static int Serialize(object? value, Type type, IBufferWriter<byte> writer, AcBinarySerializerOptions? options = null)
public static int SerializeChunked(object? value, Type type, PipeWriter writer, AcBinarySerializerOptions? options = null) and Pipe overload
public static int SerializeChunkedFramed(object? value, Type type, PipeWriter writer, AcBinarySerializerOptions? options = null) and Pipe overload
public static ValueTask SerializeAsync(object? value, Type type, Stream stream, ...) — coordinated with ACCORE-BIN-T-T8K3
Internal dispatch: value.GetType() is the runtime type; the Type type parameter constrains the declared type for polymorphism handling (ObjectWithTypeName write decision).

Acceptance:

All non-generic overloads round-trip via the generic deserializer's Deserialize(byte[], Type) overload.
Plugin-style scenario: serialize IList<dynamic> of mixed-type elements → all elements correctly typed in the wire output.
API doc-strings call out the performance characteristics (slightly slower than generic due to runtime Type lookup but without the reflection trampoline cost).

ACCORE-BIN-T-R4P2: Expose low-level `ref Writer`-style API for custom formatters

Priority: P3 · Type: Feature

The MemoryPack-style Serialize<T>(ref MemoryPackWriter writer, in T value) low-level API enables:

Custom formatters that compose write primitives without the full Serialize entry-point overhead.
Nested-into-existing-stream scenarios where the caller already owns a writer-style cursor.
Test harnesses that exercise specific wire-format paths in isolation.

Today's BufferWriterBinaryOutput standalone-mode partly fills this gap — exposing WriteByte, WriteVarUInt, WriteStringUtf8, etc. — but it is not a ref struct, not a documented low-level public API for external custom formatters, and the relationship with BinarySerializationContext<TOutput> is unclear from the consumer's perspective.

Design tension (decide before implementing):

Promote BufferWriterBinaryOutput to documented public surface — add doc, examples, supported usage patterns. Cheapest, but the standalone-mode is currently a side-feature, not a primary API; documenting it commits to its current shape.
New ref struct AcBinaryWriter wrapper around BufferWriterBinaryOutput (or a dedicated impl) — explicit "this is the low-level writer" signal. More API surface but clearer mental model. Aesthetic alignment with MemoryPack.
Skip entirely — the IBufferWriter<byte> overload is already lower-level than most consumers need; custom formatters can write to an ArrayBufferWriter<byte> and use IBufferWriter-style primitives. This is what BufferWriterBinaryOutput already does internally.

Recommendation: option 3 is honest — the existing IBufferWriter<byte> overload covers the use case, and adding a ref struct AcBinaryWriter is mostly aesthetic alignment with MemoryPack. Re-evaluate when there's a concrete custom-formatter request that the current API can't accommodate.

Acceptance (if implemented):

AcBinaryWriter ref struct (or equivalent) compiles, supports the same write primitives as BufferWriterBinaryOutput standalone-mode.
At least one example custom formatter ships in tests (e.g., a Vector3 struct formatter).
Doc-string clearly distinguishes when to use the low-level writer vs. the high-level Serialize<T> entry-point.

ACCORE-BIN-T-U6Y8: Attribute-driven polymorphism via `[AcBinaryUnion]` + SGen (opt-in, AOT-friendly)

Priority: P1 (if AOT target required) / P2 (non-AOT only) · Type: Feature

Design philosophy alignment: AcBinary's market positioning is "JSON-style flexibility with MessagePack-class speed" — attributes are opt-in optimization, never required. The runtime polymorphism path (AQN-based, today's default) stays the default and continues to work for arbitrary unattributed types. This TODO adds a fast/AOT path alongside it, never replaces it.

AcBinary today handles polymorphism at runtime: the wire writes ObjectWithTypeName(72) + AQN string, and the deserializer calls Type.GetType(aqn) to resolve. This is flexible (no upfront declaration), but has three significant drawbacks for some consumers:

AOT-incompatible — Type.GetType(AQN) requires reflection metadata that the Native AOT trimmer strips by default. The runtime polymorphism path does not work at all under Native AOT. Hard blocker for AOT-targeting consumers (Blazor WASM, MAUI mobile, container-trimmed deployments).
Slower — AQN string parse + reflection lookup vs. a closed switch (tag) in code-gen.
Larger wire format — full AQN string (often 100+ bytes) vs. a single-byte tag.

Design — three coordinated pieces:

1. New 5th bool parameter on `[AcBinarySerializable]`: `EnablePolymorphismFeature`

Mirrors the existing EnableMetadataFeature / EnableIdTrackingFeature / EnableRefHandlingFeature / EnableInternStringFeature pattern. Per-type opt-out / opt-in via attribute parameter.

public AcBinarySerializableAttribute(
    bool enableMetadataFeature,
    bool enableIdTrackingFeature,
    bool enableRefHandlingFeature,
    bool enableInternStringFeature,
    bool enablePolymorphismFeature)   // ← ÚJ, default: true

Three behavior modes per type:

EnablePolymorphismFeature = false → disabled. SGen never emits polymorphism dispatch for this type; runtime path also short-circuits — runtime type ≠ declared type is silently treated as declared (or throws, decision TBD). Use for hot-path closed types where polymorphism is impossible-by-design and the perf/AOT cost is unwanted.
EnablePolymorphismFeature = true (default), no [AcBinaryUnion] → runtime options control. Behaves per AcBinarySerializerOptions.PolymorphismMode (Runtime/AQN today). This preserves the JSON-style flexibility for unattributed bases.
EnablePolymorphismFeature = true + [AcBinaryUnion(...)] declared → union-switch dispatch. SGen emits a closed switch (tag) dispatch using the declared subtype set. Fast + AOT-friendly. Overrides the options-level default for this type.

2. New `[AcBinaryUnion(byte tag, Type subtype)]` attribute

Multiple instances per base class / interface declare the closed polymorphism set:

[AcBinarySerializable]   // EnablePolymorphismFeature defaults to true
[AcBinaryUnion(0, typeof(Cat))]
[AcBinaryUnion(1, typeof(Dog))]
public abstract partial class Animal { ... }

SGen detects [AcBinaryUnion] on abstract / base type → emits the switch-based write/read dispatch instead of falling through to runtime AQN.

3. New `PolymorphismMode` enum on `AcBinarySerializerOptions`

Options-level default for unattributed polymorphism (i.e. the case where EnablePolymorphismFeature = true but no [AcBinaryUnion] is declared):

Runtime (today's default) — AQN-based. Flexible, AOT-incompatible.
Throw — fail fast on any polymorphic write that lacks a [AcBinaryUnion] attribute. AOT-friendly diagnostic mode for migration scenarios.

Note: there is no UnionAttribute-only mode — declaration is per-type via the attribute, not options-global. The options-level mode only governs the fallback when no [AcBinaryUnion] is present.

Wire-format addition:

New marker (e.g. UnionTagBase = <TBD>) + [byte tag][inner Object], parallel to existing ObjectWithTypeName(72). Slot number to be assigned avoiding clashes with existing 64–134 / 192–255 ranges.

Implementation outline:

AcBinarySerializableAttribute — new ctor parameter enablePolymorphismFeature, all existing ctors default it to true (backward compatible).
AcBinaryUnionAttribute — new attribute, AttributeUsage(AttributeTargets.Class | Interface, AllowMultiple = true).
Source generator — emit WriteUnion<TBase>(value, ctx, depth) and ReadUnion<TBase>(ctx, depth) static methods on the union-base type's generated writer/reader. Skipped entirely when EnablePolymorphismFeature = false.
Wire-format new marker + [byte tag][inner Object] body.
Runtime path: WriteValueNonPrimitive checks the wrapper's PolymorphismFeatureEnabled flag; when false, skips the value.GetType() != declaredType polymorphism branch entirely.

Acceptance:

EnablePolymorphismFeature = false: SGen-emitted dispatch contains zero is-typeof / GetType branches; runtime path also short-circuits. Verify in JIT disassembly.
EnablePolymorphismFeature = true, no union: runtime AQN polymorphism works as today (full backward compat); preserved JSON-style flexibility for unattributed bases.
EnablePolymorphismFeature = true + [AcBinaryUnion]: AOT-test (Native AOT publish) compiles and round-trips a polymorphic graph — Type.GetType() is never invoked on this path.
Benchmark: union-switch polymorphism measurably faster than AQN polymorphism on deser side (typed switch vs. reflection lookup).
Wire format documented in BINARY_FORMAT.md; BINARY_FEATURES.md cross-references the attribute pattern; BINARY_OPTIONS.md documents PolymorphismMode. AcBinarySerializableAttribute doc-string explains all three behavior modes.

ACCORE-BIN-T-B7H4: Implement `AcBinarySerializerOptions` thread-safety fix

Priority: P2 · Type: Refactor · Related: BINARY_ISSUES.md#accore-bin-i-l8n5 (canonical issue)

The latent thread-safety problem documented in ACCORE-BIN-I-L8N5 — mutable set; properties on AcBinarySerializerOptions shared across concurrent serialize/deserialize calls — needs a fix before AcBinary ships as a NuGet package. The package cannot constrain how consumers scope their options instances; defensive contract is needed in the serializer itself.

Three candidate fix directions (decide before implementing):

Defensive copy on ingress — add AcBinarySerializerOptions Clone() method (member-wise copy). Every API entry point that retains an options instance clones it on entry. External mutation to the original becomes invisible to the holder.
- Pro: non-breaking. Existing consumer code unchanged. No major version bump required.
- Pro: API surface change limited to one new Clone() method.
- Con: per-call clone overhead (small, but non-zero). Cache keyed on options-identity becomes invalid for downstream code using reference equality.
- Con: doesn't fix the underlying mutability — internal code can still race-mutate the cloned snapshot if a method retains both the snapshot and modifies it concurrently.
Immutable record refactor — set; → init; on all configuration properties. Mutation requires with-expression which produces a new instance.
- Pro: type-system-strong guarantee. Race becomes a compile error, not a runtime corruption risk.
- Pro: zero runtime overhead (init-only is compile-time check; record class semantics are unchanged at runtime).
- Con: breaking change for any consumer doing opts.UseGeneratedCode = false after construction. Major version bump.
- Con: source-generator coordination needed if SGen emits options-builder code that mutates properties.
Read-only flag pattern (à la JsonSerializerOptions.MakeReadOnly()) — mutable by default, holder calls MakeReadOnly() on entry; subsequent property setters throw InvalidOperationException.
- Pro: BCL-precedent — Microsoft adopted it for JsonSerializerOptions in .NET 7 (dotnet/runtime#74431) for exactly this problem. Familiar pattern for consumers.
- Pro: minimal API surface change (one new method + IsReadOnly flag property).
- Pro: per-call overhead = single bool check per setter call. Negligible.
- Con: opt-in by the holder — if a custom consumer-side wrapper forgets to call MakeReadOnly(), the safety hole stays open for that wrapper's clients. Documentation-driven safety, not type-system-driven.
- Con: bypasses static-analysis tooling (the setter signature stays public; the throw is runtime). IDE doesn't surface "this property is currently read-only" in autocomplete.

Recommendation: Option 3 (MakeReadOnly pattern) is the BCL-precedent, lowest-friction migration path. Microsoft adopted it for JsonSerializerOptions in .NET 7 to solve the same problem; AcBinary should follow the same pattern for consistency with consumers' mental model and zero migration cost.

Coordination with the existing AcBinaryHubProtocol setter side-effect (the second risk surface in ACCORE-BIN-I-L8N5): the protocol ctor currently mutates the caller-provided options reference (_options.BufferWriterChunkSize = options.BufferSize). After the fix:

Option 1 (Clone): ctor mutates the cloned snapshot → no side-channel to the caller. Fix transparent.
Option 2 (Immutable): ctor cannot mutate; needs to construct a new options via with-expression. Breaking change in the ctor's options-handling.
Option 3 (MakeReadOnly): ctor mutates before calling MakeReadOnly() — same as today, but explicit "frozen" point afterwards. Caller-side mutation post-ctor is now a runtime throw.

Implementation outline (Option 3 path):

AcBinarySerializerOptions.IsReadOnly { get; } — public bool property.
AcBinarySerializerOptions.MakeReadOnly() — sets the flag; idempotent (no-op if already set).
All set; accessors guard: if (IsReadOnly) throw new InvalidOperationException("AcBinarySerializerOptions has been made read-only and can no longer be mutated. Construct a new options instance instead.");.
AcBinarySerializer.Serialize<T> entry (and all sibling entries — Deserialize<T>, SerializeChunked, etc.): options.MakeReadOnly() before any property read.
AcBinaryHubProtocol ctor: complete the BufferWriterChunkSize mutation before calling options.MakeReadOnly(). After ctor returns, the options instance is frozen for that protocol's lifetime.
Doc-string update on AcBinarySerializerOptions class header: explicit "thread-safety contract" section explaining the freeze-on-first-use semantics.

Acceptance:

Concurrent stress test (16 threads × 1000 iterations) on a shared AcBinarySerializerOptions instance with property-mutation-attempts mid-iteration — all mutations after MakeReadOnly() throw InvalidOperationException; no silent corruption observed.
Existing tests pass unchanged (the MakeReadOnly is opt-in for the serializer entries; tests that build options + use them once continue to work transparently).
BINARY_ISSUES.md#accore-bin-i-l8n5 Status updated to Closed (YYYY-MM-DD) with a ### Resolution sub-section pointing to this TODO + the implementing commit.
Doc-string on AcBinarySerializerOptions documents the freeze-on-first-use contract; BINARY_FEATURES.md or BINARY_OPTIONS.md cross-references the BCL-precedent (JsonSerializerOptions.MakeReadOnly).

ACCORE-BIN-T-F8N3: Switch source-generator type-name hashing from simple-name to fully-qualified-name

Priority: P3 · Type: Refactor · Related: ACCORE-BIN-T-I3P8 (override mechanism for residual collisions)

The source generator's ComputeFnvHash(typeSymbol.Name) uses the simple name only (e.g. "User", not "MyApp.A.User"). Cross-namespace types with the same simple name silently collide on s_typeNameHash. The hash is currently only consumed by the WireMode=Metadata inline metadata-write path (cross-version property compat) — the framework explicitly does NOT add wire-format type-id (per CLAUDE.md Rule #7: type-dispatch is consumer responsibility, see BINARY_ASYNCPIPE_ISSUES.md#accore-bin-i-t6v2). Within UseMetadata, the simple-name collision can still cause silent property-set mismatches between two types with the same short name in different namespaces — this TODO fixes that.

Change scope (AcBinarySourceGenerator.cs) — 4 call sites: ComputeFnvHash(typeSymbol.Name) → ComputeFnvHash(typeSymbol.ToDisplayString()):

Self type-name hash (~line 358)
Child type-name hash (~line 157)
Element type-name hash (~line 254)
Dict-value type-name hash (~line 311)

No runtime code changes; output regenerates with new constants on next build.

Breaking change scope: any saved binary stream that uses WireMode=Metadata and was produced by an older version embeds the old simple-name hash; consumers reading those streams with the new hash compute would mismatch and throw. Pre-1.0: acceptable. Post-1.0 would require a WireMode=Metadata format-version bump.

Acceptance:

All *_GeneratedWriter.g.cs files regenerate with FQN-based s_typeNameHash values.
Existing tests pass (auto-regen propagates; no manual hash literals in tests).
Wire format identical for WireMode=Compact (no metadata embedded).
UseMetadata=true paths produce different hashes — explicitly tested via round-trip.

ACCORE-BIN-T-I3P8: `[AcBinaryTypeId(...)]` attribute — explicit type-id override

Priority: P3 · Type: Feature · Related: ACCORE-BIN-T-F8N3 (FQN base hash being overridden)

Once ACCORE-BIN-T-F8N3 reduces collision frequency by switching to FQN, residual FQN-hash collisions are still possible (32-bit hash space, birthday paradox). Currently the only consumer of s_typeNameHash is the WireMode=Metadata inline metadata-write path — a residual collision there causes a silent property-set mismatch.

[AcBinaryTypeId(0x12345)] attribute on a class:

Source generator emits s_typeNameHash = 0x12345 instead of computing FNV.
Two types with the same [AcBinaryTypeId(...)] value → compile-time / first-use error.

Useful for:

Resolving rare FQN-hash collisions deterministically (within WireMode=Metadata).
Pinning a stable type-id across class renames (wire-compat across versions in Metadata mode).
Future-proofing: if a Layer 1 consumer (hypothetically) builds a type-dispatch above AcBinary using s_typeNameHash, the same override mechanism applies.

Acceptance:

New attribute class shipped alongside [AcBinarySerializable].
Generator honours the override (emits explicit constant instead of FNV result).
Tests: rename a class with [AcBinaryTypeId] → s_typeNameHash unchanged.

ACCORE-BIN-T-X2M5: Evaluate xxHash3 vs FNV-1a for type-name hashes

Priority: P3 · Type: Investigation · Related: ACCORE-BIN-T-F8N3

FNV-1a is currently used for both s_typeNameHash and s_propertyHashes. For compile-time hashing, performance is irrelevant. For collision resistance:

FNV-1a 32-bit: ~50% collision at ~77K types (birthday paradox). Adequate for small/medium projects, marginal for large ones with many auto-generated types.
xxHash3 32-bit: comparable mathematical properties to FNV-1a (both non-cryptographic).
xxHash3 64-bit: dramatically better collision resistance (~50% at ~5B entries), at the cost of 8 wire bytes instead of 4.

Trigger: real collisions observed (1000+ types per assembly + cross-assembly aggregation), or community feedback indicating collision pain.

Investigation questions (no code change without a triggering pain signal):

Switch to xxHash3 32-bit (incremental improvement) — but doubles the change scope (touch property hashes too if uniformity desired).
Switch to xxHash3 64-bit (8 wire bytes instead of 4) — meaningful collision resistance, modest wire cost.
Stay on FNV-1a + force [AcBinaryTypeId] for collisions — minimal change, devops burden.

Investigation only — defer until pain signal arrives.

ACCORE-BIN-T-K9E4: `[RequiresDynamicCode]` + `[RequiresUnreferencedCode]` on Runtime-only methods

Priority: P3 · Type: Refactor · Related: BINARY_FEATURES.md#nativeaot-compatibility

The Runtime path (factories in AcSerializerCommon + wrapper-based deserialize fallback in AcBinaryDeserializer) currently works under NativeAOT thanks to DAMs propagation + RuntimeFeature.IsDynamicCodeSupported guards, but the trimmer still emits warnings for the well-known blind spots (polymorphism via obj.GetType(), nested-type chain via generic argument extraction). The library suppresses these with [UnconditionalSuppressMessage] and documented justification.

A complementary signal would be to mark the Runtime entry points (or the factories themselves) with [RequiresDynamicCode("AcBinary Runtime path uses Reflection.Emit / closed-generic instantiation; use [AcBinarySerializable] + SGen for NativeAOT.")] and [RequiresUnreferencedCode("...")]. Effect:

AOT publish in consumer's project surfaces a warning at the call site → consumer chooses SGen or accepts the Runtime cost
Mirrors the System.Text.Json reflection-mode pattern ([RequiresDynamicCode] on JsonSerializer.Serialize<T> overloads)
One-codebase, no NuGet split needed
Cheap implementation — attribute placement only

Coordination: [RequiresDynamicCode] is contagious; every caller must either propagate it or suppress with [UnconditionalSuppressMessage]. Scope:

Public Serialize<T> / Deserialize<T> entry points stay attribute-free (consumer-facing)
Runtime fallback methods get the attribute (contained inside the library)
The DAMs annotations we already have stay — they're orthogonal (one prevents trim, the other warns about JIT-only behavior)

Acceptance:

Consumer's AOT publish surfaces a IL2026/IL3050 warning when UseGeneratedCode=false is set or an unattributed type is deserialized
SGen path is warning-free
Library compiles 0 warnings (suppressions added at the propagation barrier)
BINARY_FEATURES.md NativeAOT Compatibility section updated to mention the explicit warning signal

ACCORE-BIN-T-A2J7: Optional `AyCode.Core.Aot` NuGet variant (SGen-only build)

Priority: P3 · Type: Feature · Related: BINARY_FEATURES.md#nativeaot-compatibility, ACCORE-BIN-T-K9E4

Binary-size-sensitive AOT consumers (Blazor WASM, MAUI mobile, embedded, container-trimmed) benefit from a smaller library variant that strips the Runtime fallback path entirely. Estimated savings: ~80-150 KB of native code (~25-60 KB compressed wire size for WASM publish).

Strippable code in the .Aot variant:

Component	LOC	Purpose	Removable in Aot?
`AcSerializerCommon.Create*` (7 factory methods + Expression-tree code)	~150	Runtime delegate compilation	✅ Yes
`TypeMetadataBase` runtime metadata path (`CompiledConstructor`, IdGetters via Expression.Compile)	~300	Reflection-based metadata	✅ Yes
`AcBinaryDeserializer` wrapper-based runtime fallback (`PopulateObjectPropertiesIndexed`, `ReadObjectCoreWithWrapper` non-SGen branches, `CreateInstance(type)` Activator-fallback)	~500	Runtime polymorphic dispatch	✅ Yes
Property accessor runtime delegate fields (`_dynamicGetter`, typed getter/setter caches outside SGen)	~150	Boxed property access	✅ Yes
`System.Linq.Expressions` transitive dependency	—	Expression-tree IL emission	✅ Yes (when nothing else in graph uses it)

Implementation sketch (avoid #if-erdő via file-level split):

AyCode.Core/Serializers/
  AcSerializerCommon.cs              // SGen-safe shared parts
  AcSerializerCommon.Runtime.cs      // 7 Create* factory methods only here
  AcBinaryDeserializer.cs            // SGen path
  AcBinaryDeserializer.Runtime.cs    // wrapper-based runtime fallback path
  TypeMetadataBase.cs                // SGen-safe metadata
  TypeMetadataBase.Runtime.cs        // Expression.Compile-based ctor + accessor wiring

Two .csproj files:

AyCode.Core.csproj — full package (current); includes all files
AyCode.Core.Aot.csproj — <Compile Remove="**/*.Runtime.cs" />; sets <PackageId>AyCode.Core.Aot</PackageId>; same version as full

Trade-offs:

✅ No #if directives in business code — physically separate file groups
✅ Source mostly shared via SDK include/exclude semantics
✅ DAMs annotations and trim-suppressions only land in the full package; .Aot variant is genuinely trim-clean by construction
✅ "Strict SGen" semantics in .Aot: a non-SGen type at deser time throws clearly instead of silently falling back. Marketing positioning: "guaranteed SGen path, no hidden slow lane".
⚠️ Two NuGet IDs, two changelogs, version sync (CI-automatable)
⚠️ Consumer must pick the right package — wrong choice = breaking switch later

Coordination:

Land ACCORE-BIN-T-K9E4 first ([RequiresDynamicCode] attributes) — if that pattern handles the consumer-side scenarios well, .Aot may not be needed
The current Runtime fallback code is already well-isolated (mostly in AcSerializerCommon factories + AcBinaryDeserializer wrapper-based methods), so the file-split refactor is mechanically straightforward
Marketing decision: is binary size a central pillar? If yes, .Aot is a NuGet differentiator; if not, K9E4 alone is enough

Acceptance:

AyCode.Core.Aot.csproj produces a NuGet ~25-60 KB smaller than AyCode.Core after compression
.Aot build emits zero IL/AOT trim warnings (no suppressions needed because the Runtime path code is physically removed)
Round-trip tests pass on .Aot for all SGen types
.Aot throws a clear InvalidOperationException (not MissingMethodException) when a non-[AcBinarySerializable] type is encountered at deser time
BINARY_FEATURES.md NativeAOT Compatibility section documents both packages and when to choose which

ACCORE-BIN-T-V4N2: Cross-tier SIMD UTF-8 transcoder paths (AVX-512BW + Vector128 + multi-byte transcoder)

Priority: P2 · Type: Performance · Related: EncodeUtf8SinglePass, DecodeUtf8SinglePass, CountUtf8Chars

Current SIMD hierarchy (post 2026-05-05 implementation):

AVX-512BW (64 byte/iter)   → Server, Intel 11th gen client, AMD Zen 4+
Vector256 / AVX2 (32 byte) → AVX2 host (Intel 12-14th gen, AMD Zen 3 and earlier)
Vector128 (16 byte/iter)    → Apple Silicon NEON, WASM SIMD, legacy SSE2
scalar (1 byte/iter)        → no-SIMD fall-back

JIT/AOT path-selection via [Intrinsic] IsSupported static booleans — non-supported tiers constant-folded to dead code per host. Cascading tail handlers: a higher tier's tail (< 64 byte AVX-512 → < 32 byte Vector256 → < 16 byte Vector128 → scalar) is processed by the next-lower tier on the same iteration. No regression on any host.

Implementation status:

Phase	Method	AVX-512BW	Vector256	Vector128	scalar
1	`CountUtf8Chars` (decode 1st pass)	✅ done	✅ existing	✅ done	✅ existing
2	`EncodeUtf8SinglePass` Phase 1 (ASCII narrow)	✅ done	✅ existing	✅ done	✅ existing
2.5	`DecodeUtf8SinglePass` scalar run-length decoder (multi-byte baseline)	—	—	—	⏳ TODO
3a	`DecodeUtf8SinglePass` multi-byte transcoder (Vector512)	⏳ TODO	bail-out only	bail-out only	✅ existing
3b	`DecodeUtf8SinglePass` multi-byte transcoder (Vector256)	—	🔍 deferred — see note	bail-out only	✅ existing
3c	`DecodeUtf8SinglePass` multi-byte transcoder (Vector128)	—	—	⏳ TODO	✅ existing

Note on Phase 3b (Vector256 / AVX2) — deferred, not dropped. AVX2 lacks the AVX-512BW primitives (CompareEqualMask producing a __mmask k-register, in-lane vpermb, mask-driven vpcompressb) that make the classify-mask-compress-widen pipeline efficient. The Vector256.Shuffle is cross-lane via two vpshufb (per-128-bit-lane), which complicates leader-byte extraction across multi-byte sequences spanning the lane boundary. The simdutf C++ project — the canonical reference for this algorithm class — implements only SSE4 (16-byte) and AVX-512 (64-byte) paths; it explicitly skips AVX2 because the implementation cost-benefit is unfavorable on this algorithm.

On AVX2 hosts, the Phase 3c (Vector128) transcoder runs as the primer multi-byte path AND as tail handler — covering AVX2 hosts with 16-byte/iter, which is already a significant win over the current scalar multi-byte branch. Phase 3b would require either:

Hand-rolling an AVX2-specific 32-byte algorithm with cross-lane permute workarounds (research-grade complexity, uncertain net win — could be SLOWER than the Vector128 path due to cross-lane shuffle latency)
Waiting for Avx10v1 / Avx10v2 to expose AVX-512BW-class primitives in 256-bit form (Intel's unified vector ISA — Avx10v1 already in .NET 9, Avx10v2 arrives with future Intel hardware)

Re-evaluation triggers: if benchmark on AVX2 hosts shows Phase 3c Vector128 path leaves > 10% Deser gap vs MemPack on multi-byte content; or if Avx10v1 256-bit primitives mature enough to make the algorithm tractable. Until then: Phase 3b stays in the TODO as a research / future-work item — not actively scheduled, but documented so a future contributor doesn't re-derive the AVX2 limitations.

Phase 3 is the remaining gap — UTF-8 multi-byte decode on every host class. ASCII path is already fast across all SIMD tiers (Vector256 + Vector128 prefix widen + Encoding.Latin1.GetString BCL fast path). The gap is on multi-byte UTF-8 content — Hungarian / Cyrillic / Greek (2-byte) and CJK BMP (3-byte) sequences — where the SIMD prefix bails out on the first non-ASCII byte and falls back to scalar bit-extract. The Repeated benchmark cell (Hungarian content) is the canonical witness; with all-Hungarian content (current bench data), Small / Repeated Deser cells trail MemPack by 6-14%.

Why all 3 SIMD tiers (not just AVX-512BW) — public NuGet package goal: i18n payloads must be fast on every supported host (cloud server, desktop, mobile, Blazor WASM), not only AVX-512-capable cloud servers. The saját scalar multi-byte branch is the bottleneck on all non-ASCII content regardless of host class. The BCL Encoding.UTF8 falls back to a similar scalar path on multi-byte content (with virtual dispatch + EncoderFallback overhead), so even where the BCL has its own SIMD 2-byte handler (.NET 9 PR #92580), our trust-input scalar wins on net — but a saját SIMD multi-byte path would dominate on every host.

Phase 3 approach — in-house multi-byte transcoder, three SIMD widths. Single algorithm template (classify-mask-compress-widen pipeline) ported across Vector512 / Vector256 / Vector128 register widths. Algorithm designed and written in-house — no third-party port, no NuGet dependency:

Phase 3a — DecodeUtf8SinglePass Vector512 (AVX-512BW): 64-byte block fetch → classify each byte's UTF-8 sequence position via mask compares → byte-compression for length-resolution → widen to UTF-16 in two Vector256<ushort> lanes → store. ~3-5× speedup vs current scalar multi-byte branch on Hungarian / CJK content. Activates on AVX-512 hosts (cloud server, Intel 11th gen, AMD Zen 4+).
Phase 3b — DecodeUtf8SinglePass Vector256 (AVX2): same algorithm at 32-byte block. Smaller register space → fewer codepoints per iter, but ASCII bail-out gone → multi-byte content is now SIMD-handled. ~2-3× speedup. Activates on AVX2 hosts (Intel 12-14th gen, AMD Zen 3 and earlier).
Phase 3c — DecodeUtf8SinglePass Vector128 (NEON / SSE / WASM SIMD): same algorithm at 16-byte block. ~1.5-2× speedup. Activates on Apple Silicon / WASM / legacy x86 — covering the i18n production case for mobile (MAUI iOS / Android) and Blazor WASM.

The cascading tail-handler hierarchy (existing in Phase 1+2) carries over: AVX-512 → Vector256 → Vector128 → scalar tail. Each tier hands off the < N-byte tail to the next-lower tier.

No .NET 11 / multi-targeting needed. Avx512BW, Vector256, Vector128 intrinsics all available in .NET 9 (and .NET 8). Implementation lands on the current net9.0 target.

Hardware reach (2026). Per Wikipedia "CPUs with AVX-512":

✅ Intel server: Skylake-X (2017), Cascade Lake-X, Ice Lake-SP, Sapphire Rapids (2023+), Emerald Rapids, Granite Rapids — near-universal in cloud (Azure, AWS, GCP)
✅ Intel client 11th gen: Tiger Lake (mobile, 2020), Rocket Lake (desktop, 2021), Ice Lake (mobile) — pre-Alder Lake era still supports AVX-512
❌ Intel client 12-14th gen: Alder Lake / Raptor Lake / Meteor Lake / Core Ultra — AVX-512 disabled at firmware level (E-core blocking) → falls back to Vector256
✅ AMD Zen 4+: Ryzen 7000 (2022), Ryzen 9000 (2024), EPYC Genoa (2022), EPYC Turin (2024)
❌ AMD pre-Zen 4: Zen 3 and earlier → falls back to Vector256
❌ Apple Silicon / ARM: NEON only → uses Vector128 (16 byte/iter)
❌ Blazor WASM: only 128-bit SIMD per WASM SIMD spec → uses Vector128 (16 byte/iter)

The Vector128 path is the WASM and Apple Silicon target — without it both platforms fell back to scalar (1 byte/iter). With Phase 1+2 landed, WASM and Apple Silicon now run the UTF-8 hot path at 16 byte/iter (16× scalar speedup on the count + ASCII narrow operations).

Phase 2.5 — scalar run-length decoder (multi-byte baseline, pre-Phase 3 prototype) — TESTED & REVERTED 2026-05-07

Status update (2026-05-07): Phase 2.5 was implemented and tested in two configurations:

Full run-length (15:56:54 bench) — both 2-byte and 3-byte tiers used inner do-while loops. Result: +13.0 pp Deser regression on the Hungarian-mixed Repeated cell. Hypothesis confirmed (foreseen pre-implementation): rövid Magyar 2-byte runs (1-2 char average) make the run-detection overhead exceed the per-char payload; switch-jumptable per-char dispatch wins on this content shape.
Hybrid (post-15:56:54) — 2-byte single decode, 3-byte run do-while only. Tested but bench-zaj instabilitás miatt unmeasurable signal. Reverted along with V4N4 method-split (2026-05-07).

The optimization-value signal proved below the bench noise floor on the available hardware. The 3-byte do-while CJK-content win remains a theoretically valid target — but cannot be objectively validated without the ACCORE-BIN-T-C5R8 charset-parameterized benchmark workload (CJK option). Re-evaluate when CJK workload measurement becomes available.

Re-evaluable as of 2026-05-07 per ACCORE-BIN-T-D9X3 — bench stabilization removes the noise-floor that made the original signal unmeasurable; retest before any code change. (Charset bias remains — pair with ACCORE-BIN-T-C5R8 for CJK validation.)

Retested 2026-05-08 — REGRESSION CONFIRMED (Latin1Long charset, stabilized bench): adding the do-while inner loop on both 2-byte and 3-byte tiers in DecodeUtf8SinglePass produced +5-8pp Deser regression on every cell vs. the switch-jumptable baseline (Small +7.8pp, Medium +7.1pp, Large +5.5pp, Repeated +7.4pp, Deep +4.9pp). Reverted to switch-jumptable single-decode same day. The V4N2 entry's original prediction held: "Magyar mixed (KözösCímke, sötét — short alternating runs): 0-5% (run-detection overhead may eat the savings on short runs)" — Latin1Long suffix has 1-2 char average run length, well below the run-detection break-even point. Phase 2.5 is dead on Magyar mixed. CJK retest still untried, but Phase 2.5 is now obsoleted by ACCORE-BIN-T-K7M3 (the decoder hot path runs Utf8.ToUtf16 BCL static API, not DecodeUtf8SinglePass).

Below: original Phase 2.5 design notes preserved as documentation. Implementation details remain accurate even though the implementation was reverted.

Targets the DecodeUtf8SinglePass switch-jumptable per-char dispatch on multi-byte content. Current scalar Phase (jumptable) re-dispatches every char; a run-length-aware scalar decoder runs a tight branchless inner loop on homogeneous runs (long ASCII run, long 2-byte Latin/Cyrillic run, long 3-byte CJK BMP run), with the existing single-codepoint scalar branch as mixed-edge fallback.

Algorithm sketch:

while (s < src.Length)
{
    // 1) ASCII run (0xxxxxxx) — already handled by Phase 1 SIMD prefix; this is tail
    int asciiStart = s;
    while (s < src.Length && src[s] < 0x80) s++;
    if (s > asciiStart) { WriteAsciiRun(src.Slice(asciiStart, s-asciiStart), dst, ref d); continue; }

    // 2) 2-byte run (110xxxxx 10xxxxxx) — Hungarian / Cyrillic / Greek / Hebrew / Arabic
    int twoStart = s;
    while (s + 1 < src.Length && Is2ByteLead(src[s]) && IsCont(src[s+1])) s += 2;
    if (s > twoStart) { Decode2ByteRun(src.Slice(twoStart, s-twoStart), dst, ref d); continue; }

    // 3) 3-byte run (1110xxxx 10xxxxxx 10xxxxxx) — CJK BMP, other 3-byte BMP scripts
    int threeStart = s;
    while (s + 2 < src.Length && Is3ByteLead(src[s]) && IsCont(src[s+1]) && IsCont(src[s+2])) s += 3;
    if (s > threeStart) { Decode3ByteRun(src.Slice(threeStart, s-threeStart), dst, ref d); continue; }

    // 4) Mixed-edge fallback (typically 4-byte surrogate pair or single transition char)
    DecodeSingleCodePoint(src, ref s, dst, ref d);
}

Why P2.5 — scalar baseline before SIMD multi-byte (Phase 3a-3c):

1-2h prototyping cost vs 6-10h Phase 3 SIMD work
A/B benchmark on Repeated cell decides whether the run-length structure already wins on Magyar mixed (KözösCímke pattern) — if it does, Phase 3 lifts further; if not, Phase 3 SIMD is the only win path
Documents the "switch-jumptable bottleneck on Hungarian benchmark" hypothesis without committing to the larger SIMD effort
The Decode2ByteRun / Decode3ByteRun scalar-batch implementations also serve as algorithm references for the Phase 3 SIMD versions (clear semantics first, optimize after)

Expected payoff (per content class, ratio vs current switch-jumptable):

Long CJK BMP (3-byte run, e.g. 你好世界 ×30): ~20-40% Deser improvement (long homogeneous run, biggest jumptable savings)
Long 2-byte run (árvíztűrő ×10+): ~5-15% improvement
Magyar mixed (KözösCímke, sötét — short alternating runs): 0-5% (run-detection overhead may eat the savings on short runs)
Long ASCII (≥32 byte): 0% (Phase 1 SIMD prefix already handles)
Emoji (4-byte): 0% (mixed-edge fallback unchanged)

Risk — the existing switch-jumptable JIT optimization is strong; Magyar mixed text (1-2 char runs) may not show net gain. Implementation must be isolated prototype first (alongside the live DecodeUtf8SinglePass, not replacing it), with A/B benchmark comparing the two before any switch.

Acceptance (Phase 2.5):

Repeated cell Compact Deser ratio ≤ 1.0 vs MemPack on AVX2 hosts (parity with current measurement, no regression)
Round-trip tests pass on all UTF-8 content classes (ASCII / 2-byte / 3-byte BMP / 4-byte surrogate-pair)
A/B benchmark shows ≥ 5% Deser improvement on Repeated OR ≥ 10% on Large cell — else Phase 2.5 stays in TODO as documented dead-end (negative result is also valuable: confirms the jumptable is fast enough, focus moves entirely to Phase 3)

Phase 3 implementation outline

Insert SIMD multi-byte branches at DecodeUtf8SinglePass entry, before the existing ASCII-prefix bail-out loops:

if (Avx512BW.IsSupported && byteCount >= 64)        { Vector512MultiByteDecode(...) }
if (Vector256.IsHardwareAccelerated && len-i >= 32) { Vector256MultiByteDecode(...) }
if (Vector128.IsHardwareAccelerated && len-i >= 16) { Vector128MultiByteDecode(...) }
// existing scalar tail

Single algorithm template — classify-mask-compress-widen pipeline:
1. Block load (Vector512 / Vector256 / Vector128)
2. Classify each byte's UTF-8 sequence position via mask compares (start vs continuation, 1/2/3/4-byte sequence width)
3. Compute output char count via popcount on start-byte mask + extra-char mask for 4-byte sequences
4. Byte-compression for leader/continuation extraction (mask-driven PermuteVar / Shuffle)
5. Combine leader + continuations into codepoints (shift + OR)
6. Widen codepoints to UTF-16 chars (handle surrogate pairs for 4-byte sequences)
7. Store output, advance src/dst pointers
Block-boundary edge case: incomplete multi-byte sequence at block end → carry to next iter or hand off to lower tier / scalar tail
Trust-input semantics maintained — no validate-pass instructions (reader input is valid UTF-8 by writer contract)
Avx512BW.X64.IsSupported (64-bit-only intrinsics) checked separately if any code path requires the X64 sub-feature

Why P2

"i18n production deploy" perf gap on every host class — the public NuGet package contract requires fast multi-byte UTF-8 across cloud server, desktop, mobile, and Blazor WASM
No NuGet dependency, no third-party code, no wire-format change, additive — pure CPU optimization
Phase 1+2 delivered cross-tier ASCII / count SIMD coverage; Phase 3 closes the multi-byte CPU gap on all SIMD-capable hosts (not just AVX-512)
Single algorithm template ported across 3 register widths — code volume manageable

Acceptance

Repeated Deser ratio ≤ 0.7 vs MemPack on AVX-512 hosts (Phase 3a)
Repeated Deser ratio ≤ 0.8 vs MemPack on AVX2 hosts (Phase 3b)
Repeated Deser ratio ≤ 0.85 vs MemPack on Apple Silicon / WASM (Phase 3c)
Repeated Ser ratio ≤ 0.85 across all host classes
Round-trip tests pass on all UTF-8 content classes (ASCII / 2-byte / 3-byte BMP / 4-byte surrogate-pair)
BINARY_FEATURES.md documents the SIMD path selection across all four tiers

Trigger

Each SIMD width validated on a representative host before merge:
- Phase 3a: AVX-512 host (developer's local AMD Zen 4+ desktop, Intel 11th gen, or server-class machine)
- Phase 3b: AVX2 host (any modern x86 desktop / laptop without AVX-512)
- Phase 3c: Apple Silicon (macOS / iOS / Mac Catalyst) AND Blazor WASM browser runtime
Local dotnet test covers correctness; per-tier benchmarks measure the multi-byte speedup
Phase 1+2 (AVX-512BW + Vector128 in CountUtf8Chars + EncodeUtf8SinglePass Phase 1) landed 2026-05-05 — covered by existing round-trip tests, no regression on non-AVX-512 hosts (validated on AVX2-host bench)

ACCORE-BIN-T-H2Q6: Fixed-width dual-length string header (Small/Medium/Big) for 1-pass decode

Priority: P1 · Type: Wire-format + Performance · Status: Closed (2026-05-06) · Related: DecodeUtf8SinglePass, CountUtf8Chars, WriteStringWithDispatch, ReadStringUtf8

Current Compact string decode uses two-pass flow for non-ASCII payloads (CountUtf8Chars + DecodeUtf8SinglePass). Planned direction: remove VarUInt-based string-length path for the new string wire variant, and carry both lengths in a fixed-width header so deserialize can allocate target string immediately and decode in a single pass.

Planned format tiers

Small: packed uint16 (charLen:8 | utf8Len:8)
Medium: packed uint32 (charLen:16 | utf8Len:16)
Big: uint32 charLen + uint32 utf8Len

Writer picks the smallest fitting tier; reader dispatches by marker and reads fixed-width lengths (no VarUInt loop for string length metadata).

Why

Removes CountUtf8Chars pass on the new markers (1-pass decode path)
Keeps decode branch profile stable (fixed-size header reads)
Maintains range safety with explicit Big overflow path

Constraints captured from current benchmark context

Performance evaluation target is non-ASCII-heavy data (ASCII-shortcuts intentionally not primary)
Wire-format backward compatibility is not required for this development phase

Marker layout decision (2026-05-06)

After analysis on the new "all UTF-8 Magyar" benchmark baseline (2026-05-06_13-10-30.LLM — Compact +5-25% slower than MemPack on every cell):

Confirmed: the previous benchmark's Compact-vs-MemPack advantage was an artifact of ASCII property names hitting the FixStrAscii / Latin1-widen fast path; once string property values are also UTF-8 Magyar, the actual hot path (EncodeUtf8SinglePass + two-pass CountUtf8Chars + DecodeUtf8SinglePass) becomes the bottleneck.

Marker scope decision — clean split between ASCII fast path and non-ASCII tier dispatch:

MEGMARAD (changeless):

FixStrAscii (≤31 byte ASCII) — kompakt 1-byte header + Latin1 widen, zero UTF-8 decode pipeline
StringAscii (>31 byte ASCII) — long ASCII fast path, Latin1 widen
StringInternRef — 2nd+ occurrence of interned string (no body, just cache index — not affected by 2-pass problem)
StringEmpty, Null — sentinel markers

MEGSZŰNIK (replaced by H2Q6 tiers):

FixStr (32 marker values 103-134 — non-ASCII short) → replaced by StringSmall
String (1 marker value 91 — non-ASCII long with VarUInt utf8Len) → replaced by StringSmall / StringMedium / StringBig
StringInternFirst (1 marker value 94 — VarUInt utf8Len interning) → replaced by StringInternFirstSmall / StringInternFirstMedium

ÚJ markers (5 total):

StringSmall — non-ASCII, [marker:1][charLen:8][utf8Len:8][bytes], utf8Len ≤ 255
StringMedium — non-ASCII, [marker:1][charLen:16][utf8Len:16][bytes], utf8Len ≤ 65535
StringBig — non-ASCII, [marker:1][charLen:32][utf8Len:32][bytes], utf8Len > 65535
StringInternFirstSmall — [marker:1][cacheIdx:VarUInt][charLen:8][utf8Len:8][bytes]
StringInternFirstMedium — [marker:1][cacheIdx:VarUInt][charLen:16][utf8Len:16][bytes]

Trade-off justification:

Wire cost on short non-ASCII strings: +2 byte/string header (3 vs 1) → ~0.07-0.36% wire growth on Repeated cell (10 short Magyar string × 2 byte / 28 KB)
CPU saving: CountUtf8Chars Pass 1 eliminated on every non-ASCII string decode → directly attacks the +25% Deser baseline gap
The 2-byte hybrid FixStr (non-ASCII) variant (1 byte marker + 1 byte charLen) was considered but rejected: marginal wire saving (-1 byte vs StringSmall) does not justify the +1 marker complexity given the tiny absolute wire impact on the Repeated cell. Cleaner to have ASCII-vs-non-ASCII at the marker level (FixStrAscii vs StringSmall/Medium/Big).

Interning tier sizing rationale:

MaxStringInternLength is byte-typed (AcBinarySerializerOptions.cs:125, default 64, abszolút max 255 char)
Worst-case: 255 char × 4 byte/char (emoji-only) = 1020 byte → fits in Medium tier (utf8Len ≤ 65535)
Realistic Magyar/CJK: 64 char × 2-3 byte = 128-192 byte → Small tier
Big tier never engages on the interning path — only Small + Medium needed (+2 markers, not +3)

Marker address space reservation (post-H2Q6)

The marker reorg frees 34 marker values (32 FixStr non-ASCII + String + StringInternFirst). After allocating 5 for H2Q6, 29 values remain free. Strategic reservation plan to prevent ad-hoc consumption and minimize future wire-format breaks:

Reserved range	Count	Future feature	Status
`StringSmall` / `StringMedium` / `StringBig`	3	H2Q6 Compact tiers	active (this entry)
`StringInternFirstSmall` / `StringInternFirstMedium`	2	H2Q6 interning tiers	active (this entry)
`FixArrayBase..FixArrayMax`	16	`ACCORE-BIN-T-L9Y3` (FixArray short-list count in marker)	reserved, future
Sentinel-length string tier markers	~5	`ACCORE-BIN-T-S5L8` (sentinel-length encoding)	reserved, future
Markerless schema lane	~4	`ACCORE-BIN-T-S2X9` (markerless schema lane opt-in)	reserved, future
`StringFastWire`	1	`ACCORE-BIN-T-F3W6` (dedicated FastWire string marker)	reserved, future
General reserve	3	unallocated	tartalék

Wire-format version bump: v2 → v3 at H2Q6 landing. The reserved-but-unimplemented marker values are documented but not yet decoded — readers throw unknown marker if wire contains them. Future activation of FixArray / sentinel-length / markerless schema lane within the same v3 wire format is non-breaking for already-deployed v3 consumers (they reject unknown markers cleanly; producers opt in to emit them).

Acceptance

New string markers implemented for Small/Medium/Big tiers + InternFirstSmall/InternFirstMedium tiers
Deserialize path for these markers performs single-pass decode without CountUtf8Chars
29 freed marker values strategically reserved per the address-space reservation table; documented in BinaryTypeCode.cs with // Reserved for ACCORE-BIN-T-XXXX (future) comments
Wire-format version bump v2 → v3 documented in BINARY_FORMAT.md
Existing round-trip tests pass, plus new boundary tests for tier transitions (utf8Len = 254/255/256/65534/65535/65536) and interning tier transitions
Benchmark report includes before/after for Compact mode on non-ASCII dataset (Ser/Deser/RT + Size) vs the 2026-05-06_13-10-30.LLM baseline

Resolution

Landed 2026-05-06. End-to-end implementation: marker reorg + writer tier-dispatch + reader tier-readers + SGen template + skip path + interning path. Five new markers (StringSmall/Medium/Big/InternFirstSmall/InternFirstMedium) replacing the old String/StringInternFirst/FixStrBase..Max (32 + 1 + 1 = 34 marker values freed, 5 used; 29 reserved for future features per the address-space plan). Wire format version bumped v2 → v3.

Follow-up A-direction header pack-write/read optimization landed in the same window: Unsafe.WriteUnaligned<ushort> (Small) / <uint> (Medium) / <ulong> (Big) replace 2× byte / 2× ushort / 2× uint stores; reader uses single uint/ulong loads with bit-extract. Direct ref byte writes (no Span-shape overhead).

Tests: 222 pass / 13 pre-existing GuidIId failures (unchanged). 55/55 Utf8TranscoderTests pass.

Benchmark vs 2026-05-06_13-10-30.LLM baseline (2026-05-07_08-55-49.LLM, immediately post-H2Q6):

Compact-vs-MemPack Deser ratio improvement on baseline gap: -14 to -28 percentage points across cells
Deser: 4/5 cells now faster than MemPack (Small -6%, Medium -3%, Large -9%, Deep -7%); Repeated cell remaining +5% gap (V4N2 Phase 3 SIMD multi-byte transcoder targets this)
Wire size: 5/5 cells smaller than MemPack (-8% to -11%)
Ser: 1/5 win (Large -9%), 1/5 tie (Medium 0%), 3/5 minor lag (+2-7% Small/Repeated/Deep) — host-noise band

Bench evolution post-H2Q6 (subsequent micro-opts on the same H2Q6 base):

2026-05-07_09-39-09.LLM — A irány header pack-write/read (Unsafe.WriteUnaligned ushort/uint/ulong): zaj-szintű mozgás, strukturális javulás
2026-05-07_15-13-39.LLM — V4N4 Step 1+2 method-split (AggressiveInlining): regresszió (Small Ser +29.6 pp, Repeated Ser +8.9 pp) → WriteStringSmallFast túl-aggresszív inline-olás code-bloat / i-cache pressure
2026-05-07_15-29-21.LLM — V4N4 finomított (NoInlining a SmallFast-ra, dispatcher hint nélkül, Reader split visszavonva): konszolidált state:
- Ser: 5/5 cell paritás-vagy-jobb (Small -8.5%, Medium ≈, Large -8.5%, Repeated ≈, Deep ≈)
- Deser: 4/5 cell faster than MemPack (Medium -4.7%, Large -10.6%, Repeated -3.8%, Deep -10.1%); Small +10% remaining gap
- Wire: 5/5 cell -8% to -11% smaller (unchanged)
- Net: Compact mostantól 8/10 cellán nyer Compact vs MemPack; csak Small Deser-en marad +10% gap (kis abszolút érték, ~1 µs)

Critical algorithmic correctness lesson (from V4N3 follow-up GetUtf8ByteCount): the initial 4-popcount formula assumed lowSur == highSur per chunk. Fix: 5-popcount closed-form. Caught by surrogate-pair-split-across-chunk regression tests. Documented in Utf8Transcoder.

Marker address space (post-H2Q6, v3 wire):

91 → StringSmall (was String)
94 → StringMedium (was StringInternFirst)
103 → StringBig
104 → StringInternFirstSmall
105 → StringInternFirstMedium
106..134 reserved (29 values: 16 for L9Y3 FixArray, 5 for S5L8 sentinel-length, 4 for S2X9 markerless schema lane, 1 for F3W6 FastWire dedicated marker, 3 reserve)

Related follow-up TODO entries (now Open): O7G2 (overflow guard), S6F2 (shift-mentes Small fast path), W2C8 (WASM string-cache H2Q6 maximalizálás).

ACCORE-BIN-T-S5L8: Sentinel-length encoding for strings (wire-size optimization, both modes)

Priority: P3 · Type: Wire-format optimization · Related: AcBinarySerializer.WriteString, AcBinaryDeserializer.ReadValue string dispatch

The leading string-marker byte (String / StringEmpty / Null) exists primarily to distinguish null vs empty vs non-empty before dispatching. For non-polymorphic, non-interned string properties the marker can be replaced by a single sentinel-length VarUInt:

[VarUInt sentinelLength] [content bytes if applicable]
   sentinelLength == 0    → null
   sentinelLength == 1    → empty string
   sentinelLength == N+1  → string of N bytes/chars, content follows

MemoryPack-style encoding pattern. Applies to both Compact (UTF-8) and FastWire (UTF-16 raw) modes; the content following the sentinel differs by mode.

Per-mode impact

FastWire mode — wire layout today: [String marker][VarUInt charCount][UTF-16 raw bytes]. Sentinel saves 1 byte per non-null string.

TestData	Current FastWire wire	Estimated with sentinel	Δ
Small	3122 B	~3050 B	-2%
Medium	10905 B	~10500 B	-4%
Large	68603 B	~67000 B	-2%
Repeated	16244 B	~15700 B	-3%
Deep	15514 B	~14900 B	-4%

Closes the +1.7-8.1% FastWire wire gap vs MemoryPack to near zero or favorable while keeping AcBinary FastWire's +9-20% speed advantage.

Compact mode — wire layout today varies by length:

Short (≤31 byte): [FixStr+length][UTF-8 bytes] — already 1-byte marker, ties sentinel.
Long (>31 byte): [String marker][VarUInt byteCount][UTF-8 bytes] — sentinel saves 1 byte (the marker).

Compact gain: only on long strings (>31 byte UTF-8). Estimated −1 byte per long string. Workload-dependent: if most strings are short or use interning, gain is small. If many long mixed-content strings, meaningful saving.

Limitations (both modes)

Polymorphic object properties: marker needed for type discrimination. Sentinel encoding only applies when the property type is statically string or string?.
Interning incompatible: sentinel cannot express StringInternFirst / StringInterned markers (those carry cache-index semantics). Interned properties keep marker-based encoding. FastWire mode already disables interning by design (consistent); Compact mode needs per-property dispatch (interned → marker, non-interned → sentinel).
Compact-mode FixStr ties: short strings (≤31 byte UTF-8) gain nothing in Compact (FixStr is already 1-byte marker+length). The optimization wins only on long strings in Compact.

Implementation outline (rough — refine when implementing)

Writer: branch in WriteString on property metadata flags (IsString, IsNotInterned, IsNotPolymorphic). If sentinel-eligible, emit VarUInt sentinelLength + content. Else fall through to existing marker-based encoding.
Reader: matching branch in property reader. If sentinel-eligible (per property metadata), read VarUInt sentinelLength, dispatch on 0/1/N+1.
SGen: emit sentinel-encoding variant for non-polymorphic non-interned string typed properties; emit existing marker-encoding for the rest.
Wire format version bump OR header flag indicating sentinel-encoding-active. (Cross-version compat policy decided when implementing.)

Trigger

After D-2 / decoder optimization / marker-dispatch land (compact-mode focus completes)
When wire-size positioning becomes a primary pillar for NuGet release
Re-evaluate scope at implementation time — exact gain in Compact depends on consumer workload (long-string ratio, interning patterns)

Acceptance

FastWire mode: AcBinary wire ≤ MemoryPack on at least 4 of 5 test cells
Compact mode: long-string wire bytes -1 each, no regression on short or interned strings
Speed benchmark: no regression vs current encoding (essentially zero CPU cost — sentinel is shifted bookkeeping)
Cross-version compat: documented format version bump + clean fail on old reader / new wire mismatch
Polymorphic + interned property test cases pass unchanged (use existing marker-based encoding)

ACCORE-BIN-T-M3R7: ASCII marker-dispatch — writer detect + reader dedicated path

Priority: P2 · Type: Performance + wire optimization · Related: BinaryTypeCode.FixStrAsciiBase..StringAscii markers, WriteStringWithDispatch, ReadAsciiBytesAsString Status: Closed (2026-05-04)

Sorrendi megjegyzés: ezt AZ ENCODER OPTIMALIZÁCIÓ UTÁN csináljuk (lásd ACCORE-BIN-T-E2F9). Indok: a custom encoder/decoder Vector256 ASCII narrow/widen path-jai már magukban gyorsan kezelik az ASCII byte-ot. A marker-dispatch ezen FELÜL csak a per-call dispatch-overhead spórolást hozza (no Ascii.IsValid scan, no decoder layer). Garantált win, de additív — méréstechnikailag tisztább a decoder/encoder utánra hagyni.

The FixStrAscii* (135-166) and StringAscii (167) markers are defined in BinaryTypeCode.cs with helper methods (IsAsciiString, IsFixStrAscii, EncodeFixStrAscii, DecodeFixStrAsciiLength). Encoding/decoding logic NOT yet implemented — currently both writer and reader use the universal String / FixStr markers.

Implementation

Writer: in WriteStringUtf8 / WriteFixStrDirect, after UTF-8 encoding (D-2 path), check bytesWritten == charLength (= ASCII iff equal). If ASCII, emit FixStrAscii (≤31 byte) or StringAscii (>31 byte). Else emit existing FixStr / String. Free detect — both numbers already computed by D-2.
Reader: in ReadStringUtf8 (or upstream marker dispatch), branch on marker. ASCII markers → dedicated byte→char widening path (no UTF-8 decode, no Ascii.IsValid scan, no decoder dispatch). Non-ASCII markers → existing custom UTF-8 decoder.
SGen: regenerate readers/writers to dispatch on the new markers.
Re-enable ASCII fast paths: uncomment writer FixStr dispatch in AcBinarySerializer.cs and reader Ascii.IsValid block in ReadStringUtf8 — these temporarily disabled blocks become the marker-aware paths (no IsValid scan needed since the marker is the contract).

Wire format change

Format version bump (1 → 2). Old readers fail clean on new wire (version mismatch). New readers must reject old wire OR support backward read.

Acceptance

Repeated Strings (Hungarian content) Deser: AcBinary closes the ~10% gap vs MemoryPack
Pure ASCII tests (Small/Medium/Large/Deep): AcBinary Ser AND Deser ≥ MemoryPack
Wire size: minimum -25% vs MemoryPack across all test cells
SGen-generated code compiles and round-trips on all [AcBinarySerializable] types
Decision documented: backward-compat policy for v2 vs v1 wire

Resolution

End-to-end implementation landed (writer + reader + SGen + skip + populate). Key components:

Writer (AcBinarySerializer.BinarySerializationContext.WriteStringWithDispatch) — single-pass UTF-8 encode + ASCII detect via bytesWritten == charLength; emits one of 4 markers (FixStrAscii / FixStr / StringAscii / String). Split layout for hot path: charLength ≤ 31 encodes optimistically at savedPos+1 (FixStr position) → 0 shift on FixStr hit; charLength > 31 uses D-2 layout with backfill. The split avoids the post-encode left-shift that the unified layout introduced (regression seen in 12-42-32 bench).
Reader (AcBinaryDeserializer.BinaryDeserializationContext.ReadAsciiBytesAsString) — Encoding.Latin1.GetString (BCL SIMD-accelerated byte→char widen). Avoids the string.Create callback + scalar widen overhead — measurably better on Small Deser cell (closed the +20% MemPack-relative anomaly).
TypeReaderTable: StringAscii (167) + 32 × FixStrAscii (135-166) readers registered. IsFixStrAscii / StringAscii fast paths in PopulatePropertyWithMarker, ReadValue, SkipValue.
SGen (AcBinarySourceGenerator.EmitReadString) — regenerated readers branch on IsFixStr / IsFixStrAscii / case StringAscii per property.

Wire format version not bumped — the new markers occupy previously-unused codepoints (135-167); old wire (without ASCII markers) is forward-compatible (readers handle both String and StringAscii). v1 stays.

Acceptance (AOT bench 13-40-29, MemPack-relative ratios — JIT noise eliminated):

✅ AcBinary Ser AND Deser GYORSABB MemPack-nél MINDEN cellán (5/5)
- Small: Ser -8%, Deser -23%
- Medium: Ser -17%, Deser -30%
- Large: Ser -28%, Deser -32%
- Repeated: Ser -4%, Deser -9%
- Deep: Ser -24%, Deser -22%
✅ Wire size advantage: 2043-50419 byte (vs MemPack 3070-64986) = -22% to -33% across cells
✅ Round-trip tests: 167 pass (13 pre-existing failures are IId-tracking, unrelated to M3R7)

JIT vs AOT note: earlier JIT-mode benchmarks (12-50-43 → 13-27-20 series) showed elevated ratios on Small/Repeated cells (1.0-1.2 range) that disappeared under AOT publish. The JIT-mode numbers reflect tier-up artifacts (inconsistent inlining of SGen-generated reader hot paths during the 1000-iteration measurement window), not a structural M3R7 property. AOT (NativeAOT / ILC) compiles deterministically with fixed inline decisions — the steady-state numbers above reflect the actual production performance.

ACCORE-BIN-T-E2F9: Custom UTF-8 encoder (writer-side, symmetric with custom decoder)

Priority: P1 · Type: Performance · Related: decoder optimization (AcBinaryDeserializer.BinaryDeserializationContext.Read.cs::DecodeUtf8SinglePass) Status: Closed (2026-05-04)

Sorrendi megjegyzés: ezt A MARKER-DISPATCH ELŐTT csináljuk (lásd ACCORE-BIN-T-M3R7). Indok: a custom encoder/decoder optimalizáció a "nehezebb, kevésbé biztos" win — a non-ASCII / mixed content workload-okat (Repeated Strings Hungarian) hozza be. A marker-dispatch utána már csak additív tisztítás a pure ASCII path dispatch-overhead-jén.

Replace Encoding.UTF8.GetBytes calls in WriteStringUtf8 / WriteStringUtf8Internal / WriteFixStrDirect (collectively the writer's UTF-8 encode path, post-D-2) with a hand-rolled SIMD encoder. Symmetric to the decoder optimization (V4N2 / Read.cs::DecodeUtf8SinglePass).

Layered structure (mirrors decoder)

Phase 1 — Vector256 ASCII narrow: 16 chars (Vector256) → 16 bytes (Vector128) via Vector256.Narrow. ASCII detect via (v & 0xFF80).ExtractMostSignificantBits() == 0 (any high bit on UTF-16 char). Break on first non-ASCII char.
Phase 2 — DWORD ASCII batch: 4 chars at a time, OR-mask test, 4 bytes per iter when ASCII.
Phase 3 — Scalar multi-byte encode: 1-byte (ASCII) / 2-byte (Latin extended) / 3-byte (BMP) / 4-byte (surrogate pair → supplementary plane) UTF-8 encoding via direct bit-extract. No fallback dispatch — input is trusted UTF-16 (string).
Use System.Text.Unicode.Utf8.FromUtf16 as fallback target for scalar correctness — or skip BCL entirely with manual bit-pack.

Why

Encoding.UTF8.GetBytes carries virtual-dispatch + encoder-fallback overhead even with SIMD ASCII fast path internally. Custom encoder skips this. ~15-30% Ser improvement on ASCII content, ~5-10% on non-ASCII (multi-byte path stays scalar).

Trigger

NEXT — implementation order P1 before marker-dispatch (M3R7)
Re-evaluate if .NET 11 BCL UTF-8 GetBytes becomes faster (PR #120628 follow-up)

Acceptance

Writer-side benchmark: ≥15% Ser speedup on ASCII content (Small/Medium/Large/Deep), ≥5% on non-ASCII (Repeated)
Wire format unchanged (custom encoder produces same bytes as Encoding.UTF8)
Round-trip tests pass

Resolution

Implemented as EncodeUtf8SinglePass in AcBinarySerializer.BinarySerializationContext.cs — three-phase layered encoder (Vector256 ASCII narrow + DWORD ASCII batch + scalar 1/2/3-byte BMP & 4-byte surrogate-pair). Bypasses Encoding.UTF8.GetBytes virtual-dispatch + encoder-fallback overhead. Trusted-input path — no validation pass on writer side (the input is a .NET string with valid UTF-16 surrogate pairs by construction).

Used by WriteStringUtf8 (D-2 single-pass with VarUInt backfill) and WriteStringWithDispatch (M3R7 marker-dispatch path). Wire format unchanged — the encoder produces the same bytes as Encoding.UTF8.GetBytes.

Acceptance (per bench 12-50-43 → 13-27-20, MemPack-relative ratios on AcBinary Compact FastMode SGen):

✅ ASCII Ser ≥ MemPack on 4/5 cells (Small 0.94, Medium 0.80, Large 0.79, Deep 0.81)
⚠️ Repeated Ser ~1.04 (Hungarian, multi-byte path scalar) — see follow-up ACCORE-BIN-T-H7K3
✅ Round-trip tests pass (167 of 180; 13 pre-existing failures unrelated to encoder)

ACCORE-BIN-T-W7N5: Default-value omission policy — doc + optional opt-out

Priority: P2 · Type: Refactor + Documentation · Related: BINARY_ISSUES.md#accore-bin-i-d9y2 (canonical issue)

The serializer's PropertySkip (102) optimization saves 1 byte per default-valued property by omitting the full value from the wire — relying on the consumer-side type definition to have the same default(T). This is a latent correctness risk documented in ACCORE-BIN-I-D9Y2. This entry tracks the mitigation plan; full failure-mode analysis lives in the issue.

Decision tree (TBD when implementing)

Doc-only: position as a deliberate protobuf-style feature; consumer keeps type defaults stable across versions. Lowest cost, maximum benchmark wire-size advantage retained.
Option flag: AcBinarySerializerOptions.OmitDefaults boolean. Default true (preserves current behavior + benchmark numbers). false writes every property in full — opt-out for fragile-class-evolution scenarios.
Both: ship doc + flag. Default behavior unchanged; consumers who hit silent-corruption have an explicit opt-out.

Acceptance (when implementing)

BINARY_FEATURES.md adds a "Default-Value Omission" section documenting the semantic and the tradeoff (with cross-ref to ACCORE-BIN-I-D9Y2)
If flag added: round-trip tests covering both true and false; benchmark comparison table showing wire-size delta on ASCII / Hungarian / DTO-heavy workloads
Decision rationale recorded in LLM_PROTOCOL_DECISIONS.md (or a ### Resolution block on the issue) once implemented

ACCORE-BIN-T-H7K3: Hungarian / multi-byte content Ser optimization (Repeated Strings cell)

Priority: P3 · Type: Performance · Related: EncodeUtf8SinglePass Phase 3 (scalar multi-byte encode), ACCORE-BIN-T-E2F9 resolution Status: Closed (2026-05-04) — Won't Fix (JIT-only artifact)

The Repeated Strings benchmark (Hungarian content: "TermékNév_…", "RaklapKód_…") still shows AcBinary Ser ratio ~1.04 vs MemPack across multiple runs (12-50-43 / 13-21-27 / 13-27-20 series). All other ASCII-heavy cells (Small/Medium/Large/Deep) sit in the 0.79-0.94 ratio range — Repeated is the outlier.

The Phase 3 scalar multi-byte branch in EncodeUtf8SinglePass (1-byte ASCII / 2-byte Latin-extended / 3-byte BMP / 4-byte surrogate-pair) processes Hungarian diacritics (á, é, í, ő, ű, etc.) as 2-byte UTF-8 sequences via scalar bit-extract. MemPack's UTF-8 encoder appears to use a SIMD-accelerated mixed-content lane that processes 2-byte sequences in parallel.

Resolution

AOT bench 13-40-29: Repeated Ser ratio = 0.96 (AcBinary 14.50 µs vs MemPack 15.05 µs, AcBinary GYORSABB by 4%). Deser ratio 0.91 (also faster).

The 1.04+ ratio observed in JIT-mode benchmarks (12-50-43, 13-21-27, 13-27-20) was a JIT tier-up artifact — the SGen-generated writer's hot path (which calls EncodeUtf8SinglePass) didn't reliably tier up to fully-optimized code within the 1000-iteration measurement window, while MemPack's writer apparently warmed up faster. Under NativeAOT publish (-p:_IsPublishing=true) the issue disappears completely — both writers are deterministically optimized at compile time.

No structural problem in the Phase 3 scalar branch. The investigation directions (Vector256 mixed-content lane, BCL Utf8.FromUtf16 comparison) remain valid academic improvements but show no meaningful production-time win — closing as Won't Fix.

ACCORE-BIN-T-S2X9: Markerless schema lane — drop per-property type markers for fixed-shape primitives (SGen)

Priority: P2 · Type: Wire-format extension · Related: ACCORE-BIN-T-S5L8, ACCORE-BIN-T-W7N5

AcBinary is marker-driven: every value on the wire carries a 1-byte type code, so the reader can dispatch generically (handles polymorphism, null, intern markers, type-name lookup, etc.). MemPack is schema-driven: the SGen reader knows at compile time that "field 3 is int, field 4 is string" and reads values directly with no type code, no run-time dispatch.

For fixed-shape primitive properties (int, bool, double, Guid, DateTime, …) on [AcBinarySerializable] types, the per-property type marker is pure overhead — the SGen-generated reader already has compile-time knowledge of the property type, so the marker only confirms what is already known. Dropping it on this narrow class of properties is a clean wire+CPU win without losing any of the polymorphism / null / intern flexibility that the marker provides for variable-shape values.

Why P2 — `WireMode = Fast` wire-size parity (NuGet release narrative)

The WireMode = Fast lane currently produces +1.7% to +8.1% larger wire than MemPack across all benchmark cells (AOT bench 13-40-29: Small +52 byte, Medium +474, Large +3617, Repeated +1221, Deep +581). The gap is structural: UTF-16 raw-memcpy strings are 2 bytes/char fixed, while MemPack's UTF-8 is 1 byte/char on ASCII content. Touching the string-write path to fix this would either:

Lose the raw-memcpy guarantee (post-encode ASCII-detect + branchy dispatch — kills the FastWire CPU advantage), or
Add sentinel-encoding micro-savings (~3-5% wire) which don't close the structural gap.

Markerless schema lane is the only path to wire-size parity that preserves the FastWire raw-memcpy hot path. Per-primitive-property savings (1 byte for non-tiny int, Guid, DateTime, decimal, double, …) compound on DTO-heavy payloads. Estimated effect on benchmark cells:

Cell	Current FastWire	MemPack	Estimated post-S2X9 FastWire	vs MemPack
Small (~70 primitive prop)	3122	3070	~3050	-0.7% ✅
Medium (~600 primitive prop)	10905	10431	~10300	-1.3% ✅
Large (~6000 primitive prop)	68603	64986	~63500	-2.3% ✅
Deep (~700 primitive prop)	15514	14933	~14800	-0.9% ✅

The Repeated cell is harder to predict (string-dominated payload, fewer primitives) — likely smaller win, may not fully close the +8.1% gap. Acceptable: the Repeated cell is a string-interning stress test, not a typical DTO workload.

NuGet release narrative: "FastMode beats MemoryPack on both wire size AND throughput across all benchmark cells" — currently we have to qualify this with "throughput-only on Compact + i18n workloads"; S2X9 removes the qualifier. This is high-leverage for the public bench shootout.

Wire savings per property type

Type	Current encoding	Markerless lane	Wire saved
`int` (TinyInt range −16..47)	TinyInt (1 byte)	VarInt (1 byte)	0
`int` (out-of-tiny)	`[Int32]` `[VarInt]` (2-6 bytes)	VarInt (1-5 bytes)	1 byte
`bool`	`[True]` or `[False]` (1 byte)	1 byte (0/1)	0
`Guid`	`[Guid]` `[16 bytes]` (17 bytes)	16 bytes	1 byte
`DateTime`	`[DateTime]` `[9 bytes]` (10 bytes)	9 bytes	1 byte
`DateTimeOffset`	`[DateTimeOffset]` `[10 bytes]` (11 bytes)	10 bytes	1 byte
`TimeSpan`	`[TimeSpan]` `[VarLong]` (2-9 bytes)	VarLong (1-9 bytes)	1 byte
`decimal`	`[Decimal]` `[16 bytes]` (17 bytes)	16 bytes	1 byte
`double`	`[Float64]` `[8 bytes]` (9 bytes)	8 bytes	1 byte

DTO-heavy payloads with many Guid / DateTime properties benefit the most — easily -10..-20% wire size on top of the existing -22..-33% advantage.

CPU savings

Reader-side: SGen-generated code drops the per-property ReadByte() + IsTinyInt / IsFixStr / switch-case dispatch for primitive properties — direct context.ReadInt32Unsafe() / ReadGuidUnsafe() / etc. calls. Writer-side: drops the WriteByte(typeCode) per primitive. Effect amplifies on payloads with many primitive properties (Small/Medium benchmark cells) — independent of any JIT-vs-AOT measurement variance.

Sketch — opt-in markerless lane, SGen-only

New wire format flag (header HeaderFlag_MarkerlessSchema = 0x10 or similar) → activates a property-positional lane.
SGen-generated writer for [AcBinarySerializable] types: per primitive property, emits raw value (no marker). For variable-shape properties (string, complex, nullable, polymorphic) the existing marker-driven path stays.
SGen-generated reader: per primitive property, calls context.ReadInt32Unsafe() / ReadGuidUnsafe() / etc. directly. Variable-shape properties keep the marker-read + dispatch.
Heuristic: a property is markerless-eligible if IsValueType && !IsNullable && type is in {int, bool, byte, short, long, float, double, DateTime, DateTimeOffset, Guid, TimeSpan, decimal}. Anything else (string, list, nested object, nullable) keeps the marker.

Decision points

Backward compatibility: header flag + version negotiation. Old readers see the flag set and either reject (clean fail) or fall back to marker-driven (if they support both lanes). Default false preserves current wire format.
Schema evolution fragility: the markerless lane is positional, so adding/removing/reordering primitive properties breaks readers compiled against an older schema. Document this clearly — opt-in is for stable schemas only (DTO-frozen API contracts, internal SignalR messages with synchronized client/server SGen). For evolving schemas, marker-driven default stays.
Coordination with ACCORE-BIN-T-S5L8 (sentinel-length strings): the two could share the "no-marker per-call" infrastructure — markerless string lane uses sentinel-length VarUInt (null/empty/short distinguished by length value).

Acceptance

Primary: WireMode = Fast AcBinary wire size ≤ MemPack across Small/Medium/Large/Deep AOT benchmark cells (AOT release-publish bench is the canonical measurement)
Wire size: ≥ -10% on DTO-heavy payloads (Guid/DateTime-rich) vs current marker-driven format
Round-trip on the markerless lane validated on representative DTO shapes (mixed primitive + string + nested object)
Schema-evolution fragility documented in BINARY_FEATURES.md (alongside the existing PropertySkip / default-omission caveat from ACCORE-BIN-I-D9Y2)
Opt-in flag with default false (preserves marker-driven default; consumers explicitly opt in for frozen-schema scenarios)

ACCORE-BIN-T-V4N3: Symmetric `GetUtf8ByteCount` API + writer-side BCL kihagyás (cold path)

Priority: P3 · Type: Performance · Status: Superseded (2026-05-08, by ACCORE-BIN-T-K7M3) — landed Closed 2026-05-06; subsequent A/B against modern Utf8.FromUtf16 / Utf8.ToUtf16 showed the BCL modern API outperforms the custom transcoder on every benchmark cell, leading to full hot-path switch in K7M3 · Related: EncodeUtf8SinglePass, WriteStringUtf8Internal, PropertyMetadataBase.NameUtf8, ACCORE-BIN-T-K7M3 (hot-path BCL switch)

Symmetric byte-count helper for EncodeUtf8SinglePass, paired with writer-side BCL Encoding.UTF8.GetBytes / GetByteCount removal across all cold-path call sites. Utf8Transcoder.GetUtf8ByteCount(ReadOnlySpan<char>) SIMD impl (Vector512 / Vector256 / Vector128 / scalar tier hierarchy, 5-popcount closed-form aggregation handling chunk-split surrogate pairs correctly).

Implementation summary:

Utf8Transcoder.GetUtf8ByteCount SIMD impl with closed-form bytes = 3*N - ascii - c_lt_0x800 + highSur - 3*lowSur aggregation
Utf8TranscoderTests extended (29 new tests covering ASCII / Hungarian / CJK / emoji / boundary 0-64, plus surrogate-pair-split-across-SIMD-chunks regression coverage)
WriteStringUtf8Internal (BinarySerializationContext.cs:875) refactored from BCL two-pass to single-pass D-2 layout (worst-case length*4 allocate + EncodeUtf8SinglePass + VarUInt backfill); the 4× worst-case capacity is amortized by the buffer growth doubling strategy (Math.Max(buffer.Length*2, position+needed) + ArrayPool bucket-rounding to next power-of-2)
Cold path cleanup: AcBinarySerializer.AnalyzeStringInternCandidates (analysis log) and PropertyMetadataBase.NameUtf8 ctor-once init both migrated to Utf8Transcoder

Resolution

Landed 2026-05-06. All Utf8TranscoderTests pass (55/55). Binary test suite unchanged (222 pass / 13 pre-existing GuidIId failures, untouched).

Critical observation surfaced during the audit: WriteStringUtf8Internal has only one caller (WriteFixStrDirect), and WriteFixStrDirect itself is uncalled anywhere in the codebase — no core call site, no SourceGenerator template hit (verified against AcBinarySourceGenerator.cs line 706/724/1492/1514 — generator emits WriteStringGenerated and context.WriteStringUtf8 (the public 659-line method, not WriteStringUtf8Internal)), no test, no reflection path. The V4N3 implementation therefore landed cleanly but its hot-path benchmark impact is limited to the two cold-path init sites. Dead-code disposition tracked as ACCORE-BIN-T-V4N5.

Algorithmic correctness lesson — the initial 4-popcount formula (3*N - c_lt_0x80 - c_lt_0x800 - 2*highSur) was wrong on chunks where a surrogate pair straddles the SIMD chunk boundary (it implicitly assumed lowSur == highSur per chunk, which is true over the whole well-formed string but NOT per chunk). Fix: 5-popcount closed-form (3*N - ascii - c_lt_0x800 + highSur - 3*lowSur), with the scalar tail using the same per-char accounting model (i += 1 per char regardless of role; high → 4, low → 0, BMP → 3, two-byte → 2, ASCII → 1). Caught by GetUtf8ByteCount_MultipleEmojiBoundary_MatchesBcl and GetUtf8ByteCount_BoundaryAsciiToEmoji_MatchesBcl regression tests — exactly the prefixLen 1, 7 boundaries that exercise chunk-split surrogate pairs.

Superseded by `ACCORE-BIN-T-K7M3` (2026-05-08)

The V4N3 audit measured the custom transcoder against the legacy Encoding.UTF8.GetBytes API and won. Did NOT measure against the modern System.Text.Unicode.Utf8.FromUtf16 / Utf8.ToUtf16 static API (.NET 7+, used by MemoryPack source-gen). Once D9X3 stabilized the bench, a direct A/B revealed the BCL modern API outperforms the custom transcoder on every cell (Ser deficit -14 to -22pp, Deser flips from behind to ahead). All 8 hot-path call sites switched to BCL in K7M3. The Utf8Transcoder.cs file is fully commented out — preserved as historical reference.

The V4N3 algorithmic correctness work (5-popcount surrogate-pair-split-across-chunks closed-form) remains a valid algorithmic contribution, but no longer load-bearing on the hot path.

ACCORE-BIN-T-V4N4: NativeAOT-specific inlining / codegen audit on hot UTF-8 path

Priority: P2 · Type: Performance · Status: Reverted (2026-05-07) — bench instability made the optimization signal unmeasurable · Related: EncodeUtf8SinglePass, DecodeUtf8SinglePass, WriteStringWithDispatch, Utf8Transcoder SIMD path

Hypothesis: NativeAOT (the benchmark target environment) does not match Tier 1 JIT optimization quality on the UTF-8 hot path, despite [MethodImpl(AggressiveInlining)] hints. Symptoms in 2026-05-05 / 2026-05-06 benchmarks:

Repeated cell perzisztens 8-11% Compact ≤ MemPack lemaradás (Magyar content + repeated string pattern)
Compact Ser/Deser cellán mozaikos eredmények run-to-run (4-7/10 cell wins, 3-6 noise/loss bands)
Methodonkénti Compact gyorsítások a Medium/Large/Deep cellán konzisztensek (-22% to -28% vs MemPack), ami JIT/AOT inlining-eltérésnek tűnik a Repeated-en — ott a WriteStringWithDispatch short-lane sokszor hívódik 10× repeated string-en

Suspect mechanisms (ranked by likelihood):

AOT inline budget. NativeAOT is more conservative than the Tier 1 JIT in respecting AggressiveInlining for large method bodies. EncodeUtf8SinglePass (~190 lines, 4 SIMD path + scalar), DecodeUtf8SinglePass (~120 lines), GetUtf8ByteCount (~120 lines) may exceed the AOT inline budget at hot call sites (WriteStringWithDispatch short-lane, ReadString decode callback). If the AOT compiler emits call <method> instead of inlining, every iteration of the Repeated 10-string loop pays the call overhead.
[Intrinsic] IsSupported constant folding. Avx512BW.IsSupported, Vector512.IsHardwareAccelerated, Vector256.IsHardwareAccelerated, Vector128.IsHardwareAccelerated should constant-fold per host on AOT. Verify via disasm — if any remain runtime checks, every iteration pays the branch cost (3 nested if-s in each Utf8Transcoder method).
Vector256.LessThan<ushort> unsigned compare emulation. No native pcmpltw_unsigned on AVX2; JIT/AOT lowers to pminuw + pcmpeqw. Cost amortized over many chars in long content but can dominate on short Magyar runs (KözösCímke ~6 runs of 2-3 chars). Less likely if (1) holds — the inlining hit dwarfs the per-instruction emulation cost.
Method size cascade. The Utf8Transcoder method bodies grew with the V4N3 GetUtf8ByteCount addition. Adjacent methods in the same source file may have lost inlining at SGen-generated callers due to AOT compilation-unit heuristics (file-locality affects inline cost models on some AOT codegen).

Investigation steps (no code changes — diagnostic phase first):

NativeAOT publish dump:

dotnet publish AyCode.Core.Serializers.Console -c Release -r win-x64 -p:PublishAot=true
dumpbin /disasm <output.exe> > disasm.txt

Locate EncodeUtf8SinglePass, DecodeUtf8SinglePass, GetUtf8ByteCount, CountUtf8Chars symbols in the disasm
Verify constant folding on IsSupported checks — no run-time CMP/JMP at the path-selector branches; the dead branches eliminated
Verify inlining at WriteStringWithDispatch / ReadString callers — if call <Utf8Transcoder.*> instructions remain, inlining failed
Method size inspection — large method bodies hint at inline-eligibility issues; large prologue/epilogue at hot call sites is a tell
Cross-compare with Tier 1 JIT disasm (run with DOTNET_TieredCompilation=0 + DOTNET_TC_QuickJit=0 to force Tier 1, dump the JIT-tier disasm via WinDbg or BenchmarkDotNet's [DisassemblyDiagnoser]) to confirm the gap is AOT-specific rather than algorithmic

Possible fixes (Open until disasm confirms which apply):

A. Method split — EncodeUtf8SinglePass → small dispatcher + per-tier inner methods (each Vector512 / Vector256 / Vector128 / scalar in its own AOT-inline-friendly small method). Same for DecodeUtf8SinglePass. The dispatcher stays small enough to inline at the hot call site; the dead-branch tier methods are never called on a given host.
B. [MethodImpl(NoInlining)] on cold tiers — paradox tactic that can REDUCE the hot-path code emitted at the call site by preventing the AOT from speculatively considering the dead branches as inlining candidates.
C. Per-target ISA build — if the benchmark environment has a fixed ISA (e.g. AVX2 baseline), use <IlcInstructionSet> in csproj to constant-fold the IsSupported checks at AOT compile time. Alternative: separate per-ISA AOT publish artifacts.
D. Manual hot-path inlining — for the Repeated cell, hand-inline EncodeUtf8SinglePass short-string lane into WriteStringWithDispatch FixStr path (≤31 byte case). Trades code-size for hot-path speed.
E. Algorithm change — if the AOT can't inline the SIMD bodies efficiently, a smaller scalar-only fast path for short strings (≤31 byte) bypassing the SIMD setup might be faster on AOT than on JIT (where Tier 1 is fine with the SIMD path inlined).

Why P2

Repeated benchmark cell is the canonical witness for the i18n production deploy narrative — public NuGet release narrative depends on parity-or-better against MemPack across all cells (cloud / desktop / mobile / Blazor WASM)
AOT-specific tuning is high-leverage on the hot path — JIT-only optimizations will not match
Disasm validation is the prerequisite for any of the fix directions; without it, any change is speculative and risks reintroducing 2c-style regression

Acceptance

Disasm report confirms (or refutes) inlining + constant-fold hypotheses on the hot UTF-8 path
If hypotheses confirmed: the chosen fix delivers Repeated Compact Ser+Deser ratio ≤ 1.0 vs MemPack on the AOT benchmark target
No regression on Small / Medium / Large / Deep cells (or net positive)
Fix maintains cross-tier SIMD correctness (round-trip tests pass on all UTF-8 content classes); both Utf8TranscoderTests and the binary test suite stay green

Trigger

Pre-NuGet release: i18n claim cannot ship with an 8-11% gap on a representative cell
Disasm + bench correlation step before any code change (no speculative refactoring)

Resolution

Audit + targeted fix landolt 2026-05-07.

Step 1 — disasm-elemzés (disasm.txt, ~90 MB AOT-publish output):

✅ Avx512BW.IsSupported / Vector{N}.IsHardwareAccelerated constant-folded — csak 4 runtime check a teljes binary-ben (1 body + 3 call-site, kívül a Utf8Transcoder hot path-tól). Az AOT a target ISA szerint dead-branch-eliminálta.
✅ Reader tier-marker dispatch (ReadStringSmall/Medium/Big) inline-olódott a TypeReaderTable lambda-class static init-be — 0 method-call overhead a tier-on.
⚠️ WriteStringWithDispatch NEM inline-olódott — 3 generic specialization (<ArrayBinaryOutput>, <AsyncPipeWriterOutput>, <BufferWriterBinaryOutput>) különálló method body-val + 14+ call <method> instruction az <ArrayBinaryOutput> body-jában (a többi 2 specializációban hasonló volumen). Method size ~190 sor — meghaladja az AOT inline budget-et.
⚠️ ReadStringUtf8WithCharLen NEM inline-olódott — saját body, sok call-site.
❓ → ✅ string.Create callback __DelegateCtor — disasm szerint test static; jne skip ctor minta = cache-elt static lambda, lazy-init pattern. 0 hot-path overhead (nem per-hívás alloc).

Step 2 — method-split kísérlet (15:13:39 bench):

Writer split: dispatcher ([AggressiveInlining]) + WriteStringSmallFast ([AggressiveInlining]) + WriteStringDispatchLong ([NoInlining]) + WriteStringFastWire ([NoInlining])
Reader split: dispatcher ([AggressiveInlining]) + ReadStringUtf8WithCharLenCore ([NoInlining])
Bench: regresszió — Small Ser +29.6 pp, Repeated Ser +8.9 pp, Small Deser +16.6 pp.
Disasm szerint a dispatcher + SmallFast inline-olódott (body symbol eltűnt) — code-bloat: 3 generic spec × ~30-50 SGen call-site × ~45 sor inlined kód = i-cache pressure a Repeated cell hot loop-on. Reader oldali dispatcher NEM inline-olódott ([AggressiveInlining] hint hatástalan), csak +1 call instruction.

Step 3 — finomított fix (15:29:21 bench, Closed):

WriteStringWithDispatch dispatcher: NO inline hint (a fordítóra hagyva, AOT-ban stabilabb)
WriteStringSmallFast: [NoInlining] (code-bloat eltünt — call-overhead-tel marad, de strukturálisan dedikált method)
WriteStringDispatchLong + WriteStringFastWire: [NoInlining] cold path (megőrizve)
ReadStringUtf8WithCharLen + ReadStringUtf8WithCharLenCore összeolvasztva vissza egy methoddá (split nem fizetett, +1 call eltünt)

Bench (15:29:21) Compact vs MemPack arányok:

Ser: Small 0.915 (-8.5%), Medium 0.989 (≈), Large 0.915 (-8.5%), Repeated 1.019 (≈), Deep 0.981 (-1.9%) → 5/5 cell paritás-vagy-jobb
Deser: Small 1.101 (+10.1%), Medium 0.953 (-4.7%), Large 0.894 (-10.6%), Repeated 0.962 (-3.8%), Deep 0.899 (-10.1%) → 4/5 cell win, csak Small +10%
Wire: 5/5 cell -8% to -11% kisebb mint MemPack

Tanulság:

AOT-ban a [AggressiveInlining] nem garantált — a Writer dispatcher + SmallFast inline-olódott (code-bloat), de a Reader dispatcher NEM (hint hatástalan). A fordítóra bízás (no hint) stabilabb.
Method-split nem mindig nyer — a túl-aggresszív inline-olás code-bloat-ot okozhat (i-cache pressure), különösen sok SGen call-site mellett.
A __DelegateCtor cache-elt — string.Create callback nem hot-path overhead-forrás.
Strukturális struktúra megőrizve: WriteStringDispatchLong és WriteStringFastWire külön cold methodok (későbbi célzott optimalizációhoz alapot ad).

Maradék gap: Small Deser +10% — kis abszolút érték (~1 µs), nem release-blocker. A ReadStringUtf8WithCharLen body méretes (single method ~15 sor + lambda-state), AOT inline-budget határán. Tovább optimalizálható a V4N2 vagy W2C8 sprint-ben.

Reverted (2026-05-07)

A V4N4 method-split — mind a 15:13:39 (AggressiveInlining) regressziós verzió, mind a 15:29:21 (NoInlining-on-SmallFast) finomított verzió — visszavonva. A subsequent benchmark futtatások (15:29:21 → 15:56:54 → ...) drasztikus run-to-run varianciát mutattak ugyanazon kódon: az AOT-codegen file-locality / inline-cost-modell mérés-érzékeny a Utf8Transcoder.cs body-méret változásaira, és a noise-floor a method-split feltételezett +1-3% Ser nyereségét eltakarja.

A revert visszaállítja a WriteStringWithDispatch egy-method állapotot (matches 09:39:09 baseline). A megőrzött elemek:

A irány packed-header store-ok (Unsafe.WriteUnaligned<ushort/uint/ulong> Small/Medium/Big tier-on) — instruction-level optimalizáció, nem érintett az AOT-variance miatt
Overflow guard (O7G2 — ThrowStringTooLong) — defensive, különálló feature

A V4N4 audit konklúziója változatlan érvényes (constant-fold OK, reader tier-readers inline-olt a TypeReaderTable lambda-class static init-be, __DelegateCtor cache-elt). Az AOT inline-pressure-elemzés továbbra is releváns dokumentáció — csak a method-split mint fix nem volt mérhető-positív.

Tanulság: bench-driven optimalizáció csak akkor érvényesíthető, ha a noise-floor < a várható signal. AOT-on a bench-zaj jelentős (~5-15 pp run-to-run), ami a +1-3% perf-claim-eket eltakarja. Profile-vezérelt optimalizáció (CPU-profile + flame-graph + code-cache miss measurement) lenne a következő lépés, ha az inlining-pressure érdemi gap-ként marad.

Re-evaluable as of 2026-05-07 per ACCORE-BIN-T-D9X3 — bench stabilization removes the noise-floor that made the original signal unmeasurable; retest before any code change.

Obsoleted (2026-05-08) by ACCORE-BIN-T-K7M3 — the writer hot path no longer calls the custom EncodeUtf8SinglePass at all (WriteStringWithDispatch was switched to Utf8.FromUtf16 BCL). The "AOT method-split / inlining audit" target (Utf8Transcoder body method-size in NativeAOT inline budget) is moot — the BCL Utf8.FromUtf16 is a single static method with its own AOT-friendly inline footprint, and the audit's hypothesis space (Vector256 IsSupported constant-fold, lambda delegate cache) was correct for the prior code but no longer applies. The V4N4 disasm methodology remains a valid technique for future investigations of generic specialization / inline failures, but the specific hot-path target it analyzed is gone.

ACCORE-BIN-T-J5L9: Remove dead `WriteFixStrDirect` / `WriteStringUtf8Internal` (audit-surfaced uncalled methods)

Priority: P3 · Type: Refactor / hygiene · Status: Closed (2026-05-06) · Related: BinarySerializationContext.cs

V4N3 audit surfaced two methods with no callers in the entire workspace:

WriteFixStrDirect(string) — public method, no call site (no core, no SourceGenerator template, no test, no reflection / Expression-compile)
WriteStringUtf8Internal(string) — private method called only from WriteFixStrDirect's non-ASCII fallback branch

The pair forms a closed dead loop (WriteFixStrDirect → WriteStringUtf8Internal), but no entry point reaches WriteFixStrDirect. The public-API WriteStringUtf8 (line 659) is the live equivalent and is called from the SourceGenerator template (polymorphism path: assembly-qualified type-name write). The hot-path string-write goes through WriteStringWithDispatch (line 734) which uses the M3R7 marker-dispatch — NOT through this dead pair.

Disposition options (decide pre-NuGet release)

Delete both methods — pure dead-code cleanup; reduces public surface, removes maintenance burden, simplifies onboarding. Functionality is fully covered by WriteStringWithDispatch (M3R7 marker-dispatch — emits FixStr / FixStrAscii directly with proper ASCII detection via bytesWritten == charLength after EncodeUtf8SinglePass).
Activate WriteFixStrDirect for property-name writes — SGen could emit WriteFixStrDirect(propName) instead of WriteStringWithDispatch(propName) for known-short, often-ASCII property names — saving the marker-dispatch overhead. Requires SGen template change + benchmark validation that the saving is real (likely marginal — property names are typically <31 char ASCII, so M3R7 already takes the FixStrAscii fast path with one byte-write to _buffer). The pre-encoded NameUtf8 byte[] on PropertyMetadataBase already provides a faster path (WriteFixStrBytes at line 853) which the SGen / runtime writer could use directly.
Defer — leave as-is, document as dead code, revisit when the codebase has another reason to touch this area.

Why P3

No correctness or perf impact in either direction (dead code is dead — no consumer affected)
Cleanup vs activation is a low-stakes choice; benchmark would decide if option 2 has real saving
Surfaced during V4N3 work, not blocking the NuGet release

Acceptance

Decision recorded (delete / activate / defer) with rationale
If "delete": grep across workspace confirms zero callers post-removal; binary test suite unchanged (still 235 pass / 13 pre-existing failures)
If "activate": SGen template change + benchmark validation showing ≥ 2% Ser improvement on a representative cell (otherwise revert to "delete")
Documentation in BINARY_IMPLEMENTATION.md updated (or remove the old reference if both methods deleted)

Trigger

Pre-NuGet release housekeeping pass
Or: any future refactor that touches BinarySerializationContext string-write methods (then decide rather than leave the dead pair behind)

Resolution

Disposition: Delete (Option 1). Landed 2026-05-06 together with the H2Q6 marker reorg commit. Five dead methods removed in a single cleanup pass:

WriteFixStrDirect(string) — uncalled public method
WriteStringUtf8Internal(string) — uncalled private method (only called from WriteFixStrDirect)
WriteFixStr(string) — uncalled public method (audit surfaced; was originally listed as live)
WriteFixStrBytes(ReadOnlySpan<byte>) — uncalled public method (audit surfaced)
WritePreencodedPropertyName(ReadOnlySpan<byte>) — uncalled public method (audit surfaced)

All five had zero call sites across core, SourceGenerator template, tests, and reflection. The hot-path string write continues through WriteStringWithDispatch (M3R7 + H2Q6 marker dispatch) and WriteStringInternFirstWithDispatch (interning tier dispatch). Public surface reduced; binary test suite unchanged (222 pass / 13 pre-existing GuidIId failures).

ACCORE-BIN-T-L9Y3: FixArray marker tier — short-list count encoded in marker

Priority: P3 · Type: Wire-format optimization · Status: Open · Related: Array (66) marker, VarUInt itemCount, ACCORE-BIN-T-H2Q6 marker reservation

Analog to FixStr — short list count (0-15) encoded in marker, eliminating the VarUInt itemCount byte for typical DTO collections (Tags, Categories, Items, Properties, Variations, etc. — any list whose size statistically lands in the 0-15 range).

Wire format

Current: [Array marker:1][VarUInt itemCount][items] — header 2-6 byte FixArray: [FixArrayBase + N marker:1][items] — header 1 byte (N = item count, 0-15)

Writer dispatch (in WriteArray / scan-pass list-writer equivalents):

itemCount ≤ 15 → FixArrayBase + itemCount marker (1 byte total header)
itemCount > 15 → existing Array marker + VarUInt count (2-6 byte total header)

Marker reservation

16 marker values pre-reserved in the post-H2Q6 marker layout (see ACCORE-BIN-T-H2Q6 "Marker address space reservation" table). The reservation guarantees that activating FixArray does NOT require another wire-format-version bump after H2Q6 lands at v3 — producers opt in to emit FixArray markers within the same v3 envelope, consumers extend their dispatch to decode them.

Activation steps when implementing:

Allocate FixArrayBase (16 contiguous values from the H2Q6-freed range)
Add IsFixArray(byte marker), DecodeFixArrayCount(byte marker), EncodeFixArray(int count) helpers in BinaryTypeCode.cs
Writer: branch in WriteArray and equivalent ScanPass list-writers, emit FixArray for count ≤ 15
Reader: extend marker dispatch in ReadValue / SkipValue / ReadArray
SGen: regenerate readers/writers with IsFixArray dispatch in the array-typed property paths
Round-trip tests for boundary itemCount values: 0, 1, 14, 15, 16, 17 (last tier transition)

Why P3

Wire saving: -1 byte per short list. Realistic per-cell estimates:
- Repeated (10 OrderItem, ~50 list overall): ~50 byte / 28 KB = ~0.18% wire reduction (marginal)
- Large (5×5×5×10 nested, ~6000 list): ~6 KB / 118 KB = ~5% wire reduction ✓
- Medium: ~500 byte / 21 KB = ~2.4% wire reduction
- Deep (2×4×4×8 nested): similar to Medium, ~2-3% wire reduction
CPU saving: marginal (~1-2 ns/list — VarUInt short-loop replaced by 1-byte marker decode). NOT a hot-path mover for the current Repeated-cell baseline gap.
Release-narrative value: complements the post-H2Q6 wire-size advantage, particularly on deep-nested structures (Large benchmark). Sharpens the "smallest AND fastest" claim once the CPU gap closes via V4N2 Phase 3 + V4N4.

Why not P2/P1 — and why not now

The current 2026-05-06_13-10-30.LLM baseline's primary problem is CPU (Compact +5-25% slower than MemPack on every cell), NOT wire size. FixArray addresses wire size, marginal CPU.
Activation after H2Q6 + V4N2 Phase 3 + V4N4 is the natural sequence: CPU gap closes first, then wire-saver features sharpen the release narrative.
The marker reservation lets us defer activation indefinitely without losing the address-space slot.

Acceptance

16 marker values aligned in BinaryTypeCode.cs (FixArrayBase..FixArrayMax) with IsFixArray, DecodeFixArrayCount, EncodeFixArray helpers
Writer + reader dispatch with boundary tests (count = 0, 1, 14, 15, 16, 17)
SGen-regenerated readers/writers correctly dispatch via IsFixArray for array-typed properties
Round-trip tests pass, no Ser/Deser regression vs current Array path
Wire-size benchmark: ≥-2% on Medium, ≥-3% on Deep, ≥-4% on Large, no regression on any cell
Documentation update in BINARY_FORMAT.md (new marker range + dispatch rules)

Trigger

After ACCORE-BIN-T-H2Q6 lands (marker reservation must be active first)
After CPU gap closes (V4N2 Phase 3 + V4N4) — wire-saver value clearer once "fast" is settled
Pre-NuGet release housekeeping for the wire-size narrative (along with S5L8 / S2X9 if their scope justifies)

Future extension (not part of this entry)

FixDict analog — same pattern for Dictionary marker (67) with kvCount 0-15. Worth considering only if a benchmark workload demonstrates dictionary-heavy structures; the current bench data (Order DTOs) does not. Defer until evidence.
FixArray 0-31 — wider count range (32 markers). Marginal additional saving (16-31 elem list-ek ritkák); would consume nearly all freed marker space, leaving no slack for S5L8/S2X9. Reject unless evidence warrants.

ACCORE-BIN-T-O7G2: Overflow guard on `charLength * 4` writer arithmetic + corrupted-wire `ReadStringBig`

Priority: P3 · Type: Defensive / safety · Status: Closed (2026-05-06) · Related: WriteStringWithDispatch, WriteStringInternFirstWithDispatch, ReadStringBig, BinaryTypeCode.MaxStringCharLength

Defensive guards covering two latent failure modes in the H2Q6 string serialization paths:

Writer overflow (silent zero corruption) — charLength * 4 overflows int when charLength > 0x1FFFFFFF (~537M). At exactly 0x40000000 chars the multiplication wraps to 0, causing:

EnsureCapacity(reserveHeader + 0) to silently succeed (no buffer growth)
EncodeUtf8SinglePass(value, emptySpan) to write 0 bytes, returning bytesWritten = 0
The H2Q6 tier choice picks Small (bytesWritten ≤ 255), writing [StringSmall][0][0] to the wire
The string content is lost silently — no exception, wire claims an empty string

Other overflow values (e.g. charLength = 600M → maxBytes becomes negative) eventually surface as ArgumentOutOfRangeException from Span.AsSpan(start, length), but the message ("length cannot be negative") is misleading and arrives after the buffer has already been partially mutated.

Reader corrupted wire (negative cast from oversized uint) — in ReadStringBig, the wire-side charLen:32 and utf8Len:32 are read as uint, then cast to int. Corrupted or maliciously-crafted payloads with values > Int32.MaxValue produce negative ints, leading to string.Create(negative, ...) exceptions or position-state desync — at best a misleading message, at worst a partial decode with wire-position shifted incorrectly.

Resolution

Landed 2026-05-06 (this commit window).

Writer side — WriteStringWithDispatch and WriteStringInternFirstWithDispatch each gain one method-entry guard:

var charLength = value.Length;
if ((uint)charLength > BinaryTypeCode.MaxStringCharLength) ThrowStringTooLong(charLength);

A single unsigned compare catches the overflow band; predict-friendly (always false on realistic input). The throw helper is [MethodImpl(MethodImplOptions.NoInlining)] so the JIT/AOT keeps the throw site out of the inlined hot path. The same charLength value is reused across the FastWire and Compact branches — no duplicate guard.

Reader side — ReadStringBig gains a single bitwise-OR + sign-test:

var packed = context.ReadUInt64Unsafe();
var charLength = (int)(uint)packed;
var byteLength = (int)(uint)(packed >> 32);
if ((charLength | byteLength) < 0) ThrowCorruptedBigWire(charLength, byteLength);

The OR + sign-test catches negative casts (any wire-side uint > Int32.MaxValue produces a negative int after cast; OR of two positives is positive, sign-test cheap). One instruction effective; predict-friendly.

New constant: BinaryTypeCode.MaxStringCharLength = 0x1FFFFFFF (536_870_911 — largest charLength where charLength * 4 fits in int).

Hot-path cost: ~0% on realistic input — single unsigned compare on the writer, single OR + sign-test on the reader Big tier (Small/Medium readers untouched since their wire values are bounded by byte / ushort types and cannot overflow). Throw helpers NoInlining keep the inlined caller body compact. Tests 222 pass / 13 pre-existing failures unchanged.

Why P3

No correctness impact for realistic inputs (the overflow band is far outside any real DTO scenario)
Defensive value: prevents silent data loss in the charLength = 1.07G zero-overflow edge case + provides clear error messages on out-of-range inputs
Security value: corrupted/malicious wire payloads on the reader Big tier path are now caught early instead of producing inconsistent position state
NuGet release professional-quality signal — explicit, defensive guards over silent-corruption paths

ACCORE-BIN-T-S6F2: Shift-mentes Small fast path in `WriteStringWithDispatch`

Priority: P3 · Type: Performance · Status: Reverted (2026-05-07, with V4N4 method-split) · Related: WriteStringWithDispatch, BinaryTypeCode.StringSmall, ACCORE-BIN-T-V4N4

The H2Q6 writer's post-encode tier choice runs a 3-way switch (bytesWritten ≤ 255 → StringSmall, ≤ 65535 → StringMedium, else StringBig) and a header-write switch (3 / 5 / 9 byte) for every non-ASCII string. On the Repeated benchmark cell (Magyar content, ~10-15 char strings dominant) 99%+ of writes resolve to StringSmall — the 3-way switch decision is statistically determinate from charLength ≤ 63 alone (worst-case charLength * 4 ≤ 252 ≤ 255 ⇒ Small tier guaranteed).

A specialized fast path for charLength ≤ 63 could eliminate:

The int actualHeader; byte tierMarker; runtime-resolved variables
The 3-way bytesWritten switch
The 3-way actualHeader header-write switch
The shift = reserveHeader - actualHeader compute (always 0 in this branch)

Sketch:

if (charLength <= 63)
{
    EnsureCapacity(3 + charLength * 4);
    var savedPos = _position;
    var encodeStart = savedPos + 3;
    var bytesWritten = Utf8Transcoder.EncodeUtf8SinglePass(value.AsSpan(), _buffer.AsSpan(encodeStart, charLength * 4));
    if (bytesWritten == charLength) { /* ASCII override — FixStrAscii inline */ }
    else
    {
        // StringSmall — 0 shift, inline header write (constant-folded)
        _buffer[savedPos] = BinaryTypeCode.StringSmall;
        Unsafe.WriteUnaligned<ushort>(ref _buffer[savedPos + 1],
            (ushort)(charLength | (bytesWritten << 8)));
        _position = savedPos + 3 + bytesWritten;
    }
    return;
}
// charLength > 63 → fall through to existing post-encode tier dispatch

Why P3

Repeated cell hot path benefit (~99% of writes on Magyar content are charLength ≤ 63)
Estimated +1-3% Ser improvement on Repeated/Medium cells (where short non-ASCII strings dominate)
Constant-folded tier choice + inline header write — no branch overhead vs. the generic post-encode path
Trade-off: ~30 lines of duplicated specialized code; the generic post-encode path remains for charLength > 63 long-string scenarios

Acceptance

WriteStringWithDispatch Small fast path emits identical wire bytes as the generic path for charLength ≤ 63 (round-trip parity)
Benchmark on Repeated/Medium cells shows ≥ 1% Ser improvement vs. post-A-direction baseline (2026-05-07_09-39-09.LLM or later)
No regression on Large/Deep cells (long-string path untouched)
Round-trip tests pass on the boundary charLength = 63 and charLength = 64 cases

Trigger

After A-direction (header pack-write) bench result is conclusive
Pre-NuGet release if the Repeated cell Compact-vs-MemPack Ser ratio still has measurable headroom

Resolution

Integrált megvalósítás ACCORE-BIN-T-V4N4 keretében (2026-05-07): a WriteStringWithDispatch 4-method-os split egyik tagja a WriteStringSmallFast — pontosan az S6F2 ide illeszkedő fast path. A 0-shift non-ASCII branch garantált (charLength ≤ 63 ⇒ bytesWritten ≤ 252 ≤ 255 ⇒ Small tier biztos, reserveHeader = actualHeader = 3).

Az inline-stratégia tanulsága (a V4N4 disasm-ből): a WriteStringSmallFast [NoInlining] jelölést kapott a végleges verzióban — az [AggressiveInlining] kísérlet code-bloat-ot okozott (3 generic spec × 30+ SGen call-site × inlined body = i-cache pressure a Repeated cell hot loop-on, +29.6 pp Ser regresszió a 15:13:39 bench-en). A [NoInlining]-tal az S6F2 logika érvényesül (constant-folded tier choice, 0 shift), csak +1 call instruction overhead-tel.

Bench (15:29:21): Compact Ser 5/5 cellán paritás-vagy-jobb vs MemPack (Small -8.5%, Medium -1.1%, Large -8.5%, Repeated +1.9%, Deep -1.9%). Az S6F2 várt +1-3% Ser-javulás teljesült Small/Large cellákon, a Repeated/Deep paritás-szerű (a +1 call overhead kompenzálja a fast-path nyereséget rövid Magyar string-eken).

Re-evaluable as of 2026-05-07 per ACCORE-BIN-T-D9X3 — together with the parent V4N4 method-split, the Small fast path is re-testable now that bench stabilization removes the noise-floor; retest before any code change.

ACCORE-BIN-T-W2C8: WASM string-cache H2Q6 maximalizálás (`ReadStringUtf8Cached` MISS path)

Priority: P2 (WASM target) / P3 (otherwise) · Type: Performance · Related: BinaryDeserializationContext.Read.cs::ReadStringUtf8Cached, ReadStringUtf8WithCharLen, Utf8Transcoder.DecodeUtf8SinglePass

H2Q6's primary win is 1-pass decode on the reader side: tier markers carry both charLen and utf8Len, so the reader allocates the target string with the known char count and decodes in a single pass via string.Create(charLength, ..., DecodeUtf8SinglePass). This eliminates the CountUtf8Chars Pass 1 — the headline V4N3/H2Q6 win.

The WASM string-cache path bypasses this win. When _useStringCaching is true (Blazor WASM target), ReadStringUtf8WithCharLen dispatches to ReadStringUtf8Cached(byteLength) for short strings. On cache HIT, the cached instance is returned (zero decode — already optimal). On cache MISS, the current ReadStringUtf8Cached falls back to Utf8NoBom.GetString(slice) — the BCL kétpasszos UTF-8 decoder. The H2Q6 1-pass decode benefit is lost on every cache MISS.

Per-cell impact estimate on a WASM workload with hot-path strings (typical Blazor SignalR DTO traffic):

Cache HIT rate ~30-50% on repeated property names + tags + categories
Cache MISS rate ~50-70% on first occurrences + unique values
MISS path = Utf8NoBom.GetString BCL call (virtual dispatch + EncoderFallback overhead) instead of string.Create(charLength, ..., DecodeUtf8SinglePass)

Implementation outline

ReadStringUtf8Cached accepts both charLength and byteLength (or just compute charLength from the cache check / decode result). Cache HIT: cached.Length == charLength invariant check (UTF-16 char count, not UTF-8 byte count) + ASCII verification. Cache MISS: replace Utf8NoBom.GetString(slice) with string.Create(charLength, (Buffer, Pos, Len), static (chars, state) => DecodeUtf8SinglePass(state.Buffer.AsSpan(state.Pos, state.Len), chars)).

Cross-check: the existing ComputeStringHashFull(slice) and VerifyAsciiUtf8Match(cached, slice) operate on the raw UTF-8 bytes — these stay unchanged. Only the MISS-side string materialization needs the H2Q6-aware refactor.

Why P2 (WASM-target) / P3 (otherwise)

The non-WASM benchmark host (x64) doesn't enable _useStringCaching by default, so this optimization is invisible on the current bench
On Blazor WASM, all interning + repeated-string-cached deserialization currently pays the BCL decode tax on cache MISS
Estimated +5-15% Deser improvement on WASM workloads with significant cache MISS rate
Direct extension of the H2Q6 win to the WASM execution profile

Acceptance

ReadStringUtf8Cached cache MISS path uses string.Create(charLength, ..., DecodeUtf8SinglePass) — no BCL Utf8NoBom.GetString on MISS
Round-trip tests pass on cached + uncached short-string scenarios across all UTF-8 content classes (ASCII / Hungarian / CJK / emoji)
WASM-target benchmark (Blazor profile) shows ≥ 5% Deser improvement vs. pre-W2C8 state on a representative hot-string-heavy DTO workload
Cache HIT path performance unchanged (already optimal — no decode)
Cache eviction / capacity behavior unchanged

Trigger

Pre-NuGet release if Blazor WASM is a primary supported scenario in the release narrative
Or: when a WASM-fókuszú benchmark workload becomes the active perf measurement target

ACCORE-BIN-T-F3W6: Dedicated FastWire string marker (split mode-shared `StringSmall`)

Priority: P3 · Type: Performance · Related: WriteStringWithDispatch FastWire branch, ReadStringSmall FastWire branch, BinaryTypeCode.StringSmall, H2Q6 marker reservation

The H2Q6 marker layout currently shares StringSmall (=91) between Compact and FastWire modes:

Compact emits [91][charLen:8][utf8Len:8][UTF-8 bytes]
FastWire emits [91][VarUInt charCount][UTF-16 raw bytes]

The reader dispatches on context.FastWire inside ReadStringSmall. Correct (the deserializer's mode is fixed per operation), but the mode-shared marker forces runtime branching at hot points:

Writer: if (FastWire) at the top of WriteStringWithDispatch runs on every string write — runtime check on a path-dominant (Compact) call site
Reader: if (context.FastWire) inside ReadStringSmall runs on every short non-ASCII string deserialization — Compact-side waste
SGen template: every regenerated reader contains the FastWire-aware case StringSmall: block (more code per type, larger AOT binary)
JIT/AOT inlining: the larger WriteStringWithDispatch / ReadStringSmall method bodies may exceed inline budgets at hot call sites — particularly under NativeAOT

A dedicated StringFastWire marker (one value from the H2Q6-freed 106-134 range — proposed allocation: 131) splits the path:

Compact stays on StringSmall (=91) → ReadStringSmall becomes Compact-only (no if (FastWire) branch, smaller method body)
FastWire uses new StringFastWire → dedicated ReadStringFastWire reader, FastWire-only logic
Writer's FastWire branch emits StringFastWire instead of StringSmall

Wire format compatibility

The marker swap is internally consistent within the v3 envelope — producers that opt in to the dedicated FastWire marker emit it; readers expanded to handle both StringSmall and StringFastWire (transitional). Once all producers emit the dedicated marker, the old mode-shared dispatch in ReadStringSmall can be removed.

Why P3 — "minden apró % számít"

Estimated +0.5-1% Ser (writer branch elimination on Compact path)
Estimated +0.5-1% Deser (reader smaller method body, better JIT/AOT inline-eligibility on Compact path; FastWire reader gets a tight dedicated path too)
Compounds with other micro-opts across the hot path — small percentages add up
Marker-space cost: 1 reserved value consumed (general-reserve count drops from 4 to 3 in the H2Q6 reservation table)
Risk: low — mechanical split; round-trip tested against both wire-format variants

Implementation outline

BinaryTypeCode.StringFastWire = 131 constant + helper updates (IsString range check + dispatch)
WriteStringWithDispatch FastWire branch emits StringFastWire (was StringSmall)
New ReadStringFastWire<TInput> static reader — [VarUInt charCount][UTF-16 bytes] decode, no Compact-mode branching
ReadStringSmall<TInput> simplified — Compact-only, drops if (context.FastWire) branch
TypeReaderTable[StringFastWire] registration
SkipValue case StringFastWire: — same skip layout as StringSmall FastWire branch (charCount VarUInt + 2 × charCount bytes)
SGen template EmitReadString — new case StringFastWire: block (FastWire-only branch); case StringSmall: simplified to Compact-only
Round-trip tests: separate FastWire and Compact wire format coverage

Acceptance

Round-trip parity on both Compact and FastWire wire formats (existing tests pass)
Benchmark on FastWire mode shows ≥ 0.5% improvement vs. mode-shared baseline
Compact mode shows no regression (likely marginal gain from simpler ReadStringSmall)
AOT-published binary shows reduced generated reader size per [AcBinarySerializable] type (one less case-block + branch)
Marker-space documented: BinaryTypeCode.cs reservation comment + H2Q6 entry's reservation table updated to reflect the F3W6 allocation

Trigger

Pre-NuGet release if every measurable percentage point on the Compact hot path matters for the "fastest" narrative
Or: when the Compact/FastWire branch profile shows up in a NativeAOT inlining audit (ACCORE-BIN-T-V4N4)

Roll-back fallback

If a future marker-space crunch arises (additional H2Q6 tiers, new compression markers, etc.), F3W6 can be reverted by switching the writer back to emitting StringSmall on FastWire and re-introducing the mode-shared dispatch in ReadStringSmall. The original design is correctness-equivalent — the dedicated marker is purely an optimization. If marker gondunk lesz, kivesszük.

ACCORE-BIN-T-B1D5: BenchmarkDotNet release-quality measurement project

Priority: P2 · Type: Tooling / release-narrative · Status: Open · Related: AyCode.Core.Serializers.Console (existing custom bench), NuGet release-narrative

The current AyCode.Core.Serializers.Console is a hand-rolled microbenchmark — fast dev-iteration loop (30-90s per run, custom markdown output, internal TestDataSet structure). It serves the inner optimization cycle well, but is not industry-standard for the public NuGet release narrative.

A parallel BenchmarkDotNet-based project would close that gap:

Industry-standard credibility: BenchmarkDotNet is the canonical .NET benchmarking framework — MemoryPack, MessagePack, System.Text.Json all use it for their published numbers. AcBinary results expressed in BDN format are directly comparable to MemPack's own release notes.
Statistical rigor: outlier detection (Tukey's fences), interquartile range, confidence intervals, multi-process iteration runs. The current custom bench reports median-of-5; BDN reports the full distribution + variance band — the difference between "looks fast on my machine" and "demonstrably fast under controlled conditions".
NuGet release surface: BDN markdown tables drop straight into release notes / blog posts / NuGet README.md / BINARY_FEATURES.md "Performance vs MemoryPack" section. GitHub-friendly format, screenshot-friendly, reviewer-credible.
Diagnostic-plugin integration:
- [MemoryDiagnoser] — allocation per iteration (already a hot question for the Repeated cell)
- [EventPipeProfiler] — CPU profile collection during the bench run, exportable to speedscope flame-graph
- [DisassemblyDiagnoser] — per-method disasm dump, parallel to the manual dumpbin workflow used in V4N4
- [ThreadingDiagnoser] — context switches, lock contention (relevant if pool-contention shows up under load)
Multi-runtime / multi-job: a single project benchmarks against RuntimeMoniker.Net90 (JIT) and RuntimeMoniker.NativeAot90 simultaneously — same-shape table side-by-side.
CI integration potential: BDN result format is machine-readable (JSON/CSV), enabling regression detection on PR diffs (later sprint).

Implementation outline

New project: AyCode.Core.Serializers.Benchmark (or .Bdn) — separate csproj for clean BDN dependency isolation. AOT-publishable for the AOT job.
TestDataSet bridge: reuse the existing TestDataFactory / TestDataSet types from AyCode.Core.Tests.TestModels so the data-shape is identical to the custom bench.

Benchmark class skeleton:

[MemoryDiagnoser]
[SimpleJob(RuntimeMoniker.Net90, baseline: true)]
[SimpleJob(RuntimeMoniker.NativeAot90)]
public class StringSerializationBenchmark
{
    [Params("Small", "Medium", "Large", "Repeated", "Deep")]
    public string DataSet { get; set; } = "Small";

    private object _data = null!;
    private byte[] _compactWire = null!;
    private byte[] _mempackWire = null!;

    [GlobalSetup]
    public void Setup()
    {
        _data = TestDataFactory.Create(DataSet);
        _compactWire = AcBinarySerializer.Serialize(_data, AcBinarySerializerOptions.FastMode);
        _mempackWire = MemoryPackSerializer.Serialize(_data);
    }

    [Benchmark(Baseline = true)] public byte[] MemPack_Ser() => MemoryPackSerializer.Serialize(_data);
    [Benchmark] public byte[] AcBinary_Compact_Ser() => AcBinarySerializer.Serialize(_data, AcBinarySerializerOptions.FastMode);
    [Benchmark] public object? MemPack_Deser() => MemoryPackSerializer.Deserialize<TestOrder>(_mempackWire);
    [Benchmark] public object? AcBinary_Compact_Deser() => AcBinaryDeserializer.Deserialize<TestOrder>(_compactWire);
}

Multi-cell coverage: separate benchmark classes per workload-shape (StringSerializationBenchmark, ObjectGraphBenchmark, NestedDeepBenchmark) — clean grouping in BDN output.
NativeAOT-job config: <PublishAot>true</PublishAot> conditionally (mirroring Console project pattern); BDN's NativeAOT job auto-publishes the bench-runner.
Output: GitHub-flavored Markdown export → docs/BINARY/BENCHMARK_RESULTS.md (or similar), versioned in the repo.

Why P2 (pre-NuGet release)

NuGet release narrative ("AcBinary fastest AND smallest binary serializer for .NET i18n payloads") needs credible, industry-standard numbers. Custom bench → "trust me, my numbers"; BDN → "here are the variance bands and the methodology".
Direct comparison surface against MemPack's published BDN numbers (head-to-head on the same framework).
Diagnostic-plugin integration ([MemoryDiagnoser] + [EventPipeProfiler]) opens up further targeted optimization work without separate tooling.

Acceptance

New AyCode.Core.Serializers.Benchmark project compiles + runs cleanly on both JIT (net9.0) and NativeAOT
Reuses existing TestDataFactory / TestDataSet types — no test data duplication
Produces a markdown table per workload-shape covering: MemPack baseline + AcBinary Compact + (optionally) AcBinary FastWire, both Ser and Deser
BDN output saved to docs/BINARY/BENCHMARK_RESULTS.md (versioned per release)
README.md / BINARY_FEATURES.md references the BDN-measured performance claim with the methodology link

Trigger

Pre-NuGet release: when the optimization sprint cluster (V4N2 / W2C8 / etc.) settles and the perf state is release-stable
Or: when a credibility-sensitive presentation surface emerges (blog post, conference talk, GitHub README)

Coexistence with the custom bench

The custom Console bench is not replaced — it remains the dev-iteration tool (fast feedback loop, 30-90s runs, hand-tuned markdown for chat-paste). BDN is the release-grade bench (3-10 min runs, statistical rigor, NuGet release output). Different tools for different audiences.

ACCORE-BIN-T-C5R8: Charset-parameterized benchmark workload (ASCII / Hungarian / CJK / Cyrillic / Mixed)

Priority: P2 · Type: Tooling / release-narrative · Status: Closed (2026-05-07) · Related: BenchmarkTestDataProvider, AyCode.Core.Serializers.Console.Program.cs (Settings → Charset submenu), ACCORE-BIN-T-V4N2 (charset-specific optimization measurement target), ACCORE-BIN-T-D9X3 (bench stabilization preceding this work)

The current BenchmarkTestDataProvider hard-codes Hungarian (Latin extended 2-byte) content into the test DTOs. This produces a single workload-shape: Hungarian mixed text with short 1-2 char 2-byte runs. While Hungarian is a fine general-purpose i18n stress, it is only one production-content profile — and the optimization decisions ride on it implicitly (e.g. V4N2 Phase 2.5's 3-byte run do-while was deferred-on-2-byte-side because the Hungarian bench measured regression there, but its CJK-side value cannot be measured on the current data).

A charset-parameterized benchmark workload — selectable from the interactive menu — would:

Measure optimization value across realistic content profiles — what wins on CJK content may not win on Hungarian, and vice versa. Without explicit per-charset measurement, optimization decisions become Hungarian-biased.
Surface release-narrative numbers credibly — instead of "Compact beats MemPack on i18n payload" (single workload), claim "Compact vs MemPack: ASCII X%, Hungarian Y%, CJK Z%, Cyrillic W%, Mixed V%" — concrete numbers per content profile, NuGet-grade.
Enable workload-specific optimization audits — V4N2 Phase 3 SIMD multi-byte transcoder targets CJK 3-byte content; without a CJK workload measurement, Phase 3 acceptance criteria cannot be validated.

Implementation outline

1. `BenchmarkTestDataProvider` refactor

Hard-coded Hungarian strings (KözösCímke, sötét, magyar, hetenkénti, etc.) → ASCII baseline values (English equivalents: SharedTag, dark, hungarian, weekly).

New static LongStringSuffix field — charset-aware suffix appended to a subset of property values:

public static class CharsetSuffixes
{
    public const string AsciiOnly = "";  // baseline — pure-English ASCII content
    public const string Hungarian  = " árvíztűrő tükörfúrógép";
    public const string CjkBmp     = " 你好世界 こんにちは 안녕하세요";
    public const string Cyrillic   = " Привет мир дорогой друг";
    public const string Mixed      = " árvíz 你好 Привет 😀";
}

public static string LongStringSuffix { get; set; } = CharsetSuffixes.Hungarian;  // default

Property values use the suffix dynamically:

var description = "Product description" + LongStringSuffix;

The 5 charsets cover the realistic UTF-8 workload spectrum:

Pure ASCII — baseline; Phase 1 SIMD prefix widen + DWORD batch dominate; no multi-byte path engagement
Hungarian (Latin extended) — short 1-2 char 2-byte runs in mixed text; current default workload
CJK BMP — long homogeneous 3-byte runs; primary V4N2 Phase 2.5/3 win region
Cyrillic (Russian / etc.) — long 2-byte runs (different shape than Hungarian mixed); V4N2 Phase 2.5 may yet pay off here
Mixed (Hungarian + CJK + emoji) — full multi-tier coverage in one payload; surrogate-pair handling stress

2. `Program.cs` interactive submenu

Before starting a benchmark run, prompt the user for charset choice:

Choose benchmark charset:
  1 — Pure ASCII (baseline)
  2 — Hungarian (Latin extended) [DEFAULT]
  3 — CJK BMP (Chinese / Japanese / Korean)
  4 — Cyrillic (Russian / etc.)
  5 — Mixed (Hungarian + CJK + emoji)

The choice → BenchmarkTestDataProvider.LongStringSuffix = ... before constructing test data.

3. Benchmark output header

The markdown output header should reflect the selected charset:

# AcBinary Benchmark Release 2026-05-07 16:00:00
Charset: CJK BMP | Iterations: 1000 | Warmup: 10000 | ...

This makes per-charset bench files self-documenting — file names + content both encode the workload profile.

4. Round-trip tests unaffected

Utf8TranscoderTests and other content-class unit tests (with their fixed Hungarian / CJK / emoji boundary inputs) are untouched — they remain fixed-content for regression coverage. Only the benchmark workload is charset-parameterized.

Why P2

Release-narrative: NuGet release credibility depends on measurable performance claims across realistic content profiles, not a single Hungarian-mixed workload
Optimization decision quality: V4N2 Phase 2.5 / Phase 3 / future SIMD multi-byte work cannot be objectively validated without a CJK workload — current decisions have implicit Hungarian-bias
Consumer reproducibility: external consumers can reproduce benchmark numbers on their own content profile (or contribute a new charset profile)

Acceptance

BenchmarkTestDataProvider refactored: ASCII baseline + LongStringSuffix static field with 5 predefined charset constants
Interactive menu in Program.cs lets the user choose charset 1-5 before benchmark run; the chosen charset is recorded in the markdown output header
Round-trip correctness verification still runs once-per-cell before warmup (existing Verified: round-trip ... line) — works on the active charset
All 5 charsets produce valid round-trip on all benchmark cells (Small / Medium / Large / Repeated / Deep)
Existing benchmark numbers (Hungarian-default) reproducible — choosing charset 2 from the menu yields the current 15:29:21-style results
New CJK charset (option 3) produces measurable numbers (one bench run per charset documented in Test_Benchmark_Results/)

Trigger

Pre-NuGet release: per-charset numbers needed for the public performance-claim table
Or: when V4N2 Phase 3 SIMD multi-byte transcoder work needs CJK-workload validation

Resolution

Landed 2026-05-07 (after ACCORE-BIN-T-D9X3 bench stabilization made sub-3% deltas measurable, which raised the value of charset-specific measurement). Implementation refined the original 5-charset proposal into a 6-charset list per user request (Latin1FixAscii + Latin1 short/long split for finer-grained Latin1 coverage):

1. BenchmarkTestDataProvider refactor ✅

New CharsetSuffixes static class with 6 const suffixes (one more than originally proposed):
- Latin1FixAscii = "" — empty suffix; baseline values stay short → FixStr fast-path stress (renamed from AsciiOnly per user request)
- Latin1Short = " árvíztűrő tükörfúrógép" (~24 char) — Hungarian short Latin1 mixed
- Latin1Long = " árvíztűrő tükörfúrógép a magyar betűzés tesztje" (~47 char) — NEW, exceeds the 32-char FixStr boundary on the suffix alone (user request)
- CjkBmp, Cyrillic, Mixed — as originally specified
LongStringSuffix default = CharsetSuffixes.Latin1Long (backward-compatible in spirit with the prior fixed Latin1 default)
All hard-coded Hungarian baseline values replaced with ASCII English equivalents:
- KözösCímke / IsmétlődőCímke / MélyCímke → SharedTag / RepeatedTag / DeepTag
- közösfelhasználó → shareduser (and variants); közös → shared; MélyKategória → DeepCategory
- sötét / világos → dark / light; magyar / német / francia → hungarian / german / french
- hetenkénti / naponkénti / havonkénti → weekly / daily / monthly
- Repeated cell long Hungarian baselines (TermékNév_IsmétlődőTesztAdat_árvíztűrőtükörfúrógép, RaklapKód_IsmétlődőTesztAdat_árvíztűrő) shortened to ASCII ProductName / PalletCode so the EnsureAllStringsBypassFixStr suffix-append actually applies (the prior >31-char baselines bypassed the suffix, leaving Repeated cell content fixed-Hungarian regardless of charset selection)
The only Latin1/non-ASCII characters remaining in the file are inside the CharsetSuffixes const definitions themselves (intentional — those define the per-charset content profiles)

2. Program.cs interactive submenu ✅

New [3] Charset entry in the existing Settings submenu (next to [1] Iteration and [2] WireMode) — chose nested submenu over a top-level prompt to keep the main menu uncluttered
ShowCharsetSettingsMenu lists the 6 charset constants with brief descriptions; selection sets BenchmarkTestDataProvider.LongStringSuffix and returns
GetCurrentCharsetName() helper resolves the active suffix back to its constant name (returns "Custom" when programmatically set to a non-const value)

3. Benchmark output header ✅

Charset: field added to 3 output locations:
- Console run header (interactive run line — Layer: ... | Charset: CjkBmp | Iterations: ...)
- .LLM markdown header (file-self-documenting)
- .log boxed banner (║ Charset: CjkBmp ║)

4. Round-trip tests unaffected ✅ — Utf8TranscoderTests and other content-class unit tests use their own fixed boundary inputs; not touched by this change. Round-trip verification in the bench harness continues to run once-per-cell pre-warmup (VerifyRoundTrip) on the active charset.

Acceptance status

✅ BenchmarkTestDataProvider refactored with ASCII baselines + LongStringSuffix field + 6 charset constants
✅ Interactive submenu lets the user choose charset 1-6; recorded in markdown output header (3 locations)
✅ Round-trip verification runs on the active charset (existing per-cell verify, charset-agnostic by design)
⚠️ "All 6 charsets produce valid round-trip on all benchmark cells" — design correctness implies this; not yet exercised on every (cell × charset) combination explicitly. Recommend running each charset once before declaring full validation.
❌ "Existing benchmark numbers (Hungarian-default) reproducible — choosing charset 2 yields the current 15:29:21-style results" — NOT met: the ASCII baseline refactor changes the numbers regardless of charset choice (shorter baselines + suffix-driven content vs. prior fixed Hungarian baselines). New Latin1Short ≠ prior fixed Hungarian default. This is intentional: the user explicitly chose a clean ASCII-baseline + charset-suffix design over preserving historical numerical comparability.
❌ "Choosing CJK produces measurable numbers documented in Test_Benchmark_Results/" — NOT done in this commit window; user has the menu and will run per-charset benches in a follow-up sprint.

Note on numerical incompatibility with prior runs

Existing bench files generated before this commit (e.g. Console.FullBenchmark_Release_2026-05-07_17-42-22.LLM and earlier) used the prior fixed Latin1 baseline values + 32-char Hungarian suffix. The new default (Latin1Long) uses ASCII baselines + 47-char Latin1Long suffix; the Repeated cell sees a more dramatic shift (its 52-char fixed Hungarian baseline → 11-char ASCII ProductName + 47-char suffix). Numerical comparison across the boundary is not meaningful; the Charset: header field documents the source charset for each new bench file.

Future extensions

Sentinel "real-world" charsets — synthetic mixes representing typical production payloads (e.g. EnglishWithEmoji for chat-app DTOs, ArabicHebrew for RTL-script regions). Add as new CharsetSuffixes constants when consumer demand surfaces.
Charset auto-rotate mode — single benchmark run cycles through all 5 charsets, producing a 5-section markdown output. Useful for full release-narrative table generation in one pass.
BDN integration (per ACCORE-BIN-T-B1D5): charset becomes a [Params] axis in BenchmarkDotNet, producing a 5×5×N matrix (cells × charsets × engines) in the BDN output.

ACCORE-BIN-T-D9X3: Console benchmark stabilization (per-serializer warmup + GC isolate + pilot discard + min/max range + CPU pin + mode-aware JIT sleep)

Priority: P1 · Type: Tooling / measurement · Status: Closed (2026-05-07) · Related: AyCode.Core.Serializers.Console.Program.cs, ACCORE-BIN-T-V4N4, ACCORE-BIN-T-V4N2, ACCORE-BIN-T-S6F2, ACCORE-BIN-T-B1D5 (BDN release-grade variant)

The custom Console benchmark harness showed strong run-to-run variance — user-reported ±20pp / -10pp summa-spread between runs on identical code. 1-3% perf-claims became unmeasurable on this noise-floor; the V4N4 method-split and V4N2 Phase 2.5 attempts both fell into this band, leaving the question "does the regressed bench number reflect a code regression or measurement noise?" undecidable (see V4N4 Reverted section).

Diagnosis (sprint takeaway prior to this entry):

Warmup cache pollution — RunBenchmarksForTestData ran one warmup-all loop (every serializer × WarmupIterations) followed by one bench-all loop. By the time a given serializer was measured, its hot code and data lines had been evicted by the intervening serializers' warmup passes. MemPack and AcBinary hot paths share neither code nor data working sets — they actively evict each other.
GC pause leakage between samples — the Stopwatch-recorded sample loop had no explicit GC.Collect. A minor GC triggered inside sample N could promote into a Gen-2 pause inside sample N+1's timed window (1-5 ms spike).
Pilot sample contamination — the first sample after warmup absorbed residual JIT bookkeeping and cold-cache misses; on a 10-sample median this contributed 1-2 outliers that visibly stretched the min/max.
CPU migration / preemption — the Windows scheduler migrated the bench thread between cores between samples (L1/L2 cache evict on each migration); background work (Defender index, OS service threads) injected random preemption spikes.
JIT sleep not mode-aware — Thread.Sleep(JitSleep = 3000) waited 3 seconds before each cell for tiered-JIT drain. On AOT publish (PublishAot=true) there IS NO dynamic compilation — the 3 seconds were pure idle. Worse, the drain happened only globally (once before all cells), not per-serializer, so a tier-promotion mid-bench could still bleed in.
Range invisible — the .LLM markdown output showed only the median; the user could not tell whether a 5%-median-delta was inside or outside the inter-sample range for that row.

Resolution

Landed 2026-05-07 (16:00 — 17:00). Six stabilization steps in one commit window:

1. Per-serializer warmup separation (RunBenchmarksForTestData) — the warmup-loop and bench-loop merged into one per-serializer cycle: each serializer's warmup runs IMMEDIATELY before its own bench. The serializer's hot code/data is freshest in cache when the first sample times.

2. GC.Collect before every sample (RunTimed) — GC.Collect() + WaitForPendingFinalizers() + GC.Collect() triple-tap before each sample, OUTSIDE the Stopwatch window. Every sample starts from the same heap state; an ad-hoc Gen-2 pause from sample N can no longer bleed into sample N+1.

3. Pilot sample discard (RunTimed) — the loop runs samples + 1 times; the first (index 0) is discarded. The first sample post-warmup absorbs residual JIT/GC bookkeeping and cold cache; the recorded samples count remains 10 (median is the same data the user saw before, just sourced from "typical" sample-set, not from the post-warmup-first noisy point).

4. Min/max range in markdown output (SaveLlmResults, new FormatMicrosWithRange helper, new BenchmarkResult fields: SerializeTimeMinMs/MaxMs, DeserializeTimeMinMs/MaxMs, RoundTripTimeMinMs/MaxMs) — the .LLM output's Ser and Deser columns now render as 26.86 (24.50..29.10): median (min..max) µs/op. The reader sees at a glance whether a delta is above the row's noise floor.

5. CPU affinity + process priority (RunBenchmark) — ProcessorAffinity = 0x1 (CPU 0 pin) + PriorityClass = High for the benchmark phase, try/finally restores the original values. Eliminates inter-sample thread migration (L1/L2 cache evicts) and reduces background-task preemption. Platform-guarded: Windows / Linux only (CA1416 — ProcessorAffinity throws on macOS); locked-down hosts (group policy, container without CAP_SYS_NICE, etc.) catch + warning + bench continues with default scheduling.

6. Mode-aware JitSleep (property) — RuntimeFeature.IsDynamicCodeCompiled ? 250 : 0. JIT mode 250 ms (the .NET 9 tiered-JIT compile queue typically drains in <100 ms for the bench's hot path); AOT publish 0 ms. The 3000 ms blind wait is gone. The drain now happens per-serializer (Step 1) instead of once globally.

Bench result (3 consecutive runs, 2026-05-07 17:00:32 / 17:01:03 / 17:01:32, FastestByte mode, FastMode preset)

Cell	AcBinary Ser median (3 runs)	Inter-run spread	Intra-cell range
Small	7.09 / 6.83 / 6.55	7.6%	~8% (noise floor: 1000×6ns measured)
Medium	18.74 / 18.90 / 19.22	2.6%	~10%
Large	140.20 / 141.67 / 141.02	1.0%	~3%
Repeated	26.52 / 26.25 / 26.28	0.3%	~6%
Deep Nested	23.44 / 23.17 / 22.70	3.2%	~7%

The previous ±20pp / -10pp summa-spread shrank to 1-3pp on the medium/large cells. The Small cell remains noisy (~8% relative) but this is a physical floor: 1000 iter × 6 ns/op = 6 µs total batch — below this, Stopwatch resolution and OS spikes dominate relatively.

The (min..max) range is consistently 3-10% relative — a measurable signal floor: 1-3% perf-deltas no longer disappear into noise.

Lessons

Bench stabilization is a precondition for perf optimization, not a consequence. Optimization decisions (e.g. V4N4 method-split, V4N2 Phase 2.5) can only be derived from bench numbers if the noise floor < expected signal. Without that, the bench numbers mean nothing.
Cache pollution (warmup-all → bench-all flow) was the single largest noise source: per-serializer warmup separation alone removed ~10pp of variance.
Platform stabilization (CPU pin + high priority) combined with heap stabilization (GC.Collect + pilot discard) further tightened the range.
AOT and JIT have different stabilization needs: the 3000 ms blind sleep was idle time on AOT; mode-aware sleep pays the cost only when needed.

Re-evaluation list (entries currently Reverted or unmeasurable)

The stabilization opens a follow-up sprint: the Reverted (2026-05-07) entries are re-evaluable now that the noise floor < the expected 1-3% signal:

ACCORE-BIN-T-V4N4 — method-split (writer + reader hot path) is re-testable
ACCORE-BIN-T-V4N2 (Phase 2.5) — UTF-8 do-while runs (2-byte / 3-byte) per charset
ACCORE-BIN-T-S6F2 — Small fast path (was integrated into V4N4)

Per-entry re-evaluation is the next sprint's task, NOT part of this Closed entry.

Why P1

Blocked all sub-3% perf optimization work (every recent attempt fell into the noise band)
One-line user complaint ("+20 és -10 között ingadozott a summa") summarized weeks of unproductive bench-driven investigation
One-time fixed cost; every future bench run benefits

Follow-up: adaptive iteration + CV reporting + per-cell A/B mode (2026-05-07, second commit window)

After the initial 6-step landing, three additional refinements were added in a second commit window the same day. The trigger was a Copilot-suggested noise-reduction list against the now-stable bench output:

1. Per-cell adaptive iteration — fixed TestIterations = 1000 produced sample windows from 6 ms (Small cell @ 6 ns/op) to 140 ms (Large cell @ 140 µs/op). The Small cell at 6 ms remained the dominant residual noise source (7.6% inter-run spread vs ≤3.2% on the other cells) because OS-level spikes (preempt + IRQ + scheduler tick) are absolute-time events; on a 6 ms sample window their relative contribution is huge.

Implementation:

New constant TargetSampleMs = 250 (per-sample wall-clock target)
New helper CalibrateIterations(Action, int targetMs) — runs a 100-iter probe post-warmup, computes iterPerMs, and rounds up to the nearest 1000. Floor 1000, ceiling 200_000.
RunBenchmarksForTestData calibrates Ser and Des INDEPENDENTLY per serializer (different per-op cost). RT-only rows (NamedPipe) get a single RT calibration.
New BenchmarkResult fields: SerializeIterations, DeserializeIterations, RoundTripIterations (per-row).
New helpers: ToPerOpMicros(double, int) (replaces 1-arg variant), SerPerOp(r) / DesPerOp(r) / RtPerOp(r) for per-op µs from the result.
All Average(r => r.*TimeMs) and OrderBy(r => r.RoundTripTimeMs) call-sites refactored to use per-op µs (iter-independent) — mixing batch-time across rows with different iter counts would be meaningless. ~20 call-sites total.
RT for in-mem rows synthesized so RtPerOp(r) == SerPerOp(r) + DesPerOp(r) regardless of serIter != desIter: RoundTripIterations = max(serIter, desIter), RoundTripTimeMs = rtPerOpMicros / 1000 * RoundTripIterations.

Expected impact: Small cell sample window 6 ms → ~240 ms; inter-run spread 7.6% → ~1-2% (matching the other cells). Total suite duration ~50 s → ~110-130 s.

2. CV (coefficient of variation) reporting + unstable-row marker — the median + (min..max) range surfaces shape but not a single-number stability metric. The CV (= stddev/mean) is the standard statistical measure; rows with CV > threshold are flagged with a ⚠️ suffix in the markdown output so a small inter-engine delta on a high-CV row is immediately obvious as noise-suspect.

Implementation:

New constant UnstableCVThreshold = 0.03 (3% — reasonable for stabilized in-memory benchmarks)
RunTimed return tuple extended: (median, min, max, stddev). Stddev computed over the (samples − pilot) population using Math.Sqrt(Math.Max(0, E[X²] - E[X]²)).
New BenchmarkResult fields: SerializeTimeStdDevMs, DeserializeTimeStdDevMs, RoundTripTimeStdDevMs.
FormatMicrosWithRange extended: 26.86 (24.50..29.10) stays the default; 26.86 (24.50..29.10) ⚠️5.2% appears when CV exceeds the threshold.

3. Per-cell A/B mini-suite filter — optimization-iteration loops often need only one specific cell (e.g. "tuning the Repeated cell for Hungarian charset"). The full 5-cell × 2-engine × 4-measurement suite is overkill for that.

Implementation:

FilterByLayer extended: new small / medium / large / repeated / deep modes — case-insensitive prefix match on TestDataSet.Name
TryParseCliArgs recognizes the new tokens: dotnet run -- repeated runs only the Repeated Strings cell
fastestbyte mode (existing — only AcBinary FastMode + MemoryPack head-to-head) is orthogonal and stacks: dotnet run -- repeated fastestbyte

Markdown output schema change

The ## Results table gains an Iter Ser/Des column at the right edge — visible verification that each row's batch landed near the TargetSampleMs window. RT-only rows show a single Iter value (the RT calibration count); in-mem rows show serIter / desIter.

Header line updated:

Before: Iterations: 1000 | Warmup: 10000 | Samples: 10 (median) | ...
After: Iterations: per-cell adaptive (target ~250 ms/sample) | Warmup: 10000 | Samples: 10 (median) + 1 pilot discarded | ... | UnstableCV threshold: 3%

ACCORE-BIN-T-K7M3: Hot-path UTF-8 transcoder switch — `Utf8Transcoder` → BCL `Utf8.FromUtf16` / `Utf8.ToUtf16`

Priority: P1 · Type: Performance · Status: Closed (2026-05-08) · Related: ACCORE-BIN-T-V4N3 (custom transcoder origin), ACCORE-BIN-T-V4N2 (Phase 3 SIMD multi-byte), ACCORE-BIN-T-V4N4 (Reverted method-split), ACCORE-BIN-T-D9X3 (bench stabilization that made the comparison measurable)

The custom Utf8Transcoder (V4N3) was originally implemented to bypass System.Text.Encoding.UTF8.GetBytes virtual-dispatch + EncoderFallback overhead. The V4N3 audit measured wins vs. the legacy Encoding.UTF8 API. What it did NOT measure: the modern System.Text.Unicode.Utf8.FromUtf16 / Utf8.ToUtf16 API (.NET 7+, tier-1 optimized, used by MemoryPack WriteUtf8 / ReadUtf8 paths internally). Once the bench stabilized (D9X3), a direct A/B comparison surfaced that the BCL modern API consistently outperforms the custom transcoder on the binary serializer's hot path.

Bench A/B (Latin1Long charset, FastMode SGen Compact)

Cell	Ser delta vs MemPack — custom (`EncodeUtf8SinglePass`)	Ser delta vs MemPack — BCL (`Utf8.FromUtf16`)	Improvement
Small	+28.5%	+7.3%	-21pp
Medium	+23.8%	+3.1%	-21pp
Large	+19.6%	+5.1%	-14pp
Repeated	+28.8%	+10.9%	-18pp
Deep	+23.1%	+0.6%	-22pp

Cell	Deser delta vs MemPack — custom (`DecodeUtf8SinglePass`)	Deser delta vs MemPack — BCL (`Utf8.ToUtf16`)	Improvement
Small	+17.6%	-1.2% (paritás)	-19pp
Medium	+12.8%	-4.7% (AcBinary nyer)	-17pp
Large	+4.9%	-10.3% (AcBinary nyer)	-15pp
Repeated	+16.9%	-1.6% (paritás)	-18pp
Deep	+7.0%	-9.0% (AcBinary nyer)	-16pp

The Deser side flipped from "consistently behind" to "wins on 3 of 5 cells, paritás on 2". The Ser side closed the deficit from +20-29% to 0-11%. Both sides measurable improvement on every cell.

Why the custom transcoder lost

The V4N3 implementation included a 4-tier SIMD ASCII prefix path (Vector512BW / Vector256 / Vector128 / scalar) plus a DWORD ASCII batch + scalar 4-branch multi-byte fallback. All correct, all SIMD-tuned. But:

Utf8.FromUtf16 is also SIMD-tuned in .NET 9 — the .NET team rewrote it on top of System.Text.Unicode.Utf8 primitives that share infrastructure with Ascii.IsValid / Latin1.GetString. AOT-publish-friendly, branch-friendly, no virtual dispatch (the Utf8 API is static, not via an Encoding instance with virtual-method-table).
The custom transcoder's ASCII prefix path bails out on first non-ASCII byte — on multi-byte content (Latin extended / Cyrillic / CJK) the SIMD path runs only for the leading ASCII span, then the entire remainder falls into per-char scalar 4-branch dispatch. The BCL Utf8.FromUtf16 SIMD-batches multi-byte content too (different algorithm — the BCL doesn't bail on first non-ASCII).
AOT inline budget: the custom transcoder's body grew with the V4N3 / V4N4 / V4N5 additions; in NativeAOT publish the call sites in WriteStringWithDispatch / ReadString* did NOT inline (V4N4 disasm audit confirmed). The BCL Utf8.FromUtf16 is a single static method with a tighter call-site footprint.

Resolution

Landed 2026-05-08. The 8 production hot-path call sites of Utf8Transcoder.* switched to BCL:

File / line	Before	After
`AcBinarySerializer.cs:120`	`Utf8Transcoder.GetUtf8ByteCount`	`Encoding.UTF8.GetByteCount`
`AcBinarySerializer.BinarySerializationContext.cs:694`	`Utf8Transcoder.EncodeUtf8SinglePass`	`Utf8.FromUtf16(...)`
`AcBinarySerializer.BinarySerializationContext.cs:784`	`Utf8Transcoder.EncodeUtf8SinglePass`	`Utf8.FromUtf16(...)`
`AcBinarySerializer.BinarySerializationContext.cs:901`	`Utf8Transcoder.EncodeUtf8SinglePass`	`Utf8.FromUtf16(...)`
`AcBinaryDeserializer.BinaryDeserializationContext.Read.cs:523`	`Utf8Transcoder.CountUtf8Chars`	`Encoding.UTF8.GetCharCount`
`AcBinaryDeserializer.BinaryDeserializationContext.Read.cs:527`	`Utf8Transcoder.DecodeUtf8SinglePass`	`Utf8.ToUtf16(...)`
`AcBinaryDeserializer.BinaryDeserializationContext.Read.cs:565`	`Utf8Transcoder.DecodeUtf8SinglePass`	`Utf8.ToUtf16(...)`
`PropertyMetadataBase.cs:104-109` (ctor-once)	`Utf8Transcoder.GetUtf8ByteCount` + `EncodeUtf8SinglePass` (two-pass)	`Encoding.UTF8.GetBytes(string)` (single-pass with exact-size byte[] return)

The count-only call sites (GetByteCount / GetCharCount) stay on the legacy Encoding.UTF8 API — System.Text.Unicode.Utf8 has no count-only equivalent (only FromUtf16 / ToUtf16 which encode + count combined). For pure count, the legacy API is the optimal tool (single SIMD-tuned scan, no encode/decode work).

The Utf8Transcoder.cs file remains in the repo but fully commented out — the class definition is preserved as historical reference / future reactivation if a workload ever surfaces where it could win again. Utf8TranscoderTests.cs is not currently exercising live code.

Lesson — the V4N3 audit's blind spot

The V4N3 (custom transcoder) audit compared against legacy Encoding.UTF8.GetBytes and won. The audit did NOT compare against Utf8.FromUtf16 (the modern API, .NET 7+). On modern runtime the BCL has two UTF-8 transcoders: a legacy one (instance-method on Encoding, virtual dispatch) and a modern one (static Utf8.FromUtf16 / Utf8.ToUtf16). MemoryPack uses the modern one — that's what we should have been comparing against from the start.

Generalizable lesson: when measuring a custom implementation against a "BCL baseline", verify which BCL API is used by the actual competition (here: MemoryPack source-gen). The Encoding.UTF8.* instance API and System.Text.Unicode.Utf8 static API are different generations of the same logical operation; treating them as interchangeable hides the comparison's scope.

Why P1

Closed the FastMode Compact mode Ser deficit from +20-29% to ≤11% on every cell (Latin1Long benchmark)
Flipped the Deser side from -1 to -10% deficit to AcBinary winning on 3 of 5 cells, parity on 2 (Latin1Long benchmark)
One-time fixed cost (8 production call-site cseréje) — every future bench profits
Removed a load-bearing ~600-line custom SIMD module from the maintained surface area; future maintainers don't need to reason about Vector512BW / cross-lane shuffle / 5-popcount surrogate-pair correctness — the BCL handles it

Follow-up — `Utf8Transcoder.cs` cleanup

The file is fully commented out. Either:

Delete entirely (preferred for repo cleanliness) — Utf8TranscoderTests.cs then needs deletion or revival as a regression-only guard
Keep the comment-block as historical reference, with a header comment pointing to this entry

Decision deferred — the comment-block does no harm to build / runtime. Address when the next docs-archive sweep runs.

ACCORE-BIN-T-P3X7: Profile-driven Compact-mode Ser optimalizációs roadmap (post-K7M3 hot-path analysis)

Priority: P2 · Type: Performance roadmap · Status: Open · Related: ACCORE-BIN-T-K7M3 (BCL UTF-8 transcoder switch — előfeltétele), ACCORE-BIN-T-D9X3 (bench stabilization), ACCORE-BIN-T-S2X9 (markerless schema lane — primitív property-marker már kivezetve a SGen-ben), ACCORE-BIN-T-V4N4 (audit methodológia hivatkozás)

A 2026-05-08 VS Performance Profiler session (4 sec range, AcBinary FastMode Serialize, Latin1Long charset, FastWire mode) konkrét hot-path-decomposition-t adott a K7M3 BCL-csere utáni állapotról. A string-encoding már nem akadály (a Utf8.FromUtf16 SIMD-tuned), a fennmaradó AcBinary-specific overhead azonosítható.

Profile session adatok (Self CPU%)

Self CPU%	Function	Category
39.77%	`System.Buffer._Memmove`	Közös MemPack-kel (UTF-16 raw + return-time `byte[]`-copy) — NEM AcBinary-spec
10.03%	`AcBinarySerializer.Serialize<T>`	Top-level (context-acquire, type lookup, return-alloc)
7.48%	`TestMeasurementPoint_GeneratedWriter.WriteProperties`	SGen template (legkisebb levél típus, ~12500 hívás Large cellán)
5.31%	`WriteStringWithDispatch`	String hot path
3.23%	`TestMeasurement_GeneratedWriter.WriteProperties`	SGen
1.66%	`WriteVarUIntMultiByteUnsafe`	VarUInt int-property encode
1.10%	`TestPallet_GeneratedWriter.WriteProperties`	SGen
0.39%	`TestOrderItem_GeneratedWriter.WriteProperties`	SGen
0.32%	`SharedUser_GeneratedWriter.WriteProperties`	SGen
0.05%	`ArrayBinaryOutput.Grow`	Buffer-grow (ritka, kicsi probléma)

Total SGen WriteProperties Self CPU: ~12.6% — a leg nagyobb AcBinary-specific surface.

A AcBinarySerializer.Serialize<T> line-szintű drill-down (AcBinarySerializer.cs:312-335):

WriteObject(value, wrapper, context, 0) Total: 28.05% — a teljes serializációs fa (SGen + Writer hot path)
context.Output.ToArray(context._buffer, context._position) Total: 47.37% — final byte[]-alloc + content-memcpy (= a 39.77% _Memmove Self nagy része)

MemPack-összehasonlítás (referenciaként)

A MemPack Serialize<T>(T value) mechanizmus:

[ThreadStatic] writer-state — nincs pool-bérlés, nincs lock, nincs concurrent dictionary lookup
ReusableLinkedArrayBufferWriter — linked chunk-list (4 KB → 8 KB → 16 KB geometriai); buffer-grow = új chunk hozzáadása, nincs memcpy a régi adaton
ToArrayAndReset() — végén alloc + chunks → byte[] memcpy (közös overhead az AcBinary-vel)

Az AcBinary AcquireArrayOutputContext(options) pool-bérlés + lineáris byte[] Array.Resize + Output.ToArray(...) — két memcpy-cost (grow + return), de a grow ritka.

Sorrendezett optimalizációs ötletek

A. SGen `WriteProperties` — ensure-capacity batching (várt: -1-3pp Ser, revíziós becslés)

Jelenlegi SGen-template per-property emit (mindenenkit külön ensure):

context.WriteVarInt(obj.Id);                    // ensure(5) + write(1-5)
context.WriteByte(BinaryTypeCode.Object);        // ensure(1) + write(1)
context.WriteVarInt((int)obj.Status);            // ensure(5) + write(1-5)
context.WriteRaw(obj.Weight);                     // ensure(8) + write(8)

Csoportosított ensure pattern:

context.EnsureCapacity(maxBytesForGroup);        // worst-case sum, 1× hívás
context.WriteVarIntUnsafe(obj.Id);                // no ensure (csak buffer write)
context.WriteByteUnsafe(BinaryTypeCode.Object);   // no ensure
context.WriteVarIntUnsafe((int)obj.Status);
context.WriteRawUnsafe(obj.Weight);

A AcBinarySourceGenerator.cs WriteProperties template-jét kell módosítani:

Property-listából contiguous primitív csoportok kinyerése (Object/Collection property-knél megszakítva — mély rekurzió, méret nem előre kiszámítható)
Csoportonként worst-case-size compute compile-time-on (a primitív type-ok mérete fix vagy worst-case ismert)
Egyetlen EnsureCapacity(sum) + bulk *Unsafe write-ok

*Unsafe írók szükségessége: WriteVarUIntUnsafe már létezik. WriteByteUnsafe, WriteRawUnsafe<T> valószínűleg hozzá kell adni a BinarySerializationContext-hez.

Becslés-revízió (2026-05-08): az eredeti -4-6pp becslés felső volt. Egy EnsureCapacity inline-olva ~1-2 ns/call (a hot path-on a branch-prediction perfekt — sosem jut el a Grow-hoz). 10 property × 1.5 ns = ~15 ns / object megtakarítás batch-eléssel — Latin1Long Large cell 1250 instance × 13 ns = ~16 µs / 120 µs Ser ≈ ~13% felső, de csak az ensure-szám csökkenéséből. A SGen WriteProperties Self CPU 12.6%-a NEM csak ensure-check; tartalmaz HasPropertyFilter branch-check, null-check + depth-check dispatch, Unsafe.As<T> cast, etc. — lásd F. Az ensure-batching önmagában reálisan 1-3pp Ser javulás.

Wire-formátum változatlan, backward-kompatibilis, kis kockázat. Hatás minden cellán mérhető (TestOrder cell-szerkezet ~100+ primitív property per Object-instance).

B. `WriteStringWithDispatch` Compact ág batch-write (várt: -1-2pp Ser)

A FastWire ágat már K7M3-ban + a 2026-05-08 batch-write fixxel egyetlen ensure + direct-write-ra alakítottuk. A Compact ág ugyanaz a 3-step pattern (post-encode tier-shift CopyTo ha actualHeader < reserveHeader, plus header-write a tier alapján). A Compact ágon is alkalmazható batch-write — egyetlen EnsureCapacity a worst-case-tier-szel + direct header-write a Utf8.FromUtf16 után.

C. Thread-static context (várt: -2-4pp Ser, NAGY refactor)

A AcquireArrayOutputContext(options) pool-bérlés overhead-jét mérsékelheti a MemPack [ThreadStatic] mintázat. A jelenlegi pool-bérlés:

Pool dictionary lookup (lehet, lock-os)
Context-state init / reset minden hívásnál

Thread-static cseréje:

Per-thread cached context, nincs lock
Context-reset minden hívásnál ugyanaz, de a state allokáció egyszer fut

Refactor szempontok:

A BinarySerializationContext state-tárolása nem thread-safe önmagában — pool-bérlés vagy thread-static mind a single-thread haszálatot biztosítja
Az options paraméter érintheti a state-init logikát — multi-options scenárió esetén a thread-static state-t reset-elni kell
Concurrent serialize hívások (több thread egyidejű) — minden thread saját state-tel rendelkezne; nincs cross-thread sharing igény

D. Linked-array buffer chunk strategy (kicsi hatás, NAGY refactor)

A MemPack ReusableLinkedArrayBufferWriter linked chunk-list helyettesíti a lineáris byte[]-grow stratégiát. Buffer-grow = új chunk hozzáadása (no memcpy a régi adaton).

A profile szerint a ArrayBinaryOutput.Grow Self CPU csak 0.05% — a buffer-grow ritkán fut, a default kapacitás elég nagy a Large cell-hez. Kicsi hatás, nagy refactor. Alacsony prioritás.

F. SGen `HasPropertyFilter` lift-out a `WriteProperties` method elejére (várt: -2-4pp Ser)

A jelenlegi SGen-template minden property-emit előtt ellenőrzi a property-filter-t:

public void WriteProperties<TOutput>(object value, ...)
{
    var obj = Unsafe.As<TestPallet>(value);

    if (context.HasPropertyFilter)                     // ← MINDEN property-en check!
    {
        var fc_Category = new BinaryPropertyFilterContext(obj, ..., "Category", ...);
        if (!context.PropertyFilter!(in fc_Category)) {
            context.WriteByte(BinaryTypeCode.PropertySkip);
            goto skip_Category;
        }
    }
    if (obj.Category == null) context.WriteByte(BinaryTypeCode.PropertySkip);
    else if (depth > context.MaxDepth) context.WriteByte(BinaryTypeCode.Null);
    else { context.WriteByte(BinaryTypeCode.Object); ...WriteProperties... }
    skip_Category:;

    if (context.HasPropertyFilter) { /* same for Inspector */ }   // ← újra!
    // ... 10× ismétlés property-listán
}

A HasPropertyFilter per-property branch-check TestOrder benchmark workload-on mindig false (a benchmark nem használ property-filter-t). De a check minden property-en lefut — kód-cache-ben benne van, branch-predict ugyan jó, mégis CPU cycle.

Optimalizáció — kétpályás SGen kódgenerálás:

public void WriteProperties<TOutput>(object value, ..., int depth)
{
    var obj = Unsafe.As<TestPallet>(value);

    if (context.HasPropertyFilter)
    {
        WritePropertiesWithFilter(obj, context, depth);    // ritka path — full per-property check
        return;
    }

    // Fast path — NO filter check anywhere
    if (obj.Category == null) context.WriteByte(BinaryTypeCode.PropertySkip);
    else if (depth > context.MaxDepth) context.WriteByte(BinaryTypeCode.Null);
    else { ... }
    // (no skip_Category goto — never needed)

    context.WriteVarInt(obj.Id);                       // primitív, no filter check
    // ... rest of properties without HasPropertyFilter check
}

// Külön emit-elt method ritka path-ra:
private static void WritePropertiesWithFilter<TOutput>(TestPallet obj, ..., int depth)
{
    // Full per-property filter-aware kód (the current behavior)
}

A AcBinarySourceGenerator.cs-t kell módosítani:

A WriteProperties method elején egyetlen HasPropertyFilter check
Két különböző code-path emit:
- Fast path (default — no filter): nincs per-property if (context.HasPropertyFilter) check, nincs filter-context allokáció + lambda-call, nincs goto skip_X
- Slow path (filter aware — separate static method): a jelenlegi viselkedés

Várt nyereség: a fast path ~10 elimináció / object × 1-2 ns / branch ≈ ~15-20 ns / object. Latin1Long Large cell 1250 instance × 18 ns = ~22 µs / 120 µs Ser ≈ ~18% felső becslés; reálisan 2-4pp Ser javulás (a kód-bloat növekedés és a JIT inlinelés-ráhatás miatt mérséklődik).

Kombinálható az A-val: az A + F együtt 3-7pp javulás célozható meg — a SGen WriteProperties 12.6% Self CPU jelentős csökkenése.

Wire-formátum változatlan, kód-méret kicsivel nő (két path-ot generál minden type-on), de a fast path a JIT-tel jobban inlinelhető.

G. SGen `WriteProperties` null/depth/object-ref kombinálás (kapcsolt az F-hez)

A komplex (Object) property-knél a 3-ágú dispatch:

if (obj.X == null) context.WriteByte(BinaryTypeCode.PropertySkip);
else if (depth > context.MaxDepth) context.WriteByte(BinaryTypeCode.Null);
else { context.WriteByte(BinaryTypeCode.Object); X_GeneratedWriter.Instance.WriteProperties(...); }

Ez minden komplex property-en fut. Lehetséges optimalizáció: a depth > MaxDepth check egy method-szintű branch-szé alakítás (egyszer ellenőrizni a method elején, aztán a property-szintű ágat egyszerűsíteni). De ez kis hatás és a MaxDepth jellemzően nem érintő (a legtöbb workload-on depth < MaxDepth).

Alacsony prio, F-tel kombinált.

E. `WriteVarUIntMultiByteUnsafe` (1.66% Self) → fix-int (várható: -1pp Ser, NEM javasolt önmagában)

A WriteVarInt (signed int property-encode, ZigZag + VarUInt) kódolás a SGen-template-ekben gyakori (Id, Status, TrayCount, stb.). A multi-byte ág 1.66% Self CPU.

Fix-int (4 byte) cseréje wire-méret-növekedéssel jár (kis int-eken +3 byte / property), ami a wire-formátum kompaktság-előnyét rontja. Csak ACCORE-BIN-T-S2X9 markerless lane kontextusban érdemes — ahol a property-marker eltávolításával együtt fix-int kicserélése wire-szempontból kompenzálódik.

Közös, NEM AcBinary-spec overhead — nem optimalizálható

A Buffer._Memmove 39.77% Self CPU + a Output.ToArray() 47.37% Total a return-time byte[]-alloc + content-memcpy, ami minden byte[] Serialize(T) hívásnál fut. Mindkét engine fizeti (MemPack ToArrayAndReset() is alloc + memcpy a chunkokból). Az API contract (byte[] Serialize(T)) miatt elkerülhetetlen.

Aki teljesítményt akar, használja a IBufferWriter<byte> overload-ot (AcBinaryBufferWriterBenchmark vs MemoryPackBufferWriterBenchmark apples-to-apples a benchmarkban — mindkét engine ugyanezt csinálja).

Acceptance (per-section)

A (SGen ensure-batching): Latin1Long FastWire bench AcBinary Ser delta vs MemPack -1-3pp javulás minden cellán
F (HasPropertyFilter lift-out): Latin1Long Ser delta -2-4pp; A + F együtt SGen WriteProperties Self CPU ≤ 8% (jelenleg ~12.6%)
G (null/depth/object-ref kombinálás): kis hatás, F-tel kombinált
B (WriteStringWithDispatch Compact batch-write): Latin1Long Compact bench AcBinary Ser delta vs MemPack ≤ +5% minden cellán
C (Thread-static context): Serialize<T> Self CPU ≤ 6% (jelenleg ~10%)
D (Linked-array): nem prioritás — buffer-grow Self CPU már ≤ 0.05%
E (VarInt → fix-int): csak az S2X9 markerless lane sprint kontextusában mérni

Sorrend

A + F kombinálva — SGen WriteProperties template átfogó refactor (ensure-batching + HasPropertyFilter lift-out + esetleg G null/depth-combine). Együtt ~3-7pp Ser javulás várt minden cellán. Izolált változtatás csak AcBinarySourceGenerator.cs-en, wire-format változatlan.
B — ~1-2pp javulás, ugyanaz a pattern mint a K7M3 FastWire batch-write
C — ~2-4pp, de NAGY refactor (thread-safety, pool semantics felülvizsgálat)
D — alacsony prioritás (kis hatás, nagy refactor)
E — csak S2X9 kontextusban

Trigger

A + F → most azonnal implementálható; ezek a SGen template-en belül kombinálandók (egyetlen template-átdolgozás kétségtelenül jobb mint külön refactor-körök). Minden továbbai mérés ettől függ.
B → A+F után, hasonló pattern alkalmazása más writer-helyen
C → ha a Serialize Self CPU 10% továbbra is dominál A+F+B után
D, E → opcionális, az A/F/B/C eredmények alapján

ACCORE-BIN-T-Q5T2: Önleíró wire-formátum — duplikált object-marker-ek + UTF-16 string marker (per-type/property encoding choice)

Priority: P2 · Type: Architecture / Performance · Status: Open · Related: ACCORE-BIN-T-P3X7 (profile-driven roadmap — kis-adat slowdown diagnózis), ACCORE-BIN-T-K7M3 (BCL UTF-8 transcoder — előfeltétele), ACCORE-BIN-T-S2X9 (markerless schema lane), ACCORE-BIN-T-V4N2 (UTF-8 SIMD)

A 2026-05-08 design-session során merült fel mint válasz a kis-adat-slowdown problémára és az if (FastWire) / if (UseMetadata) runtime-branch-ek széles jelenlétére. Cél: a wire-mode kivezetése a globális header-ből, per-object/per-property encoding-szabadság attribute-tal, megőrizve a SGen↔Runtime wire-kompatibilitást.

LLM Context (cold-start)

Egy fresh session olvasásához ez a kontextus elég:

Wire-modell: AcBinary két párhuzamos serializációs path-ot futtat — SGen (compile-time generált, [AcBinarySerializable] típusokra) és Runtime (reflection + Expression.Compile). Mindkettő ugyanazt a wire-t produkálja és olvassa (interop garancia, BINARY_SGEN.md "Hybrid Execution Model").

Markerless body: object scope-on belül a primitív property-k (int, long, double, …) közvetlenül írnak a wire-be, marker-byte nélkül. A reader a sorrendet compile-time schema-ból (SGen) vagy OrderedProperties metadata-ból (Runtime) tudja. A wire object-prefix-szel kezdődik (1-byte marker), majd markerless body.

Meglévő object-marker család (AcBinarySerializer.BinarySerializationContext.cs writer-ek + AcBinaryDeserializer.cs reader-dispatch switch):

Object — sima first-occurrence
ObjectWithTypeName — polimorf (runtimeType != declaredType)
ObjectFullMarkerIId / ObjectFullMarkerAll — RefHandling=IId|All first-occurrence
ObjectRef / ObjectRefIId — subsequent (csak ID, NEM duplikálódik — nincs primitív property körülötte)

OPT-OUT minta (jelenlegi konvenció): default SGen flexibilis — minden runtime-branch-et generál (pl. if (context.UseRefHandling)). Class-attribute disable-eli a feature-t → SGen omitti a branch-et → drasztikus optimum. Q5T2 ezt a mintát terjeszti ki encoding-választásra.

Naming-konvenció: PascalCase, suffix-variánsok (Object → ObjectVarUInt, String → StringUtf16). NEM Object_NoZZ, NEM ObjVU.

Motiváció

A jelenlegi AcBinaryOptions.WireMode (FastMode vs Compact) payload-szintű globális flag:

A kódban sok if (FastWire) { ... } else { ... } branch (lásd WriteVarInt 514. sor, WriteStringWithDispatch, WriteValueNonPrimitive, property-writers)
A fejlesztő nem optimalizálhat granuláris szinten (pl. [NoZZ] egy hot type-ra, default másnak)
Schema-evolúciós szempontból: ha a szerver attribute-ot változtat egy type-on, a klienseknek (akár régebbi verzió) rekomp nélkül olvasniuk kell az új wire-t

A ACCORE-BIN-T-P3X7 profile-bench mérése szerint a kis-adat slowdown (Latin1Long Small +2.6%, Medium +1.5% AcBinary lassulás MemPack-hez képest) jelentős részben a VarUInt per-call overhead-ből származik (ZigZag shift + multi-byte branch loop). A type-szintű [IntEncoding=VarUInt] attribute-tal a fejlesztő a non-negative property-ket VarUInt-NoZigZag-ra állíthatja → ZigZag shift kiesik, kis-adatra mérhető nyereség.

Wire-formátum design

5 új BinaryTypeCode marker (naming TBD: *VarUInt vagy *NoZZ suffix, implementációkor véglegesítendő):

Új marker	Cél	Alkalmazási hely
`ObjectVarUInt`	Object scope primitive int/long/enum-jai NoZigZag VarUInt encoding-ban	sima object first-occurrence
`ObjectWithTypeNameVarUInt`	Polimorf first-occurrence NoZZ-variánsa	`runtimeType != declaredType` esetén
`ObjectFullMarkerIIdVarUInt`	`RefHandling=IId` first-occurrence NoZZ-variánsa	csak first; subsequent `ObjectRefIId` változatlan
`ObjectFullMarkerAllVarUInt`	`RefHandling=All` first-occurrence NoZZ-variánsa	csak first; subsequent `ObjectRef` változatlan
`StringUtf16`	UTF-16 encoded string content (property-szintű)	bárhol egy string property emit-jénél

Wire-példa:

[ObjectVarUInt marker]                  ← scope-szintű: int-property-k VarUInt-NoZZ
  WriteVarUInt(obj.Id)                   ← markerless body, encoding a marker alapján
  WriteVarUInt(obj.Status)
  [String marker] UTF-8(obj.Notes)        ← default UTF-8
  [StringUtf16 marker] UTF-16(obj.Name)   ← property-szintű override

Byte-szintű példa (Order { Id=42, Status=3, Notes="ok" }, class-szintű IntEncoding=VarUInt):

Default ZigZag wire: [Object] [0x54] (VarInt 42 ZigZag: ((42<<1)^(42>>31))=84) [0x06] (VarInt 3 ZigZag: 6) [String] [0x02] 0x6F 0x6B
New VarUInt wire: [ObjectVarUInt] [0x2A] (VarUInt 42 raw: 0x2A) [0x03] (VarUInt 3 raw: 0x03) [String] [0x02] 0x6F 0x6B
Body-sorrend és byte-szám változatlan; csak az encoding-szabályok mások. Stringek ugyanúgy markered (UTF-8 default itt). String-encoding override esetén [StringUtf16] [char-count] [2-byte-per-char].

A primitive property-k körüli wire markerless marad — a body-encoding-ot az object-marker határozza meg, nem per-property byte. Wire-bloat csak ott van, ahol most is van marker (object-prefix, string-marker).

Attribute design

Object-szintű (mert object-marker is object-szintű):

[AcBinarySerializable(IntEncoding = IntEncoding.VarUInt)]
public class Order { ... }

Property-szintű (csak string-en, mert string-marker is per-property):

public class Order {
    [AcBinaryEncoding(StringEncoding.Utf16)]
    public string CustomerName { get; set; }
}

Új public API elemek:

AcBinaryEncodingAttribute (target: Class | Property)
IntEncoding enum (Default = ZigZag VarInt, VarUInt = NoZigZag)
StringEncoding enum (Default = UTF-8, Utf16 = UTF-16)
AcBinaryOptions.IntEncoding és AcBinaryOptions.StringEncoding runtime fallback opciók

Encoding-választás precedenciája (writer-side)

Property attribute (legerősebb) — pl. [AcBinaryEncoding(StringEncoding.Utf16)]
Class attribute — pl. [AcBinarySerializable(IntEncoding=VarUInt)]
AcBinaryOptions runtime opció — pl. options.StringEncoding = Utf16
Built-in default — ZigZag-VarInt + UTF-8

Szerepkörök és path-ok

Path	Encoding-választás
SGen writer (with attribute)	Compile-time pinned, hard-coded marker + encoding emit (NO runtime branch) — a meglévő OPT-OUT minta (mint `RefHandling`/`Interning` disable)
SGen writer (no attribute)	Runtime branch a `context.IntEncoding`/`context.StringEncoding` option-en — két path generálódik, runtime dönt
SGen reader	Marker-dispatch (NEM hard-coded marker-expect — runtime-on dönti el, hogy `Object` vagy `ObjectVarUInt` érkezett, és annak megfelelően olvas)
Runtime writer (reflection-based)	Reflection-attribute-read + option fallback + default fallback — ugyanaz a precedencia mint SGen-nél
Runtime reader	Marker-dispatch (universal — nincs attribute / option használat encoding-döntésre, csak a marker-byte)

⚠️ SGen reader marker-dispatch KÖTELEZŐ (NEM hard-coded marker-expect). Konkrét scenario amit ez kezel:

Szerver Runtime-mode-ban serializálja Order-t. Az Order osztályon a szerver-deploy óta változott az attribute (új deploy hozott [IntEncoding=VarUInt]-ot). Szerver Runtime writer reflection-ből olvassa az új attribute-ot → ObjectVarUInt markert emit-el a wire-be.

Régi kliens rekomp nélkül kapja a payload-ot. Ha a kliens SGen reader-e hard-coded Object-marker-expect-tel olvasna → panik / mismatch.

Marker-dispatch-szel a kliens helyesen dekódol bármelyik markert, függetlenül attól, hogy a kliens-oldali compile-time Order typebe-n volt-e az attribute.

Ez biztosítja a "server-side attribute-change doesn't break clients" garanciát.

Kompatibilitási garanciák

Interakció	Eredmény
SGen-write (NoZZ attr) → SGen-read	OK (marker-dispatch)
SGen-write (NoZZ attr) → Runtime-read	OK (marker-dispatch)
Runtime-write (option=NoZZ) → SGen-read	OK (marker-dispatch)
Runtime-write (option=NoZZ) → Runtime-read	OK (marker-dispatch)
Server-attribute-changed → old client (no recompile)	OK — kliens csak a marker-t olvassa
Mixed payload (egyik object NoZZ, másik default)	OK — minden object-marker önálló scope

Implementációs lépések

BinaryTypeCode const-bővítés — 5 új byte-érték (range-allokáció: a meglévő enum szervezése alapján a következő szabad slot-okba). Wire-format spec frissítés BINARY_FORMAT.md-ben.
AcBinaryEncodingAttribute + IntEncoding + StringEncoding enum-ok — új fájlok az AyCode.Core/Serializers/Binaries/ mappában.
AcBinaryOptions.IntEncoding + AcBinaryOptions.StringEncoding opciók hozzáadása (default = Default).
WriteStringUtf16 / ReadStringUtf16 context-helper-ek — MemoryMarshal.Cast<char,byte> direct copy + length-prefix (VarUInt char-count).
Runtime writer reflection — BinarySerializeTypeMetadata cache: IntEncoding, StringEncoding-per-property flag-ek (attribute-alapján). Encoding-emit a precedencia szerint.
SGen writer template — attribute-feldolgozás EmitWriteValue-ban: ha attribute → compile-time hard-coded emit; ha nincs → runtime-branch emit a context option-en.
SGen reader template — EmitReadValue marker-dispatch-szel (object-marker scope-encoding-mode tracking + string-marker per-property dispatch).
Runtime reader update — object-marker dispatch a scope-encoding-state-be (pl. BinaryDeserializationContext.CurrentIntEncoding), string-marker per-property dispatch.
Cross-mode tesztek — minden write-read kombináció (SGen↔SGen, SGen↔Runtime, Runtime↔SGen, Runtime↔Runtime) minden encoding-kombinációban (default, attr-only, option-only, attr+option, mixed payload).
Doc: BINARY_FORMAT.md wire-format spec, BINARY_OPTIONS.md új opciók, BINARY_SGEN.md precedencia + szerepkörök táblázat.

Acceptance

5 új BinaryTypeCode marker, naming-konvenció dokumentált
AcBinaryEncodingAttribute + 2 enum + 2 opció extension working
Round-trip teszt minden cross-mode kombinációban zöld
Wire-bloat default-encoding-on 0 byte (nincs új per-property marker)
Latin1Long Small bench: AcBinary [IntEncoding=VarUInt] típuson a slowdown ≤ MemPack +0.5pp (jelenleg +2.6%)
BINARY_FORMAT.md/BINARY_OPTIONS.md/BINARY_SGEN.md szinkronban a wire- és attribute-világgal
A meglévő WireMode=Fast/Compact distinction-ek kompatibilisek maradnak (vagy migrálódnak az új encoding-attribute-okra — külön döntés implementációkor)

Trigger / Sorrend

Implementáció ne kezdődjön azonnal — a ACCORE-BIN-T-P3X7 A+F szekciói (SGen ensure-batching + HasPropertyFilter lift-out) előbb mérendő. Ha az A+F már lehozza a SGen WriteProperties Self CPU-t ≤ 8%-ra, és a kis-adat slowdown ettől már ≤ +1pp, akkor ez a Q5T2 entry alacsony prioritásra kerül. Ha a kis-adat slowdown az A+F után is megmarad → Q5T2 implementáció érdemi.

Egyéb prerekvizit: ACCORE-BIN-T-W9F1 (compile-time metadata) szinkronizálás — a Runtime writer reflection-attribute-read-je beleilleszthető a generált metadata-ba, ezzel a runtime path is gyorsabb attribute-alapú encoding-választás-on.

Open kérdések (implementációkor eldöntendő)

Marker naming: ObjectVarUInt (semantic, az encoding alapján) vagy ObjectNoZZ (rövidebb)?
[AcBinarySerializable]-on belül vegyük fel a IntEncoding paramétert, vagy külön [AcBinaryEncoding] attribute legyen object-szinten is (és a [AcBinarySerializable] változatlan)?
AcBinaryOptions.WireMode jövője: a régi Fast/Compact enum migrálódjon az új IntEncoding/StringEncoding-ra (BC-break) vagy maradjon mint shortcut-default?

223 KiB Raw Blame History Unescape Escape

AcBinarySerializer — TODO

Priority legend

ACCORE-BIN-T-P6M4: Universal hotpath optimization guardrails + follow-up backlog

ACCORE-BIN-T-K9M3: Hoist wire codec primitives to context instance methods (ser + deser, feature-aware SGen emit)

Motivation

Pilot landed

Scope — both ser and deser

Phase A — Decode primitives (deser context)

Phase B — Encode primitives (ser context)

Phase C — Feature-conditional SGen-emit

Perf guardrails (NON-NEGOTIABLE)

JIT / NativeAOT outlook

Caveat — where NOT to hoist

Acceptance

ACCORE-BIN-T-S8P4: Replace JSON-in-Binary request parameters

Resolution

ACCORE-BIN-T-Q2N7: Re-evaluate DiscountProductMapping SGen exclusion

ACCORE-BIN-T-W9F1: Generate BinarySerializeTypeMetadata / BinaryDeserializeTypeMetadata at compile time

ACCORE-BIN-T-T5J8: JIT Tier 1 warmup for generated hot methods

ACCORE-BIN-T-Z3K8: Replace IId<T> interface dependency with convention/attribute-based Id detection

ACCORE-BIN-T-N7V1: Replace [JsonIgnore] dependency with serializer-native ignore attribute

ACCORE-BIN-T-Y6R2: Implement projection serialization phase 1 (runtime path)

ACCORE-BIN-T-K3W7: Rename BufferWriterChunkSize to reflect actual semantics

ACCORE-BIN-T-M4D2: Add ReadOnlyMemory<byte> / Memory<byte> deserialize overloads

ACCORE-BIN-T-S7X3: Add ReadOnlySpan<byte> deserialize overload

ACCORE-BIN-T-T8K3: Add SerializeAsync(Stream, T) async overloads with mode-driven output strategy

Mode-driven output strategy — three lanes for three workload shapes

Honest performance positioning vs. MemoryPack — three real axes

Throughput nuance — AsyncSegment cost on Stream-backed transports

Marketing claim — three-way honest comparison

ACCORE-BIN-T-D7K4: Add DeserializeAsync(Stream, T) async overloads with mode-driven input strategy

Implementation: zero new IBinaryInputBase impl needed

Public API shape

Implementation outline (per mode)

Honest performance positioning

Acceptance

ACCORE-BIN-T-N9G6: Add non-generic Type-based Serialize(object, Type, ...) overloads

Resolution

ACCORE-BIN-T-R4P2: Expose low-level ref Writer-style API for custom formatters

ACCORE-BIN-T-U6Y8: Attribute-driven polymorphism via [AcBinaryUnion] + SGen (opt-in, AOT-friendly)

1. New 5th bool parameter on [AcBinarySerializable]: EnablePolymorphismFeature

2. New [AcBinaryUnion(byte tag, Type subtype)] attribute

3. New PolymorphismMode enum on AcBinarySerializerOptions

ACCORE-BIN-T-B7H4: Implement AcBinarySerializerOptions thread-safety fix

ACCORE-BIN-T-F8N3: Switch source-generator type-name hashing from simple-name to fully-qualified-name

ACCORE-BIN-T-I3P8: [AcBinaryTypeId(...)] attribute — explicit type-id override

ACCORE-BIN-T-X2M5: Evaluate xxHash3 vs FNV-1a for type-name hashes

ACCORE-BIN-T-K9E4: [RequiresDynamicCode] + [RequiresUnreferencedCode] on Runtime-only methods

ACCORE-BIN-T-A2J7: Optional AyCode.Core.Aot NuGet variant (SGen-only build)

ACCORE-BIN-T-V4N2: Cross-tier SIMD UTF-8 transcoder paths (AVX-512BW + Vector128 + multi-byte transcoder)

Phase 2.5 — scalar run-length decoder (multi-byte baseline, pre-Phase 3 prototype) — TESTED & REVERTED 2026-05-07

Phase 3 implementation outline

Why P2

Acceptance

Trigger

ACCORE-BIN-T-H2Q6: Fixed-width dual-length string header (Small/Medium/Big) for 1-pass decode

Planned format tiers

Why

Constraints captured from current benchmark context

Marker layout decision (2026-05-06)

Marker address space reservation (post-H2Q6)

Acceptance

Resolution

ACCORE-BIN-T-S5L8: Sentinel-length encoding for strings (wire-size optimization, both modes)

Per-mode impact

Limitations (both modes)

Implementation outline (rough — refine when implementing)

Trigger

Acceptance

ACCORE-BIN-T-M3R7: ASCII marker-dispatch — writer detect + reader dedicated path

Implementation

Wire format change

Acceptance

Resolution

ACCORE-BIN-T-E2F9: Custom UTF-8 encoder (writer-side, symmetric with custom decoder)

Layered structure (mirrors decoder)

Why

Trigger

Acceptance

223 KiB

Raw Blame History

ACCORE-BIN-T-W9F1: Generate `BinarySerializeTypeMetadata` / `BinaryDeserializeTypeMetadata` at compile time

ACCORE-BIN-T-Z3K8: Replace `IId<T>` interface dependency with convention/attribute-based Id detection

ACCORE-BIN-T-N7V1: Replace `[JsonIgnore]` dependency with serializer-native ignore attribute

ACCORE-BIN-T-K3W7: Rename `BufferWriterChunkSize` to reflect actual semantics

ACCORE-BIN-T-M4D2: Add `ReadOnlyMemory<byte>` / `Memory<byte>` deserialize overloads

ACCORE-BIN-T-S7X3: Add `ReadOnlySpan<byte>` deserialize overload

ACCORE-BIN-T-T8K3: Add `SerializeAsync(Stream, T)` async overloads with mode-driven output strategy

Throughput nuance — `AsyncSegment` cost on Stream-backed transports

ACCORE-BIN-T-D7K4: Add `DeserializeAsync(Stream, T)` async overloads with mode-driven input strategy

Implementation: zero new `IBinaryInputBase` impl needed

ACCORE-BIN-T-N9G6: Add non-generic `Type`-based `Serialize(object, Type, ...)` overloads

ACCORE-BIN-T-R4P2: Expose low-level `ref Writer`-style API for custom formatters

ACCORE-BIN-T-U6Y8: Attribute-driven polymorphism via `[AcBinaryUnion]` + SGen (opt-in, AOT-friendly)

1. New 5th bool parameter on `[AcBinarySerializable]`: `EnablePolymorphismFeature`

2. New `[AcBinaryUnion(byte tag, Type subtype)]` attribute

3. New `PolymorphismMode` enum on `AcBinarySerializerOptions`

ACCORE-BIN-T-B7H4: Implement `AcBinarySerializerOptions` thread-safety fix

ACCORE-BIN-T-I3P8: `[AcBinaryTypeId(...)]` attribute — explicit type-id override

ACCORE-BIN-T-K9E4: `[RequiresDynamicCode]` + `[RequiresUnreferencedCode]` on Runtime-only methods

ACCORE-BIN-T-A2J7: Optional `AyCode.Core.Aot` NuGet variant (SGen-only build)

Why P2 — `WireMode = Fast` wire-size parity (NuGet release narrative)

ACCORE-BIN-T-V4N3: Symmetric `GetUtf8ByteCount` API + writer-side BCL kihagyás (cold path)

Superseded by `ACCORE-BIN-T-K7M3` (2026-05-08)

ACCORE-BIN-T-J5L9: Remove dead `WriteFixStrDirect` / `WriteStringUtf8Internal` (audit-surfaced uncalled methods)

ACCORE-BIN-T-O7G2: Overflow guard on `charLength * 4` writer arithmetic + corrupted-wire `ReadStringBig`

ACCORE-BIN-T-S6F2: Shift-mentes Small fast path in `WriteStringWithDispatch`

ACCORE-BIN-T-W2C8: WASM string-cache H2Q6 maximalizálás (`ReadStringUtf8Cached` MISS path)

ACCORE-BIN-T-F3W6: Dedicated FastWire string marker (split mode-shared `StringSmall`)

1. `BenchmarkTestDataProvider` refactor

2. `Program.cs` interactive submenu