223 KiB
AcBinarySerializer — TODO
This page covers planned work for the binary serializer core (format, SGen, options, deserialization context, buffer writer). Work specific to the streaming I/O layer (AsyncPipeReaderInput + AsyncPipeWriterOutput, multi-message wire framing, sliding-window buffer, producer-consumer synchronization) is tracked separately in BINARY_ASYNCPIPE_TODO.md.
Priority legend
- P0 blocker · P1 important · P2 nice-to-have · P3 idea
ACCORE-BIN-T-P6M4: Universal hotpath optimization guardrails + follow-up backlog
Priority: P1 · Type: Performance
AcBinary is a universal serializer. Hotpath work must avoid benchmark-only overfitting.
For each performance TODO, validate on representative workload mixes (ASCII-heavy, mixed Latin, multi-byte UTF-8; small/medium/large/deep payloads) and evaluate throughput + latency + allocation + wire-size together.
Follow-up backlog (short):
- Split oversized hot methods into inline-friendly dispatcher + cold helpers (writer/reader/populate).
- Add direct fast branches for the most frequent markers before generic table-dispatch.
- Reduce repeated
EnsureAvailablechecks by grouping fixed-width reads under one bounds check. - Extend VarUInt fast-path coverage for common 3-byte cases on metadata/index/cache-id routes.
- Reorder populate/property-loop branches by runtime frequency (
PropertySkip/Null/primitive fast-setters first). - Minimize pool/clear overhead by avoiding unnecessary aggressive array clearing in hot lifecycle paths.
- Add early scan-pass short-circuit when options guarantee no ref/intern benefit.
ACCORE-BIN-T-K9M3: Hoist wire codec primitives to context instance methods (ser + deser, feature-aware SGen emit)
Priority: P2 · Type: Refactor + Performance · Related: ACCORE-BIN-T-P6M4 (hotpath guardrails), BINARY_ISSUES.md#accore-bin-i-t7k3 (polymorph compile-time guard)
Motivation
Wire codec logic is currently triplicated:
- SGen-emit inlines marker decode/encode at every property emit site (
StringInternFirstSmall,Object/ObjectRefFirst/Null/ObjectRef/FixObj-slot dispatch, etc.). - Runtime
TypeReaderTabledispatches viastatic (ctx, _) => ReadXxx(ctx)lambdas to per-markerstatichelpers inAcBinaryDeserializer. - Cross-type populate (
PopulatePropertyfallback) repeats the same per-marker switch.
Result: bug-fix risk (three copies drift), ad-hoc divergence (the polymorph ObjectWithTypeName emit was missing on the SGen side for months — ACCORE-BIN-I-T7K3), larger generated assemblies, longer JIT time. A single instance method on the context is the natural single-source-of-truth for each wire primitive.
Pilot landed
ReadAndRegisterInternedStringSmall / Medium moved from static helpers on AcBinaryDeserializer to internal instance methods on BinaryDeserializationContext. All three call paths (TypeReaderTable lambdas, cross-type PopulateProperty switch, SGen-emit EmitReadProp case-body) now call context.ReadAndRegister...(). Generated case-body shrank from 12 lines to 3 per case — no perf regression ([AggressiveInlining] keeps the JIT/AOT inline footprint identical).
Scope — both ser and deser
Phase A — Decode primitives (deser context)
ReadStringSmall/Medium/Big(H2Q6 non-ASCII tiers).ReadPlainStringAscii(long ASCII tier).ReadObjectfamily — careful: this branches ontargetTypeand on the writer's runtime polymorphic slot table, both of which are call-site-context-specific. May not be a clean hoist; see "Caveat" below.
Phase B — Encode primitives (ser context)
WriteStringWithDispatch,WriteStringInternFirstWithDispatch— already partly on the context, audit completeness.- Marker-write helpers (
WriteObjectFullMarker*) — already on the context post-T7K3. - Audit: scan ser-side SGen-emit for any inline encode duplication that should move to the context.
Phase C — Feature-conditional SGen-emit
EmitReadProp (and the symmetric emit paths) must consult the per-type Enable*Feature flags to omit case-branches for disabled features. Today the SGen reader handles every marker regardless of the type's feature opt-outs — wasteful, and worse, it silently accepts markers the writer would never emit (instead of fail-fast):
| Disabled feature | Cases to skip in SGen reader emit |
|---|---|
EnableInternStringFeature = false |
StringInterned, StringInternFirstSmall, StringInternFirstMedium |
EnableRefHandlingFeature = false |
ObjectRef, ObjectRefFirst, ObjectWithMetadataRefFirst |
EnableMetadataFeature = false |
ObjectWithMetadata, ObjectWithMetadataRefFirst |
EnablePolymorphDetectFeature = false |
Already guarded by ACBIN002 (compile error if any object property remains on the type) — symmetric here. |
After Phase C: leaner generated code per opt-out type AND wire-misuse (e.g. mixed writer/reader feature configurations) surfaces as explicit fail-fast in the default switch arm — same philosophy as ACBIN002.
Perf guardrails (NON-NEGOTIABLE)
The hoisting MUST NOT regress SGen hot-path performance. The pilot iteration was a net positive (less IL → faster cold-start JIT, smaller native code, identical inline body); this property has to hold for every subsequent hoist.
Rules of thumb:
- Every hoisted method MUST have
[MethodImpl(MethodImplOptions.AggressiveInlining)]. - Body must stay small (≤ ~30 IL instructions after compile) so the JIT/AOT actually inlines — verify via
dotnet jit-dasmspot-check on representative callers. - Single-purpose; no
if-branches across distinct call-site contexts (those stay inline at the call site where the context-specific constants are visible). - Benchmark verification before/after each hoist (
Console.FullBenchmark).
JIT / NativeAOT outlook
Modern .NET JIT (≥7) and NativeAOT both honour AggressiveInlining for small bodies → the hoisted methods inline back into the caller at compile time → identical native code to the previous inline-emit. The IL is smaller (less SGen-emit per file), which gives:
- Faster cold-start JIT (less IL to translate on first call per type).
- Smaller assemblies on disk (NativeAOT publish size shrinks).
- Smaller i-cache footprint per active hot type (since SGen-emit no longer balloons per property).
The generic <TInput> specialization remains: each ArrayBinaryInput / SequenceBinaryInput / AsyncPipeReaderInput still gets its own native body (TInput.IsTrustedSingleSegment constant-folds per specialization), so no overhead vs. the current state.
NativeAOT additionally prefers small, single-purpose methods: register-allocation (LSRA) is more effective, peephole / loop-unroll / dead-code passes run faster per method, and the published native image is denser. The previous "giant SGen-emitted ReadProperties body" pattern was actively hostile to AOT in this respect.
Caveat — where NOT to hoist
Not every inline emit is a candidate. If the inline body carries compile-time constants (typeof(TFoo) literal, direct Instance.ReadProperties call on a concrete generated reader class, nameof(prop) constant), hoisting forces those into runtime parameters: constant-folding opportunity lost AND a direct call may become virtual via interface dispatch. The Complex property dispatch (Object → new T + ReadProperties direct call) is in this category and should stay inline at the SGen emit site.
Decision per primitive: can it be expressed as a context method that takes only wire-bytes-relevant inputs (no targetType literal, no per-property setter callback)? If yes → hoist. If no → keep inline.
Acceptance
- Phase A: all shared decode primitives reachable as instance methods on
BinaryDeserializationContext. TypeReaderTable + cross-type populate + SGen-emit all call them. SGen-generated case-body for each affected marker is ≤ 3 lines. - Phase B: ser-side audit complete; any encode duplication closed by hoist or explicit "keep inline — see caveat" note in the SGen comment.
- Phase C: SGen-emit reader honours
Enable*Featureflags. Verified by spot-checking generated*.g.csfiles: anEnableInternStringFeature=falsetype's reader does NOT containStringInternFirstSmall/Medium/StringInternedcases. - Per-phase benchmark run (
Console.FullBenchmark) confirms no hot-path regression (within noise floor).
ACCORE-BIN-T-S8P4: Replace JSON-in-Binary request parameters
Priority: P1 · Type: Refactor · Status: Closed (2026-04-26, landed in commits cdd54d3 2026-04-05 + 3b70070 2026-04-06) · Related: ../XCUT/XCUT_ISSUES.md#accore-xcut-i-x8q1 (canonical), AyCode.Services/docs/SIGNALR/SIGNALR_TODO.md
Migrate client→server request parameters from JSON-in-Binary envelope to direct Binary serialization (matching response path). Coordinated change across client, server, and all consuming projects. Do NOT attempt as side-effect of unrelated work.
Acceptance: SignalPostJsonDataMessage<T> replaced by a SignalPostBinaryDataMessage<T> (or equivalent); no JSON round-trip on the wire for request params; benchmarks confirm no regression.
Resolution
- What: Length-prefixed, per-parameter binary format introduced via
SignalRSerializationHelper.SerializeParametersToBinary/DeserializeParametersFromBinary; further unified intoSignalParams(singlebyte[]carrying packed method parameters withSetParameterValues/GetParameterValues). - Where:
AyCode.Services/SignalRs/AcSignalRClientBase.cs,AcWebSignalRHubBase.cs,ISignalParams.cs(server + client dispatch);IAcSignalRHubClient.cs(legacy wrappers). - Equivalent (not literal
SignalPostBinaryDataMessage<T>):SignalParamswas chosen over a 1:1 binary wrapper class — fewer indirections on the hot path, type-safe pack/unpack, andDataSerializerTypefield onSignalReceiveParamsfor response format indication. - Wire impact: No JSON round-trip on the wire for request params; this is a breaking change vs. previous JSON-in-Binary clients/servers (see commit message).
- Legacy types:
SignalPostJsonMessage,SignalPostJsonDataMessage<T>,SignalPostMessage<T>,ISignalPostMessage<T>all marked[Obsolete]inIAcSignalRHubClient.cs; deletion tracked separately inAyCode.Services/docs/SIGNALR/SIGNALR_TODO.md#accore-sig-t-s3n8(gated on consumer migration).
ACCORE-BIN-T-Q2N7: Re-evaluate DiscountProductMapping SGen exclusion
Priority: P3 · Type: Investigation · Related: BINARY_ISSUES.md#accore-bin-i-f1w8
Investigate whether the new int Id shadowing pattern can be handled by SGen (via base-class introspection, property-setter lookup on the base) to eliminate the runtime compiled-expression fallback for this entity class.
ACCORE-BIN-T-W9F1: Generate BinarySerializeTypeMetadata / BinaryDeserializeTypeMetadata at compile time
Priority: P1 · Type: Performance · Related: BINARY_ISSUES.md#accore-bin-i-n6q3
Eliminate the dominant first-call cost (reflection + Expression.Compile in metadata ctor) for SGen types by emitting pre-built metadata from the source generator.
Design outline:
TypeMetadataBase/BinarySerializeTypeMetadata/BinaryDeserializeTypeMetadataget a second constructor that accepts pre-computed values (hashes,MinWriteSize,ComplexPropertyCount, flags,IsIId,IdAccessorType, etc.). No reflection executes in this ctor.- Source generator keeps its existing
s_typeNameHash/s_propertyHashesstatic fields (hot-path access stays static, zero indirection) and passes the same references to the metadata — single source of truth, no duplicate computation. ModuleInitregisters both the writer/reader and the pre-built metadata into aGeneratedMetadataRegistry.GetWrapperSlowconsults this registry first, falling back to the reflection-basedMetadataFactoryfor runtime-only types.- Lazy
RuntimeInit()pattern forExpression.Compileproperty accessors:TypeMetadataBasegetsvolatile bool _runtimeInitialized+internal void RuntimeInit()(idempotent, no lock needed).GetWrapperSlowcallsmetadata.RuntimeInit()only whenwrapper.GeneratedWriter == null || !Options.UseGeneratedCode— SGen types skip it entirely (they never touch runtime accessors on their own metadata; non-SGen child types have their own metadata and run the factory path normally).- Hybrid mode stays correct: an SGen type on the SGen path never uses its own property accessors; a non-SGen child type's metadata runs the reflection ctor as today.
volatileguards the flag; multiple contexts may race intoRuntimeInit, second run is a no-op.
Thread safety: GlobalMetadataCache is ConcurrentDictionary; generated metadata is registered once at ModuleInit; wrapper construction is per-context and unchanged.
Acceptance:
- Cold benchmark: first
Serialize<T>of a fresh SGen type shows no reflection /Expression.Compileon the call stack. - Runtime fallback (
UseGeneratedCode=false) still produces identical wire output and uses the full metadata accessors. - Deserialize side has parity (same approach for
BinaryDeserializeTypeMetadata). - Existing tests pass; wire format unchanged.
ACCORE-BIN-T-T5J8: JIT Tier 1 warmup for generated hot methods
Priority: P2 · Type: Performance · Related: BINARY_ISSUES.md#accore-bin-i-n6q3
After ACCORE-BIN-T-W9F1 lands, JIT of generated WriteProperties / ScanObject / ScanForDuplicates becomes the dominant residual first-call cost for SGen types. Options to evaluate (benchmark before committing):
[MethodImpl(MethodImplOptions.AggressiveOptimization)]on the generated hot methods — skips Tier 0, compiles directly at Tier 1. Simple generator change. Trade-off: larger one-time JIT cost in exchange for eliminating the Tier 0→1 recompile step.- Background prewarm from
ModuleInit:Task.Run(() => RuntimeHelpers.PrepareMethod(handle))for each registered writer/reader method. Parallelizes JIT with app startup. Keep it opt-in (option flag) to avoid surprising consumers with extra startup threads. - ReadyToRun (R2R) in consuming projects' publish config — pre-compiles IL to native at publish time. External to SGen, complementary. Document as a recommended publish setting.
- Code chunking (split generated methods exceeding a property threshold into sub-methods, e.g.
WriteProperties_Part1/_Part2) — measure first. Only beneficial for unusually large types (20+ properties / nested collections). Call overhead can offset gains; JIT inliner may already handle reasonably-sized methods well. try/finallyaudit on hot path — On .NET 9 (project's minimum target), JIT silently refuses to inline any method containing an EH region (AggressiveInliningis ignored). [.NET 10 partially lifts this for same-module try-finally — seedotnet/runtime#112998, merged 2025-03-20 — butcatch, cross-module, and P/Invoke-stub cases stay blocked. Until project's minimum runtime moves to .NET 10, treat EH as an absolute inlining barrier; even after the upgrade, several sub-cases keep the rule.] Audit scope:- Hand-written bridges:
WriteValueGenerated/WriteObjectGenerated/WriteStringGenerated/ScanValueGeneratedand any helper called from generatedWritePropertiesfor accidentaltry/finally/usingblocks. - SGen output template (
AcBinarySourceGenerator.cs): generatedWriteProperties/ScanObject/ScanForDuplicates/ReadObject/ReadPropertiesMUST stay straight-line. Future feature additions ([CustomSerializer] / [CustomDeserializer] hooks,OnSerializing/OnDeserializedcallbacks, validation attributes, rented-bufferusingblocks) are tempting candidates fortry/catch/finally— emit them in separate cold helpers, never inline into the generated hot method. A single accidentaltryblock inWritePropertiesmakes the whole generated method non-inlinable, killing the SGen Root Fast Path benefit. - Resource cleanup (Pool/ArrayPool/Dispose) belongs in
Serialize<T>entry-frame only, not in per-property helpers or generated hot methods. SeeBINARY_IMPLEMENTATION.mdRule #3 (Inlining barriers) andBINARY_SGEN.md(SGen Output Constraints).
- Hand-written bridges:
stackallocsize discipline on hot path — On .NET 9, methods containinglocalloc(any C#stackalloc) historically blocked inlining. Modern .NET allows inlining only for fixed-sizestackalloc≤ 32 bytes outside loops (seedotnet/runtime#7113) — anything larger or loop-nested still blocks. Our typical scratch-buffer patterns (UTF-8 encoding scratch, ArrayPool fallbacks) sit far above 32 bytes (256+), so any helper containing such astackallocis non-inlinable. Combined withtry/finallyforArrayPool.Returncleanup, the method is doubly non-inlinable on .NET 9. Plan accordingly: keepstackalloc-using helpers as deliberate cold call-frames, not asAggressiveInliningcandidates.- Native AOT — out of scope for this TODO; separate architectural decision with deployment-model implications.
Acceptance:
- Benchmark a realistic entity graph (≥ 3 referenced child types) and show first-call time within ~10% of steady-state after ACCORE-BIN-T-W9F1 + chosen mitigation(s).
- Document which combination is recommended for SignalR hot-path workloads vs. batch serialization.
ACCORE-BIN-T-Z3K8: Replace IId<T> interface dependency with convention/attribute-based Id detection
Priority: P1 · Type: Refactor
The binary serializer currently detects Id-tracking properties via the IId<T> interface (AyCode.Interfaces). This couples the serializer to a framework-specific abstraction and forces consumer types to implement the interface for tracking participation. Move to a POCO-friendly detection scheme:
IdDetectionMode.Convention(default) — convention-based; any property namedIdis treated as the tracking key. Zero-friction onboarding.IdDetectionMode.Attribute— explicit; only properties marked with a serializer-native[Id](or similar) attribute are tracked.[IgnoreId]attribute — escape hatch inConventionmode to exclude an Id-named property from tracking when the developer wants explicit opt-out.
Implicit contract for Convention mode: within a single class, the Id property must be type-level unique. Whether it semantically represents a primary key or a sequence number is irrelevant — the tracker keys by (Type, Id), so per-type uniqueness is the only requirement. Violating this invariant typically signals a domain-modelling problem, not a serializer bug. Design rationale discussed in conversation 2026-04-27.
Acceptance:
- Binary serializer no longer references
IId<T>in any execution path (no interface checks, nowhere T : IId<TKey>constraints in the serializer surface). - Wire format unchanged.
- Existing consumers using
IId<T>-implementing types still work transparently inConventionmode (theirIdproperty is detected via convention). - New consumers can use plain POCOs with no
AyCode.Interfacesdependency. IdDetectionModeexposed onAcBinaryOptions(or successor options class post-rebrand).- Default mode =
Convention.
ACCORE-BIN-T-N7V1: Replace [JsonIgnore] dependency with serializer-native ignore attribute
Priority: P2 · Type: Refactor
Property exclusion from binary serialization currently relies on [JsonIgnore] (Newtonsoft.Json). This couples the binary serializer to a third-party JSON library's attribute and is conceptually wrong — a binary serializer should not consult a JSON-specific marker for its exclusion semantics.
Define a serializer-native ignore attribute (working name [BinaryIgnore]; final name TBD pending broader rebrand). For backward compatibility during transition, also continue recognizing [JsonIgnore] with a deprecation note.
Possible cross-cutting consideration: if Toon and other future serializers also need property-exclusion, a single shared attribute (e.g., [SerializerIgnore] in a common abstractions package) may be cleaner than per-serializer attributes. Decide before naming finalizes — this may belong in XCUT_TODO.md rather than purely BINARY scope.
Acceptance:
- Native ignore attribute defined in the binary serializer's namespace (or shared abstractions package, pending the cross-cutting decision above).
- Both native attribute and
[JsonIgnore]recognized during a transitional period; native attribute takes precedence on conflict. [JsonIgnore]recognition flagged for removal in a future major version (track in a follow-up cleanup TODO once consumer projects have migrated).- No new code dependency on Newtonsoft.Json for property-exclusion logic.
ACCORE-BIN-T-Y6R2: Implement projection serialization phase 1 (runtime path)
Priority: P1 · Type: Feature · Related: ../adr/0001-binary-projection-serialization.md (canonical)
Implement the phase 1 runtime path of source→target projection serialization per ADR 0001. See the ADR for full context, decision rationale, alternatives, consequences, and acceptance criteria.
Sibling rebrand-prep TODOs: ACCORE-BIN-T-Z3K8 (IId migration), ACCORE-BIN-T-N7V1 (JsonIgnore replacement).
ACCORE-BIN-T-K3W7: Rename BufferWriterChunkSize to reflect actual semantics
Priority: P3 · Type: Refactor · Breaking: Yes (public option API) · Streaming impact: see BINARY_ASYNCPIPE_TODO.md for the streaming-side companion considerations (chunk-on-wire vs internal-buffer semantics)
The property name BufferWriterChunkSize is misleading: across the three output paths it does NOT consistently represent a "chunk".
| Output path | What BufferWriterChunkSize actually controls |
Wire-format chunk? |
|---|---|---|
ArrayBinaryOutput (Byte[] API) |
Initial buffer capacity of the internal byte[] |
No |
BufferWriterBinaryOutput (IBufferWriter overload) |
Internal buffer size — how much data accumulates before Advance() + new GetMemory() on the underlying writer |
No |
AsyncPipeWriterOutput (streaming) |
Both internal buffer and wire-format chunk frame size for chunked framing | Yes (only here) |
Receive side (AsyncPipeReaderInput) |
Initial receive buffer = BufferWriterChunkSize × 2 |
No (just sizing hint) |
Only the streaming AsyncPipeWriterOutput path has a wire-format "chunk" concept (chunked framing for length-prefixed segments). On the other 75% of paths the property name reads as if the serializer were segmenting the payload, which is not what happens.
Possible directions (decide before implementing):
- Single rename, semantic-neutral —
BufferWriterChunkSize→BufferWriterBufferSizeorBufferWriterPageSize. Minimal API surface change, single-property semantics preserved. Downside: still slightly off for the streaming path where there IS chunked framing. - Two-property split —
InternalBufferSize(universal: how much data accumulates before Advance/Grow) +StreamingChunkSize(only meaningful forAsyncPipeWriterOutput; separate knob, defaults toInternalBufferSize). Cleanest semantics, most ceremony, slightly more options to document. - Single rename, streaming-honest — Keep as
BufferWriterChunkSizebut document explicitly that on non-streaming paths the value is repurposed as buffer size. Cheapest change (docs only). Downside: doesn't fix the underlying confusion the field name causes.
Pick one before touching code. Option 2 is the most correct but adds API surface; Option 1 is the pragmatic middle.
Affected callers / docs to update on rename:
AcBinarySerializerOptions.cs(definition)AcBinarySerializer.cs× 3 sites (ArrayBinaryOutputctor,BufferWriterBinaryOutputctor,AsyncPipeWriterOutputctor)AcBinaryDeserializer.cs× 1 site (receive-side initial capacity derivation)AsyncPipeReaderInput.cs— XML doc cross-refsBINARY_WRITERS.md,BINARY_TODO.md(this entry),BINARY_ISSUES.md(line 151 — already listsBufferWriterChunkSizeamong the struct-mutation issue's affected setters)- Consumer-side:
AyCode.Services/SignalRs/AcBinaryHubProtocol.csctor mutates_options.BufferWriterChunkSize = options.BufferSize;— seeBINARY_ISSUES.md#accore-bin-i-...(struct-mutation context). Coordinate the rename with the struct-mutation fix to avoid two cross-cutting churn waves on the same property.
Acceptance:
- Property renamed (or split) per the chosen direction; all internal references updated.
- XML docs reflect the actual semantics on each output path (initial capacity / advance threshold / chunk frame size — whichever applies).
- Consumer-side usage in
AcBinaryHubProtocolupdated; if Option 2 is chosen, the protocol usesStreamingChunkSize(the streaming knob), not the universal one. - Wire format unchanged. Default values unchanged (65535 / equivalent).
- Migration note in CHANGELOG / release notes since this is a breaking change to
AcBinarySerializerOptions.
ACCORE-BIN-T-M4D2: Add ReadOnlyMemory<byte> / Memory<byte> deserialize overloads
Priority: P3 · Type: Feature
The public AcBinaryDeserializer.Deserialize surface accepts byte[] (with optional offset/length) and ReadOnlySequence<byte>, but not ReadOnlyMemory<byte> / Memory<byte>. Consumers that hold a ReadOnlyMemory<byte> (cached payloads, message-broker frames, in-memory pipe slices) must call .ToArray() to round-trip through byte[] — unnecessary copy + GC alloc.
Implementation:
Deserialize<T>(ReadOnlyMemory<byte> data, AcBinarySerializerOptions options)and the non-genericType-based variant.- Body:
MemoryMarshal.TryGetArray(data, out var seg)→ array-backed path delegates toDeserialize<T>(seg.Array!, seg.Offset, seg.Count, options)(zero-copy). Non-array-backed fallback (rare — customMemoryManager<T>with native memory) copies into a pooledbyte[]. Memory<byte>overload trivially delegates to theReadOnlyMemory<byte>one (Memory<byte>is implicitly convertible).- No new input-strategy struct needed — reuses existing
ArrayBinaryInput.
Acceptance:
- Both overloads compile and pass round-trip tests against
byte[]-equivalent input. - Array-backed path measurably zero-alloc (BenchmarkDotNet allocation diagnoser).
- Non-array-backed path documented as fallback (separate
using var pooled = MemoryPool<byte>.Shared.Rent(...)style copy). - API doc-strings cross-reference the existing
byte[]andReadOnlySequence<byte>overloads.
ACCORE-BIN-T-S7X3: Add ReadOnlySpan<byte> deserialize overload
Priority: P2 · Type: Feature · Related: ACCORE-BIN-T-M4D2
The MemoryPack-style Deserialize<T>(ReadOnlySpan<byte>) API enables direct deserialization from stack-allocated buffers (stackalloc byte[256]), pinned native memory (fixed blocks), and ReadOnlyMemory<byte>.Span slices without round-tripping through a heap-allocated byte[]. The current AcBinary surface lacks this entry point.
Design tension: the existing IBinaryInputBase.Initialize(out byte[] buffer, ...) contract returns a byte[] — a ReadOnlySpan<byte> cannot be stored in a regular struct field, only in a ref struct field. Two implementation paths to evaluate:
ref struct SpanBinaryInput+ interface bump to supportref byte buffer/int lengthfields. Pure zero-copy from any span. Cost:BinaryDeserializationContext<TInput>andIBinaryInputBaseneed a parallel ref-struct-friendly track (the existing pooled context cannot hold aref struct). Major surgery on the deser core.MemoryMarshal.CreateReadOnlySpanFromNullTerminated-style hack — acceptReadOnlySpan<byte>, useUnsafe.AsRef/MemoryMarshal.GetReferenceto obtain aref byte, then copy into a pooledbyte[]before deserialization. Not zero-copy, defeats the purpose. Reject.- Pinned-buffer trampoline — accept
ReadOnlySpan<byte>, allocate aMemory<byte>view via aMemoryManager<byte>-like wrapper, delegate toReadOnlyMemory<byte>overload. Awkward, allocations per call. Reject.
Recommendation: option (1) is the only correct path, but it's a substantial refactor — measure first whether real consumer demand justifies the surgery. The current byte[]-based pool-pattern outperforms MemoryPack on the dominant use-cases per existing benchmarks; this overload addresses an API-surface gap, not a perf gap.
Acceptance:
Deserialize<T>(ReadOnlySpan<byte> data, AcBinarySerializerOptions options)compiles and round-trips againstbyte[]-equivalent input.- Zero-alloc path verified for
stackalloc-source spans (BenchmarkDotNet allocation diagnoser). IBinaryInputBase(or successor interface) refactor preserves backward compatibility for existingArrayBinaryInput/SequenceBinaryInput/AsyncPipeReaderInputAdapterconsumers.- Doc-strings cross-reference the
byte[]/ReadOnlyMemory<byte>(ACCORE-BIN-T-M4D2) /ReadOnlySequence<byte>overloads with use-case guidance.
ACCORE-BIN-T-T8K3: Add SerializeAsync(Stream, T) async overloads with mode-driven output strategy
Priority: P1 · Type: Feature · Related: ACCORE-BIN-T-N9G6 (Type-based coordination)
The mainstream serializer ecosystem (System.Text.Json, MessagePack, Newtonsoft.Json, MemoryPack) all expose SerializeAsync(Stream, T) as a primary entry point — async file I/O, network response body, log streaming. AcBinary's public API surface MUST include this overload regardless of what we do internally; consumers expect a Stream parameter and don't navigate PipeWriter.Create(stream) workarounds. Market-entry-blocking otherwise.
Mode-driven output strategy — three lanes for three workload shapes
AcBinary already models the three output strategies in BinaryProtocolMode (AyCode.Services/SignalRs/BinaryProtocolMode.cs) for the SignalR side. The same three-lane shape applies to the public SerializeAsync(Stream) API. Promote the concept to AcBinary core scope (e.g. AcBinaryOutputMode in AyCode.Core/Serializers/Binaries/) and let the SignalR BinaryProtocolMode either alias it or migrate to it. Migration timing: the existing BinaryProtocolMode keeps shipping until the new public API is stabilized; both names live for one major version, then BinaryProtocolMode becomes a using-alias.
| Mode | Output strategy | Peak memory | Pipeline parallelism | Use when |
|---|---|---|---|---|
Bytes (default) |
Serialize(T) → byte[] + stream.WriteAsync(bytes) |
Full payload in byte[] (pooled) |
No | Typical payloads (<10 MB), throughput-focus |
Segment |
BufferWriterBinaryOutput → PipeWriter, single closing flush |
PipeWriter pause-threshold-bounded (~64 KB Kestrel default) | No | Mid-size payloads, zero-copy desired |
AsyncSegment |
SerializeChunked(PipeWriter), per-chunk async flush |
Chunk-size-bounded (~8 KB at default BufferWriterChunkSize) |
Yes (on parallel-capable PipeWriter — Kestrel / Pipe) |
Very large payloads (>10 MB), memory-tight hosts, parallel-capable transport |
Honest performance positioning vs. MemoryPack — three real axes
MemoryPack's SerializeAsync(Stream) is pseudo-streaming — serializes the entire payload into a pool-allocated linked-list buffer first (ReusableLinkedArrayBufferWriter), then writes the completed buffer to the stream in a single closing fence. Peak memory ≈ payload size; no pipeline parallelism. AcBinary's Bytes mode is architecturally similar (single pooled contiguous byte[] vs. MemoryPack's linked-list) — comparable peak-memory cost, often faster on the wire due to one contiguous WriteAsync call.
AcBinary's AsyncSegment mode is architecturally different in three real ways MemoryPack cannot match:
| Axis | Bytes mode (default) |
AsyncSegment mode |
MemoryPack SerializeAsync |
|---|---|---|---|
| Heap allocation per call | Pooled byte[] rent (peak ≈ payload size) |
Truly zero — ArrayPool + pooled context + MemoryMarshal.TryGetArray direct-buffer-write into the transport's own byte[] |
Pool-allocated linked-list buffer per call (peak ≈ payload size) |
| Peak managed memory | ≈ payload size | ≈ chunk size (BufferWriterChunkSize, e.g. 4-8 KB) |
≈ payload size |
| GC pressure | Touches GC pool on every call | Never touches GC for the serialize itself | Touches GC pool on every call |
| Pipeline parallelism | No | Yes on parallel-capable PipeWriter (Kestrel transport, new Pipe()) |
No |
| GB-scale payload | OOM risk on memory-tight hosts | Works | OOM risk |
The AsyncSegment zero-alloc claim is literal, not "almost zero": AsyncPipeWriterOutput.AcquireChunk calls _pipeWriter.GetMemory(chunkSize) and uses MemoryMarshal.TryGetArray(memory, out segment) to obtain the transport's own internal byte[] — the serializer writes directly into it. With chunkSize aligned to the transport's internal buffer (e.g. NamedPipe-server pipe-buffer-size), one chunk is one kernel-level transfer; no managed-side double-fragmentation.
Throughput nuance — AsyncSegment cost on Stream-backed transports
AsyncSegment IS slightly slower than Bytes on StreamPipeWriter-backed transports (NamedPipe / FileStream / NetworkStream), but not for the reason that initially seems obvious:
- The cost is NOT "managed-side double-fragmentation on top of OS-level fragmentation" — that's not what happens.
MemoryMarshal.TryGetArrayzero-copy direct-buffer-writes mean the managed chunking is the same chunking the kernel does anyway, not redundant. - The cost IS the per-chunk async-await round-trip (
SyncAwaitFlush(_lastFlush)blocks until the kernel acknowledges the write), forced sequential by theStreamPipeWriter._tailMemoryreset race (ACCORE-BIN-I-...). N async cycles vs 1 inBytesmode. - Empirically the gap is roughly 1.2-1.5x on NamedPipe — not 2-5x. The dominant cost on these transports is the transport itself (Windows IRP / Linux FIFO syscall overhead), independent of the serializer mode.
When AsyncSegment wins outright:
- GC-sensitive hot-paths (server hubs, real-time game tick loops, mobile UI thread, embedded targets): zero-alloc + zero-GC-pressure beats a 1.2x throughput edge every time.
- Memory-tight hosts (mobile, WASM, container-trimmed, embedded): chunk-bounded peak memory is the only option.
- GB-scale payloads:
BytesOOMs;AsyncSegmentworks. - Kestrel transport / parallel-capable
Pipe: pipeline parallelism makesAsyncSegmentfaster thanBytesfor medium-to-large payloads.
When Bytes wins outright:
- Tipikus NuGet workload (small-to-medium payload, throughput priority, GC-tolerant): one async cycle vs N is the simpler, faster path.
MemoryStream(in-memory): one largebyte[]copy decisively beats N managed chunks.
Marketing claim — three-way honest comparison
"AcBinary offers a real choice.
Bytesmode for typical throughput-priority workloads (matches MemoryPack's pseudo-streaming, often faster on the wire).AsyncSegmentmode for the workloads MemoryPack cannot serve: zero-alloc serialize for GC-sensitive hot-paths, chunk-bounded peak memory for tight-budget hosts, GB-scale payloads, and pipeline parallelism on parallel-capable transports. You pick the mode; MemoryPack picks for you."
This is honest — does not overclaim universal speed, does not hide the small AsyncSegment cost on Stream-backed transports, AND clearly surfaces the three differentiator axes (alloc / memory / parallelism) where AcBinary architecturally beats MemoryPack.
Implementation outline:
- New enum
AcBinaryOutputMode { Bytes = 0, Segment = 1, AsyncSegment = 2 }inAyCode.Core/Serializers/Binaries/. DefaultBytes. - New mode field on
AcBinarySerializerOptions:AcBinaryOutputMode OutputMode { get; set; } = AcBinaryOutputMode.Bytes;. (Note: subject toACCORE-BIN-I-L8N5thread-safety treatment — defensive copy / immutable refactor coordination.) public static ValueTask SerializeAsync<T>(T value, Stream stream, AcBinarySerializerOptions? options = null, bool leaveOpen = false, CancellationToken ct = default):- Switch on
options.OutputMode:Bytes→var bytes = Serialize(value, options); await stream.WriteAsync(bytes, ct); ArrayPool.Return(bytes);Segment→var pw = PipeWriter.Create(stream, new(leaveOpen: leaveOpen)); Serialize(value, pw, options); await pw.CompleteAsync();AsyncSegment→var pw = PipeWriter.Create(stream, new(leaveOpen: leaveOpen)); SerializeChunked(value, pw, options); await pw.CompleteAsync();
- Switch on
public static ValueTask SerializeAsync(object? value, Type type, Stream stream, ...)— non-generic, same dispatch (coordinated withACCORE-BIN-T-N9G6).leaveOpenparameter standard for stream-async serializers (System.Text.Json, MessagePack convention).- The
Bytesmode uses a pooledbyte[]fromArrayBinaryOutputto keep alloc cost amortized.
SignalR migration coordination: the existing BinaryProtocolMode enum (in AyCode.Services) keeps shipping unchanged until the new public API is stabilized. After stabilization, BinaryProtocolMode becomes a deprecated alias of AcBinaryOutputMode, eventually removed in a major-bump. No SignalR-side churn during this TODO's implementation.
Acceptance:
SerializeAsync<T>round-trips againstDeserialize<T>(byte[])viaMemoryStreamin all three modes.- Cancellation propagates correctly (
OperationCanceledExceptionon cancelled token mid-stream). - Throughput matrix benchmark: 4 transports (
MemoryStream,FileStream,NamedPipeStream,NetworkStream) × 3 modes × 3 payload sizes (small ~1 KB / medium ~100 KB / large ~10 MB). Results documented inTest_Benchmark_Results/Benchmark/SerializeAsync_Stream_Modes.LLM(or similar) and surfaced as a doc-string table for consumer guidance. - Memory-bounded benchmark: 100 MB payload to
FileStreaminAsyncSegmentmode → peak managed-heap delta ≤ 1 MB throughout. Same payload inBytesmode → peak ~100 MB (expected, documented). - API doc-string contains a "When to use which mode?" decision matrix; explicitly compares with MemoryPack's pseudo-streaming.
leaveOpenparameter behaves per the System.Text.Json / MessagePack convention across all three modes.
ACCORE-BIN-T-D7K4: Add DeserializeAsync(Stream, T) async overloads with mode-driven input strategy
Priority: P1 · Type: Feature · Related: ACCORE-BIN-T-T8K3 (companion write-side overload), ACCORE-BIN-T-N9G6 (non-generic Type-based dispatch)
Companion to T8K3 on the receive side. The mainstream serializer ecosystem (System.Text.Json, MessagePack, Newtonsoft.Json, MemoryPack) all expose DeserializeAsync<T>(Stream) — the symmetric counterpart of SerializeAsync(Stream, T). AcBinary's public API surface MUST include this overload for parity; consumers expect a Stream parameter for receive paths (file load, HTTP response body, network stream) and don't navigate PipeReader.Create(stream) workarounds. Market-entry-blocking otherwise.
Implementation: zero new IBinaryInputBase impl needed
The existing receive-side primitives cover the full strategy space via BCL PipeReader.Create(stream):
| Mode | Input strategy | Peak memory | Pipeline parallelism | Use when |
|---|---|---|---|---|
Bytes (default) |
await stream.CopyToAsync(MemoryStream) → Deserialize<T>(byte[]) (existing overload) |
Full payload as byte[] (pooled) |
No | Typical payloads (<10 MB), throughput-focus |
Segment |
await PipeReader.Create(stream).ReadAsync() → Deserialize<T>(ReadOnlySequence<byte>) (existing overload) |
PipeReader pause-threshold-bounded (~64 KB) | No | Mid-size payloads, no full byte[] alloc desired |
AsyncSegment |
AsyncPipeReaderInput + DrainFromAsync(PipeReader.Create(stream)) + Deserialize<T>(input) (existing overload) |
Chunk-size-bounded (~8 KB) | Yes (producer drain Task in parallel with deser Task) | Very large payloads (>10 MB), memory-tight hosts |
The AcBinaryOutputMode enum (introduced by T8K3) is symmetric — it controls deser-input strategy as well. The same enum value picks the matching read path. No new IBinaryInputBase implementation needed — the trio of existing inputs (ArrayBinaryInput, SequenceBinaryInput, AsyncPipeReaderInput) already cover all three modes; the new overload is a thin shim that wraps the Stream and routes to the right existing overload.
Public API shape
public static ValueTask<T?> DeserializeAsync<T>(
Stream stream,
AcBinarySerializerOptions? options = null,
bool leaveOpen = false,
CancellationToken ct = default);
// Non-generic Type-based variant (coordinated with N9G6):
public static ValueTask<object?> DeserializeAsync(
Stream stream,
Type targetType,
AcBinarySerializerOptions? options = null,
bool leaveOpen = false,
CancellationToken ct = default);
Implementation outline (per mode)
// Bytes mode (default — simplest path, sub-LOH-friendly fast path):
public static async ValueTask<T?> DeserializeAsync_Bytes<T>(Stream stream, ..., CancellationToken ct)
{
var rented = ArrayPool<byte>.Shared.Rent((int)Math.Min(stream.CanSeek ? stream.Length : 4096, int.MaxValue));
try
{
var totalRead = 0;
int read;
while ((read = await stream.ReadAsync(rented.AsMemory(totalRead), ct)) > 0)
{
totalRead += read;
if (totalRead == rented.Length) { /* grow rented */ }
}
return Deserialize<T>(rented, 0, totalRead, options);
}
finally { ArrayPool<byte>.Shared.Return(rented); }
}
// Segment mode (PipeReader.Create wrapping, then drain to ReadOnlySequence):
public static async ValueTask<T?> DeserializeAsync_Segment<T>(Stream stream, ..., CancellationToken ct)
{
var pipeReader = PipeReader.Create(stream, new(leaveOpen: leaveOpen));
var result = await pipeReader.ReadAtLeastAsync(int.MaxValue, ct); // drain whole stream
var seq = result.Buffer;
var obj = Deserialize<T>(seq, options);
pipeReader.AdvanceTo(seq.End);
await pipeReader.CompleteAsync();
return obj;
}
// AsyncSegment mode (chunked streaming pipeline, parallel drain + deser):
public static async ValueTask<T?> DeserializeAsync_AsyncSegment<T>(Stream stream, ..., CancellationToken ct)
{
using var input = new AsyncPipeReaderInput(options.BufferWriterChunkSize * 2, multiMessage: false);
var pipeReader = PipeReader.Create(stream, new(leaveOpen: leaveOpen));
var deserTask = Task.Run(() => Deserialize<T>(input, options), ct);
await input.DrainFromAsync(pipeReader, ct);
await pipeReader.CompleteAsync();
return await deserTask;
}
Honest performance positioning
Symmetric to T8K3's analysis:
Bytesmode: simplest, single contiguousbyte[](pooled) →Deserialize<T>(byte[]). Comparable to MemoryPack'sDeserializeAsync(which does similar full-buffer-then-deser). Best for typical workloads.Segmentmode: zero-copy from PipeReader's naturalReadOnlySequence<byte>— no extra byte[] allocation. Best for mid-size payloads where allocation matters but pipeline overlap doesn't.AsyncSegmentmode: producer-drain Task and consumer-deser Task in parallel viaAsyncPipeReaderInput. Wall-clock = max(network-drain, deser-CPU) + small overlap-cost. Best for large payloads + slow transports (network, mobile, satellite — where transit dominates and overlap pays).
Acceptance
DeserializeAsync<T>round-trips againstSerializeAsync(Stream, T)(T8K3) viaMemoryStreamin all three modes.- Cancellation propagates correctly (
OperationCanceledExceptionon cancelled token mid-stream); partial-buffer state cleaned up; pooled byte[] returned even on cancellation. - Throughput matrix benchmark (mirror of T8K3): 4 transports (
MemoryStream,FileStream,NamedPipeStream,NetworkStream) × 3 modes × 3 payload sizes. Results documented inTest_Benchmark_Results/Benchmark/DeserializeAsync_Stream_Modes.LLM. - Memory-bounded benchmark: 100 MB payload from
FileStreaminAsyncSegmentmode → peak managed-heap delta ≤ 1 MB throughout. Same payload inBytesmode → peak ~100 MB (expected, documented). - API doc-string contains a "When to use which mode?" decision matrix; cross-references T8K3's symmetric write-side guidance.
leaveOpenparameter behaves per the System.Text.Json / MessagePack convention across all three modes.
ACCORE-BIN-T-N9G6: Add non-generic Type-based Serialize(object, Type, ...) overloads
Priority: P2 · Type: Feature · Status: Closed (2026-05-04) · Related: ACCORE-BIN-T-T8K3
Resolution
Added in AcBinarySerializer.cs:
Serialize(object?, Type, opts)→byte[]Serialize(object?, Type, IBufferWriter<byte>, opts)→intSerializeChunked(object?, Type, PipeWriter, opts)→intSerializeChunkedFramed(object?, Type, PipeWriter, opts)→int
AcBinaryDeserializer.cs already had Deserialize(byte[], Type, opts) / Deserialize(ReadOnlySequence<byte>, Type, opts) / Deserialize(AsyncPipeReaderInput, Type, opts) overloads — no new entries needed.
Layering note: PipeReader → AsyncPipeReaderInput drain-loop is the consumer's responsibility, not the binary serializer's. The serializer surface ends at AsyncPipeReaderInput; transport-specific draining (PipeReader, NamedPipe, SignalR state.Buffer.Write, etc.) lives in the consumer layer (e.g. AcBinaryInputFormatter, AcBinaryHubProtocol.TryParseChunkData).
Consumed by ASP.NET Core MVC formatter package (AyCode.Services/Mvc/) — AcBinaryInputFormatter, AcBinaryOutputFormatter, AddAcBinaryFormatters extension. Media type: application/vnd.acbinary. Drain-loop inlined in AcBinaryInputFormatter.ReadRequestBodyAsync.
Plugin frameworks, ASP.NET ModelBinding, DI middleware, and DataContractSerializer-style "generic-API container" use-cases need to serialize an object whose type is known only at runtime. Current AcBinary surface forces a reflection trampoline through the generic Serialize<T>:
// Today's workaround (slow + noisy):
typeof(AcBinarySerializer).GetMethod("Serialize", new[] { type, typeof(AcBinarySerializerOptions) })
.MakeGenericMethod(type).Invoke(null, new[] { value, options });
Implementation outline:
public static byte[] Serialize(object? value, Type type, AcBinarySerializerOptions? options = null)public static int Serialize(object? value, Type type, IBufferWriter<byte> writer, AcBinarySerializerOptions? options = null)public static int SerializeChunked(object? value, Type type, PipeWriter writer, AcBinarySerializerOptions? options = null)andPipeoverloadpublic static int SerializeChunkedFramed(object? value, Type type, PipeWriter writer, AcBinarySerializerOptions? options = null)andPipeoverloadpublic static ValueTask SerializeAsync(object? value, Type type, Stream stream, ...)— coordinated withACCORE-BIN-T-T8K3- Internal dispatch:
value.GetType()is the runtime type; theType typeparameter constrains the declared type for polymorphism handling (ObjectWithTypeNamewrite decision).
Acceptance:
- All non-generic overloads round-trip via the generic deserializer's
Deserialize(byte[], Type)overload. - Plugin-style scenario: serialize
IList<dynamic>of mixed-type elements → all elements correctly typed in the wire output. - API doc-strings call out the performance characteristics (slightly slower than generic due to runtime
Typelookup but without the reflection trampoline cost).
ACCORE-BIN-T-R4P2: Expose low-level ref Writer-style API for custom formatters
Priority: P3 · Type: Feature
The MemoryPack-style Serialize<T>(ref MemoryPackWriter writer, in T value) low-level API enables:
- Custom formatters that compose write primitives without the full Serialize entry-point overhead.
- Nested-into-existing-stream scenarios where the caller already owns a writer-style cursor.
- Test harnesses that exercise specific wire-format paths in isolation.
Today's BufferWriterBinaryOutput standalone-mode partly fills this gap — exposing WriteByte, WriteVarUInt, WriteStringUtf8, etc. — but it is not a ref struct, not a documented low-level public API for external custom formatters, and the relationship with BinarySerializationContext<TOutput> is unclear from the consumer's perspective.
Design tension (decide before implementing):
- Promote
BufferWriterBinaryOutputto documented public surface — add doc, examples, supported usage patterns. Cheapest, but the standalone-mode is currently a side-feature, not a primary API; documenting it commits to its current shape. - New
ref struct AcBinaryWriterwrapper aroundBufferWriterBinaryOutput(or a dedicated impl) — explicit "this is the low-level writer" signal. More API surface but clearer mental model. Aesthetic alignment with MemoryPack. - Skip entirely — the
IBufferWriter<byte>overload is already lower-level than most consumers need; custom formatters can write to anArrayBufferWriter<byte>and useIBufferWriter-style primitives. This is whatBufferWriterBinaryOutputalready does internally.
Recommendation: option 3 is honest — the existing IBufferWriter<byte> overload covers the use case, and adding a ref struct AcBinaryWriter is mostly aesthetic alignment with MemoryPack. Re-evaluate when there's a concrete custom-formatter request that the current API can't accommodate.
Acceptance (if implemented):
AcBinaryWriter ref struct(or equivalent) compiles, supports the same write primitives asBufferWriterBinaryOutputstandalone-mode.- At least one example custom formatter ships in tests (e.g., a
Vector3struct formatter). - Doc-string clearly distinguishes when to use the low-level writer vs. the high-level
Serialize<T>entry-point.
ACCORE-BIN-T-U6Y8: Attribute-driven polymorphism via [AcBinaryUnion] + SGen (opt-in, AOT-friendly)
Priority: P1 (if AOT target required) / P2 (non-AOT only) · Type: Feature
Design philosophy alignment: AcBinary's market positioning is "JSON-style flexibility with MessagePack-class speed" — attributes are opt-in optimization, never required. The runtime polymorphism path (AQN-based, today's default) stays the default and continues to work for arbitrary unattributed types. This TODO adds a fast/AOT path alongside it, never replaces it.
AcBinary today handles polymorphism at runtime: the wire writes ObjectWithTypeName(72) + AQN string, and the deserializer calls Type.GetType(aqn) to resolve. This is flexible (no upfront declaration), but has three significant drawbacks for some consumers:
- AOT-incompatible —
Type.GetType(AQN)requires reflection metadata that the Native AOT trimmer strips by default. The runtime polymorphism path does not work at all under Native AOT. Hard blocker for AOT-targeting consumers (Blazor WASM, MAUI mobile, container-trimmed deployments). - Slower — AQN string parse + reflection lookup vs. a closed
switch (tag)in code-gen. - Larger wire format — full AQN string (often 100+ bytes) vs. a single-byte
tag.
Design — three coordinated pieces:
1. New 5th bool parameter on [AcBinarySerializable]: EnablePolymorphismFeature
Mirrors the existing EnableMetadataFeature / EnableIdTrackingFeature / EnableRefHandlingFeature / EnableInternStringFeature pattern. Per-type opt-out / opt-in via attribute parameter.
public AcBinarySerializableAttribute(
bool enableMetadataFeature,
bool enableIdTrackingFeature,
bool enableRefHandlingFeature,
bool enableInternStringFeature,
bool enablePolymorphismFeature) // ← ÚJ, default: true
Three behavior modes per type:
EnablePolymorphismFeature = false→ disabled. SGen never emits polymorphism dispatch for this type; runtime path also short-circuits — runtime type ≠ declared type is silently treated as declared (or throws, decision TBD). Use for hot-path closed types where polymorphism is impossible-by-design and the perf/AOT cost is unwanted.EnablePolymorphismFeature = true(default), no[AcBinaryUnion]→ runtime options control. Behaves perAcBinarySerializerOptions.PolymorphismMode(Runtime/AQN today). This preserves the JSON-style flexibility for unattributed bases.EnablePolymorphismFeature = true+[AcBinaryUnion(...)]declared → union-switch dispatch. SGen emits a closedswitch (tag)dispatch using the declared subtype set. Fast + AOT-friendly. Overrides the options-level default for this type.
2. New [AcBinaryUnion(byte tag, Type subtype)] attribute
Multiple instances per base class / interface declare the closed polymorphism set:
[AcBinarySerializable] // EnablePolymorphismFeature defaults to true
[AcBinaryUnion(0, typeof(Cat))]
[AcBinaryUnion(1, typeof(Dog))]
public abstract partial class Animal { ... }
SGen detects [AcBinaryUnion] on abstract / base type → emits the switch-based write/read dispatch instead of falling through to runtime AQN.
3. New PolymorphismMode enum on AcBinarySerializerOptions
Options-level default for unattributed polymorphism (i.e. the case where EnablePolymorphismFeature = true but no [AcBinaryUnion] is declared):
Runtime(today's default) — AQN-based. Flexible, AOT-incompatible.Throw— fail fast on any polymorphic write that lacks a[AcBinaryUnion]attribute. AOT-friendly diagnostic mode for migration scenarios.
Note: there is no UnionAttribute-only mode — declaration is per-type via the attribute, not options-global. The options-level mode only governs the fallback when no [AcBinaryUnion] is present.
Wire-format addition:
New marker (e.g. UnionTagBase = <TBD>) + [byte tag][inner Object], parallel to existing ObjectWithTypeName(72). Slot number to be assigned avoiding clashes with existing 64–134 / 192–255 ranges.
Implementation outline:
AcBinarySerializableAttribute— new ctor parameterenablePolymorphismFeature, all existing ctors default it totrue(backward compatible).AcBinaryUnionAttribute— new attribute,AttributeUsage(AttributeTargets.Class | Interface, AllowMultiple = true).- Source generator — emit
WriteUnion<TBase>(value, ctx, depth)andReadUnion<TBase>(ctx, depth)static methods on the union-base type's generated writer/reader. Skipped entirely whenEnablePolymorphismFeature = false. - Wire-format new marker +
[byte tag][inner Object]body. - Runtime path:
WriteValueNonPrimitivechecks the wrapper'sPolymorphismFeatureEnabledflag; whenfalse, skips thevalue.GetType() != declaredTypepolymorphism branch entirely.
Acceptance:
EnablePolymorphismFeature = false: SGen-emitted dispatch contains zerois-typeof / GetType branches; runtime path also short-circuits. Verify in JIT disassembly.EnablePolymorphismFeature = true, no union: runtime AQN polymorphism works as today (full backward compat); preserved JSON-style flexibility for unattributed bases.EnablePolymorphismFeature = true+[AcBinaryUnion]: AOT-test (Native AOT publish) compiles and round-trips a polymorphic graph —Type.GetType()is never invoked on this path.- Benchmark: union-switch polymorphism measurably faster than AQN polymorphism on deser side (typed switch vs. reflection lookup).
- Wire format documented in
BINARY_FORMAT.md;BINARY_FEATURES.mdcross-references the attribute pattern;BINARY_OPTIONS.mddocumentsPolymorphismMode.AcBinarySerializableAttributedoc-string explains all three behavior modes.
ACCORE-BIN-T-B7H4: Implement AcBinarySerializerOptions thread-safety fix
Priority: P2 · Type: Refactor · Related: BINARY_ISSUES.md#accore-bin-i-l8n5 (canonical issue)
The latent thread-safety problem documented in ACCORE-BIN-I-L8N5 — mutable set; properties on AcBinarySerializerOptions shared across concurrent serialize/deserialize calls — needs a fix before AcBinary ships as a NuGet package. The package cannot constrain how consumers scope their options instances; defensive contract is needed in the serializer itself.
Three candidate fix directions (decide before implementing):
-
Defensive copy on ingress — add
AcBinarySerializerOptions Clone()method (member-wise copy). Every API entry point that retains an options instance clones it on entry. External mutation to the original becomes invisible to the holder.- Pro: non-breaking. Existing consumer code unchanged. No major version bump required.
- Pro: API surface change limited to one new
Clone()method. - Con: per-call clone overhead (small, but non-zero). Cache keyed on options-identity becomes invalid for downstream code using reference equality.
- Con: doesn't fix the underlying mutability — internal code can still race-mutate the cloned snapshot if a method retains both the snapshot and modifies it concurrently.
-
Immutable record refactor —
set;→init;on all configuration properties. Mutation requireswith-expression which produces a new instance.- Pro: type-system-strong guarantee. Race becomes a compile error, not a runtime corruption risk.
- Pro: zero runtime overhead (init-only is compile-time check; record class semantics are unchanged at runtime).
- Con: breaking change for any consumer doing
opts.UseGeneratedCode = falseafter construction. Major version bump. - Con: source-generator coordination needed if SGen emits options-builder code that mutates properties.
-
Read-only flag pattern (à la
JsonSerializerOptions.MakeReadOnly()) — mutable by default, holder callsMakeReadOnly()on entry; subsequent property setters throwInvalidOperationException.- Pro: BCL-precedent — Microsoft adopted it for
JsonSerializerOptionsin .NET 7 (dotnet/runtime#74431) for exactly this problem. Familiar pattern for consumers. - Pro: minimal API surface change (one new method +
IsReadOnlyflag property). - Pro: per-call overhead = single bool check per setter call. Negligible.
- Con: opt-in by the holder — if a custom consumer-side wrapper forgets to call
MakeReadOnly(), the safety hole stays open for that wrapper's clients. Documentation-driven safety, not type-system-driven. - Con: bypasses static-analysis tooling (the setter signature stays public; the throw is runtime). IDE doesn't surface "this property is currently read-only" in autocomplete.
- Pro: BCL-precedent — Microsoft adopted it for
Recommendation: Option 3 (MakeReadOnly pattern) is the BCL-precedent, lowest-friction migration path. Microsoft adopted it for JsonSerializerOptions in .NET 7 to solve the same problem; AcBinary should follow the same pattern for consistency with consumers' mental model and zero migration cost.
Coordination with the existing AcBinaryHubProtocol setter side-effect (the second risk surface in ACCORE-BIN-I-L8N5): the protocol ctor currently mutates the caller-provided options reference (_options.BufferWriterChunkSize = options.BufferSize). After the fix:
- Option 1 (Clone): ctor mutates the cloned snapshot → no side-channel to the caller. Fix transparent.
- Option 2 (Immutable): ctor cannot mutate; needs to construct a new options via
with-expression. Breaking change in the ctor's options-handling. - Option 3 (MakeReadOnly): ctor mutates before calling
MakeReadOnly()— same as today, but explicit "frozen" point afterwards. Caller-side mutation post-ctor is now a runtime throw.
Implementation outline (Option 3 path):
AcBinarySerializerOptions.IsReadOnly { get; }— public bool property.AcBinarySerializerOptions.MakeReadOnly()— sets the flag; idempotent (no-op if already set).- All
set;accessors guard:if (IsReadOnly) throw new InvalidOperationException("AcBinarySerializerOptions has been made read-only and can no longer be mutated. Construct a new options instance instead.");. AcBinarySerializer.Serialize<T>entry (and all sibling entries —Deserialize<T>,SerializeChunked, etc.):options.MakeReadOnly()before any property read.AcBinaryHubProtocolctor: complete theBufferWriterChunkSizemutation before callingoptions.MakeReadOnly(). After ctor returns, the options instance is frozen for that protocol's lifetime.- Doc-string update on
AcBinarySerializerOptionsclass header: explicit "thread-safety contract" section explaining the freeze-on-first-use semantics.
Acceptance:
- Concurrent stress test (16 threads × 1000 iterations) on a shared
AcBinarySerializerOptionsinstance with property-mutation-attempts mid-iteration — all mutations afterMakeReadOnly()throwInvalidOperationException; no silent corruption observed. - Existing tests pass unchanged (the
MakeReadOnlyis opt-in for the serializer entries; tests that build options + use them once continue to work transparently). BINARY_ISSUES.md#accore-bin-i-l8n5Status updated toClosed (YYYY-MM-DD)with a### Resolutionsub-section pointing to this TODO + the implementing commit.- Doc-string on
AcBinarySerializerOptionsdocuments the freeze-on-first-use contract;BINARY_FEATURES.mdorBINARY_OPTIONS.mdcross-references the BCL-precedent (JsonSerializerOptions.MakeReadOnly).
ACCORE-BIN-T-F8N3: Switch source-generator type-name hashing from simple-name to fully-qualified-name
Priority: P3 · Type: Refactor · Related: ACCORE-BIN-T-I3P8 (override mechanism for residual collisions)
The source generator's ComputeFnvHash(typeSymbol.Name) uses the simple name only (e.g. "User", not "MyApp.A.User"). Cross-namespace types with the same simple name silently collide on s_typeNameHash. The hash is currently only consumed by the WireMode=Metadata inline metadata-write path (cross-version property compat) — the framework explicitly does NOT add wire-format type-id (per CLAUDE.md Rule #7: type-dispatch is consumer responsibility, see BINARY_ASYNCPIPE_ISSUES.md#accore-bin-i-t6v2). Within UseMetadata, the simple-name collision can still cause silent property-set mismatches between two types with the same short name in different namespaces — this TODO fixes that.
Change scope (AcBinarySourceGenerator.cs) — 4 call sites: ComputeFnvHash(typeSymbol.Name) → ComputeFnvHash(typeSymbol.ToDisplayString()):
- Self type-name hash (~line 358)
- Child type-name hash (~line 157)
- Element type-name hash (~line 254)
- Dict-value type-name hash (~line 311)
No runtime code changes; output regenerates with new constants on next build.
Breaking change scope: any saved binary stream that uses WireMode=Metadata and was produced by an older version embeds the old simple-name hash; consumers reading those streams with the new hash compute would mismatch and throw. Pre-1.0: acceptable. Post-1.0 would require a WireMode=Metadata format-version bump.
Acceptance:
- All
*_GeneratedWriter.g.csfiles regenerate with FQN-baseds_typeNameHashvalues. - Existing tests pass (auto-regen propagates; no manual hash literals in tests).
- Wire format identical for
WireMode=Compact(no metadata embedded). UseMetadata=truepaths produce different hashes — explicitly tested via round-trip.
ACCORE-BIN-T-I3P8: [AcBinaryTypeId(...)] attribute — explicit type-id override
Priority: P3 · Type: Feature · Related: ACCORE-BIN-T-F8N3 (FQN base hash being overridden)
Once ACCORE-BIN-T-F8N3 reduces collision frequency by switching to FQN, residual FQN-hash collisions are still possible (32-bit hash space, birthday paradox). Currently the only consumer of s_typeNameHash is the WireMode=Metadata inline metadata-write path — a residual collision there causes a silent property-set mismatch.
[AcBinaryTypeId(0x12345)] attribute on a class:
- Source generator emits
s_typeNameHash = 0x12345instead of computing FNV. - Two types with the same
[AcBinaryTypeId(...)]value → compile-time / first-use error.
Useful for:
- Resolving rare FQN-hash collisions deterministically (within
WireMode=Metadata). - Pinning a stable type-id across class renames (wire-compat across versions in
Metadatamode). - Future-proofing: if a Layer 1 consumer (hypothetically) builds a type-dispatch above AcBinary using
s_typeNameHash, the same override mechanism applies.
Acceptance:
- New attribute class shipped alongside
[AcBinarySerializable]. - Generator honours the override (emits explicit constant instead of FNV result).
- Tests: rename a class with
[AcBinaryTypeId]→s_typeNameHashunchanged.
ACCORE-BIN-T-X2M5: Evaluate xxHash3 vs FNV-1a for type-name hashes
Priority: P3 · Type: Investigation · Related: ACCORE-BIN-T-F8N3
FNV-1a is currently used for both s_typeNameHash and s_propertyHashes. For compile-time hashing, performance is irrelevant. For collision resistance:
- FNV-1a 32-bit: ~50% collision at ~77K types (birthday paradox). Adequate for small/medium projects, marginal for large ones with many auto-generated types.
- xxHash3 32-bit: comparable mathematical properties to FNV-1a (both non-cryptographic).
- xxHash3 64-bit: dramatically better collision resistance (~50% at ~5B entries), at the cost of 8 wire bytes instead of 4.
Trigger: real collisions observed (1000+ types per assembly + cross-assembly aggregation), or community feedback indicating collision pain.
Investigation questions (no code change without a triggering pain signal):
- Switch to xxHash3 32-bit (incremental improvement) — but doubles the change scope (touch property hashes too if uniformity desired).
- Switch to xxHash3 64-bit (8 wire bytes instead of 4) — meaningful collision resistance, modest wire cost.
- Stay on FNV-1a + force
[AcBinaryTypeId]for collisions — minimal change, devops burden.
Investigation only — defer until pain signal arrives.
ACCORE-BIN-T-K9E4: [RequiresDynamicCode] + [RequiresUnreferencedCode] on Runtime-only methods
Priority: P3 · Type: Refactor · Related: BINARY_FEATURES.md#nativeaot-compatibility
The Runtime path (factories in AcSerializerCommon + wrapper-based deserialize fallback in AcBinaryDeserializer) currently works under NativeAOT thanks to DAMs propagation + RuntimeFeature.IsDynamicCodeSupported guards, but the trimmer still emits warnings for the well-known blind spots (polymorphism via obj.GetType(), nested-type chain via generic argument extraction). The library suppresses these with [UnconditionalSuppressMessage] and documented justification.
A complementary signal would be to mark the Runtime entry points (or the factories themselves) with [RequiresDynamicCode("AcBinary Runtime path uses Reflection.Emit / closed-generic instantiation; use [AcBinarySerializable] + SGen for NativeAOT.")] and [RequiresUnreferencedCode("...")]. Effect:
- AOT publish in consumer's project surfaces a warning at the call site → consumer chooses SGen or accepts the Runtime cost
- Mirrors the System.Text.Json reflection-mode pattern (
[RequiresDynamicCode]onJsonSerializer.Serialize<T>overloads) - One-codebase, no NuGet split needed
- Cheap implementation — attribute placement only
Coordination: [RequiresDynamicCode] is contagious; every caller must either propagate it or suppress with [UnconditionalSuppressMessage]. Scope:
- Public
Serialize<T>/Deserialize<T>entry points stay attribute-free (consumer-facing) - Runtime fallback methods get the attribute (contained inside the library)
- The DAMs annotations we already have stay — they're orthogonal (one prevents trim, the other warns about JIT-only behavior)
Acceptance:
- Consumer's AOT publish surfaces a IL2026/IL3050 warning when
UseGeneratedCode=falseis set or an unattributed type is deserialized - SGen path is warning-free
- Library compiles 0 warnings (suppressions added at the propagation barrier)
BINARY_FEATURES.mdNativeAOT Compatibility section updated to mention the explicit warning signal
ACCORE-BIN-T-A2J7: Optional AyCode.Core.Aot NuGet variant (SGen-only build)
Priority: P3 · Type: Feature · Related: BINARY_FEATURES.md#nativeaot-compatibility, ACCORE-BIN-T-K9E4
Binary-size-sensitive AOT consumers (Blazor WASM, MAUI mobile, embedded, container-trimmed) benefit from a smaller library variant that strips the Runtime fallback path entirely. Estimated savings: ~80-150 KB of native code (~25-60 KB compressed wire size for WASM publish).
Strippable code in the .Aot variant:
| Component | LOC | Purpose | Removable in Aot? |
|---|---|---|---|
AcSerializerCommon.Create* (7 factory methods + Expression-tree code) |
~150 | Runtime delegate compilation | ✅ Yes |
TypeMetadataBase runtime metadata path (CompiledConstructor, IdGetters via Expression.Compile) |
~300 | Reflection-based metadata | ✅ Yes |
AcBinaryDeserializer wrapper-based runtime fallback (PopulateObjectPropertiesIndexed, ReadObjectCoreWithWrapper non-SGen branches, CreateInstance(type) Activator-fallback) |
~500 | Runtime polymorphic dispatch | ✅ Yes |
Property accessor runtime delegate fields (_dynamicGetter, typed getter/setter caches outside SGen) |
~150 | Boxed property access | ✅ Yes |
System.Linq.Expressions transitive dependency |
— | Expression-tree IL emission | ✅ Yes (when nothing else in graph uses it) |
Implementation sketch (avoid #if-erdő via file-level split):
AyCode.Core/Serializers/
AcSerializerCommon.cs // SGen-safe shared parts
AcSerializerCommon.Runtime.cs // 7 Create* factory methods only here
AcBinaryDeserializer.cs // SGen path
AcBinaryDeserializer.Runtime.cs // wrapper-based runtime fallback path
TypeMetadataBase.cs // SGen-safe metadata
TypeMetadataBase.Runtime.cs // Expression.Compile-based ctor + accessor wiring
Two .csproj files:
AyCode.Core.csproj— full package (current); includes all filesAyCode.Core.Aot.csproj—<Compile Remove="**/*.Runtime.cs" />; sets<PackageId>AyCode.Core.Aot</PackageId>; same version as full
Trade-offs:
- ✅ No
#ifdirectives in business code — physically separate file groups - ✅ Source mostly shared via SDK include/exclude semantics
- ✅ DAMs annotations and trim-suppressions only land in the full package;
.Aotvariant is genuinely trim-clean by construction - ✅ "Strict SGen" semantics in
.Aot: a non-SGen type at deser time throws clearly instead of silently falling back. Marketing positioning: "guaranteed SGen path, no hidden slow lane". - ⚠️ Two NuGet IDs, two changelogs, version sync (CI-automatable)
- ⚠️ Consumer must pick the right package — wrong choice = breaking switch later
Coordination:
- Land
ACCORE-BIN-T-K9E4first ([RequiresDynamicCode]attributes) — if that pattern handles the consumer-side scenarios well,.Aotmay not be needed - The current Runtime fallback code is already well-isolated (mostly in
AcSerializerCommonfactories +AcBinaryDeserializerwrapper-based methods), so the file-split refactor is mechanically straightforward - Marketing decision: is binary size a central pillar? If yes,
.Aotis a NuGet differentiator; if not,K9E4alone is enough
Acceptance:
AyCode.Core.Aot.csprojproduces a NuGet ~25-60 KB smaller thanAyCode.Coreafter compression.Aotbuild emits zero IL/AOT trim warnings (no suppressions needed because the Runtime path code is physically removed)- Round-trip tests pass on
.Aotfor all SGen types .Aotthrows a clearInvalidOperationException(notMissingMethodException) when a non-[AcBinarySerializable]type is encountered at deser timeBINARY_FEATURES.mdNativeAOT Compatibility section documents both packages and when to choose which
ACCORE-BIN-T-V4N2: Cross-tier SIMD UTF-8 transcoder paths (AVX-512BW + Vector128 + multi-byte transcoder)
Priority: P2 · Type: Performance · Related: EncodeUtf8SinglePass, DecodeUtf8SinglePass, CountUtf8Chars
Current SIMD hierarchy (post 2026-05-05 implementation):
AVX-512BW (64 byte/iter) → Server, Intel 11th gen client, AMD Zen 4+
Vector256 / AVX2 (32 byte) → AVX2 host (Intel 12-14th gen, AMD Zen 3 and earlier)
Vector128 (16 byte/iter) → Apple Silicon NEON, WASM SIMD, legacy SSE2
scalar (1 byte/iter) → no-SIMD fall-back
JIT/AOT path-selection via [Intrinsic] IsSupported static booleans — non-supported tiers constant-folded to dead code per host. Cascading tail handlers: a higher tier's tail (< 64 byte AVX-512 → < 32 byte Vector256 → < 16 byte Vector128 → scalar) is processed by the next-lower tier on the same iteration. No regression on any host.
Implementation status:
| Phase | Method | AVX-512BW | Vector256 | Vector128 | scalar |
|---|---|---|---|---|---|
| 1 | CountUtf8Chars (decode 1st pass) |
✅ done | ✅ existing | ✅ done | ✅ existing |
| 2 | EncodeUtf8SinglePass Phase 1 (ASCII narrow) |
✅ done | ✅ existing | ✅ done | ✅ existing |
| 2.5 | DecodeUtf8SinglePass scalar run-length decoder (multi-byte baseline) |
— | — | — | ⏳ TODO |
| 3a | DecodeUtf8SinglePass multi-byte transcoder (Vector512) |
⏳ TODO | bail-out only | bail-out only | ✅ existing |
| 3b | DecodeUtf8SinglePass multi-byte transcoder (Vector256) |
— | 🔍 deferred — see note | bail-out only | ✅ existing |
| 3c | DecodeUtf8SinglePass multi-byte transcoder (Vector128) |
— | — | ⏳ TODO | ✅ existing |
Note on Phase 3b (Vector256 / AVX2) — deferred, not dropped. AVX2 lacks the AVX-512BW primitives (CompareEqualMask producing a __mmask k-register, in-lane vpermb, mask-driven vpcompressb) that make the classify-mask-compress-widen pipeline efficient. The Vector256.Shuffle is cross-lane via two vpshufb (per-128-bit-lane), which complicates leader-byte extraction across multi-byte sequences spanning the lane boundary. The simdutf C++ project — the canonical reference for this algorithm class — implements only SSE4 (16-byte) and AVX-512 (64-byte) paths; it explicitly skips AVX2 because the implementation cost-benefit is unfavorable on this algorithm.
On AVX2 hosts, the Phase 3c (Vector128) transcoder runs as the primer multi-byte path AND as tail handler — covering AVX2 hosts with 16-byte/iter, which is already a significant win over the current scalar multi-byte branch. Phase 3b would require either:
- Hand-rolling an AVX2-specific 32-byte algorithm with cross-lane permute workarounds (research-grade complexity, uncertain net win — could be SLOWER than the Vector128 path due to cross-lane shuffle latency)
- Waiting for
Avx10v1/Avx10v2to expose AVX-512BW-class primitives in 256-bit form (Intel's unified vector ISA —Avx10v1already in .NET 9,Avx10v2arrives with future Intel hardware)
Re-evaluation triggers: if benchmark on AVX2 hosts shows Phase 3c Vector128 path leaves > 10% Deser gap vs MemPack on multi-byte content; or if Avx10v1 256-bit primitives mature enough to make the algorithm tractable. Until then: Phase 3b stays in the TODO as a research / future-work item — not actively scheduled, but documented so a future contributor doesn't re-derive the AVX2 limitations.
Phase 3 is the remaining gap — UTF-8 multi-byte decode on every host class. ASCII path is already fast across all SIMD tiers (Vector256 + Vector128 prefix widen + Encoding.Latin1.GetString BCL fast path). The gap is on multi-byte UTF-8 content — Hungarian / Cyrillic / Greek (2-byte) and CJK BMP (3-byte) sequences — where the SIMD prefix bails out on the first non-ASCII byte and falls back to scalar bit-extract. The Repeated benchmark cell (Hungarian content) is the canonical witness; with all-Hungarian content (current bench data), Small / Repeated Deser cells trail MemPack by 6-14%.
Why all 3 SIMD tiers (not just AVX-512BW) — public NuGet package goal: i18n payloads must be fast on every supported host (cloud server, desktop, mobile, Blazor WASM), not only AVX-512-capable cloud servers. The saját scalar multi-byte branch is the bottleneck on all non-ASCII content regardless of host class. The BCL Encoding.UTF8 falls back to a similar scalar path on multi-byte content (with virtual dispatch + EncoderFallback overhead), so even where the BCL has its own SIMD 2-byte handler (.NET 9 PR #92580), our trust-input scalar wins on net — but a saját SIMD multi-byte path would dominate on every host.
Phase 3 approach — in-house multi-byte transcoder, three SIMD widths. Single algorithm template (classify-mask-compress-widen pipeline) ported across Vector512 / Vector256 / Vector128 register widths. Algorithm designed and written in-house — no third-party port, no NuGet dependency:
- Phase 3a —
DecodeUtf8SinglePassVector512 (AVX-512BW): 64-byte block fetch → classify each byte's UTF-8 sequence position via mask compares → byte-compression for length-resolution → widen to UTF-16 in twoVector256<ushort>lanes → store. ~3-5× speedup vs current scalar multi-byte branch on Hungarian / CJK content. Activates on AVX-512 hosts (cloud server, Intel 11th gen, AMD Zen 4+). - Phase 3b —
DecodeUtf8SinglePassVector256 (AVX2): same algorithm at 32-byte block. Smaller register space → fewer codepoints per iter, but ASCII bail-out gone → multi-byte content is now SIMD-handled. ~2-3× speedup. Activates on AVX2 hosts (Intel 12-14th gen, AMD Zen 3 and earlier). - Phase 3c —
DecodeUtf8SinglePassVector128 (NEON / SSE / WASM SIMD): same algorithm at 16-byte block. ~1.5-2× speedup. Activates on Apple Silicon / WASM / legacy x86 — covering the i18n production case for mobile (MAUI iOS / Android) and Blazor WASM.
The cascading tail-handler hierarchy (existing in Phase 1+2) carries over: AVX-512 → Vector256 → Vector128 → scalar tail. Each tier hands off the < N-byte tail to the next-lower tier.
No .NET 11 / multi-targeting needed. Avx512BW, Vector256, Vector128 intrinsics all available in .NET 9 (and .NET 8). Implementation lands on the current net9.0 target.
Hardware reach (2026). Per Wikipedia "CPUs with AVX-512":
- ✅ Intel server: Skylake-X (2017), Cascade Lake-X, Ice Lake-SP, Sapphire Rapids (2023+), Emerald Rapids, Granite Rapids — near-universal in cloud (Azure, AWS, GCP)
- ✅ Intel client 11th gen: Tiger Lake (mobile, 2020), Rocket Lake (desktop, 2021), Ice Lake (mobile) — pre-Alder Lake era still supports AVX-512
- ❌ Intel client 12-14th gen: Alder Lake / Raptor Lake / Meteor Lake / Core Ultra — AVX-512 disabled at firmware level (E-core blocking) → falls back to Vector256
- ✅ AMD Zen 4+: Ryzen 7000 (2022), Ryzen 9000 (2024), EPYC Genoa (2022), EPYC Turin (2024)
- ❌ AMD pre-Zen 4: Zen 3 and earlier → falls back to Vector256
- ❌ Apple Silicon / ARM: NEON only → uses Vector128 (16 byte/iter)
- ❌ Blazor WASM: only 128-bit SIMD per WASM SIMD spec → uses Vector128 (16 byte/iter)
The Vector128 path is the WASM and Apple Silicon target — without it both platforms fell back to scalar (1 byte/iter). With Phase 1+2 landed, WASM and Apple Silicon now run the UTF-8 hot path at 16 byte/iter (16× scalar speedup on the count + ASCII narrow operations).
Phase 2.5 — scalar run-length decoder (multi-byte baseline, pre-Phase 3 prototype) — TESTED & REVERTED 2026-05-07
Status update (2026-05-07): Phase 2.5 was implemented and tested in two configurations:
- Full run-length (15:56:54 bench) — both 2-byte and 3-byte tiers used inner do-while loops. Result: +13.0 pp Deser regression on the Hungarian-mixed Repeated cell. Hypothesis confirmed (foreseen pre-implementation): rövid Magyar 2-byte runs (1-2 char average) make the run-detection overhead exceed the per-char payload; switch-jumptable per-char dispatch wins on this content shape.
- Hybrid (post-15:56:54) — 2-byte single decode, 3-byte run do-while only. Tested but bench-zaj instabilitás miatt unmeasurable signal. Reverted along with V4N4 method-split (2026-05-07).
The optimization-value signal proved below the bench noise floor on the available hardware. The 3-byte do-while CJK-content win remains a theoretically valid target — but cannot be objectively validated without the ACCORE-BIN-T-C5R8 charset-parameterized benchmark workload (CJK option). Re-evaluate when CJK workload measurement becomes available.
Re-evaluable as of 2026-05-07 per ACCORE-BIN-T-D9X3 — bench stabilization removes the noise-floor that made the original signal unmeasurable; retest before any code change. (Charset bias remains — pair with ACCORE-BIN-T-C5R8 for CJK validation.)
Retested 2026-05-08 — REGRESSION CONFIRMED (Latin1Long charset, stabilized bench): adding the do-while inner loop on both 2-byte and 3-byte tiers in DecodeUtf8SinglePass produced +5-8pp Deser regression on every cell vs. the switch-jumptable baseline (Small +7.8pp, Medium +7.1pp, Large +5.5pp, Repeated +7.4pp, Deep +4.9pp). Reverted to switch-jumptable single-decode same day. The V4N2 entry's original prediction held: "Magyar mixed (KözösCímke, sötét — short alternating runs): 0-5% (run-detection overhead may eat the savings on short runs)" — Latin1Long suffix has 1-2 char average run length, well below the run-detection break-even point. Phase 2.5 is dead on Magyar mixed. CJK retest still untried, but Phase 2.5 is now obsoleted by ACCORE-BIN-T-K7M3 (the decoder hot path runs Utf8.ToUtf16 BCL static API, not DecodeUtf8SinglePass).
Below: original Phase 2.5 design notes preserved as documentation. Implementation details remain accurate even though the implementation was reverted.
Targets the DecodeUtf8SinglePass switch-jumptable per-char dispatch on multi-byte content. Current scalar Phase (jumptable) re-dispatches every char; a run-length-aware scalar decoder runs a tight branchless inner loop on homogeneous runs (long ASCII run, long 2-byte Latin/Cyrillic run, long 3-byte CJK BMP run), with the existing single-codepoint scalar branch as mixed-edge fallback.
Algorithm sketch:
while (s < src.Length)
{
// 1) ASCII run (0xxxxxxx) — already handled by Phase 1 SIMD prefix; this is tail
int asciiStart = s;
while (s < src.Length && src[s] < 0x80) s++;
if (s > asciiStart) { WriteAsciiRun(src.Slice(asciiStart, s-asciiStart), dst, ref d); continue; }
// 2) 2-byte run (110xxxxx 10xxxxxx) — Hungarian / Cyrillic / Greek / Hebrew / Arabic
int twoStart = s;
while (s + 1 < src.Length && Is2ByteLead(src[s]) && IsCont(src[s+1])) s += 2;
if (s > twoStart) { Decode2ByteRun(src.Slice(twoStart, s-twoStart), dst, ref d); continue; }
// 3) 3-byte run (1110xxxx 10xxxxxx 10xxxxxx) — CJK BMP, other 3-byte BMP scripts
int threeStart = s;
while (s + 2 < src.Length && Is3ByteLead(src[s]) && IsCont(src[s+1]) && IsCont(src[s+2])) s += 3;
if (s > threeStart) { Decode3ByteRun(src.Slice(threeStart, s-threeStart), dst, ref d); continue; }
// 4) Mixed-edge fallback (typically 4-byte surrogate pair or single transition char)
DecodeSingleCodePoint(src, ref s, dst, ref d);
}
Why P2.5 — scalar baseline before SIMD multi-byte (Phase 3a-3c):
- 1-2h prototyping cost vs 6-10h Phase 3 SIMD work
- A/B benchmark on Repeated cell decides whether the run-length structure already wins on Magyar mixed (
KözösCímkepattern) — if it does, Phase 3 lifts further; if not, Phase 3 SIMD is the only win path - Documents the "switch-jumptable bottleneck on Hungarian benchmark" hypothesis without committing to the larger SIMD effort
- The
Decode2ByteRun/Decode3ByteRunscalar-batch implementations also serve as algorithm references for the Phase 3 SIMD versions (clear semantics first, optimize after)
Expected payoff (per content class, ratio vs current switch-jumptable):
- Long CJK BMP (3-byte run, e.g.
你好世界×30): ~20-40% Deser improvement (long homogeneous run, biggest jumptable savings) - Long 2-byte run (
árvíztűrő×10+): ~5-15% improvement - Magyar mixed (
KözösCímke,sötét— short alternating runs): 0-5% (run-detection overhead may eat the savings on short runs) - Long ASCII (≥32 byte): 0% (Phase 1 SIMD prefix already handles)
- Emoji (4-byte): 0% (mixed-edge fallback unchanged)
Risk — the existing switch-jumptable JIT optimization is strong; Magyar mixed text (1-2 char runs) may not show net gain. Implementation must be isolated prototype first (alongside the live DecodeUtf8SinglePass, not replacing it), with A/B benchmark comparing the two before any switch.
Acceptance (Phase 2.5):
- Repeated cell Compact Deser ratio ≤ 1.0 vs MemPack on AVX2 hosts (parity with current measurement, no regression)
- Round-trip tests pass on all UTF-8 content classes (ASCII / 2-byte / 3-byte BMP / 4-byte surrogate-pair)
- A/B benchmark shows ≥ 5% Deser improvement on Repeated OR ≥ 10% on Large cell — else Phase 2.5 stays in TODO as documented dead-end (negative result is also valuable: confirms the jumptable is fast enough, focus moves entirely to Phase 3)
Phase 3 implementation outline
- Insert SIMD multi-byte branches at
DecodeUtf8SinglePassentry, before the existing ASCII-prefix bail-out loops:if (Avx512BW.IsSupported && byteCount >= 64) { Vector512MultiByteDecode(...) } if (Vector256.IsHardwareAccelerated && len-i >= 32) { Vector256MultiByteDecode(...) } if (Vector128.IsHardwareAccelerated && len-i >= 16) { Vector128MultiByteDecode(...) } // existing scalar tail - Single algorithm template — classify-mask-compress-widen pipeline:
- Block load (Vector512 / Vector256 / Vector128)
- Classify each byte's UTF-8 sequence position via mask compares (start vs continuation, 1/2/3/4-byte sequence width)
- Compute output char count via popcount on start-byte mask + extra-char mask for 4-byte sequences
- Byte-compression for leader/continuation extraction (mask-driven
PermuteVar/Shuffle) - Combine leader + continuations into codepoints (shift + OR)
- Widen codepoints to UTF-16 chars (handle surrogate pairs for 4-byte sequences)
- Store output, advance src/dst pointers
- Block-boundary edge case: incomplete multi-byte sequence at block end → carry to next iter or hand off to lower tier / scalar tail
- Trust-input semantics maintained — no validate-pass instructions (reader input is valid UTF-8 by writer contract)
Avx512BW.X64.IsSupported(64-bit-only intrinsics) checked separately if any code path requires the X64 sub-feature
Why P2
- "i18n production deploy" perf gap on every host class — the public NuGet package contract requires fast multi-byte UTF-8 across cloud server, desktop, mobile, and Blazor WASM
- No NuGet dependency, no third-party code, no wire-format change, additive — pure CPU optimization
- Phase 1+2 delivered cross-tier ASCII / count SIMD coverage; Phase 3 closes the multi-byte CPU gap on all SIMD-capable hosts (not just AVX-512)
- Single algorithm template ported across 3 register widths — code volume manageable
Acceptance
- Repeated Deser ratio ≤ 0.7 vs MemPack on AVX-512 hosts (Phase 3a)
- Repeated Deser ratio ≤ 0.8 vs MemPack on AVX2 hosts (Phase 3b)
- Repeated Deser ratio ≤ 0.85 vs MemPack on Apple Silicon / WASM (Phase 3c)
- Repeated Ser ratio ≤ 0.85 across all host classes
- Round-trip tests pass on all UTF-8 content classes (ASCII / 2-byte / 3-byte BMP / 4-byte surrogate-pair)
BINARY_FEATURES.mddocuments the SIMD path selection across all four tiers
Trigger
- Each SIMD width validated on a representative host before merge:
- Phase 3a: AVX-512 host (developer's local AMD Zen 4+ desktop, Intel 11th gen, or server-class machine)
- Phase 3b: AVX2 host (any modern x86 desktop / laptop without AVX-512)
- Phase 3c: Apple Silicon (macOS / iOS / Mac Catalyst) AND Blazor WASM browser runtime
- Local
dotnet testcovers correctness; per-tier benchmarks measure the multi-byte speedup - Phase 1+2 (AVX-512BW + Vector128 in
CountUtf8Chars+EncodeUtf8SinglePassPhase 1) landed 2026-05-05 — covered by existing round-trip tests, no regression on non-AVX-512 hosts (validated on AVX2-host bench)
ACCORE-BIN-T-H2Q6: Fixed-width dual-length string header (Small/Medium/Big) for 1-pass decode
Priority: P1 · Type: Wire-format + Performance · Status: Closed (2026-05-06) · Related: DecodeUtf8SinglePass, CountUtf8Chars, WriteStringWithDispatch, ReadStringUtf8
Current Compact string decode uses two-pass flow for non-ASCII payloads (CountUtf8Chars + DecodeUtf8SinglePass).
Planned direction: remove VarUInt-based string-length path for the new string wire variant, and carry both lengths in a fixed-width header so deserialize can allocate target string immediately and decode in a single pass.
Planned format tiers
- Small: packed
uint16(charLen:8 | utf8Len:8) - Medium: packed
uint32(charLen:16 | utf8Len:16) - Big:
uint32 charLen + uint32 utf8Len
Writer picks the smallest fitting tier; reader dispatches by marker and reads fixed-width lengths (no VarUInt loop for string length metadata).
Why
- Removes
CountUtf8Charspass on the new markers (1-pass decode path) - Keeps decode branch profile stable (fixed-size header reads)
- Maintains range safety with explicit Big overflow path
Constraints captured from current benchmark context
- Performance evaluation target is non-ASCII-heavy data (ASCII-shortcuts intentionally not primary)
- Wire-format backward compatibility is not required for this development phase
Marker layout decision (2026-05-06)
After analysis on the new "all UTF-8 Magyar" benchmark baseline (2026-05-06_13-10-30.LLM — Compact +5-25% slower than MemPack on every cell):
Confirmed: the previous benchmark's Compact-vs-MemPack advantage was an artifact of ASCII property names hitting the FixStrAscii / Latin1-widen fast path; once string property values are also UTF-8 Magyar, the actual hot path (EncodeUtf8SinglePass + two-pass CountUtf8Chars + DecodeUtf8SinglePass) becomes the bottleneck.
Marker scope decision — clean split between ASCII fast path and non-ASCII tier dispatch:
MEGMARAD (changeless):
FixStrAscii(≤31 byte ASCII) — kompakt 1-byte header + Latin1 widen, zero UTF-8 decode pipelineStringAscii(>31 byte ASCII) — long ASCII fast path, Latin1 widenStringInternRef— 2nd+ occurrence of interned string (no body, just cache index — not affected by 2-pass problem)StringEmpty,Null— sentinel markers
MEGSZŰNIK (replaced by H2Q6 tiers):
FixStr(32 marker values 103-134 — non-ASCII short) → replaced byStringSmallString(1 marker value 91 — non-ASCII long with VarUInt utf8Len) → replaced byStringSmall/StringMedium/StringBigStringInternFirst(1 marker value 94 — VarUInt utf8Len interning) → replaced byStringInternFirstSmall/StringInternFirstMedium
ÚJ markers (5 total):
StringSmall— non-ASCII,[marker:1][charLen:8][utf8Len:8][bytes], utf8Len ≤ 255StringMedium— non-ASCII,[marker:1][charLen:16][utf8Len:16][bytes], utf8Len ≤ 65535StringBig— non-ASCII,[marker:1][charLen:32][utf8Len:32][bytes], utf8Len > 65535StringInternFirstSmall—[marker:1][cacheIdx:VarUInt][charLen:8][utf8Len:8][bytes]StringInternFirstMedium—[marker:1][cacheIdx:VarUInt][charLen:16][utf8Len:16][bytes]
Trade-off justification:
- Wire cost on short non-ASCII strings: +2 byte/string header (3 vs 1) → ~0.07-0.36% wire growth on Repeated cell (10 short Magyar string × 2 byte / 28 KB)
- CPU saving:
CountUtf8CharsPass 1 eliminated on every non-ASCII string decode → directly attacks the +25% Deser baseline gap - The 2-byte hybrid
FixStr(non-ASCII) variant (1 byte marker + 1 byte charLen) was considered but rejected: marginal wire saving (-1 byte vs StringSmall) does not justify the +1 marker complexity given the tiny absolute wire impact on the Repeated cell. Cleaner to have ASCII-vs-non-ASCII at the marker level (FixStrAscii vs StringSmall/Medium/Big).
Interning tier sizing rationale:
MaxStringInternLengthisbyte-typed (AcBinarySerializerOptions.cs:125, default 64, abszolút max 255 char)- Worst-case: 255 char × 4 byte/char (emoji-only) = 1020 byte → fits in Medium tier (utf8Len ≤ 65535)
- Realistic Magyar/CJK: 64 char × 2-3 byte = 128-192 byte → Small tier
- Big tier never engages on the interning path — only Small + Medium needed (+2 markers, not +3)
Marker address space reservation (post-H2Q6)
The marker reorg frees 34 marker values (32 FixStr non-ASCII + String + StringInternFirst). After allocating 5 for H2Q6, 29 values remain free. Strategic reservation plan to prevent ad-hoc consumption and minimize future wire-format breaks:
| Reserved range | Count | Future feature | Status |
|---|---|---|---|
StringSmall / StringMedium / StringBig |
3 | H2Q6 Compact tiers | active (this entry) |
StringInternFirstSmall / StringInternFirstMedium |
2 | H2Q6 interning tiers | active (this entry) |
FixArrayBase..FixArrayMax |
16 | ACCORE-BIN-T-L9Y3 (FixArray short-list count in marker) |
reserved, future |
| Sentinel-length string tier markers | ~5 | ACCORE-BIN-T-S5L8 (sentinel-length encoding) |
reserved, future |
| Markerless schema lane | ~4 | ACCORE-BIN-T-S2X9 (markerless schema lane opt-in) |
reserved, future |
StringFastWire |
1 | ACCORE-BIN-T-F3W6 (dedicated FastWire string marker) |
reserved, future |
| General reserve | 3 | unallocated | tartalék |
Wire-format version bump: v2 → v3 at H2Q6 landing. The reserved-but-unimplemented marker values are documented but not yet decoded — readers throw unknown marker if wire contains them. Future activation of FixArray / sentinel-length / markerless schema lane within the same v3 wire format is non-breaking for already-deployed v3 consumers (they reject unknown markers cleanly; producers opt in to emit them).
Acceptance
- New string markers implemented for Small/Medium/Big tiers + InternFirstSmall/InternFirstMedium tiers
- Deserialize path for these markers performs single-pass decode without
CountUtf8Chars - 29 freed marker values strategically reserved per the address-space reservation table; documented in
BinaryTypeCode.cswith// Reserved for ACCORE-BIN-T-XXXX (future)comments - Wire-format version bump v2 → v3 documented in
BINARY_FORMAT.md - Existing round-trip tests pass, plus new boundary tests for tier transitions (utf8Len = 254/255/256/65534/65535/65536) and interning tier transitions
- Benchmark report includes before/after for Compact mode on non-ASCII dataset (Ser/Deser/RT + Size) vs the
2026-05-06_13-10-30.LLMbaseline
Resolution
Landed 2026-05-06. End-to-end implementation: marker reorg + writer tier-dispatch + reader tier-readers + SGen template + skip path + interning path. Five new markers (StringSmall/Medium/Big/InternFirstSmall/InternFirstMedium) replacing the old String/StringInternFirst/FixStrBase..Max (32 + 1 + 1 = 34 marker values freed, 5 used; 29 reserved for future features per the address-space plan). Wire format version bumped v2 → v3.
Follow-up A-direction header pack-write/read optimization landed in the same window: Unsafe.WriteUnaligned<ushort> (Small) / <uint> (Medium) / <ulong> (Big) replace 2× byte / 2× ushort / 2× uint stores; reader uses single uint/ulong loads with bit-extract. Direct ref byte writes (no Span-shape overhead).
Tests: 222 pass / 13 pre-existing GuidIId failures (unchanged). 55/55 Utf8TranscoderTests pass.
Benchmark vs 2026-05-06_13-10-30.LLM baseline (2026-05-07_08-55-49.LLM, immediately post-H2Q6):
- Compact-vs-MemPack Deser ratio improvement on baseline gap: -14 to -28 percentage points across cells
- Deser: 4/5 cells now faster than MemPack (Small -6%, Medium -3%, Large -9%, Deep -7%); Repeated cell remaining +5% gap (V4N2 Phase 3 SIMD multi-byte transcoder targets this)
- Wire size: 5/5 cells smaller than MemPack (-8% to -11%)
- Ser: 1/5 win (Large -9%), 1/5 tie (Medium 0%), 3/5 minor lag (+2-7% Small/Repeated/Deep) — host-noise band
Bench evolution post-H2Q6 (subsequent micro-opts on the same H2Q6 base):
2026-05-07_09-39-09.LLM— A irány header pack-write/read (Unsafe.WriteUnalignedushort/uint/ulong): zaj-szintű mozgás, strukturális javulás2026-05-07_15-13-39.LLM— V4N4 Step 1+2 method-split (AggressiveInlining): regresszió (Small Ser +29.6 pp, Repeated Ser +8.9 pp) →WriteStringSmallFasttúl-aggresszív inline-olás code-bloat / i-cache pressure2026-05-07_15-29-21.LLM— V4N4 finomított (NoInlining a SmallFast-ra, dispatcher hint nélkül, Reader split visszavonva): konszolidált state:- Ser: 5/5 cell paritás-vagy-jobb (Small -8.5%, Medium ≈, Large -8.5%, Repeated ≈, Deep ≈)
- Deser: 4/5 cell faster than MemPack (Medium -4.7%, Large -10.6%, Repeated -3.8%, Deep -10.1%); Small +10% remaining gap
- Wire: 5/5 cell -8% to -11% smaller (unchanged)
- Net: Compact mostantól 8/10 cellán nyer Compact vs MemPack; csak Small Deser-en marad +10% gap (kis abszolút érték, ~1 µs)
Critical algorithmic correctness lesson (from V4N3 follow-up GetUtf8ByteCount): the initial 4-popcount formula assumed lowSur == highSur per chunk. Fix: 5-popcount closed-form. Caught by surrogate-pair-split-across-chunk regression tests. Documented in Utf8Transcoder.
Marker address space (post-H2Q6, v3 wire):
- 91 → StringSmall (was String)
- 94 → StringMedium (was StringInternFirst)
- 103 → StringBig
- 104 → StringInternFirstSmall
- 105 → StringInternFirstMedium
- 106..134 reserved (29 values: 16 for
L9Y3FixArray, 5 forS5L8sentinel-length, 4 forS2X9markerless schema lane, 1 forF3W6FastWire dedicated marker, 3 reserve)
Related follow-up TODO entries (now Open): O7G2 (overflow guard), S6F2 (shift-mentes Small fast path), W2C8 (WASM string-cache H2Q6 maximalizálás).
ACCORE-BIN-T-S5L8: Sentinel-length encoding for strings (wire-size optimization, both modes)
Priority: P3 · Type: Wire-format optimization · Related: AcBinarySerializer.WriteString, AcBinaryDeserializer.ReadValue string dispatch
The leading string-marker byte (String / StringEmpty / Null) exists primarily to distinguish null vs empty vs non-empty before dispatching. For non-polymorphic, non-interned string properties the marker can be replaced by a single sentinel-length VarUInt:
[VarUInt sentinelLength] [content bytes if applicable]
sentinelLength == 0 → null
sentinelLength == 1 → empty string
sentinelLength == N+1 → string of N bytes/chars, content follows
MemoryPack-style encoding pattern. Applies to both Compact (UTF-8) and FastWire (UTF-16 raw) modes; the content following the sentinel differs by mode.
Per-mode impact
FastWire mode — wire layout today: [String marker][VarUInt charCount][UTF-16 raw bytes]. Sentinel saves 1 byte per non-null string.
| TestData | Current FastWire wire | Estimated with sentinel | Δ |
|---|---|---|---|
| Small | 3122 B | ~3050 B | -2% |
| Medium | 10905 B | ~10500 B | -4% |
| Large | 68603 B | ~67000 B | -2% |
| Repeated | 16244 B | ~15700 B | -3% |
| Deep | 15514 B | ~14900 B | -4% |
Closes the +1.7-8.1% FastWire wire gap vs MemoryPack to near zero or favorable while keeping AcBinary FastWire's +9-20% speed advantage.
Compact mode — wire layout today varies by length:
- Short (≤31 byte):
[FixStr+length][UTF-8 bytes]— already 1-byte marker, ties sentinel. - Long (>31 byte):
[String marker][VarUInt byteCount][UTF-8 bytes]— sentinel saves 1 byte (the marker).
Compact gain: only on long strings (>31 byte UTF-8). Estimated −1 byte per long string. Workload-dependent: if most strings are short or use interning, gain is small. If many long mixed-content strings, meaningful saving.
Limitations (both modes)
- Polymorphic
objectproperties: marker needed for type discrimination. Sentinel encoding only applies when the property type is staticallystringorstring?. - Interning incompatible: sentinel cannot express
StringInternFirst/StringInternedmarkers (those carry cache-index semantics). Interned properties keep marker-based encoding. FastWire mode already disables interning by design (consistent); Compact mode needs per-property dispatch (interned → marker, non-interned → sentinel). - Compact-mode FixStr ties: short strings (≤31 byte UTF-8) gain nothing in Compact (FixStr is already 1-byte marker+length). The optimization wins only on long strings in Compact.
Implementation outline (rough — refine when implementing)
- Writer: branch in
WriteStringon property metadata flags(IsString, IsNotInterned, IsNotPolymorphic). If sentinel-eligible, emitVarUInt sentinelLength+ content. Else fall through to existing marker-based encoding. - Reader: matching branch in property reader. If sentinel-eligible (per property metadata), read
VarUInt sentinelLength, dispatch on 0/1/N+1. - SGen: emit sentinel-encoding variant for non-polymorphic non-interned
stringtyped properties; emit existing marker-encoding for the rest. - Wire format version bump OR header flag indicating sentinel-encoding-active. (Cross-version compat policy decided when implementing.)
Trigger
- After D-2 / decoder optimization / marker-dispatch land (compact-mode focus completes)
- When wire-size positioning becomes a primary pillar for NuGet release
- Re-evaluate scope at implementation time — exact gain in Compact depends on consumer workload (long-string ratio, interning patterns)
Acceptance
- FastWire mode: AcBinary wire ≤ MemoryPack on at least 4 of 5 test cells
- Compact mode: long-string wire bytes -1 each, no regression on short or interned strings
- Speed benchmark: no regression vs current encoding (essentially zero CPU cost — sentinel is shifted bookkeeping)
- Cross-version compat: documented format version bump + clean fail on old reader / new wire mismatch
- Polymorphic + interned property test cases pass unchanged (use existing marker-based encoding)
ACCORE-BIN-T-M3R7: ASCII marker-dispatch — writer detect + reader dedicated path
Priority: P2 · Type: Performance + wire optimization · Related: BinaryTypeCode.FixStrAsciiBase..StringAscii markers, WriteStringWithDispatch, ReadAsciiBytesAsString
Status: Closed (2026-05-04)
Sorrendi megjegyzés: ezt AZ ENCODER OPTIMALIZÁCIÓ UTÁN csináljuk (lásd
ACCORE-BIN-T-E2F9). Indok: a custom encoder/decoder Vector256 ASCII narrow/widen path-jai már magukban gyorsan kezelik az ASCII byte-ot. A marker-dispatch ezen FELÜL csak a per-call dispatch-overhead spórolást hozza (noAscii.IsValidscan, no decoder layer). Garantált win, de additív — méréstechnikailag tisztább a decoder/encoder utánra hagyni.
The FixStrAscii* (135-166) and StringAscii (167) markers are defined in BinaryTypeCode.cs with helper methods (IsAsciiString, IsFixStrAscii, EncodeFixStrAscii, DecodeFixStrAsciiLength). Encoding/decoding logic NOT yet implemented — currently both writer and reader use the universal String / FixStr markers.
Implementation
- Writer: in
WriteStringUtf8/WriteFixStrDirect, after UTF-8 encoding (D-2 path), checkbytesWritten == charLength(= ASCII iff equal). If ASCII, emitFixStrAscii(≤31 byte) orStringAscii(>31 byte). Else emit existingFixStr/String. Free detect — both numbers already computed by D-2. - Reader: in
ReadStringUtf8(or upstream marker dispatch), branch on marker. ASCII markers → dedicated byte→char widening path (no UTF-8 decode, noAscii.IsValidscan, no decoder dispatch). Non-ASCII markers → existing custom UTF-8 decoder. - SGen: regenerate readers/writers to dispatch on the new markers.
- Re-enable ASCII fast paths: uncomment writer FixStr dispatch in
AcBinarySerializer.csand readerAscii.IsValidblock inReadStringUtf8— these temporarily disabled blocks become the marker-aware paths (no IsValid scan needed since the marker is the contract).
Wire format change
- Format version bump (1 → 2). Old readers fail clean on new wire (version mismatch). New readers must reject old wire OR support backward read.
Acceptance
- Repeated Strings (Hungarian content) Deser: AcBinary closes the ~10% gap vs MemoryPack
- Pure ASCII tests (Small/Medium/Large/Deep): AcBinary Ser AND Deser ≥ MemoryPack
- Wire size: minimum -25% vs MemoryPack across all test cells
- SGen-generated code compiles and round-trips on all
[AcBinarySerializable]types - Decision documented: backward-compat policy for v2 vs v1 wire
Resolution
End-to-end implementation landed (writer + reader + SGen + skip + populate). Key components:
- Writer (
AcBinarySerializer.BinarySerializationContext.WriteStringWithDispatch) — single-pass UTF-8 encode + ASCII detect viabytesWritten == charLength; emits one of 4 markers (FixStrAscii / FixStr / StringAscii / String). Split layout for hot path:charLength ≤ 31encodes optimistically atsavedPos+1(FixStr position) → 0 shift on FixStr hit;charLength > 31uses D-2 layout with backfill. The split avoids the post-encode left-shift that the unified layout introduced (regression seen in 12-42-32 bench). - Reader (
AcBinaryDeserializer.BinaryDeserializationContext.ReadAsciiBytesAsString) —Encoding.Latin1.GetString(BCL SIMD-accelerated byte→char widen). Avoids thestring.Createcallback + scalar widen overhead — measurably better on Small Deser cell (closed the +20% MemPack-relative anomaly). - TypeReaderTable:
StringAscii(167) + 32 ×FixStrAscii(135-166) readers registered.IsFixStrAscii/StringAsciifast paths inPopulatePropertyWithMarker,ReadValue,SkipValue. - SGen (
AcBinarySourceGenerator.EmitReadString) — regenerated readers branch onIsFixStr/IsFixStrAscii/case StringAsciiper property.
Wire format version not bumped — the new markers occupy previously-unused codepoints (135-167); old wire (without ASCII markers) is forward-compatible (readers handle both String and StringAscii). v1 stays.
Acceptance (AOT bench 13-40-29, MemPack-relative ratios — JIT noise eliminated):
- ✅ AcBinary Ser AND Deser GYORSABB MemPack-nél MINDEN cellán (5/5)
- Small: Ser -8%, Deser -23%
- Medium: Ser -17%, Deser -30%
- Large: Ser -28%, Deser -32%
- Repeated: Ser -4%, Deser -9%
- Deep: Ser -24%, Deser -22%
- ✅ Wire size advantage: 2043-50419 byte (vs MemPack 3070-64986) = -22% to -33% across cells
- ✅ Round-trip tests: 167 pass (13 pre-existing failures are IId-tracking, unrelated to M3R7)
JIT vs AOT note: earlier JIT-mode benchmarks (12-50-43 → 13-27-20 series) showed elevated ratios on Small/Repeated cells (1.0-1.2 range) that disappeared under AOT publish. The JIT-mode numbers reflect tier-up artifacts (inconsistent inlining of SGen-generated reader hot paths during the 1000-iteration measurement window), not a structural M3R7 property. AOT (NativeAOT / ILC) compiles deterministically with fixed inline decisions — the steady-state numbers above reflect the actual production performance.
ACCORE-BIN-T-E2F9: Custom UTF-8 encoder (writer-side, symmetric with custom decoder)
Priority: P1 · Type: Performance · Related: decoder optimization (AcBinaryDeserializer.BinaryDeserializationContext.Read.cs::DecodeUtf8SinglePass)
Status: Closed (2026-05-04)
Sorrendi megjegyzés: ezt A MARKER-DISPATCH ELŐTT csináljuk (lásd
ACCORE-BIN-T-M3R7). Indok: a custom encoder/decoder optimalizáció a "nehezebb, kevésbé biztos" win — a non-ASCII / mixed content workload-okat (Repeated Strings Hungarian) hozza be. A marker-dispatch utána már csak additív tisztítás a pure ASCII path dispatch-overhead-jén.
Replace Encoding.UTF8.GetBytes calls in WriteStringUtf8 / WriteStringUtf8Internal / WriteFixStrDirect (collectively the writer's UTF-8 encode path, post-D-2) with a hand-rolled SIMD encoder. Symmetric to the decoder optimization (V4N2 / Read.cs::DecodeUtf8SinglePass).
Layered structure (mirrors decoder)
- Phase 1 — Vector256 ASCII narrow: 16 chars (Vector256) → 16 bytes (Vector128) via
Vector256.Narrow. ASCII detect via(v & 0xFF80).ExtractMostSignificantBits() == 0(any high bit on UTF-16 char). Break on first non-ASCII char. - Phase 2 — DWORD ASCII batch: 4 chars at a time, OR-mask test, 4 bytes per iter when ASCII.
- Phase 3 — Scalar multi-byte encode: 1-byte (ASCII) / 2-byte (Latin extended) / 3-byte (BMP) / 4-byte (surrogate pair → supplementary plane) UTF-8 encoding via direct bit-extract. No fallback dispatch — input is trusted UTF-16 (string).
- Use
System.Text.Unicode.Utf8.FromUtf16as fallback target for scalar correctness — or skip BCL entirely with manual bit-pack.
Why
Encoding.UTF8.GetBytes carries virtual-dispatch + encoder-fallback overhead even with SIMD ASCII fast path internally. Custom encoder skips this. ~15-30% Ser improvement on ASCII content, ~5-10% on non-ASCII (multi-byte path stays scalar).
Trigger
- NEXT — implementation order P1 before marker-dispatch (M3R7)
- Re-evaluate if .NET 11 BCL UTF-8 GetBytes becomes faster (PR #120628 follow-up)
Acceptance
- Writer-side benchmark: ≥15% Ser speedup on ASCII content (Small/Medium/Large/Deep), ≥5% on non-ASCII (Repeated)
- Wire format unchanged (custom encoder produces same bytes as
Encoding.UTF8) - Round-trip tests pass
Resolution
Implemented as EncodeUtf8SinglePass in AcBinarySerializer.BinarySerializationContext.cs — three-phase layered encoder (Vector256 ASCII narrow + DWORD ASCII batch + scalar 1/2/3-byte BMP & 4-byte surrogate-pair). Bypasses Encoding.UTF8.GetBytes virtual-dispatch + encoder-fallback overhead. Trusted-input path — no validation pass on writer side (the input is a .NET string with valid UTF-16 surrogate pairs by construction).
Used by WriteStringUtf8 (D-2 single-pass with VarUInt backfill) and WriteStringWithDispatch (M3R7 marker-dispatch path). Wire format unchanged — the encoder produces the same bytes as Encoding.UTF8.GetBytes.
Acceptance (per bench 12-50-43 → 13-27-20, MemPack-relative ratios on AcBinary Compact FastMode SGen):
- ✅ ASCII Ser ≥ MemPack on 4/5 cells (Small 0.94, Medium 0.80, Large 0.79, Deep 0.81)
- ⚠️ Repeated Ser ~1.04 (Hungarian, multi-byte path scalar) — see follow-up
ACCORE-BIN-T-H7K3 - ✅ Round-trip tests pass (167 of 180; 13 pre-existing failures unrelated to encoder)
ACCORE-BIN-T-W7N5: Default-value omission policy — doc + optional opt-out
Priority: P2 · Type: Refactor + Documentation · Related: BINARY_ISSUES.md#accore-bin-i-d9y2 (canonical issue)
The serializer's PropertySkip (102) optimization saves 1 byte per default-valued property by omitting the full value from the wire — relying on the consumer-side type definition to have the same default(T). This is a latent correctness risk documented in ACCORE-BIN-I-D9Y2. This entry tracks the mitigation plan; full failure-mode analysis lives in the issue.
Decision tree (TBD when implementing)
- Doc-only: position as a deliberate protobuf-style feature; consumer keeps type defaults stable across versions. Lowest cost, maximum benchmark wire-size advantage retained.
- Option flag:
AcBinarySerializerOptions.OmitDefaultsboolean. Defaulttrue(preserves current behavior + benchmark numbers).falsewrites every property in full — opt-out for fragile-class-evolution scenarios. - Both: ship doc + flag. Default behavior unchanged; consumers who hit silent-corruption have an explicit opt-out.
Acceptance (when implementing)
BINARY_FEATURES.mdadds a "Default-Value Omission" section documenting the semantic and the tradeoff (with cross-ref toACCORE-BIN-I-D9Y2)- If flag added: round-trip tests covering both
trueandfalse; benchmark comparison table showing wire-size delta on ASCII / Hungarian / DTO-heavy workloads - Decision rationale recorded in
LLM_PROTOCOL_DECISIONS.md(or a### Resolutionblock on the issue) once implemented
ACCORE-BIN-T-H7K3: Hungarian / multi-byte content Ser optimization (Repeated Strings cell)
Priority: P3 · Type: Performance · Related: EncodeUtf8SinglePass Phase 3 (scalar multi-byte encode), ACCORE-BIN-T-E2F9 resolution
Status: Closed (2026-05-04) — Won't Fix (JIT-only artifact)
The Repeated Strings benchmark (Hungarian content: "TermékNév_…", "RaklapKód_…") still shows AcBinary Ser ratio ~1.04 vs MemPack across multiple runs (12-50-43 / 13-21-27 / 13-27-20 series). All other ASCII-heavy cells (Small/Medium/Large/Deep) sit in the 0.79-0.94 ratio range — Repeated is the outlier.
The Phase 3 scalar multi-byte branch in EncodeUtf8SinglePass (1-byte ASCII / 2-byte Latin-extended / 3-byte BMP / 4-byte surrogate-pair) processes Hungarian diacritics (á, é, í, ő, ű, etc.) as 2-byte UTF-8 sequences via scalar bit-extract. MemPack's UTF-8 encoder appears to use a SIMD-accelerated mixed-content lane that processes 2-byte sequences in parallel.
Resolution
AOT bench 13-40-29: Repeated Ser ratio = 0.96 (AcBinary 14.50 µs vs MemPack 15.05 µs, AcBinary GYORSABB by 4%). Deser ratio 0.91 (also faster).
The 1.04+ ratio observed in JIT-mode benchmarks (12-50-43, 13-21-27, 13-27-20) was a JIT tier-up artifact — the SGen-generated writer's hot path (which calls EncodeUtf8SinglePass) didn't reliably tier up to fully-optimized code within the 1000-iteration measurement window, while MemPack's writer apparently warmed up faster. Under NativeAOT publish (-p:_IsPublishing=true) the issue disappears completely — both writers are deterministically optimized at compile time.
No structural problem in the Phase 3 scalar branch. The investigation directions (Vector256 mixed-content lane, BCL Utf8.FromUtf16 comparison) remain valid academic improvements but show no meaningful production-time win — closing as Won't Fix.
ACCORE-BIN-T-S2X9: Markerless schema lane — drop per-property type markers for fixed-shape primitives (SGen)
Priority: P2 · Type: Wire-format extension · Related: ACCORE-BIN-T-S5L8, ACCORE-BIN-T-W7N5
AcBinary is marker-driven: every value on the wire carries a 1-byte type code, so the reader can dispatch generically (handles polymorphism, null, intern markers, type-name lookup, etc.). MemPack is schema-driven: the SGen reader knows at compile time that "field 3 is int, field 4 is string" and reads values directly with no type code, no run-time dispatch.
For fixed-shape primitive properties (int, bool, double, Guid, DateTime, …) on [AcBinarySerializable] types, the per-property type marker is pure overhead — the SGen-generated reader already has compile-time knowledge of the property type, so the marker only confirms what is already known. Dropping it on this narrow class of properties is a clean wire+CPU win without losing any of the polymorphism / null / intern flexibility that the marker provides for variable-shape values.
Why P2 — WireMode = Fast wire-size parity (NuGet release narrative)
The WireMode = Fast lane currently produces +1.7% to +8.1% larger wire than MemPack across all benchmark cells (AOT bench 13-40-29: Small +52 byte, Medium +474, Large +3617, Repeated +1221, Deep +581). The gap is structural: UTF-16 raw-memcpy strings are 2 bytes/char fixed, while MemPack's UTF-8 is 1 byte/char on ASCII content. Touching the string-write path to fix this would either:
- Lose the raw-memcpy guarantee (post-encode ASCII-detect + branchy dispatch — kills the FastWire CPU advantage), or
- Add sentinel-encoding micro-savings (~3-5% wire) which don't close the structural gap.
Markerless schema lane is the only path to wire-size parity that preserves the FastWire raw-memcpy hot path. Per-primitive-property savings (1 byte for non-tiny int, Guid, DateTime, decimal, double, …) compound on DTO-heavy payloads. Estimated effect on benchmark cells:
| Cell | Current FastWire | MemPack | Estimated post-S2X9 FastWire | vs MemPack |
|---|---|---|---|---|
| Small (~70 primitive prop) | 3122 | 3070 | ~3050 | -0.7% ✅ |
| Medium (~600 primitive prop) | 10905 | 10431 | ~10300 | -1.3% ✅ |
| Large (~6000 primitive prop) | 68603 | 64986 | ~63500 | -2.3% ✅ |
| Deep (~700 primitive prop) | 15514 | 14933 | ~14800 | -0.9% ✅ |
The Repeated cell is harder to predict (string-dominated payload, fewer primitives) — likely smaller win, may not fully close the +8.1% gap. Acceptable: the Repeated cell is a string-interning stress test, not a typical DTO workload.
NuGet release narrative: "FastMode beats MemoryPack on both wire size AND throughput across all benchmark cells" — currently we have to qualify this with "throughput-only on Compact + i18n workloads"; S2X9 removes the qualifier. This is high-leverage for the public bench shootout.
Wire savings per property type
| Type | Current encoding | Markerless lane | Wire saved |
|---|---|---|---|
int (TinyInt range −16..47) |
TinyInt (1 byte) | VarInt (1 byte) | 0 |
int (out-of-tiny) |
[Int32] [VarInt] (2-6 bytes) |
VarInt (1-5 bytes) | 1 byte |
bool |
[True] or [False] (1 byte) |
1 byte (0/1) | 0 |
Guid |
[Guid] [16 bytes] (17 bytes) |
16 bytes | 1 byte |
DateTime |
[DateTime] [9 bytes] (10 bytes) |
9 bytes | 1 byte |
DateTimeOffset |
[DateTimeOffset] [10 bytes] (11 bytes) |
10 bytes | 1 byte |
TimeSpan |
[TimeSpan] [VarLong] (2-9 bytes) |
VarLong (1-9 bytes) | 1 byte |
decimal |
[Decimal] [16 bytes] (17 bytes) |
16 bytes | 1 byte |
double |
[Float64] [8 bytes] (9 bytes) |
8 bytes | 1 byte |
DTO-heavy payloads with many Guid / DateTime properties benefit the most — easily -10..-20% wire size on top of the existing -22..-33% advantage.
CPU savings
Reader-side: SGen-generated code drops the per-property ReadByte() + IsTinyInt / IsFixStr / switch-case dispatch for primitive properties — direct context.ReadInt32Unsafe() / ReadGuidUnsafe() / etc. calls. Writer-side: drops the WriteByte(typeCode) per primitive. Effect amplifies on payloads with many primitive properties (Small/Medium benchmark cells) — independent of any JIT-vs-AOT measurement variance.
Sketch — opt-in markerless lane, SGen-only
- New wire format flag (header
HeaderFlag_MarkerlessSchema = 0x10or similar) → activates a property-positional lane. - SGen-generated writer for
[AcBinarySerializable]types: per primitive property, emits raw value (no marker). For variable-shape properties (string, complex, nullable, polymorphic) the existing marker-driven path stays. - SGen-generated reader: per primitive property, calls
context.ReadInt32Unsafe()/ReadGuidUnsafe()/ etc. directly. Variable-shape properties keep the marker-read + dispatch. - Heuristic: a property is markerless-eligible if
IsValueType && !IsNullable && type is in {int, bool, byte, short, long, float, double, DateTime, DateTimeOffset, Guid, TimeSpan, decimal}. Anything else (string, list, nested object, nullable) keeps the marker.
Decision points
- Backward compatibility: header flag + version negotiation. Old readers see the flag set and either reject (clean fail) or fall back to marker-driven (if they support both lanes). Default
falsepreserves current wire format. - Schema evolution fragility: the markerless lane is positional, so adding/removing/reordering primitive properties breaks readers compiled against an older schema. Document this clearly — opt-in is for stable schemas only (DTO-frozen API contracts, internal SignalR messages with synchronized client/server SGen). For evolving schemas, marker-driven default stays.
- Coordination with
ACCORE-BIN-T-S5L8(sentinel-length strings): the two could share the "no-marker per-call" infrastructure — markerless string lane uses sentinel-length VarUInt (null/empty/short distinguished by length value).
Acceptance
- Primary:
WireMode = FastAcBinary wire size ≤ MemPack across Small/Medium/Large/Deep AOT benchmark cells (AOT release-publish bench is the canonical measurement) - Wire size: ≥ -10% on DTO-heavy payloads (Guid/DateTime-rich) vs current marker-driven format
- Round-trip on the markerless lane validated on representative DTO shapes (mixed primitive + string + nested object)
- Schema-evolution fragility documented in
BINARY_FEATURES.md(alongside the existingPropertySkip/ default-omission caveat fromACCORE-BIN-I-D9Y2) - Opt-in flag with default
false(preserves marker-driven default; consumers explicitly opt in for frozen-schema scenarios)
ACCORE-BIN-T-V4N3: Symmetric GetUtf8ByteCount API + writer-side BCL kihagyás (cold path)
Priority: P3 · Type: Performance · Status: Superseded (2026-05-08, by ACCORE-BIN-T-K7M3) — landed Closed 2026-05-06; subsequent A/B against modern Utf8.FromUtf16 / Utf8.ToUtf16 showed the BCL modern API outperforms the custom transcoder on every benchmark cell, leading to full hot-path switch in K7M3 · Related: EncodeUtf8SinglePass, WriteStringUtf8Internal, PropertyMetadataBase.NameUtf8, ACCORE-BIN-T-K7M3 (hot-path BCL switch)
Symmetric byte-count helper for EncodeUtf8SinglePass, paired with writer-side BCL Encoding.UTF8.GetBytes / GetByteCount removal across all cold-path call sites. Utf8Transcoder.GetUtf8ByteCount(ReadOnlySpan<char>) SIMD impl (Vector512 / Vector256 / Vector128 / scalar tier hierarchy, 5-popcount closed-form aggregation handling chunk-split surrogate pairs correctly).
Implementation summary:
Utf8Transcoder.GetUtf8ByteCountSIMD impl with closed-formbytes = 3*N - ascii - c_lt_0x800 + highSur - 3*lowSuraggregationUtf8TranscoderTestsextended (29 new tests covering ASCII / Hungarian / CJK / emoji / boundary 0-64, plus surrogate-pair-split-across-SIMD-chunks regression coverage)WriteStringUtf8Internal(BinarySerializationContext.cs:875) refactored from BCL two-pass to single-pass D-2 layout (worst-caselength*4allocate +EncodeUtf8SinglePass+ VarUInt backfill); the4×worst-case capacity is amortized by the buffer growth doubling strategy (Math.Max(buffer.Length*2, position+needed)+ ArrayPool bucket-rounding to next power-of-2)- Cold path cleanup:
AcBinarySerializer.AnalyzeStringInternCandidates(analysis log) andPropertyMetadataBase.NameUtf8ctor-once init both migrated toUtf8Transcoder
Resolution
Landed 2026-05-06. All Utf8TranscoderTests pass (55/55). Binary test suite unchanged (222 pass / 13 pre-existing GuidIId failures, untouched).
Critical observation surfaced during the audit: WriteStringUtf8Internal has only one caller (WriteFixStrDirect), and WriteFixStrDirect itself is uncalled anywhere in the codebase — no core call site, no SourceGenerator template hit (verified against AcBinarySourceGenerator.cs line 706/724/1492/1514 — generator emits WriteStringGenerated and context.WriteStringUtf8 (the public 659-line method, not WriteStringUtf8Internal)), no test, no reflection path. The V4N3 implementation therefore landed cleanly but its hot-path benchmark impact is limited to the two cold-path init sites. Dead-code disposition tracked as ACCORE-BIN-T-V4N5.
Algorithmic correctness lesson — the initial 4-popcount formula (3*N - c_lt_0x80 - c_lt_0x800 - 2*highSur) was wrong on chunks where a surrogate pair straddles the SIMD chunk boundary (it implicitly assumed lowSur == highSur per chunk, which is true over the whole well-formed string but NOT per chunk). Fix: 5-popcount closed-form (3*N - ascii - c_lt_0x800 + highSur - 3*lowSur), with the scalar tail using the same per-char accounting model (i += 1 per char regardless of role; high → 4, low → 0, BMP → 3, two-byte → 2, ASCII → 1). Caught by GetUtf8ByteCount_MultipleEmojiBoundary_MatchesBcl and GetUtf8ByteCount_BoundaryAsciiToEmoji_MatchesBcl regression tests — exactly the prefixLen 1, 7 boundaries that exercise chunk-split surrogate pairs.
Superseded by ACCORE-BIN-T-K7M3 (2026-05-08)
The V4N3 audit measured the custom transcoder against the legacy Encoding.UTF8.GetBytes API and won. Did NOT measure against the modern System.Text.Unicode.Utf8.FromUtf16 / Utf8.ToUtf16 static API (.NET 7+, used by MemoryPack source-gen). Once D9X3 stabilized the bench, a direct A/B revealed the BCL modern API outperforms the custom transcoder on every cell (Ser deficit -14 to -22pp, Deser flips from behind to ahead). All 8 hot-path call sites switched to BCL in K7M3. The Utf8Transcoder.cs file is fully commented out — preserved as historical reference.
The V4N3 algorithmic correctness work (5-popcount surrogate-pair-split-across-chunks closed-form) remains a valid algorithmic contribution, but no longer load-bearing on the hot path.
ACCORE-BIN-T-V4N4: NativeAOT-specific inlining / codegen audit on hot UTF-8 path
Priority: P2 · Type: Performance · Status: Reverted (2026-05-07) — bench instability made the optimization signal unmeasurable · Related: EncodeUtf8SinglePass, DecodeUtf8SinglePass, WriteStringWithDispatch, Utf8Transcoder SIMD path
Hypothesis: NativeAOT (the benchmark target environment) does not match Tier 1 JIT optimization quality on the UTF-8 hot path, despite [MethodImpl(AggressiveInlining)] hints. Symptoms in 2026-05-05 / 2026-05-06 benchmarks:
- Repeated cell perzisztens 8-11% Compact ≤ MemPack lemaradás (Magyar content + repeated string pattern)
- Compact Ser/Deser cellán mozaikos eredmények run-to-run (4-7/10 cell wins, 3-6 noise/loss bands)
- Methodonkénti Compact gyorsítások a Medium/Large/Deep cellán konzisztensek (-22% to -28% vs MemPack), ami JIT/AOT inlining-eltérésnek tűnik a Repeated-en — ott a
WriteStringWithDispatchshort-lane sokszor hívódik 10× repeated string-en
Suspect mechanisms (ranked by likelihood):
-
AOT inline budget. NativeAOT is more conservative than the Tier 1 JIT in respecting
AggressiveInliningfor large method bodies.EncodeUtf8SinglePass(~190 lines, 4 SIMD path + scalar),DecodeUtf8SinglePass(~120 lines),GetUtf8ByteCount(~120 lines) may exceed the AOT inline budget at hot call sites (WriteStringWithDispatchshort-lane,ReadStringdecode callback). If the AOT compiler emitscall <method>instead of inlining, every iteration of the Repeated 10-string loop pays the call overhead. -
[Intrinsic]IsSupportedconstant folding.Avx512BW.IsSupported,Vector512.IsHardwareAccelerated,Vector256.IsHardwareAccelerated,Vector128.IsHardwareAcceleratedshould constant-fold per host on AOT. Verify via disasm — if any remain runtime checks, every iteration pays the branch cost (3 nestedif-s in each Utf8Transcoder method). -
Vector256.LessThan<ushort>unsigned compare emulation. No nativepcmpltw_unsignedon AVX2; JIT/AOT lowers topminuw+pcmpeqw. Cost amortized over many chars in long content but can dominate on short Magyar runs (KözösCímke~6 runs of 2-3 chars). Less likely if (1) holds — the inlining hit dwarfs the per-instruction emulation cost. -
Method size cascade. The Utf8Transcoder method bodies grew with the V4N3
GetUtf8ByteCountaddition. Adjacent methods in the same source file may have lost inlining at SGen-generated callers due to AOT compilation-unit heuristics (file-locality affects inline cost models on some AOT codegen).
Investigation steps (no code changes — diagnostic phase first):
- NativeAOT publish dump:
dotnet publish AyCode.Core.Serializers.Console -c Release -r win-x64 -p:PublishAot=true dumpbin /disasm <output.exe> > disasm.txt - Locate
EncodeUtf8SinglePass,DecodeUtf8SinglePass,GetUtf8ByteCount,CountUtf8Charssymbols in the disasm - Verify constant folding on
IsSupportedchecks — no run-time CMP/JMP at the path-selector branches; the dead branches eliminated - Verify inlining at
WriteStringWithDispatch/ReadStringcallers — ifcall <Utf8Transcoder.*>instructions remain, inlining failed - Method size inspection — large method bodies hint at inline-eligibility issues; large prologue/epilogue at hot call sites is a tell
- Cross-compare with Tier 1 JIT disasm (run with
DOTNET_TieredCompilation=0+DOTNET_TC_QuickJit=0to force Tier 1, dump the JIT-tier disasm via WinDbg orBenchmarkDotNet's[DisassemblyDiagnoser]) to confirm the gap is AOT-specific rather than algorithmic
Possible fixes (Open until disasm confirms which apply):
- A. Method split —
EncodeUtf8SinglePass→ small dispatcher + per-tier inner methods (each Vector512 / Vector256 / Vector128 / scalar in its own AOT-inline-friendly small method). Same forDecodeUtf8SinglePass. The dispatcher stays small enough to inline at the hot call site; the dead-branch tier methods are never called on a given host. - B.
[MethodImpl(NoInlining)]on cold tiers — paradox tactic that can REDUCE the hot-path code emitted at the call site by preventing the AOT from speculatively considering the dead branches as inlining candidates. - C. Per-target ISA build — if the benchmark environment has a fixed ISA (e.g. AVX2 baseline), use
<IlcInstructionSet>incsprojto constant-fold theIsSupportedchecks at AOT compile time. Alternative: separate per-ISA AOT publish artifacts. - D. Manual hot-path inlining — for the Repeated cell, hand-inline
EncodeUtf8SinglePassshort-string lane intoWriteStringWithDispatchFixStr path (≤31 byte case). Trades code-size for hot-path speed. - E. Algorithm change — if the AOT can't inline the SIMD bodies efficiently, a smaller scalar-only fast path for short strings (≤31 byte) bypassing the SIMD setup might be faster on AOT than on JIT (where Tier 1 is fine with the SIMD path inlined).
Why P2
- Repeated benchmark cell is the canonical witness for the i18n production deploy narrative — public NuGet release narrative depends on parity-or-better against MemPack across all cells (cloud / desktop / mobile / Blazor WASM)
- AOT-specific tuning is high-leverage on the hot path — JIT-only optimizations will not match
- Disasm validation is the prerequisite for any of the fix directions; without it, any change is speculative and risks reintroducing 2c-style regression
Acceptance
- Disasm report confirms (or refutes) inlining + constant-fold hypotheses on the hot UTF-8 path
- If hypotheses confirmed: the chosen fix delivers Repeated Compact Ser+Deser ratio ≤ 1.0 vs MemPack on the AOT benchmark target
- No regression on Small / Medium / Large / Deep cells (or net positive)
- Fix maintains cross-tier SIMD correctness (round-trip tests pass on all UTF-8 content classes); both
Utf8TranscoderTestsand the binary test suite stay green
Trigger
- Pre-NuGet release: i18n claim cannot ship with an 8-11% gap on a representative cell
- Disasm + bench correlation step before any code change (no speculative refactoring)
Resolution
Audit + targeted fix landolt 2026-05-07.
Step 1 — disasm-elemzés (disasm.txt, ~90 MB AOT-publish output):
- ✅
Avx512BW.IsSupported/Vector{N}.IsHardwareAcceleratedconstant-folded — csak 4 runtime check a teljes binary-ben (1 body + 3 call-site, kívül a Utf8Transcoder hot path-tól). Az AOT a target ISA szerint dead-branch-eliminálta. - ✅ Reader tier-marker dispatch (
ReadStringSmall/Medium/Big) inline-olódott aTypeReaderTablelambda-class static init-be — 0 method-call overhead a tier-on. - ⚠️
WriteStringWithDispatchNEM inline-olódott — 3 generic specialization (<ArrayBinaryOutput>,<AsyncPipeWriterOutput>,<BufferWriterBinaryOutput>) különálló method body-val + 14+call <method>instruction az<ArrayBinaryOutput>body-jában (a többi 2 specializációban hasonló volumen). Method size ~190 sor — meghaladja az AOT inline budget-et. - ⚠️
ReadStringUtf8WithCharLenNEM inline-olódott — saját body, sok call-site. - ❓ → ✅
string.Createcallback__DelegateCtor— disasm szerinttest static; jne skip ctorminta = cache-elt static lambda, lazy-init pattern. 0 hot-path overhead (nem per-hívás alloc).
Step 2 — method-split kísérlet (15:13:39 bench):
- Writer split: dispatcher (
[AggressiveInlining]) +WriteStringSmallFast([AggressiveInlining]) +WriteStringDispatchLong([NoInlining]) +WriteStringFastWire([NoInlining]) - Reader split: dispatcher (
[AggressiveInlining]) +ReadStringUtf8WithCharLenCore([NoInlining]) - Bench: regresszió — Small Ser +29.6 pp, Repeated Ser +8.9 pp, Small Deser +16.6 pp.
- Disasm szerint a dispatcher + SmallFast inline-olódott (body symbol eltűnt) — code-bloat: 3 generic spec × ~30-50 SGen call-site × ~45 sor inlined kód = i-cache pressure a Repeated cell hot loop-on. Reader oldali dispatcher NEM inline-olódott (
[AggressiveInlining]hint hatástalan), csak +1 call instruction.
Step 3 — finomított fix (15:29:21 bench, Closed):
WriteStringWithDispatchdispatcher: NO inline hint (a fordítóra hagyva, AOT-ban stabilabb)WriteStringSmallFast:[NoInlining](code-bloat eltünt — call-overhead-tel marad, de strukturálisan dedikált method)WriteStringDispatchLong+WriteStringFastWire:[NoInlining]cold path (megőrizve)ReadStringUtf8WithCharLen+ReadStringUtf8WithCharLenCoreösszeolvasztva vissza egy methoddá (split nem fizetett, +1 call eltünt)
Bench (15:29:21) Compact vs MemPack arányok:
- Ser: Small 0.915 (-8.5%), Medium 0.989 (≈), Large 0.915 (-8.5%), Repeated 1.019 (≈), Deep 0.981 (-1.9%) → 5/5 cell paritás-vagy-jobb
- Deser: Small 1.101 (+10.1%), Medium 0.953 (-4.7%), Large 0.894 (-10.6%), Repeated 0.962 (-3.8%), Deep 0.899 (-10.1%) → 4/5 cell win, csak Small +10%
- Wire: 5/5 cell -8% to -11% kisebb mint MemPack
Tanulság:
- AOT-ban a
[AggressiveInlining]nem garantált — a Writer dispatcher + SmallFast inline-olódott (code-bloat), de a Reader dispatcher NEM (hint hatástalan). A fordítóra bízás (no hint) stabilabb. - Method-split nem mindig nyer — a túl-aggresszív inline-olás code-bloat-ot okozhat (i-cache pressure), különösen sok SGen call-site mellett.
- A
__DelegateCtorcache-elt —string.Createcallback nem hot-path overhead-forrás. - Strukturális struktúra megőrizve:
WriteStringDispatchLongésWriteStringFastWirekülön cold methodok (későbbi célzott optimalizációhoz alapot ad).
Maradék gap: Small Deser +10% — kis abszolút érték (~1 µs), nem release-blocker. A ReadStringUtf8WithCharLen body méretes (single method ~15 sor + lambda-state), AOT inline-budget határán. Tovább optimalizálható a V4N2 vagy W2C8 sprint-ben.
Reverted (2026-05-07)
A V4N4 method-split — mind a 15:13:39 (AggressiveInlining) regressziós verzió, mind a 15:29:21 (NoInlining-on-SmallFast) finomított verzió — visszavonva. A subsequent benchmark futtatások (15:29:21 → 15:56:54 → ...) drasztikus run-to-run varianciát mutattak ugyanazon kódon: az AOT-codegen file-locality / inline-cost-modell mérés-érzékeny a Utf8Transcoder.cs body-méret változásaira, és a noise-floor a method-split feltételezett +1-3% Ser nyereségét eltakarja.
A revert visszaállítja a WriteStringWithDispatch egy-method állapotot (matches 09:39:09 baseline). A megőrzött elemek:
- A irány packed-header store-ok (
Unsafe.WriteUnaligned<ushort/uint/ulong>Small/Medium/Big tier-on) — instruction-level optimalizáció, nem érintett az AOT-variance miatt - Overflow guard (
O7G2—ThrowStringTooLong) — defensive, különálló feature
A V4N4 audit konklúziója változatlan érvényes (constant-fold OK, reader tier-readers inline-olt a TypeReaderTable lambda-class static init-be, __DelegateCtor cache-elt). Az AOT inline-pressure-elemzés továbbra is releváns dokumentáció — csak a method-split mint fix nem volt mérhető-positív.
Tanulság: bench-driven optimalizáció csak akkor érvényesíthető, ha a noise-floor < a várható signal. AOT-on a bench-zaj jelentős (~5-15 pp run-to-run), ami a +1-3% perf-claim-eket eltakarja. Profile-vezérelt optimalizáció (CPU-profile + flame-graph + code-cache miss measurement) lenne a következő lépés, ha az inlining-pressure érdemi gap-ként marad.
Re-evaluable as of 2026-05-07 per ACCORE-BIN-T-D9X3 — bench stabilization removes the noise-floor that made the original signal unmeasurable; retest before any code change.
Obsoleted (2026-05-08) by ACCORE-BIN-T-K7M3 — the writer hot path no longer calls the custom EncodeUtf8SinglePass at all (WriteStringWithDispatch was switched to Utf8.FromUtf16 BCL). The "AOT method-split / inlining audit" target (Utf8Transcoder body method-size in NativeAOT inline budget) is moot — the BCL Utf8.FromUtf16 is a single static method with its own AOT-friendly inline footprint, and the audit's hypothesis space (Vector256 IsSupported constant-fold, lambda delegate cache) was correct for the prior code but no longer applies. The V4N4 disasm methodology remains a valid technique for future investigations of generic specialization / inline failures, but the specific hot-path target it analyzed is gone.
ACCORE-BIN-T-J5L9: Remove dead WriteFixStrDirect / WriteStringUtf8Internal (audit-surfaced uncalled methods)
Priority: P3 · Type: Refactor / hygiene · Status: Closed (2026-05-06) · Related: BinarySerializationContext.cs
V4N3 audit surfaced two methods with no callers in the entire workspace:
WriteFixStrDirect(string)— public method, no call site (no core, no SourceGenerator template, no test, no reflection / Expression-compile)WriteStringUtf8Internal(string)— private method called only fromWriteFixStrDirect's non-ASCII fallback branch
The pair forms a closed dead loop (WriteFixStrDirect → WriteStringUtf8Internal), but no entry point reaches WriteFixStrDirect. The public-API WriteStringUtf8 (line 659) is the live equivalent and is called from the SourceGenerator template (polymorphism path: assembly-qualified type-name write). The hot-path string-write goes through WriteStringWithDispatch (line 734) which uses the M3R7 marker-dispatch — NOT through this dead pair.
Disposition options (decide pre-NuGet release)
- Delete both methods — pure dead-code cleanup; reduces public surface, removes maintenance burden, simplifies onboarding. Functionality is fully covered by
WriteStringWithDispatch(M3R7 marker-dispatch — emitsFixStr/FixStrAsciidirectly with proper ASCII detection viabytesWritten == charLengthafterEncodeUtf8SinglePass). - Activate
WriteFixStrDirectfor property-name writes — SGen could emitWriteFixStrDirect(propName)instead ofWriteStringWithDispatch(propName)for known-short, often-ASCII property names — saving the marker-dispatch overhead. Requires SGen template change + benchmark validation that the saving is real (likely marginal — property names are typically <31 char ASCII, so M3R7 already takes the FixStrAscii fast path with one byte-write to_buffer). The pre-encodedNameUtf8byte[] onPropertyMetadataBasealready provides a faster path (WriteFixStrBytesat line 853) which the SGen / runtime writer could use directly. - Defer — leave as-is, document as dead code, revisit when the codebase has another reason to touch this area.
Why P3
- No correctness or perf impact in either direction (dead code is dead — no consumer affected)
- Cleanup vs activation is a low-stakes choice; benchmark would decide if option 2 has real saving
- Surfaced during V4N3 work, not blocking the NuGet release
Acceptance
- Decision recorded (delete / activate / defer) with rationale
- If "delete": grep across workspace confirms zero callers post-removal; binary test suite unchanged (still 235 pass / 13 pre-existing failures)
- If "activate": SGen template change + benchmark validation showing ≥ 2% Ser improvement on a representative cell (otherwise revert to "delete")
- Documentation in
BINARY_IMPLEMENTATION.mdupdated (or remove the old reference if both methods deleted)
Trigger
- Pre-NuGet release housekeeping pass
- Or: any future refactor that touches
BinarySerializationContextstring-write methods (then decide rather than leave the dead pair behind)
Resolution
Disposition: Delete (Option 1). Landed 2026-05-06 together with the H2Q6 marker reorg commit. Five dead methods removed in a single cleanup pass:
WriteFixStrDirect(string)— uncalled public methodWriteStringUtf8Internal(string)— uncalled private method (only called fromWriteFixStrDirect)WriteFixStr(string)— uncalled public method (audit surfaced; was originally listed as live)WriteFixStrBytes(ReadOnlySpan<byte>)— uncalled public method (audit surfaced)WritePreencodedPropertyName(ReadOnlySpan<byte>)— uncalled public method (audit surfaced)
All five had zero call sites across core, SourceGenerator template, tests, and reflection. The hot-path string write continues through WriteStringWithDispatch (M3R7 + H2Q6 marker dispatch) and WriteStringInternFirstWithDispatch (interning tier dispatch). Public surface reduced; binary test suite unchanged (222 pass / 13 pre-existing GuidIId failures).
ACCORE-BIN-T-L9Y3: FixArray marker tier — short-list count encoded in marker
Priority: P3 · Type: Wire-format optimization · Status: Open · Related: Array (66) marker, VarUInt itemCount, ACCORE-BIN-T-H2Q6 marker reservation
Analog to FixStr — short list count (0-15) encoded in marker, eliminating the VarUInt itemCount byte for typical DTO collections (Tags, Categories, Items, Properties, Variations, etc. — any list whose size statistically lands in the 0-15 range).
Wire format
Current: [Array marker:1][VarUInt itemCount][items] — header 2-6 byte
FixArray: [FixArrayBase + N marker:1][items] — header 1 byte (N = item count, 0-15)
Writer dispatch (in WriteArray / scan-pass list-writer equivalents):
itemCount ≤ 15→FixArrayBase + itemCountmarker (1 byte total header)itemCount > 15→ existingArraymarker +VarUIntcount (2-6 byte total header)
Marker reservation
16 marker values pre-reserved in the post-H2Q6 marker layout (see ACCORE-BIN-T-H2Q6 "Marker address space reservation" table). The reservation guarantees that activating FixArray does NOT require another wire-format-version bump after H2Q6 lands at v3 — producers opt in to emit FixArray markers within the same v3 envelope, consumers extend their dispatch to decode them.
Activation steps when implementing:
- Allocate
FixArrayBase(16 contiguous values from the H2Q6-freed range) - Add
IsFixArray(byte marker),DecodeFixArrayCount(byte marker),EncodeFixArray(int count)helpers inBinaryTypeCode.cs - Writer: branch in
WriteArrayand equivalent ScanPass list-writers, emit FixArray forcount ≤ 15 - Reader: extend marker dispatch in
ReadValue/SkipValue/ReadArray - SGen: regenerate readers/writers with
IsFixArraydispatch in the array-typed property paths - Round-trip tests for boundary
itemCountvalues: 0, 1, 14, 15, 16, 17 (last tier transition)
Why P3
- Wire saving: -1 byte per short list. Realistic per-cell estimates:
- Repeated (10 OrderItem, ~50 list overall): ~50 byte / 28 KB = ~0.18% wire reduction (marginal)
- Large (5×5×5×10 nested, ~6000 list): ~6 KB / 118 KB = ~5% wire reduction ✓
- Medium: ~500 byte / 21 KB = ~2.4% wire reduction
- Deep (2×4×4×8 nested): similar to Medium, ~2-3% wire reduction
- CPU saving: marginal (~1-2 ns/list —
VarUIntshort-loop replaced by 1-byte marker decode). NOT a hot-path mover for the current Repeated-cell baseline gap. - Release-narrative value: complements the post-H2Q6 wire-size advantage, particularly on deep-nested structures (Large benchmark). Sharpens the "smallest AND fastest" claim once the CPU gap closes via V4N2 Phase 3 + V4N4.
Why not P2/P1 — and why not now
- The current
2026-05-06_13-10-30.LLMbaseline's primary problem is CPU (Compact +5-25% slower than MemPack on every cell), NOT wire size. FixArray addresses wire size, marginal CPU. - Activation after H2Q6 + V4N2 Phase 3 + V4N4 is the natural sequence: CPU gap closes first, then wire-saver features sharpen the release narrative.
- The marker reservation lets us defer activation indefinitely without losing the address-space slot.
Acceptance
- 16 marker values aligned in
BinaryTypeCode.cs(FixArrayBase..FixArrayMax) withIsFixArray,DecodeFixArrayCount,EncodeFixArrayhelpers - Writer + reader dispatch with boundary tests (count = 0, 1, 14, 15, 16, 17)
- SGen-regenerated readers/writers correctly dispatch via
IsFixArrayfor array-typed properties - Round-trip tests pass, no Ser/Deser regression vs current
Arraypath - Wire-size benchmark: ≥-2% on Medium, ≥-3% on Deep, ≥-4% on Large, no regression on any cell
- Documentation update in
BINARY_FORMAT.md(new marker range + dispatch rules)
Trigger
- After
ACCORE-BIN-T-H2Q6lands (marker reservation must be active first) - After CPU gap closes (V4N2 Phase 3 + V4N4) — wire-saver value clearer once "fast" is settled
- Pre-NuGet release housekeeping for the wire-size narrative (along with
S5L8/S2X9if their scope justifies)
Future extension (not part of this entry)
FixDictanalog — same pattern forDictionarymarker (67) withkvCount0-15. Worth considering only if a benchmark workload demonstrates dictionary-heavy structures; the current bench data (Order DTOs) does not. Defer until evidence.FixArray 0-31— wider count range (32 markers). Marginal additional saving (16-31 elem list-ek ritkák); would consume nearly all freed marker space, leaving no slack forS5L8/S2X9. Reject unless evidence warrants.
ACCORE-BIN-T-O7G2: Overflow guard on charLength * 4 writer arithmetic + corrupted-wire ReadStringBig
Priority: P3 · Type: Defensive / safety · Status: Closed (2026-05-06) · Related: WriteStringWithDispatch, WriteStringInternFirstWithDispatch, ReadStringBig, BinaryTypeCode.MaxStringCharLength
Defensive guards covering two latent failure modes in the H2Q6 string serialization paths:
Writer overflow (silent zero corruption) — charLength * 4 overflows int when charLength > 0x1FFFFFFF (~537M). At exactly 0x40000000 chars the multiplication wraps to 0, causing:
EnsureCapacity(reserveHeader + 0)to silently succeed (no buffer growth)EncodeUtf8SinglePass(value, emptySpan)to write 0 bytes, returningbytesWritten = 0- The H2Q6 tier choice picks Small (
bytesWritten ≤ 255), writing[StringSmall][0][0]to the wire - The string content is lost silently — no exception, wire claims an empty string
Other overflow values (e.g. charLength = 600M → maxBytes becomes negative) eventually surface as ArgumentOutOfRangeException from Span.AsSpan(start, length), but the message ("length cannot be negative") is misleading and arrives after the buffer has already been partially mutated.
Reader corrupted wire (negative cast from oversized uint) — in ReadStringBig, the wire-side charLen:32 and utf8Len:32 are read as uint, then cast to int. Corrupted or maliciously-crafted payloads with values > Int32.MaxValue produce negative ints, leading to string.Create(negative, ...) exceptions or position-state desync — at best a misleading message, at worst a partial decode with wire-position shifted incorrectly.
Resolution
Landed 2026-05-06 (this commit window).
Writer side — WriteStringWithDispatch and WriteStringInternFirstWithDispatch each gain one method-entry guard:
var charLength = value.Length;
if ((uint)charLength > BinaryTypeCode.MaxStringCharLength) ThrowStringTooLong(charLength);
A single unsigned compare catches the overflow band; predict-friendly (always false on realistic input). The throw helper is [MethodImpl(MethodImplOptions.NoInlining)] so the JIT/AOT keeps the throw site out of the inlined hot path. The same charLength value is reused across the FastWire and Compact branches — no duplicate guard.
Reader side — ReadStringBig gains a single bitwise-OR + sign-test:
var packed = context.ReadUInt64Unsafe();
var charLength = (int)(uint)packed;
var byteLength = (int)(uint)(packed >> 32);
if ((charLength | byteLength) < 0) ThrowCorruptedBigWire(charLength, byteLength);
The OR + sign-test catches negative casts (any wire-side uint > Int32.MaxValue produces a negative int after cast; OR of two positives is positive, sign-test cheap). One instruction effective; predict-friendly.
New constant: BinaryTypeCode.MaxStringCharLength = 0x1FFFFFFF (536_870_911 — largest charLength where charLength * 4 fits in int).
Hot-path cost: ~0% on realistic input — single unsigned compare on the writer, single OR + sign-test on the reader Big tier (Small/Medium readers untouched since their wire values are bounded by byte / ushort types and cannot overflow). Throw helpers NoInlining keep the inlined caller body compact. Tests 222 pass / 13 pre-existing failures unchanged.
Why P3
- No correctness impact for realistic inputs (the overflow band is far outside any real DTO scenario)
- Defensive value: prevents silent data loss in the
charLength = 1.07Gzero-overflow edge case + provides clear error messages on out-of-range inputs - Security value: corrupted/malicious wire payloads on the reader Big tier path are now caught early instead of producing inconsistent position state
- NuGet release professional-quality signal — explicit, defensive guards over silent-corruption paths
ACCORE-BIN-T-S6F2: Shift-mentes Small fast path in WriteStringWithDispatch
Priority: P3 · Type: Performance · Status: Reverted (2026-05-07, with V4N4 method-split) · Related: WriteStringWithDispatch, BinaryTypeCode.StringSmall, ACCORE-BIN-T-V4N4
The H2Q6 writer's post-encode tier choice runs a 3-way switch (bytesWritten ≤ 255 → StringSmall, ≤ 65535 → StringMedium, else StringBig) and a header-write switch (3 / 5 / 9 byte) for every non-ASCII string. On the Repeated benchmark cell (Magyar content, ~10-15 char strings dominant) 99%+ of writes resolve to StringSmall — the 3-way switch decision is statistically determinate from charLength ≤ 63 alone (worst-case charLength * 4 ≤ 252 ≤ 255 ⇒ Small tier guaranteed).
A specialized fast path for charLength ≤ 63 could eliminate:
- The
int actualHeader; byte tierMarker;runtime-resolved variables - The 3-way
bytesWrittenswitch - The 3-way
actualHeaderheader-write switch - The
shift = reserveHeader - actualHeadercompute (always 0 in this branch)
Sketch:
if (charLength <= 63)
{
EnsureCapacity(3 + charLength * 4);
var savedPos = _position;
var encodeStart = savedPos + 3;
var bytesWritten = Utf8Transcoder.EncodeUtf8SinglePass(value.AsSpan(), _buffer.AsSpan(encodeStart, charLength * 4));
if (bytesWritten == charLength) { /* ASCII override — FixStrAscii inline */ }
else
{
// StringSmall — 0 shift, inline header write (constant-folded)
_buffer[savedPos] = BinaryTypeCode.StringSmall;
Unsafe.WriteUnaligned<ushort>(ref _buffer[savedPos + 1],
(ushort)(charLength | (bytesWritten << 8)));
_position = savedPos + 3 + bytesWritten;
}
return;
}
// charLength > 63 → fall through to existing post-encode tier dispatch
Why P3
- Repeated cell hot path benefit (~99% of writes on Magyar content are charLength ≤ 63)
- Estimated +1-3% Ser improvement on Repeated/Medium cells (where short non-ASCII strings dominate)
- Constant-folded tier choice + inline header write — no branch overhead vs. the generic post-encode path
- Trade-off: ~30 lines of duplicated specialized code; the generic post-encode path remains for charLength > 63 long-string scenarios
Acceptance
WriteStringWithDispatchSmall fast path emits identical wire bytes as the generic path forcharLength ≤ 63(round-trip parity)- Benchmark on Repeated/Medium cells shows ≥ 1% Ser improvement vs. post-A-direction baseline (
2026-05-07_09-39-09.LLMor later) - No regression on Large/Deep cells (long-string path untouched)
- Round-trip tests pass on the boundary
charLength = 63andcharLength = 64cases
Trigger
- After A-direction (header pack-write) bench result is conclusive
- Pre-NuGet release if the Repeated cell Compact-vs-MemPack Ser ratio still has measurable headroom
Resolution
Integrált megvalósítás ACCORE-BIN-T-V4N4 keretében (2026-05-07): a WriteStringWithDispatch 4-method-os split egyik tagja a WriteStringSmallFast — pontosan az S6F2 ide illeszkedő fast path. A 0-shift non-ASCII branch garantált (charLength ≤ 63 ⇒ bytesWritten ≤ 252 ≤ 255 ⇒ Small tier biztos, reserveHeader = actualHeader = 3).
Az inline-stratégia tanulsága (a V4N4 disasm-ből): a WriteStringSmallFast [NoInlining] jelölést kapott a végleges verzióban — az [AggressiveInlining] kísérlet code-bloat-ot okozott (3 generic spec × 30+ SGen call-site × inlined body = i-cache pressure a Repeated cell hot loop-on, +29.6 pp Ser regresszió a 15:13:39 bench-en). A [NoInlining]-tal az S6F2 logika érvényesül (constant-folded tier choice, 0 shift), csak +1 call instruction overhead-tel.
Bench (15:29:21): Compact Ser 5/5 cellán paritás-vagy-jobb vs MemPack (Small -8.5%, Medium -1.1%, Large -8.5%, Repeated +1.9%, Deep -1.9%). Az S6F2 várt +1-3% Ser-javulás teljesült Small/Large cellákon, a Repeated/Deep paritás-szerű (a +1 call overhead kompenzálja a fast-path nyereséget rövid Magyar string-eken).
Re-evaluable as of 2026-05-07 per ACCORE-BIN-T-D9X3 — together with the parent V4N4 method-split, the Small fast path is re-testable now that bench stabilization removes the noise-floor; retest before any code change.
ACCORE-BIN-T-W2C8: WASM string-cache H2Q6 maximalizálás (ReadStringUtf8Cached MISS path)
Priority: P2 (WASM target) / P3 (otherwise) · Type: Performance · Related: BinaryDeserializationContext.Read.cs::ReadStringUtf8Cached, ReadStringUtf8WithCharLen, Utf8Transcoder.DecodeUtf8SinglePass
H2Q6's primary win is 1-pass decode on the reader side: tier markers carry both charLen and utf8Len, so the reader allocates the target string with the known char count and decodes in a single pass via string.Create(charLength, ..., DecodeUtf8SinglePass). This eliminates the CountUtf8Chars Pass 1 — the headline V4N3/H2Q6 win.
The WASM string-cache path bypasses this win. When _useStringCaching is true (Blazor WASM target), ReadStringUtf8WithCharLen dispatches to ReadStringUtf8Cached(byteLength) for short strings. On cache HIT, the cached instance is returned (zero decode — already optimal). On cache MISS, the current ReadStringUtf8Cached falls back to Utf8NoBom.GetString(slice) — the BCL kétpasszos UTF-8 decoder. The H2Q6 1-pass decode benefit is lost on every cache MISS.
Per-cell impact estimate on a WASM workload with hot-path strings (typical Blazor SignalR DTO traffic):
- Cache HIT rate ~30-50% on repeated property names + tags + categories
- Cache MISS rate ~50-70% on first occurrences + unique values
- MISS path =
Utf8NoBom.GetStringBCL call (virtual dispatch + EncoderFallback overhead) instead ofstring.Create(charLength, ..., DecodeUtf8SinglePass)
Implementation outline
ReadStringUtf8Cached accepts both charLength and byteLength (or just compute charLength from the cache check / decode result). Cache HIT: cached.Length == charLength invariant check (UTF-16 char count, not UTF-8 byte count) + ASCII verification. Cache MISS: replace Utf8NoBom.GetString(slice) with string.Create(charLength, (Buffer, Pos, Len), static (chars, state) => DecodeUtf8SinglePass(state.Buffer.AsSpan(state.Pos, state.Len), chars)).
Cross-check: the existing ComputeStringHashFull(slice) and VerifyAsciiUtf8Match(cached, slice) operate on the raw UTF-8 bytes — these stay unchanged. Only the MISS-side string materialization needs the H2Q6-aware refactor.
Why P2 (WASM-target) / P3 (otherwise)
- The non-WASM benchmark host (x64) doesn't enable
_useStringCachingby default, so this optimization is invisible on the current bench - On Blazor WASM, all interning + repeated-string-cached deserialization currently pays the BCL decode tax on cache MISS
- Estimated +5-15% Deser improvement on WASM workloads with significant cache MISS rate
- Direct extension of the H2Q6 win to the WASM execution profile
Acceptance
ReadStringUtf8Cachedcache MISS path usesstring.Create(charLength, ..., DecodeUtf8SinglePass)— no BCLUtf8NoBom.GetStringon MISS- Round-trip tests pass on cached + uncached short-string scenarios across all UTF-8 content classes (ASCII / Hungarian / CJK / emoji)
- WASM-target benchmark (Blazor profile) shows ≥ 5% Deser improvement vs. pre-W2C8 state on a representative hot-string-heavy DTO workload
- Cache HIT path performance unchanged (already optimal — no decode)
- Cache eviction / capacity behavior unchanged
Trigger
- Pre-NuGet release if Blazor WASM is a primary supported scenario in the release narrative
- Or: when a WASM-fókuszú benchmark workload becomes the active perf measurement target
ACCORE-BIN-T-F3W6: Dedicated FastWire string marker (split mode-shared StringSmall)
Priority: P3 · Type: Performance · Related: WriteStringWithDispatch FastWire branch, ReadStringSmall FastWire branch, BinaryTypeCode.StringSmall, H2Q6 marker reservation
The H2Q6 marker layout currently shares StringSmall (=91) between Compact and FastWire modes:
- Compact emits
[91][charLen:8][utf8Len:8][UTF-8 bytes] - FastWire emits
[91][VarUInt charCount][UTF-16 raw bytes]
The reader dispatches on context.FastWire inside ReadStringSmall. Correct (the deserializer's mode is fixed per operation), but the mode-shared marker forces runtime branching at hot points:
- Writer:
if (FastWire)at the top ofWriteStringWithDispatchruns on every string write — runtime check on a path-dominant (Compact) call site - Reader:
if (context.FastWire)insideReadStringSmallruns on every short non-ASCII string deserialization — Compact-side waste - SGen template: every regenerated reader contains the FastWire-aware
case StringSmall:block (more code per type, larger AOT binary) - JIT/AOT inlining: the larger
WriteStringWithDispatch/ReadStringSmallmethod bodies may exceed inline budgets at hot call sites — particularly under NativeAOT
A dedicated StringFastWire marker (one value from the H2Q6-freed 106-134 range — proposed allocation: 131) splits the path:
- Compact stays on
StringSmall(=91) →ReadStringSmallbecomes Compact-only (noif (FastWire)branch, smaller method body) - FastWire uses new
StringFastWire→ dedicatedReadStringFastWirereader, FastWire-only logic - Writer's FastWire branch emits
StringFastWireinstead ofStringSmall
Wire format compatibility
The marker swap is internally consistent within the v3 envelope — producers that opt in to the dedicated FastWire marker emit it; readers expanded to handle both StringSmall and StringFastWire (transitional). Once all producers emit the dedicated marker, the old mode-shared dispatch in ReadStringSmall can be removed.
Why P3 — "minden apró % számít"
- Estimated +0.5-1% Ser (writer branch elimination on Compact path)
- Estimated +0.5-1% Deser (reader smaller method body, better JIT/AOT inline-eligibility on Compact path; FastWire reader gets a tight dedicated path too)
- Compounds with other micro-opts across the hot path — small percentages add up
- Marker-space cost: 1 reserved value consumed (general-reserve count drops from 4 to 3 in the H2Q6 reservation table)
- Risk: low — mechanical split; round-trip tested against both wire-format variants
Implementation outline
BinaryTypeCode.StringFastWire = 131constant + helper updates (IsStringrange check + dispatch)WriteStringWithDispatchFastWire branch emitsStringFastWire(wasStringSmall)- New
ReadStringFastWire<TInput>static reader —[VarUInt charCount][UTF-16 bytes]decode, no Compact-mode branching ReadStringSmall<TInput>simplified — Compact-only, dropsif (context.FastWire)branchTypeReaderTable[StringFastWire]registrationSkipValuecase StringFastWire:— same skip layout asStringSmallFastWire branch (charCount VarUInt + 2 × charCount bytes)- SGen template
EmitReadString— newcase StringFastWire:block (FastWire-only branch);case StringSmall:simplified to Compact-only - Round-trip tests: separate FastWire and Compact wire format coverage
Acceptance
- Round-trip parity on both Compact and FastWire wire formats (existing tests pass)
- Benchmark on FastWire mode shows ≥ 0.5% improvement vs. mode-shared baseline
- Compact mode shows no regression (likely marginal gain from simpler
ReadStringSmall) - AOT-published binary shows reduced generated reader size per
[AcBinarySerializable]type (one less case-block + branch) - Marker-space documented:
BinaryTypeCode.csreservation comment + H2Q6 entry's reservation table updated to reflect the F3W6 allocation
Trigger
- Pre-NuGet release if every measurable percentage point on the Compact hot path matters for the "fastest" narrative
- Or: when the Compact/FastWire branch profile shows up in a NativeAOT inlining audit (
ACCORE-BIN-T-V4N4)
Roll-back fallback
If a future marker-space crunch arises (additional H2Q6 tiers, new compression markers, etc.), F3W6 can be reverted by switching the writer back to emitting StringSmall on FastWire and re-introducing the mode-shared dispatch in ReadStringSmall. The original design is correctness-equivalent — the dedicated marker is purely an optimization. If marker gondunk lesz, kivesszük.
ACCORE-BIN-T-B1D5: BenchmarkDotNet release-quality measurement project
Priority: P2 · Type: Tooling / release-narrative · Status: Open · Related: AyCode.Core.Serializers.Console (existing custom bench), NuGet release-narrative
The current AyCode.Core.Serializers.Console is a hand-rolled microbenchmark — fast dev-iteration loop (30-90s per run, custom markdown output, internal TestDataSet structure). It serves the inner optimization cycle well, but is not industry-standard for the public NuGet release narrative.
A parallel BenchmarkDotNet-based project would close that gap:
- Industry-standard credibility: BenchmarkDotNet is the canonical .NET benchmarking framework — MemoryPack, MessagePack, System.Text.Json all use it for their published numbers. AcBinary results expressed in BDN format are directly comparable to MemPack's own release notes.
- Statistical rigor: outlier detection (Tukey's fences), interquartile range, confidence intervals, multi-process iteration runs. The current custom bench reports median-of-5; BDN reports the full distribution + variance band — the difference between "looks fast on my machine" and "demonstrably fast under controlled conditions".
- NuGet release surface: BDN markdown tables drop straight into release notes / blog posts / NuGet
README.md/BINARY_FEATURES.md"Performance vs MemoryPack" section. GitHub-friendly format, screenshot-friendly, reviewer-credible. - Diagnostic-plugin integration:
[MemoryDiagnoser]— allocation per iteration (already a hot question for the Repeated cell)[EventPipeProfiler]— CPU profile collection during the bench run, exportable to speedscope flame-graph[DisassemblyDiagnoser]— per-method disasm dump, parallel to the manualdumpbinworkflow used in V4N4[ThreadingDiagnoser]— context switches, lock contention (relevant if pool-contention shows up under load)
- Multi-runtime / multi-job: a single project benchmarks against
RuntimeMoniker.Net90(JIT) andRuntimeMoniker.NativeAot90simultaneously — same-shape table side-by-side. - CI integration potential: BDN result format is machine-readable (JSON/CSV), enabling regression detection on PR diffs (later sprint).
Implementation outline
- New project:
AyCode.Core.Serializers.Benchmark(or.Bdn) — separate csproj for clean BDN dependency isolation. AOT-publishable for the AOT job. - TestDataSet bridge: reuse the existing
TestDataFactory/TestDataSettypes fromAyCode.Core.Tests.TestModelsso the data-shape is identical to the custom bench. - Benchmark class skeleton:
[MemoryDiagnoser] [SimpleJob(RuntimeMoniker.Net90, baseline: true)] [SimpleJob(RuntimeMoniker.NativeAot90)] public class StringSerializationBenchmark { [Params("Small", "Medium", "Large", "Repeated", "Deep")] public string DataSet { get; set; } = "Small"; private object _data = null!; private byte[] _compactWire = null!; private byte[] _mempackWire = null!; [GlobalSetup] public void Setup() { _data = TestDataFactory.Create(DataSet); _compactWire = AcBinarySerializer.Serialize(_data, AcBinarySerializerOptions.FastMode); _mempackWire = MemoryPackSerializer.Serialize(_data); } [Benchmark(Baseline = true)] public byte[] MemPack_Ser() => MemoryPackSerializer.Serialize(_data); [Benchmark] public byte[] AcBinary_Compact_Ser() => AcBinarySerializer.Serialize(_data, AcBinarySerializerOptions.FastMode); [Benchmark] public object? MemPack_Deser() => MemoryPackSerializer.Deserialize<TestOrder>(_mempackWire); [Benchmark] public object? AcBinary_Compact_Deser() => AcBinaryDeserializer.Deserialize<TestOrder>(_compactWire); } - Multi-cell coverage: separate benchmark classes per workload-shape (StringSerializationBenchmark, ObjectGraphBenchmark, NestedDeepBenchmark) — clean grouping in BDN output.
- NativeAOT-job config:
<PublishAot>true</PublishAot>conditionally (mirroringConsoleproject pattern); BDN's NativeAOT job auto-publishes the bench-runner. - Output: GitHub-flavored Markdown export →
docs/BINARY/BENCHMARK_RESULTS.md(or similar), versioned in the repo.
Why P2 (pre-NuGet release)
- NuGet release narrative ("AcBinary fastest AND smallest binary serializer for .NET i18n payloads") needs credible, industry-standard numbers. Custom bench → "trust me, my numbers"; BDN → "here are the variance bands and the methodology".
- Direct comparison surface against MemPack's published BDN numbers (head-to-head on the same framework).
- Diagnostic-plugin integration (
[MemoryDiagnoser]+[EventPipeProfiler]) opens up further targeted optimization work without separate tooling.
Acceptance
- New
AyCode.Core.Serializers.Benchmarkproject compiles + runs cleanly on both JIT (net9.0) and NativeAOT - Reuses existing
TestDataFactory/TestDataSettypes — no test data duplication - Produces a markdown table per workload-shape covering: MemPack baseline + AcBinary Compact + (optionally) AcBinary FastWire, both Ser and Deser
- BDN output saved to
docs/BINARY/BENCHMARK_RESULTS.md(versioned per release) - README.md /
BINARY_FEATURES.mdreferences the BDN-measured performance claim with the methodology link
Trigger
- Pre-NuGet release: when the optimization sprint cluster (V4N2 / W2C8 / etc.) settles and the perf state is release-stable
- Or: when a credibility-sensitive presentation surface emerges (blog post, conference talk, GitHub README)
Coexistence with the custom bench
The custom Console bench is not replaced — it remains the dev-iteration tool (fast feedback loop, 30-90s runs, hand-tuned markdown for chat-paste). BDN is the release-grade bench (3-10 min runs, statistical rigor, NuGet release output). Different tools for different audiences.
ACCORE-BIN-T-C5R8: Charset-parameterized benchmark workload (ASCII / Hungarian / CJK / Cyrillic / Mixed)
Priority: P2 · Type: Tooling / release-narrative · Status: Closed (2026-05-07) · Related: BenchmarkTestDataProvider, AyCode.Core.Serializers.Console.Program.cs (Settings → Charset submenu), ACCORE-BIN-T-V4N2 (charset-specific optimization measurement target), ACCORE-BIN-T-D9X3 (bench stabilization preceding this work)
The current BenchmarkTestDataProvider hard-codes Hungarian (Latin extended 2-byte) content into the test DTOs. This produces a single workload-shape: Hungarian mixed text with short 1-2 char 2-byte runs. While Hungarian is a fine general-purpose i18n stress, it is only one production-content profile — and the optimization decisions ride on it implicitly (e.g. V4N2 Phase 2.5's 3-byte run do-while was deferred-on-2-byte-side because the Hungarian bench measured regression there, but its CJK-side value cannot be measured on the current data).
A charset-parameterized benchmark workload — selectable from the interactive menu — would:
- Measure optimization value across realistic content profiles — what wins on CJK content may not win on Hungarian, and vice versa. Without explicit per-charset measurement, optimization decisions become Hungarian-biased.
- Surface release-narrative numbers credibly — instead of "Compact beats MemPack on i18n payload" (single workload), claim "Compact vs MemPack: ASCII X%, Hungarian Y%, CJK Z%, Cyrillic W%, Mixed V%" — concrete numbers per content profile, NuGet-grade.
- Enable workload-specific optimization audits — V4N2 Phase 3 SIMD multi-byte transcoder targets CJK 3-byte content; without a CJK workload measurement, Phase 3 acceptance criteria cannot be validated.
Implementation outline
1. BenchmarkTestDataProvider refactor
Hard-coded Hungarian strings (KözösCímke, sötét, magyar, hetenkénti, etc.) → ASCII baseline values (English equivalents: SharedTag, dark, hungarian, weekly).
New static LongStringSuffix field — charset-aware suffix appended to a subset of property values:
public static class CharsetSuffixes
{
public const string AsciiOnly = ""; // baseline — pure-English ASCII content
public const string Hungarian = " árvíztűrő tükörfúrógép";
public const string CjkBmp = " 你好世界 こんにちは 안녕하세요";
public const string Cyrillic = " Привет мир дорогой друг";
public const string Mixed = " árvíz 你好 Привет 😀";
}
public static string LongStringSuffix { get; set; } = CharsetSuffixes.Hungarian; // default
Property values use the suffix dynamically:
var description = "Product description" + LongStringSuffix;
The 5 charsets cover the realistic UTF-8 workload spectrum:
- Pure ASCII — baseline; Phase 1 SIMD prefix widen + DWORD batch dominate; no multi-byte path engagement
- Hungarian (Latin extended) — short 1-2 char 2-byte runs in mixed text; current default workload
- CJK BMP — long homogeneous 3-byte runs; primary V4N2 Phase 2.5/3 win region
- Cyrillic (Russian / etc.) — long 2-byte runs (different shape than Hungarian mixed); V4N2 Phase 2.5 may yet pay off here
- Mixed (Hungarian + CJK + emoji) — full multi-tier coverage in one payload; surrogate-pair handling stress
2. Program.cs interactive submenu
Before starting a benchmark run, prompt the user for charset choice:
Choose benchmark charset:
1 — Pure ASCII (baseline)
2 — Hungarian (Latin extended) [DEFAULT]
3 — CJK BMP (Chinese / Japanese / Korean)
4 — Cyrillic (Russian / etc.)
5 — Mixed (Hungarian + CJK + emoji)
The choice → BenchmarkTestDataProvider.LongStringSuffix = ... before constructing test data.
3. Benchmark output header
The markdown output header should reflect the selected charset:
# AcBinary Benchmark Release 2026-05-07 16:00:00
Charset: CJK BMP | Iterations: 1000 | Warmup: 10000 | ...
This makes per-charset bench files self-documenting — file names + content both encode the workload profile.
4. Round-trip tests unaffected
Utf8TranscoderTests and other content-class unit tests (with their fixed Hungarian / CJK / emoji boundary inputs) are untouched — they remain fixed-content for regression coverage. Only the benchmark workload is charset-parameterized.
Why P2
- Release-narrative: NuGet release credibility depends on measurable performance claims across realistic content profiles, not a single Hungarian-mixed workload
- Optimization decision quality: V4N2 Phase 2.5 / Phase 3 / future SIMD multi-byte work cannot be objectively validated without a CJK workload — current decisions have implicit Hungarian-bias
- Consumer reproducibility: external consumers can reproduce benchmark numbers on their own content profile (or contribute a new charset profile)
Acceptance
BenchmarkTestDataProviderrefactored: ASCII baseline +LongStringSuffixstatic field with 5 predefined charset constants- Interactive menu in
Program.cslets the user choose charset 1-5 before benchmark run; the chosen charset is recorded in the markdown output header - Round-trip correctness verification still runs once-per-cell before warmup (existing
Verified: round-trip ...line) — works on the active charset - All 5 charsets produce valid round-trip on all benchmark cells (Small / Medium / Large / Repeated / Deep)
- Existing benchmark numbers (Hungarian-default) reproducible — choosing charset 2 from the menu yields the current 15:29:21-style results
- New CJK charset (option 3) produces measurable numbers (one bench run per charset documented in
Test_Benchmark_Results/)
Trigger
- Pre-NuGet release: per-charset numbers needed for the public performance-claim table
- Or: when V4N2 Phase 3 SIMD multi-byte transcoder work needs CJK-workload validation
Resolution
Landed 2026-05-07 (after ACCORE-BIN-T-D9X3 bench stabilization made sub-3% deltas measurable, which raised the value of charset-specific measurement). Implementation refined the original 5-charset proposal into a 6-charset list per user request (Latin1FixAscii + Latin1 short/long split for finer-grained Latin1 coverage):
1. BenchmarkTestDataProvider refactor ✅
- New
CharsetSuffixesstatic class with 6 const suffixes (one more than originally proposed):Latin1FixAscii = ""— empty suffix; baseline values stay short → FixStr fast-path stress (renamed fromAsciiOnlyper user request)Latin1Short = " árvíztűrő tükörfúrógép"(~24 char) — Hungarian short Latin1 mixedLatin1Long = " árvíztűrő tükörfúrógép a magyar betűzés tesztje"(~47 char) — NEW, exceeds the 32-char FixStr boundary on the suffix alone (user request)CjkBmp,Cyrillic,Mixed— as originally specified
LongStringSuffixdefault =CharsetSuffixes.Latin1Long(backward-compatible in spirit with the prior fixed Latin1 default)- All hard-coded Hungarian baseline values replaced with ASCII English equivalents:
KözösCímke/IsmétlődőCímke/MélyCímke→SharedTag/RepeatedTag/DeepTagközösfelhasználó→shareduser(and variants);közös→shared;MélyKategória→DeepCategorysötét/világos→dark/light;magyar/német/francia→hungarian/german/frenchhetenkénti/naponkénti/havonkénti→weekly/daily/monthly- Repeated cell long Hungarian baselines (
TermékNév_IsmétlődőTesztAdat_árvíztűrőtükörfúrógép,RaklapKód_IsmétlődőTesztAdat_árvíztűrő) shortened to ASCIIProductName/PalletCodeso theEnsureAllStringsBypassFixStrsuffix-append actually applies (the prior >31-char baselines bypassed the suffix, leaving Repeated cell content fixed-Hungarian regardless of charset selection)
- The only Latin1/non-ASCII characters remaining in the file are inside the
CharsetSuffixesconst definitions themselves (intentional — those define the per-charset content profiles)
2. Program.cs interactive submenu ✅
- New
[3] Charsetentry in the existingSettingssubmenu (next to[1] Iterationand[2] WireMode) — chose nested submenu over a top-level prompt to keep the main menu uncluttered ShowCharsetSettingsMenulists the 6 charset constants with brief descriptions; selection setsBenchmarkTestDataProvider.LongStringSuffixand returnsGetCurrentCharsetName()helper resolves the active suffix back to its constant name (returns"Custom"when programmatically set to a non-const value)
3. Benchmark output header ✅
Charset:field added to 3 output locations:- Console run header (interactive run line —
Layer: ... | Charset: CjkBmp | Iterations: ...) .LLMmarkdown header (file-self-documenting).logboxed banner (║ Charset: CjkBmp ║)
- Console run header (interactive run line —
4. Round-trip tests unaffected ✅ — Utf8TranscoderTests and other content-class unit tests use their own fixed boundary inputs; not touched by this change. Round-trip verification in the bench harness continues to run once-per-cell pre-warmup (VerifyRoundTrip) on the active charset.
Acceptance status
- ✅
BenchmarkTestDataProviderrefactored with ASCII baselines +LongStringSuffixfield + 6 charset constants - ✅ Interactive submenu lets the user choose charset 1-6; recorded in markdown output header (3 locations)
- ✅ Round-trip verification runs on the active charset (existing per-cell verify, charset-agnostic by design)
- ⚠️ "All 6 charsets produce valid round-trip on all benchmark cells" — design correctness implies this; not yet exercised on every (cell × charset) combination explicitly. Recommend running each charset once before declaring full validation.
- ❌ "Existing benchmark numbers (Hungarian-default) reproducible — choosing charset 2 yields the current 15:29:21-style results" — NOT met: the ASCII baseline refactor changes the numbers regardless of charset choice (shorter baselines + suffix-driven content vs. prior fixed Hungarian baselines). New
Latin1Short≠ prior fixed Hungarian default. This is intentional: the user explicitly chose a clean ASCII-baseline + charset-suffix design over preserving historical numerical comparability. - ❌ "Choosing CJK produces measurable numbers documented in
Test_Benchmark_Results/" — NOT done in this commit window; user has the menu and will run per-charset benches in a follow-up sprint.
Note on numerical incompatibility with prior runs
Existing bench files generated before this commit (e.g. Console.FullBenchmark_Release_2026-05-07_17-42-22.LLM and earlier) used the prior fixed Latin1 baseline values + 32-char Hungarian suffix. The new default (Latin1Long) uses ASCII baselines + 47-char Latin1Long suffix; the Repeated cell sees a more dramatic shift (its 52-char fixed Hungarian baseline → 11-char ASCII ProductName + 47-char suffix). Numerical comparison across the boundary is not meaningful; the Charset: header field documents the source charset for each new bench file.
Future extensions
- Sentinel "real-world" charsets — synthetic mixes representing typical production payloads (e.g.
EnglishWithEmojifor chat-app DTOs,ArabicHebrewfor RTL-script regions). Add as newCharsetSuffixesconstants when consumer demand surfaces. - Charset auto-rotate mode — single benchmark run cycles through all 5 charsets, producing a 5-section markdown output. Useful for full release-narrative table generation in one pass.
- BDN integration (per
ACCORE-BIN-T-B1D5): charset becomes a[Params]axis in BenchmarkDotNet, producing a 5×5×N matrix (cells × charsets × engines) in the BDN output.
ACCORE-BIN-T-D9X3: Console benchmark stabilization (per-serializer warmup + GC isolate + pilot discard + min/max range + CPU pin + mode-aware JIT sleep)
Priority: P1 · Type: Tooling / measurement · Status: Closed (2026-05-07) · Related: AyCode.Core.Serializers.Console.Program.cs, ACCORE-BIN-T-V4N4, ACCORE-BIN-T-V4N2, ACCORE-BIN-T-S6F2, ACCORE-BIN-T-B1D5 (BDN release-grade variant)
The custom Console benchmark harness showed strong run-to-run variance — user-reported ±20pp / -10pp summa-spread between runs on identical code. 1-3% perf-claims became unmeasurable on this noise-floor; the V4N4 method-split and V4N2 Phase 2.5 attempts both fell into this band, leaving the question "does the regressed bench number reflect a code regression or measurement noise?" undecidable (see V4N4 Reverted section).
Diagnosis (sprint takeaway prior to this entry):
- Warmup cache pollution —
RunBenchmarksForTestDataran one warmup-all loop (every serializer × WarmupIterations) followed by one bench-all loop. By the time a given serializer was measured, its hot code and data lines had been evicted by the intervening serializers' warmup passes. MemPack and AcBinary hot paths share neither code nor data working sets — they actively evict each other. - GC pause leakage between samples — the Stopwatch-recorded sample loop had no explicit
GC.Collect. A minor GC triggered inside sample N could promote into a Gen-2 pause inside sample N+1's timed window (1-5 ms spike). - Pilot sample contamination — the first sample after warmup absorbed residual JIT bookkeeping and cold-cache misses; on a 10-sample median this contributed 1-2 outliers that visibly stretched the min/max.
- CPU migration / preemption — the Windows scheduler migrated the bench thread between cores between samples (L1/L2 cache evict on each migration); background work (Defender index, OS service threads) injected random preemption spikes.
- JIT sleep not mode-aware —
Thread.Sleep(JitSleep = 3000)waited 3 seconds before each cell for tiered-JIT drain. On AOT publish (PublishAot=true) there IS NO dynamic compilation — the 3 seconds were pure idle. Worse, the drain happened only globally (once before all cells), not per-serializer, so a tier-promotion mid-bench could still bleed in. - Range invisible — the
.LLMmarkdown output showed only the median; the user could not tell whether a 5%-median-delta was inside or outside the inter-sample range for that row.
Resolution
Landed 2026-05-07 (16:00 — 17:00). Six stabilization steps in one commit window:
1. Per-serializer warmup separation (RunBenchmarksForTestData) — the warmup-loop and bench-loop merged into one per-serializer cycle: each serializer's warmup runs IMMEDIATELY before its own bench. The serializer's hot code/data is freshest in cache when the first sample times.
2. GC.Collect before every sample (RunTimed) — GC.Collect() + WaitForPendingFinalizers() + GC.Collect() triple-tap before each sample, OUTSIDE the Stopwatch window. Every sample starts from the same heap state; an ad-hoc Gen-2 pause from sample N can no longer bleed into sample N+1.
3. Pilot sample discard (RunTimed) — the loop runs samples + 1 times; the first (index 0) is discarded. The first sample post-warmup absorbs residual JIT/GC bookkeeping and cold cache; the recorded samples count remains 10 (median is the same data the user saw before, just sourced from "typical" sample-set, not from the post-warmup-first noisy point).
4. Min/max range in markdown output (SaveLlmResults, new FormatMicrosWithRange helper, new BenchmarkResult fields: SerializeTimeMinMs/MaxMs, DeserializeTimeMinMs/MaxMs, RoundTripTimeMinMs/MaxMs) — the .LLM output's Ser and Deser columns now render as 26.86 (24.50..29.10): median (min..max) µs/op. The reader sees at a glance whether a delta is above the row's noise floor.
5. CPU affinity + process priority (RunBenchmark) — ProcessorAffinity = 0x1 (CPU 0 pin) + PriorityClass = High for the benchmark phase, try/finally restores the original values. Eliminates inter-sample thread migration (L1/L2 cache evicts) and reduces background-task preemption. Platform-guarded: Windows / Linux only (CA1416 — ProcessorAffinity throws on macOS); locked-down hosts (group policy, container without CAP_SYS_NICE, etc.) catch + warning + bench continues with default scheduling.
6. Mode-aware JitSleep (property) — RuntimeFeature.IsDynamicCodeCompiled ? 250 : 0. JIT mode 250 ms (the .NET 9 tiered-JIT compile queue typically drains in <100 ms for the bench's hot path); AOT publish 0 ms. The 3000 ms blind wait is gone. The drain now happens per-serializer (Step 1) instead of once globally.
Bench result (3 consecutive runs, 2026-05-07 17:00:32 / 17:01:03 / 17:01:32, FastestByte mode, FastMode preset)
| Cell | AcBinary Ser median (3 runs) | Inter-run spread | Intra-cell range |
|---|---|---|---|
| Small | 7.09 / 6.83 / 6.55 | 7.6% | ~8% (noise floor: 1000×6ns measured) |
| Medium | 18.74 / 18.90 / 19.22 | 2.6% | ~10% |
| Large | 140.20 / 141.67 / 141.02 | 1.0% | ~3% |
| Repeated | 26.52 / 26.25 / 26.28 | 0.3% | ~6% |
| Deep Nested | 23.44 / 23.17 / 22.70 | 3.2% | ~7% |
The previous ±20pp / -10pp summa-spread shrank to 1-3pp on the medium/large cells. The Small cell remains noisy (~8% relative) but this is a physical floor: 1000 iter × 6 ns/op = 6 µs total batch — below this, Stopwatch resolution and OS spikes dominate relatively.
The (min..max) range is consistently 3-10% relative — a measurable signal floor: 1-3% perf-deltas no longer disappear into noise.
Lessons
- Bench stabilization is a precondition for perf optimization, not a consequence. Optimization decisions (e.g. V4N4 method-split, V4N2 Phase 2.5) can only be derived from bench numbers if the noise floor < expected signal. Without that, the bench numbers mean nothing.
- Cache pollution (warmup-all → bench-all flow) was the single largest noise source: per-serializer warmup separation alone removed ~10pp of variance.
- Platform stabilization (CPU pin + high priority) combined with heap stabilization (GC.Collect + pilot discard) further tightened the range.
- AOT and JIT have different stabilization needs: the 3000 ms blind sleep was idle time on AOT; mode-aware sleep pays the cost only when needed.
Re-evaluation list (entries currently Reverted or unmeasurable)
The stabilization opens a follow-up sprint: the Reverted (2026-05-07) entries are re-evaluable now that the noise floor < the expected 1-3% signal:
ACCORE-BIN-T-V4N4— method-split (writer + reader hot path) is re-testableACCORE-BIN-T-V4N2(Phase 2.5) — UTF-8 do-while runs (2-byte / 3-byte) per charsetACCORE-BIN-T-S6F2— Small fast path (was integrated into V4N4)
Per-entry re-evaluation is the next sprint's task, NOT part of this Closed entry.
Why P1
- Blocked all sub-3% perf optimization work (every recent attempt fell into the noise band)
- One-line user complaint ("+20 és -10 között ingadozott a summa") summarized weeks of unproductive bench-driven investigation
- One-time fixed cost; every future bench run benefits
Follow-up: adaptive iteration + CV reporting + per-cell A/B mode (2026-05-07, second commit window)
After the initial 6-step landing, three additional refinements were added in a second commit window the same day. The trigger was a Copilot-suggested noise-reduction list against the now-stable bench output:
1. Per-cell adaptive iteration — fixed TestIterations = 1000 produced sample windows from 6 ms (Small cell @ 6 ns/op) to 140 ms (Large cell @ 140 µs/op). The Small cell at 6 ms remained the dominant residual noise source (7.6% inter-run spread vs ≤3.2% on the other cells) because OS-level spikes (preempt + IRQ + scheduler tick) are absolute-time events; on a 6 ms sample window their relative contribution is huge.
Implementation:
- New constant
TargetSampleMs = 250(per-sample wall-clock target) - New helper
CalibrateIterations(Action, int targetMs)— runs a 100-iter probe post-warmup, computesiterPerMs, and rounds up to the nearest 1000. Floor 1000, ceiling 200_000. RunBenchmarksForTestDatacalibrates Ser and Des INDEPENDENTLY per serializer (different per-op cost). RT-only rows (NamedPipe) get a single RT calibration.- New
BenchmarkResultfields:SerializeIterations,DeserializeIterations,RoundTripIterations(per-row). - New helpers:
ToPerOpMicros(double, int)(replaces 1-arg variant),SerPerOp(r)/DesPerOp(r)/RtPerOp(r)for per-op µs from the result. - All
Average(r => r.*TimeMs)andOrderBy(r => r.RoundTripTimeMs)call-sites refactored to use per-op µs (iter-independent) — mixing batch-time across rows with different iter counts would be meaningless. ~20 call-sites total. - RT for in-mem rows synthesized so
RtPerOp(r) == SerPerOp(r) + DesPerOp(r)regardless ofserIter != desIter:RoundTripIterations = max(serIter, desIter),RoundTripTimeMs = rtPerOpMicros / 1000 * RoundTripIterations.
Expected impact: Small cell sample window 6 ms → ~240 ms; inter-run spread 7.6% → ~1-2% (matching the other cells). Total suite duration ~50 s → ~110-130 s.
2. CV (coefficient of variation) reporting + unstable-row marker — the median + (min..max) range surfaces shape but not a single-number stability metric. The CV (= stddev/mean) is the standard statistical measure; rows with CV > threshold are flagged with a ⚠️ suffix in the markdown output so a small inter-engine delta on a high-CV row is immediately obvious as noise-suspect.
Implementation:
- New constant
UnstableCVThreshold = 0.03(3% — reasonable for stabilized in-memory benchmarks) RunTimedreturn tuple extended:(median, min, max, stddev). Stddev computed over the (samples − pilot) population usingMath.Sqrt(Math.Max(0, E[X²] - E[X]²)).- New
BenchmarkResultfields:SerializeTimeStdDevMs,DeserializeTimeStdDevMs,RoundTripTimeStdDevMs. FormatMicrosWithRangeextended:26.86 (24.50..29.10)stays the default;26.86 (24.50..29.10) ⚠️5.2%appears when CV exceeds the threshold.
3. Per-cell A/B mini-suite filter — optimization-iteration loops often need only one specific cell (e.g. "tuning the Repeated cell for Hungarian charset"). The full 5-cell × 2-engine × 4-measurement suite is overkill for that.
Implementation:
FilterByLayerextended: newsmall/medium/large/repeated/deepmodes — case-insensitive prefix match onTestDataSet.NameTryParseCliArgsrecognizes the new tokens:dotnet run -- repeatedruns only the Repeated Strings cellfastestbytemode (existing — only AcBinary FastMode + MemoryPack head-to-head) is orthogonal and stacks:dotnet run -- repeated fastestbyte
Markdown output schema change
The ## Results table gains an Iter Ser/Des column at the right edge — visible verification that each row's batch landed near the TargetSampleMs window. RT-only rows show a single Iter value (the RT calibration count); in-mem rows show serIter / desIter.
Header line updated:
- Before:
Iterations: 1000 | Warmup: 10000 | Samples: 10 (median) | ... - After:
Iterations: per-cell adaptive (target ~250 ms/sample) | Warmup: 10000 | Samples: 10 (median) + 1 pilot discarded | ... | UnstableCV threshold: 3%
ACCORE-BIN-T-K7M3: Hot-path UTF-8 transcoder switch — Utf8Transcoder → BCL Utf8.FromUtf16 / Utf8.ToUtf16
Priority: P1 · Type: Performance · Status: Closed (2026-05-08) · Related: ACCORE-BIN-T-V4N3 (custom transcoder origin), ACCORE-BIN-T-V4N2 (Phase 3 SIMD multi-byte), ACCORE-BIN-T-V4N4 (Reverted method-split), ACCORE-BIN-T-D9X3 (bench stabilization that made the comparison measurable)
The custom Utf8Transcoder (V4N3) was originally implemented to bypass System.Text.Encoding.UTF8.GetBytes virtual-dispatch + EncoderFallback overhead. The V4N3 audit measured wins vs. the legacy Encoding.UTF8 API. What it did NOT measure: the modern System.Text.Unicode.Utf8.FromUtf16 / Utf8.ToUtf16 API (.NET 7+, tier-1 optimized, used by MemoryPack WriteUtf8 / ReadUtf8 paths internally). Once the bench stabilized (D9X3), a direct A/B comparison surfaced that the BCL modern API consistently outperforms the custom transcoder on the binary serializer's hot path.
Bench A/B (Latin1Long charset, FastMode SGen Compact)
| Cell | Ser delta vs MemPack — custom (EncodeUtf8SinglePass) |
Ser delta vs MemPack — BCL (Utf8.FromUtf16) |
Improvement |
|---|---|---|---|
| Small | +28.5% | +7.3% | -21pp |
| Medium | +23.8% | +3.1% | -21pp |
| Large | +19.6% | +5.1% | -14pp |
| Repeated | +28.8% | +10.9% | -18pp |
| Deep | +23.1% | +0.6% | -22pp |
| Cell | Deser delta vs MemPack — custom (DecodeUtf8SinglePass) |
Deser delta vs MemPack — BCL (Utf8.ToUtf16) |
Improvement |
|---|---|---|---|
| Small | +17.6% | -1.2% (paritás) | -19pp |
| Medium | +12.8% | -4.7% (AcBinary nyer) | -17pp |
| Large | +4.9% | -10.3% (AcBinary nyer) | -15pp |
| Repeated | +16.9% | -1.6% (paritás) | -18pp |
| Deep | +7.0% | -9.0% (AcBinary nyer) | -16pp |
The Deser side flipped from "consistently behind" to "wins on 3 of 5 cells, paritás on 2". The Ser side closed the deficit from +20-29% to 0-11%. Both sides measurable improvement on every cell.
Why the custom transcoder lost
The V4N3 implementation included a 4-tier SIMD ASCII prefix path (Vector512BW / Vector256 / Vector128 / scalar) plus a DWORD ASCII batch + scalar 4-branch multi-byte fallback. All correct, all SIMD-tuned. But:
Utf8.FromUtf16is also SIMD-tuned in .NET 9 — the .NET team rewrote it on top ofSystem.Text.Unicode.Utf8primitives that share infrastructure withAscii.IsValid/Latin1.GetString. AOT-publish-friendly, branch-friendly, no virtual dispatch (theUtf8API is static, not via anEncodinginstance with virtual-method-table).- The custom transcoder's ASCII prefix path bails out on first non-ASCII byte — on multi-byte content (Latin extended / Cyrillic / CJK) the SIMD path runs only for the leading ASCII span, then the entire remainder falls into per-char scalar 4-branch dispatch. The BCL
Utf8.FromUtf16SIMD-batches multi-byte content too (different algorithm — the BCL doesn't bail on first non-ASCII). - AOT inline budget: the custom transcoder's body grew with the V4N3 / V4N4 / V4N5 additions; in NativeAOT publish the call sites in
WriteStringWithDispatch/ReadString*did NOT inline (V4N4 disasm audit confirmed). The BCLUtf8.FromUtf16is a single static method with a tighter call-site footprint.
Resolution
Landed 2026-05-08. The 8 production hot-path call sites of Utf8Transcoder.* switched to BCL:
| File / line | Before | After |
|---|---|---|
AcBinarySerializer.cs:120 |
Utf8Transcoder.GetUtf8ByteCount |
Encoding.UTF8.GetByteCount |
AcBinarySerializer.BinarySerializationContext.cs:694 |
Utf8Transcoder.EncodeUtf8SinglePass |
Utf8.FromUtf16(...) |
AcBinarySerializer.BinarySerializationContext.cs:784 |
Utf8Transcoder.EncodeUtf8SinglePass |
Utf8.FromUtf16(...) |
AcBinarySerializer.BinarySerializationContext.cs:901 |
Utf8Transcoder.EncodeUtf8SinglePass |
Utf8.FromUtf16(...) |
AcBinaryDeserializer.BinaryDeserializationContext.Read.cs:523 |
Utf8Transcoder.CountUtf8Chars |
Encoding.UTF8.GetCharCount |
AcBinaryDeserializer.BinaryDeserializationContext.Read.cs:527 |
Utf8Transcoder.DecodeUtf8SinglePass |
Utf8.ToUtf16(...) |
AcBinaryDeserializer.BinaryDeserializationContext.Read.cs:565 |
Utf8Transcoder.DecodeUtf8SinglePass |
Utf8.ToUtf16(...) |
PropertyMetadataBase.cs:104-109 (ctor-once) |
Utf8Transcoder.GetUtf8ByteCount + EncodeUtf8SinglePass (two-pass) |
Encoding.UTF8.GetBytes(string) (single-pass with exact-size byte[] return) |
The count-only call sites (GetByteCount / GetCharCount) stay on the legacy Encoding.UTF8 API — System.Text.Unicode.Utf8 has no count-only equivalent (only FromUtf16 / ToUtf16 which encode + count combined). For pure count, the legacy API is the optimal tool (single SIMD-tuned scan, no encode/decode work).
The Utf8Transcoder.cs file remains in the repo but fully commented out — the class definition is preserved as historical reference / future reactivation if a workload ever surfaces where it could win again. Utf8TranscoderTests.cs is not currently exercising live code.
Lesson — the V4N3 audit's blind spot
The V4N3 (custom transcoder) audit compared against legacy Encoding.UTF8.GetBytes and won. The audit did NOT compare against Utf8.FromUtf16 (the modern API, .NET 7+). On modern runtime the BCL has two UTF-8 transcoders: a legacy one (instance-method on Encoding, virtual dispatch) and a modern one (static Utf8.FromUtf16 / Utf8.ToUtf16). MemoryPack uses the modern one — that's what we should have been comparing against from the start.
Generalizable lesson: when measuring a custom implementation against a "BCL baseline", verify which BCL API is used by the actual competition (here: MemoryPack source-gen). The Encoding.UTF8.* instance API and System.Text.Unicode.Utf8 static API are different generations of the same logical operation; treating them as interchangeable hides the comparison's scope.
Why P1
- Closed the FastMode Compact mode Ser deficit from +20-29% to ≤11% on every cell (Latin1Long benchmark)
- Flipped the Deser side from -1 to -10% deficit to AcBinary winning on 3 of 5 cells, parity on 2 (Latin1Long benchmark)
- One-time fixed cost (8 production call-site cseréje) — every future bench profits
- Removed a load-bearing ~600-line custom SIMD module from the maintained surface area; future maintainers don't need to reason about Vector512BW / cross-lane shuffle / 5-popcount surrogate-pair correctness — the BCL handles it
Follow-up — Utf8Transcoder.cs cleanup
The file is fully commented out. Either:
- Delete entirely (preferred for repo cleanliness) —
Utf8TranscoderTests.csthen needs deletion or revival as a regression-only guard - Keep the comment-block as historical reference, with a header comment pointing to this entry
Decision deferred — the comment-block does no harm to build / runtime. Address when the next docs-archive sweep runs.
ACCORE-BIN-T-P3X7: Profile-driven Compact-mode Ser optimalizációs roadmap (post-K7M3 hot-path analysis)
Priority: P2 · Type: Performance roadmap · Status: Open · Related: ACCORE-BIN-T-K7M3 (BCL UTF-8 transcoder switch — előfeltétele), ACCORE-BIN-T-D9X3 (bench stabilization), ACCORE-BIN-T-S2X9 (markerless schema lane — primitív property-marker már kivezetve a SGen-ben), ACCORE-BIN-T-V4N4 (audit methodológia hivatkozás)
A 2026-05-08 VS Performance Profiler session (4 sec range, AcBinary FastMode Serialize, Latin1Long charset, FastWire mode) konkrét hot-path-decomposition-t adott a K7M3 BCL-csere utáni állapotról. A string-encoding már nem akadály (a Utf8.FromUtf16 SIMD-tuned), a fennmaradó AcBinary-specific overhead azonosítható.
Profile session adatok (Self CPU%)
| Self CPU% | Function | Category |
|---|---|---|
| 39.77% | System.Buffer._Memmove |
Közös MemPack-kel (UTF-16 raw + return-time byte[]-copy) — NEM AcBinary-spec |
| 10.03% | AcBinarySerializer.Serialize<T> |
Top-level (context-acquire, type lookup, return-alloc) |
| 7.48% | TestMeasurementPoint_GeneratedWriter.WriteProperties |
SGen template (legkisebb levél típus, ~12500 hívás Large cellán) |
| 5.31% | WriteStringWithDispatch |
String hot path |
| 3.23% | TestMeasurement_GeneratedWriter.WriteProperties |
SGen |
| 1.66% | WriteVarUIntMultiByteUnsafe |
VarUInt int-property encode |
| 1.10% | TestPallet_GeneratedWriter.WriteProperties |
SGen |
| 0.39% | TestOrderItem_GeneratedWriter.WriteProperties |
SGen |
| 0.32% | SharedUser_GeneratedWriter.WriteProperties |
SGen |
| 0.05% | ArrayBinaryOutput.Grow |
Buffer-grow (ritka, kicsi probléma) |
Total SGen WriteProperties Self CPU: ~12.6% — a leg nagyobb AcBinary-specific surface.
A AcBinarySerializer.Serialize<T> line-szintű drill-down (AcBinarySerializer.cs:312-335):
WriteObject(value, wrapper, context, 0)Total: 28.05% — a teljes serializációs fa (SGen + Writer hot path)context.Output.ToArray(context._buffer, context._position)Total: 47.37% — finalbyte[]-alloc + content-memcpy (= a 39.77%_MemmoveSelf nagy része)
MemPack-összehasonlítás (referenciaként)
A MemPack Serialize<T>(T value) mechanizmus:
[ThreadStatic]writer-state — nincs pool-bérlés, nincs lock, nincs concurrent dictionary lookupReusableLinkedArrayBufferWriter— linked chunk-list (4 KB → 8 KB → 16 KB geometriai); buffer-grow = új chunk hozzáadása, nincs memcpy a régi adatonToArrayAndReset()— végén alloc + chunks → byte[] memcpy (közös overhead az AcBinary-vel)
Az AcBinary AcquireArrayOutputContext(options) pool-bérlés + lineáris byte[] Array.Resize + Output.ToArray(...) — két memcpy-cost (grow + return), de a grow ritka.
Sorrendezett optimalizációs ötletek
A. SGen WriteProperties — ensure-capacity batching (várt: -1-3pp Ser, revíziós becslés)
Jelenlegi SGen-template per-property emit (mindenenkit külön ensure):
context.WriteVarInt(obj.Id); // ensure(5) + write(1-5)
context.WriteByte(BinaryTypeCode.Object); // ensure(1) + write(1)
context.WriteVarInt((int)obj.Status); // ensure(5) + write(1-5)
context.WriteRaw(obj.Weight); // ensure(8) + write(8)
Csoportosított ensure pattern:
context.EnsureCapacity(maxBytesForGroup); // worst-case sum, 1× hívás
context.WriteVarIntUnsafe(obj.Id); // no ensure (csak buffer write)
context.WriteByteUnsafe(BinaryTypeCode.Object); // no ensure
context.WriteVarIntUnsafe((int)obj.Status);
context.WriteRawUnsafe(obj.Weight);
A AcBinarySourceGenerator.cs WriteProperties template-jét kell módosítani:
- Property-listából contiguous primitív csoportok kinyerése (Object/Collection property-knél megszakítva — mély rekurzió, méret nem előre kiszámítható)
- Csoportonként worst-case-size compute compile-time-on (a primitív type-ok mérete fix vagy worst-case ismert)
- Egyetlen
EnsureCapacity(sum)+ bulk*Unsafewrite-ok
*Unsafe írók szükségessége: WriteVarUIntUnsafe már létezik. WriteByteUnsafe, WriteRawUnsafe<T> valószínűleg hozzá kell adni a BinarySerializationContext-hez.
Becslés-revízió (2026-05-08): az eredeti -4-6pp becslés felső volt. Egy EnsureCapacity inline-olva ~1-2 ns/call (a hot path-on a branch-prediction perfekt — sosem jut el a Grow-hoz). 10 property × 1.5 ns = ~15 ns / object megtakarítás batch-eléssel — Latin1Long Large cell 1250 instance × 13 ns = ~16 µs / 120 µs Ser ≈ ~13% felső, de csak az ensure-szám csökkenéséből. A SGen WriteProperties Self CPU 12.6%-a NEM csak ensure-check; tartalmaz HasPropertyFilter branch-check, null-check + depth-check dispatch, Unsafe.As<T> cast, etc. — lásd F. Az ensure-batching önmagában reálisan 1-3pp Ser javulás.
Wire-formátum változatlan, backward-kompatibilis, kis kockázat. Hatás minden cellán mérhető (TestOrder cell-szerkezet ~100+ primitív property per Object-instance).
B. WriteStringWithDispatch Compact ág batch-write (várt: -1-2pp Ser)
A FastWire ágat már K7M3-ban + a 2026-05-08 batch-write fixxel egyetlen ensure + direct-write-ra alakítottuk. A Compact ág ugyanaz a 3-step pattern (post-encode tier-shift CopyTo ha actualHeader < reserveHeader, plus header-write a tier alapján). A Compact ágon is alkalmazható batch-write — egyetlen EnsureCapacity a worst-case-tier-szel + direct header-write a Utf8.FromUtf16 után.
C. Thread-static context (várt: -2-4pp Ser, NAGY refactor)
A AcquireArrayOutputContext(options) pool-bérlés overhead-jét mérsékelheti a MemPack [ThreadStatic] mintázat. A jelenlegi pool-bérlés:
- Pool dictionary lookup (lehet, lock-os)
- Context-state init / reset minden hívásnál
Thread-static cseréje:
- Per-thread cached context, nincs lock
- Context-reset minden hívásnál ugyanaz, de a
stateallokáció egyszer fut
Refactor szempontok:
- A
BinarySerializationContextstate-tárolása nem thread-safe önmagában — pool-bérlés vagy thread-static mind a single-thread haszálatot biztosítja - Az
optionsparaméter érintheti a state-init logikát — multi-options scenárió esetén a thread-static state-t reset-elni kell - Concurrent serialize hívások (több thread egyidejű) — minden thread saját state-tel rendelkezne; nincs cross-thread sharing igény
D. Linked-array buffer chunk strategy (kicsi hatás, NAGY refactor)
A MemPack ReusableLinkedArrayBufferWriter linked chunk-list helyettesíti a lineáris byte[]-grow stratégiát. Buffer-grow = új chunk hozzáadása (no memcpy a régi adaton).
A profile szerint a ArrayBinaryOutput.Grow Self CPU csak 0.05% — a buffer-grow ritkán fut, a default kapacitás elég nagy a Large cell-hez. Kicsi hatás, nagy refactor. Alacsony prioritás.
F. SGen HasPropertyFilter lift-out a WriteProperties method elejére (várt: -2-4pp Ser)
A jelenlegi SGen-template minden property-emit előtt ellenőrzi a property-filter-t:
public void WriteProperties<TOutput>(object value, ...)
{
var obj = Unsafe.As<TestPallet>(value);
if (context.HasPropertyFilter) // ← MINDEN property-en check!
{
var fc_Category = new BinaryPropertyFilterContext(obj, ..., "Category", ...);
if (!context.PropertyFilter!(in fc_Category)) {
context.WriteByte(BinaryTypeCode.PropertySkip);
goto skip_Category;
}
}
if (obj.Category == null) context.WriteByte(BinaryTypeCode.PropertySkip);
else if (depth > context.MaxDepth) context.WriteByte(BinaryTypeCode.Null);
else { context.WriteByte(BinaryTypeCode.Object); ...WriteProperties... }
skip_Category:;
if (context.HasPropertyFilter) { /* same for Inspector */ } // ← újra!
// ... 10× ismétlés property-listán
}
A HasPropertyFilter per-property branch-check TestOrder benchmark workload-on mindig false (a benchmark nem használ property-filter-t). De a check minden property-en lefut — kód-cache-ben benne van, branch-predict ugyan jó, mégis CPU cycle.
Optimalizáció — kétpályás SGen kódgenerálás:
public void WriteProperties<TOutput>(object value, ..., int depth)
{
var obj = Unsafe.As<TestPallet>(value);
if (context.HasPropertyFilter)
{
WritePropertiesWithFilter(obj, context, depth); // ritka path — full per-property check
return;
}
// Fast path — NO filter check anywhere
if (obj.Category == null) context.WriteByte(BinaryTypeCode.PropertySkip);
else if (depth > context.MaxDepth) context.WriteByte(BinaryTypeCode.Null);
else { ... }
// (no skip_Category goto — never needed)
context.WriteVarInt(obj.Id); // primitív, no filter check
// ... rest of properties without HasPropertyFilter check
}
// Külön emit-elt method ritka path-ra:
private static void WritePropertiesWithFilter<TOutput>(TestPallet obj, ..., int depth)
{
// Full per-property filter-aware kód (the current behavior)
}
A AcBinarySourceGenerator.cs-t kell módosítani:
- A
WritePropertiesmethod elején egyetlenHasPropertyFiltercheck - Két különböző code-path emit:
- Fast path (default — no filter): nincs per-property
if (context.HasPropertyFilter)check, nincs filter-context allokáció + lambda-call, nincsgoto skip_X - Slow path (filter aware — separate static method): a jelenlegi viselkedés
- Fast path (default — no filter): nincs per-property
Várt nyereség: a fast path ~10 elimináció / object × 1-2 ns / branch ≈ ~15-20 ns / object. Latin1Long Large cell 1250 instance × 18 ns = ~22 µs / 120 µs Ser ≈ ~18% felső becslés; reálisan 2-4pp Ser javulás (a kód-bloat növekedés és a JIT inlinelés-ráhatás miatt mérséklődik).
Kombinálható az A-val: az A + F együtt 3-7pp javulás célozható meg — a SGen WriteProperties 12.6% Self CPU jelentős csökkenése.
Wire-formátum változatlan, kód-méret kicsivel nő (két path-ot generál minden type-on), de a fast path a JIT-tel jobban inlinelhető.
G. SGen WriteProperties null/depth/object-ref kombinálás (kapcsolt az F-hez)
A komplex (Object) property-knél a 3-ágú dispatch:
if (obj.X == null) context.WriteByte(BinaryTypeCode.PropertySkip);
else if (depth > context.MaxDepth) context.WriteByte(BinaryTypeCode.Null);
else { context.WriteByte(BinaryTypeCode.Object); X_GeneratedWriter.Instance.WriteProperties(...); }
Ez minden komplex property-en fut. Lehetséges optimalizáció: a depth > MaxDepth check egy method-szintű branch-szé alakítás (egyszer ellenőrizni a method elején, aztán a property-szintű ágat egyszerűsíteni). De ez kis hatás és a MaxDepth jellemzően nem érintő (a legtöbb workload-on depth < MaxDepth).
Alacsony prio, F-tel kombinált.
E. WriteVarUIntMultiByteUnsafe (1.66% Self) → fix-int (várható: -1pp Ser, NEM javasolt önmagában)
A WriteVarInt (signed int property-encode, ZigZag + VarUInt) kódolás a SGen-template-ekben gyakori (Id, Status, TrayCount, stb.). A multi-byte ág 1.66% Self CPU.
Fix-int (4 byte) cseréje wire-méret-növekedéssel jár (kis int-eken +3 byte / property), ami a wire-formátum kompaktság-előnyét rontja. Csak ACCORE-BIN-T-S2X9 markerless lane kontextusban érdemes — ahol a property-marker eltávolításával együtt fix-int kicserélése wire-szempontból kompenzálódik.
Közös, NEM AcBinary-spec overhead — nem optimalizálható
A Buffer._Memmove 39.77% Self CPU + a Output.ToArray() 47.37% Total a return-time byte[]-alloc + content-memcpy, ami minden byte[] Serialize(T) hívásnál fut. Mindkét engine fizeti (MemPack ToArrayAndReset() is alloc + memcpy a chunkokból). Az API contract (byte[] Serialize(T)) miatt elkerülhetetlen.
Aki teljesítményt akar, használja a IBufferWriter<byte> overload-ot (AcBinaryBufferWriterBenchmark vs MemoryPackBufferWriterBenchmark apples-to-apples a benchmarkban — mindkét engine ugyanezt csinálja).
Acceptance (per-section)
- A (SGen ensure-batching): Latin1Long FastWire bench AcBinary Ser delta vs MemPack -1-3pp javulás minden cellán
- F (HasPropertyFilter lift-out): Latin1Long Ser delta -2-4pp; A + F együtt SGen
WritePropertiesSelf CPU ≤ 8% (jelenleg ~12.6%) - G (null/depth/object-ref kombinálás): kis hatás, F-tel kombinált
- B (WriteStringWithDispatch Compact batch-write): Latin1Long Compact bench AcBinary Ser delta vs MemPack ≤ +5% minden cellán
- C (Thread-static context):
Serialize<T>Self CPU ≤ 6% (jelenleg ~10%) - D (Linked-array): nem prioritás — buffer-grow Self CPU már ≤ 0.05%
- E (VarInt → fix-int): csak az
S2X9markerless lane sprint kontextusában mérni
Sorrend
- A + F kombinálva — SGen
WritePropertiestemplate átfogó refactor (ensure-batching + HasPropertyFilter lift-out + esetleg G null/depth-combine). Együtt ~3-7pp Ser javulás várt minden cellán. Izolált változtatás csakAcBinarySourceGenerator.cs-en, wire-format változatlan. - B — ~1-2pp javulás, ugyanaz a pattern mint a
K7M3FastWire batch-write - C — ~2-4pp, de NAGY refactor (thread-safety, pool semantics felülvizsgálat)
- D — alacsony prioritás (kis hatás, nagy refactor)
- E — csak
S2X9kontextusban
Trigger
- A + F → most azonnal implementálható; ezek a SGen template-en belül kombinálandók (egyetlen template-átdolgozás kétségtelenül jobb mint külön refactor-körök). Minden továbbai mérés ettől függ.
- B → A+F után, hasonló pattern alkalmazása más writer-helyen
- C → ha a Serialize Self CPU 10% továbbra is dominál A+F+B után
- D, E → opcionális, az A/F/B/C eredmények alapján
ACCORE-BIN-T-Q5T2: Önleíró wire-formátum — duplikált object-marker-ek + UTF-16 string marker (per-type/property encoding choice)
Priority: P2 · Type: Architecture / Performance · Status: Open · Related: ACCORE-BIN-T-P3X7 (profile-driven roadmap — kis-adat slowdown diagnózis), ACCORE-BIN-T-K7M3 (BCL UTF-8 transcoder — előfeltétele), ACCORE-BIN-T-S2X9 (markerless schema lane), ACCORE-BIN-T-V4N2 (UTF-8 SIMD)
A 2026-05-08 design-session során merült fel mint válasz a kis-adat-slowdown problémára és az if (FastWire) / if (UseMetadata) runtime-branch-ek széles jelenlétére. Cél: a wire-mode kivezetése a globális header-ből, per-object/per-property encoding-szabadság attribute-tal, megőrizve a SGen↔Runtime wire-kompatibilitást.
LLM Context (cold-start)
Egy fresh session olvasásához ez a kontextus elég:
Wire-modell: AcBinary két párhuzamos serializációs path-ot futtat — SGen (compile-time generált, [AcBinarySerializable] típusokra) és Runtime (reflection + Expression.Compile). Mindkettő ugyanazt a wire-t produkálja és olvassa (interop garancia, BINARY_SGEN.md "Hybrid Execution Model").
Markerless body: object scope-on belül a primitív property-k (int, long, double, …) közvetlenül írnak a wire-be, marker-byte nélkül. A reader a sorrendet compile-time schema-ból (SGen) vagy OrderedProperties metadata-ból (Runtime) tudja. A wire object-prefix-szel kezdődik (1-byte marker), majd markerless body.
Meglévő object-marker család (AcBinarySerializer.BinarySerializationContext.cs writer-ek + AcBinaryDeserializer.cs reader-dispatch switch):
Object— sima first-occurrenceObjectWithTypeName— polimorf (runtimeType != declaredType)ObjectFullMarkerIId/ObjectFullMarkerAll—RefHandling=IId|Allfirst-occurrenceObjectRef/ObjectRefIId— subsequent (csak ID, NEM duplikálódik — nincs primitív property körülötte)
OPT-OUT minta (jelenlegi konvenció): default SGen flexibilis — minden runtime-branch-et generál (pl. if (context.UseRefHandling)). Class-attribute disable-eli a feature-t → SGen omitti a branch-et → drasztikus optimum. Q5T2 ezt a mintát terjeszti ki encoding-választásra.
Naming-konvenció: PascalCase, suffix-variánsok (Object → ObjectVarUInt, String → StringUtf16). NEM Object_NoZZ, NEM ObjVU.
Motiváció
A jelenlegi AcBinaryOptions.WireMode (FastMode vs Compact) payload-szintű globális flag:
- A kódban sok
if (FastWire) { ... } else { ... }branch (lásdWriteVarInt514. sor,WriteStringWithDispatch,WriteValueNonPrimitive, property-writers) - A fejlesztő nem optimalizálhat granuláris szinten (pl.
[NoZZ]egy hot type-ra, default másnak) - Schema-evolúciós szempontból: ha a szerver attribute-ot változtat egy type-on, a klienseknek (akár régebbi verzió) rekomp nélkül olvasniuk kell az új wire-t
A ACCORE-BIN-T-P3X7 profile-bench mérése szerint a kis-adat slowdown (Latin1Long Small +2.6%, Medium +1.5% AcBinary lassulás MemPack-hez képest) jelentős részben a VarUInt per-call overhead-ből származik (ZigZag shift + multi-byte branch loop). A type-szintű [IntEncoding=VarUInt] attribute-tal a fejlesztő a non-negative property-ket VarUInt-NoZigZag-ra állíthatja → ZigZag shift kiesik, kis-adatra mérhető nyereség.
Wire-formátum design
5 új BinaryTypeCode marker (naming TBD: *VarUInt vagy *NoZZ suffix, implementációkor véglegesítendő):
| Új marker | Cél | Alkalmazási hely |
|---|---|---|
ObjectVarUInt |
Object scope primitive int/long/enum-jai NoZigZag VarUInt encoding-ban | sima object first-occurrence |
ObjectWithTypeNameVarUInt |
Polimorf first-occurrence NoZZ-variánsa | runtimeType != declaredType esetén |
ObjectFullMarkerIIdVarUInt |
RefHandling=IId first-occurrence NoZZ-variánsa |
csak first; subsequent ObjectRefIId változatlan |
ObjectFullMarkerAllVarUInt |
RefHandling=All first-occurrence NoZZ-variánsa |
csak first; subsequent ObjectRef változatlan |
StringUtf16 |
UTF-16 encoded string content (property-szintű) | bárhol egy string property emit-jénél |
Wire-példa:
[ObjectVarUInt marker] ← scope-szintű: int-property-k VarUInt-NoZZ
WriteVarUInt(obj.Id) ← markerless body, encoding a marker alapján
WriteVarUInt(obj.Status)
[String marker] UTF-8(obj.Notes) ← default UTF-8
[StringUtf16 marker] UTF-16(obj.Name) ← property-szintű override
Byte-szintű példa (Order { Id=42, Status=3, Notes="ok" }, class-szintű IntEncoding=VarUInt):
- Default ZigZag wire:
[Object][0x54](VarInt 42 ZigZag:((42<<1)^(42>>31))=84)[0x06](VarInt 3 ZigZag: 6)[String][0x02]0x6F 0x6B - New VarUInt wire:
[ObjectVarUInt][0x2A](VarUInt 42 raw:0x2A)[0x03](VarUInt 3 raw:0x03)[String][0x02]0x6F 0x6B - Body-sorrend és byte-szám változatlan; csak az encoding-szabályok mások. Stringek ugyanúgy markered (UTF-8 default itt). String-encoding override esetén
[StringUtf16][char-count][2-byte-per-char].
A primitive property-k körüli wire markerless marad — a body-encoding-ot az object-marker határozza meg, nem per-property byte. Wire-bloat csak ott van, ahol most is van marker (object-prefix, string-marker).
Attribute design
Object-szintű (mert object-marker is object-szintű):
[AcBinarySerializable(IntEncoding = IntEncoding.VarUInt)]
public class Order { ... }
Property-szintű (csak string-en, mert string-marker is per-property):
public class Order {
[AcBinaryEncoding(StringEncoding.Utf16)]
public string CustomerName { get; set; }
}
Új public API elemek:
AcBinaryEncodingAttribute(target:Class | Property)IntEncodingenum (Default= ZigZag VarInt,VarUInt= NoZigZag)StringEncodingenum (Default= UTF-8,Utf16= UTF-16)AcBinaryOptions.IntEncodingésAcBinaryOptions.StringEncodingruntime fallback opciók
Encoding-választás precedenciája (writer-side)
- Property attribute (legerősebb) — pl.
[AcBinaryEncoding(StringEncoding.Utf16)] - Class attribute — pl.
[AcBinarySerializable(IntEncoding=VarUInt)] AcBinaryOptionsruntime opció — pl.options.StringEncoding = Utf16- Built-in default — ZigZag-VarInt + UTF-8
Szerepkörök és path-ok
| Path | Encoding-választás |
|---|---|
| SGen writer (with attribute) | Compile-time pinned, hard-coded marker + encoding emit (NO runtime branch) — a meglévő OPT-OUT minta (mint RefHandling/Interning disable) |
| SGen writer (no attribute) | Runtime branch a context.IntEncoding/context.StringEncoding option-en — két path generálódik, runtime dönt |
| SGen reader | Marker-dispatch (NEM hard-coded marker-expect — runtime-on dönti el, hogy Object vagy ObjectVarUInt érkezett, és annak megfelelően olvas) |
| Runtime writer (reflection-based) | Reflection-attribute-read + option fallback + default fallback — ugyanaz a precedencia mint SGen-nél |
| Runtime reader | Marker-dispatch (universal — nincs attribute / option használat encoding-döntésre, csak a marker-byte) |
⚠️ SGen reader marker-dispatch KÖTELEZŐ (NEM hard-coded marker-expect). Konkrét scenario amit ez kezel:
Szerver Runtime-mode-ban serializálja
Order-t. AzOrderosztályon a szerver-deploy óta változott az attribute (új deploy hozott[IntEncoding=VarUInt]-ot). Szerver Runtime writer reflection-ből olvassa az új attribute-ot →ObjectVarUIntmarkert emit-el a wire-be.Régi kliens rekomp nélkül kapja a payload-ot. Ha a kliens SGen reader-e hard-coded
Object-marker-expect-tel olvasna → panik / mismatch.Marker-dispatch-szel a kliens helyesen dekódol bármelyik markert, függetlenül attól, hogy a kliens-oldali compile-time
Ordertypebe-n volt-e az attribute.
Ez biztosítja a "server-side attribute-change doesn't break clients" garanciát.
Kompatibilitási garanciák
| Interakció | Eredmény |
|---|---|
| SGen-write (NoZZ attr) → SGen-read | OK (marker-dispatch) |
| SGen-write (NoZZ attr) → Runtime-read | OK (marker-dispatch) |
| Runtime-write (option=NoZZ) → SGen-read | OK (marker-dispatch) |
| Runtime-write (option=NoZZ) → Runtime-read | OK (marker-dispatch) |
| Server-attribute-changed → old client (no recompile) | OK — kliens csak a marker-t olvassa |
| Mixed payload (egyik object NoZZ, másik default) | OK — minden object-marker önálló scope |
Implementációs lépések
BinaryTypeCodeconst-bővítés — 5 új byte-érték (range-allokáció: a meglévő enum szervezése alapján a következő szabad slot-okba). Wire-format spec frissítésBINARY_FORMAT.md-ben.AcBinaryEncodingAttribute+IntEncoding+StringEncodingenum-ok — új fájlok azAyCode.Core/Serializers/Binaries/mappában.AcBinaryOptions.IntEncoding+AcBinaryOptions.StringEncodingopciók hozzáadása (default =Default).WriteStringUtf16/ReadStringUtf16context-helper-ek —MemoryMarshal.Cast<char,byte>direct copy + length-prefix (VarUInt char-count).- Runtime writer reflection —
BinarySerializeTypeMetadatacache:IntEncoding,StringEncoding-per-property flag-ek (attribute-alapján). Encoding-emit a precedencia szerint. - SGen writer template — attribute-feldolgozás
EmitWriteValue-ban: ha attribute → compile-time hard-coded emit; ha nincs → runtime-branch emit acontextoption-en. - SGen reader template —
EmitReadValuemarker-dispatch-szel (object-marker scope-encoding-mode tracking + string-marker per-property dispatch). - Runtime reader update — object-marker dispatch a scope-encoding-state-be (pl.
BinaryDeserializationContext.CurrentIntEncoding), string-marker per-property dispatch. - Cross-mode tesztek — minden write-read kombináció (SGen↔SGen, SGen↔Runtime, Runtime↔SGen, Runtime↔Runtime) minden encoding-kombinációban (default, attr-only, option-only, attr+option, mixed payload).
- Doc:
BINARY_FORMAT.mdwire-format spec,BINARY_OPTIONS.mdúj opciók,BINARY_SGEN.mdprecedencia + szerepkörök táblázat.
Acceptance
- 5 új BinaryTypeCode marker, naming-konvenció dokumentált
AcBinaryEncodingAttribute+ 2 enum + 2 opció extension working- Round-trip teszt minden cross-mode kombinációban zöld
- Wire-bloat default-encoding-on 0 byte (nincs új per-property marker)
- Latin1Long Small bench: AcBinary
[IntEncoding=VarUInt]típuson a slowdown ≤ MemPack +0.5pp (jelenleg +2.6%) BINARY_FORMAT.md/BINARY_OPTIONS.md/BINARY_SGEN.mdszinkronban a wire- és attribute-világgal- A meglévő
WireMode=Fast/Compactdistinction-ek kompatibilisek maradnak (vagy migrálódnak az új encoding-attribute-okra — külön döntés implementációkor)
Trigger / Sorrend
Implementáció ne kezdődjön azonnal — a ACCORE-BIN-T-P3X7 A+F szekciói (SGen ensure-batching + HasPropertyFilter lift-out) előbb mérendő. Ha az A+F már lehozza a SGen WriteProperties Self CPU-t ≤ 8%-ra, és a kis-adat slowdown ettől már ≤ +1pp, akkor ez a Q5T2 entry alacsony prioritásra kerül. Ha a kis-adat slowdown az A+F után is megmarad → Q5T2 implementáció érdemi.
Egyéb prerekvizit: ACCORE-BIN-T-W9F1 (compile-time metadata) szinkronizálás — a Runtime writer reflection-attribute-read-je beleilleszthető a generált metadata-ba, ezzel a runtime path is gyorsabb attribute-alapú encoding-választás-on.
Open kérdések (implementációkor eldöntendő)
- Marker naming:
ObjectVarUInt(semantic, az encoding alapján) vagyObjectNoZZ(rövidebb)? [AcBinarySerializable]-on belül vegyük fel aIntEncodingparamétert, vagy külön[AcBinaryEncoding]attribute legyen object-szinten is (és a[AcBinarySerializable]változatlan)?AcBinaryOptions.WireModejövője: a régiFast/Compactenum migrálódjon az újIntEncoding/StringEncoding-ra (BC-break) vagy maradjon mint shortcut-default?