[LOADED_DOCS: 8 files, no new loads]

AcBinary: H2Q6 string marker overhaul, 1-pass decode

- Replace FixStr/String with tiered StringSmall/Medium/Big markers for non-ASCII strings (v3 wire format)
- Split StringInternFirst into Small/Medium tiers for interned strings
- Remove all FixStr (non-ASCII) code; FixStrAscii path unchanged
- Writers select smallest tier post-encode; readers use new 1-pass decode helpers
- Update BinaryTypeCode.cs with new marker constants and reservation docs
- Update SGen and all string read/write/skip logic for new markers
- Document marker layout, optimization policy, and endianness caveat in BINARY_FEATURES.md, BINARY_ISSUES.md, BINARY_TODO.md
This commit is contained in:
Loretta 2026-05-07 09:52:10 +02:00
parent abee22b31a
commit fa48596dbf
9 changed files with 786 additions and 278 deletions

View File

@ -1820,37 +1820,63 @@ public class AcBinarySourceGenerator : IIncrementalGenerator
} }
/// <summary> /// <summary>
/// Emits inline string read from type code. Handles all string wire formats: /// Emits inline string read from type code. Handles all H2Q6 (v3 wire format) string markers:
/// FixStr (UTF-8 short, 103-134), FixStrAscii (ASCII short, 135-166), String (UTF-8 long, 91), /// FixStrAscii (ASCII short, 135-166), StringAscii (ASCII long, 167),
/// StringAscii (ASCII long, 167), StringInterned, StringEmpty, StringInternFirst, Null. /// StringSmall/Medium/Big (non-ASCII tiers, 91/94/103),
/// FixStr/FixStrAscii are checked first as hot paths for short strings — ASCII variant /// StringInternFirstSmall/Medium (interning tiers, 104/105),
/// dispatches to <c>ReadAsciiBytesAsString</c> (byte→char widen, no UTF-8 decode). /// StringInterned (cache ref, 92), StringEmpty (93), Null.
///
/// FixStrAscii is checked first as the hot path for short ASCII property names; non-ASCII
/// tier markers carry both <c>charLen</c> and <c>utf8Len</c> in fixed-width headers (1-pass decode).
/// </summary> /// </summary>
private static void EmitReadString(StringBuilder sb, string a, string tc, string i) private static void EmitReadString(StringBuilder sb, string a, string tc, string i)
{ {
// FixStr is the hot path — most strings are short (1-31 bytes, encoded in the type code itself) // FixStrAscii is the hot path — most short strings (property names) are ASCII.
sb.AppendLine($"{i}if (BinaryTypeCode.IsFixStr({tc}))"); sb.AppendLine($"{i}if (BinaryTypeCode.IsFixStrAscii({tc}))");
sb.AppendLine($"{i}{{");
sb.AppendLine($"{i} var flen = BinaryTypeCode.DecodeFixStrLength({tc});");
sb.AppendLine($"{i} {a} = flen == 0 ? string.Empty : context.ReadStringUtf8(flen);");
sb.AppendLine($"{i}}}");
// FixStrAscii — ASCII short strings, byte→char widen path (skips UTF-8 decode).
sb.AppendLine($"{i}else if (BinaryTypeCode.IsFixStrAscii({tc}))");
sb.AppendLine($"{i}{{"); sb.AppendLine($"{i}{{");
sb.AppendLine($"{i} var falen = BinaryTypeCode.DecodeFixStrAsciiLength({tc});"); sb.AppendLine($"{i} var falen = BinaryTypeCode.DecodeFixStrAsciiLength({tc});");
sb.AppendLine($"{i} {a} = falen == 0 ? string.Empty : context.ReadAsciiBytesAsString(falen);"); sb.AppendLine($"{i} {a} = falen == 0 ? string.Empty : context.ReadAsciiBytesAsString(falen);");
sb.AppendLine($"{i}}}"); sb.AppendLine($"{i}}}");
// Switch gives O(1) dispatch via JIT jump table for the long markers. // Switch gives O(1) dispatch via JIT jump table for the remaining markers.
// StringInterned is the hot path for repeated interned strings.
sb.AppendLine($"{i}else switch ({tc})"); sb.AppendLine($"{i}else switch ({tc})");
sb.AppendLine($"{i}{{"); sb.AppendLine($"{i}{{");
sb.AppendLine($"{i} case BinaryTypeCode.StringInterned:"); sb.AppendLine($"{i} case BinaryTypeCode.StringInterned:");
sb.AppendLine($"{i} {a} = context.GetInternedString((int)context.ReadVarUInt());"); sb.AppendLine($"{i} {a} = context.GetInternedString((int)context.ReadVarUInt());");
sb.AppendLine($"{i} break;"); sb.AppendLine($"{i} break;");
sb.AppendLine($"{i} case BinaryTypeCode.String:"); // H2Q6 StringSmall — non-ASCII utf8Len ≤ 255 — wire: [charLen:8][utf8Len:8][bytes], 1-pass decode.
// FastWire mode shares the marker value (=91); reader dispatches by mode.
sb.AppendLine($"{i} case BinaryTypeCode.StringSmall:");
sb.AppendLine($"{i} {{"); sb.AppendLine($"{i} {{");
sb.AppendLine($"{i} var slen = (int)context.ReadVarUInt();"); sb.AppendLine($"{i} if (context.FastWire)");
sb.AppendLine($"{i} {a} = slen == 0 ? string.Empty : context.ReadStringUtf8(slen);"); sb.AppendLine($"{i} {{");
sb.AppendLine($"{i} var fwlen = (int)context.ReadVarUInt();");
sb.AppendLine($"{i} {a} = fwlen == 0 ? string.Empty : context.ReadStringUtf8(fwlen);");
sb.AppendLine($"{i} }}");
sb.AppendLine($"{i} else");
sb.AppendLine($"{i} {{");
sb.AppendLine($"{i} var sshdr = context.ReadTwoBytesUnsafe();");
sb.AppendLine($"{i} var sscharLen = (byte)sshdr;");
sb.AppendLine($"{i} var ssbyteLen = (byte)(sshdr >> 8);");
sb.AppendLine($"{i} {a} = ssbyteLen == 0 ? string.Empty : context.ReadStringUtf8WithCharLen(sscharLen, ssbyteLen);");
sb.AppendLine($"{i} }}");
sb.AppendLine($"{i} break;");
sb.AppendLine($"{i} }}");
// H2Q6 StringMedium — utf8Len ≤ 65535 — single uint read packs charLen:16 + utf8Len:16
sb.AppendLine($"{i} case BinaryTypeCode.StringMedium:");
sb.AppendLine($"{i} {{");
sb.AppendLine($"{i} var smpacked = context.ReadUInt32Unsafe();");
sb.AppendLine($"{i} var smcharLen = (ushort)smpacked;");
sb.AppendLine($"{i} var smbyteLen = (ushort)(smpacked >> 16);");
sb.AppendLine($"{i} {a} = smbyteLen == 0 ? string.Empty : context.ReadStringUtf8WithCharLen(smcharLen, smbyteLen);");
sb.AppendLine($"{i} break;");
sb.AppendLine($"{i} }}");
// H2Q6 StringBig — utf8Len > 65535 — single ulong read packs charLen:32 + utf8Len:32
sb.AppendLine($"{i} case BinaryTypeCode.StringBig:");
sb.AppendLine($"{i} {{");
sb.AppendLine($"{i} var sbpacked = context.ReadUInt64Unsafe();");
sb.AppendLine($"{i} var sbcharLen = (int)(uint)sbpacked;");
sb.AppendLine($"{i} var sbbyteLen = (int)(uint)(sbpacked >> 32);");
sb.AppendLine($"{i} {a} = sbbyteLen == 0 ? string.Empty : context.ReadStringUtf8WithCharLen(sbcharLen, sbbyteLen);");
sb.AppendLine($"{i} break;"); sb.AppendLine($"{i} break;");
sb.AppendLine($"{i} }}"); sb.AppendLine($"{i} }}");
sb.AppendLine($"{i} case BinaryTypeCode.StringAscii:"); sb.AppendLine($"{i} case BinaryTypeCode.StringAscii:");
@ -1859,14 +1885,30 @@ public class AcBinarySourceGenerator : IIncrementalGenerator
sb.AppendLine($"{i} {a} = salen == 0 ? string.Empty : context.ReadAsciiBytesAsString(salen);"); sb.AppendLine($"{i} {a} = salen == 0 ? string.Empty : context.ReadAsciiBytesAsString(salen);");
sb.AppendLine($"{i} break;"); sb.AppendLine($"{i} break;");
sb.AppendLine($"{i} }}"); sb.AppendLine($"{i} }}");
sb.AppendLine($"{i} case BinaryTypeCode.StringInternFirst:"); // H2Q6 interning — Small tier
sb.AppendLine($"{i} case BinaryTypeCode.StringInternFirstSmall:");
sb.AppendLine($"{i} {{"); sb.AppendLine($"{i} {{");
sb.AppendLine($"{i} context.DisableStringCaching();"); sb.AppendLine($"{i} context.DisableStringCaching();");
sb.AppendLine($"{i} var sci = (int)context.ReadVarUInt();"); sb.AppendLine($"{i} var iscIdx = (int)context.ReadVarUInt();");
sb.AppendLine($"{i} var slen2 = (int)context.ReadVarUInt();"); sb.AppendLine($"{i} var ishdr = context.ReadTwoBytesUnsafe();");
sb.AppendLine($"{i} var sv = slen2 == 0 ? string.Empty : context.ReadStringUtf8(slen2);"); sb.AppendLine($"{i} var ischarLen = (byte)ishdr;");
sb.AppendLine($"{i} context.RegisterInternedValueAt(sci, sv);"); sb.AppendLine($"{i} var isbyteLen = (byte)(ishdr >> 8);");
sb.AppendLine($"{i} {a} = sv;"); sb.AppendLine($"{i} var isv = isbyteLen == 0 ? string.Empty : context.ReadStringUtf8WithCharLen(ischarLen, isbyteLen);");
sb.AppendLine($"{i} context.RegisterInternedValueAt(iscIdx, isv);");
sb.AppendLine($"{i} {a} = isv;");
sb.AppendLine($"{i} break;");
sb.AppendLine($"{i} }}");
// H2Q6 interning — Medium tier — single uint header read
sb.AppendLine($"{i} case BinaryTypeCode.StringInternFirstMedium:");
sb.AppendLine($"{i} {{");
sb.AppendLine($"{i} context.DisableStringCaching();");
sb.AppendLine($"{i} var imcIdx = (int)context.ReadVarUInt();");
sb.AppendLine($"{i} var impacked = context.ReadUInt32Unsafe();");
sb.AppendLine($"{i} var imcharLen = (ushort)impacked;");
sb.AppendLine($"{i} var imbyteLen = (ushort)(impacked >> 16);");
sb.AppendLine($"{i} var imv = imbyteLen == 0 ? string.Empty : context.ReadStringUtf8WithCharLen(imcharLen, imbyteLen);");
sb.AppendLine($"{i} context.RegisterInternedValueAt(imcIdx, imv);");
sb.AppendLine($"{i} {a} = imv;");
sb.AppendLine($"{i} break;"); sb.AppendLine($"{i} break;");
sb.AppendLine($"{i} }}"); sb.AppendLine($"{i} }}");
sb.AppendLine($"{i} case BinaryTypeCode.Null:"); sb.AppendLine($"{i} case BinaryTypeCode.Null:");

View File

@ -105,6 +105,47 @@ public static partial class AcBinaryDeserializer
return value; return value;
} }
/// <summary>
/// H2Q6 helper — reads 2 bytes as little-endian <c>ushort</c> (low byte = first byte, high byte = second).
/// Used by <c>StringSmall</c> / <c>StringInternFirstSmall</c> readers to grab <c>charLen:8 | utf8Len:8</c>
/// in a single 2-byte aligned-load + EnsureAvailable.
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public ushort ReadTwoBytesUnsafe()
{
EnsureAvailable(2);
var value = Unsafe.ReadUnaligned<ushort>(ref _buffer[_position]);
_position += 2;
return value;
}
/// <summary>
/// Reads a 4-byte unsigned integer (little-endian on Intel/AMD, native-endian elsewhere — wire format
/// is little-endian by convention; on big-endian hosts this would need <c>BinaryPrimitives.ReverseEndianness</c>).
/// Used by <c>StringBig</c> reader to grab <c>charLen:32</c> and <c>utf8Len:32</c>.
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public uint ReadUInt32Unsafe()
{
EnsureAvailable(4);
var value = Unsafe.ReadUnaligned<uint>(ref _buffer[_position]);
_position += 4;
return value;
}
/// <summary>
/// Reads an 8-byte unsigned integer (little-endian on Intel/AMD, native-endian elsewhere).
/// Used by H2Q6 <c>StringBig</c> reader to grab packed <c>charLen:32 | utf8Len:32</c> in a single load.
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public ulong ReadUInt64Unsafe()
{
EnsureAvailable(8);
var value = Unsafe.ReadUnaligned<ulong>(ref _buffer[_position]);
_position += 8;
return value;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)] [MethodImpl(MethodImplOptions.AggressiveInlining)]
public char ReadCharUnsafe() public char ReadCharUnsafe()
{ {
@ -492,6 +533,41 @@ public static partial class AcBinaryDeserializer
}); });
} }
/// <summary>
/// H2Q6 1-pass UTF-8 string read — both <paramref name="charLength"/> and <paramref name="byteLength"/>
/// come from the wire (StringSmall/Medium/Big tier headers), eliminating the
/// <see cref="Utf8Transcoder.CountUtf8Chars"/> Pass 1.
/// </summary>
/// <remarks>
/// Wire context: tier markers (StringSmall/Medium/Big, StringInternFirstSmall/Medium) carry the
/// char count alongside the byte count, so this method can <see cref="string.Create{TState}"/>
/// directly with the known target capacity and decode in a single pass through the bytes.
///
/// <para>Compact mode only — FastWire mode never emits H2Q6 tier markers (its
/// <see cref="ReadStringUtf8"/> path handles UTF-16 raw memcpy).</para>
/// </remarks>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public string ReadStringUtf8WithCharLen(int charLength, int byteLength)
{
if (byteLength == 0) return string.Empty;
EnsureAvailable(byteLength);
// WASM string-cache fast path — if cached, byte-cmp validates and returns the canonical instance
if (_useStringCaching && byteLength <= _maxCachedStringLength)
{
return ReadStringUtf8Cached(byteLength);
}
var pos = _position;
_position += byteLength;
return string.Create(charLength, (Buffer: _buffer, Pos: pos, Len: byteLength), static (chars, state) =>
{
Utf8Transcoder.DecodeUtf8SinglePass(state.Buffer.AsSpan(state.Pos, state.Len), chars);
});
}
private string ReadStringUtf8Cached(int length) private string ReadStringUtf8Cached(int length)
{ {
var slice = _buffer.AsSpan(_position, length); var slice = _buffer.AsSpan(_position, length);

View File

@ -1,5 +1,6 @@
using System; using System;
using System.Buffers; using System.Buffers;
using System.Buffers.Binary;
using System.Collections; using System.Collections;
using System.Collections.Concurrent; using System.Collections.Concurrent;
using System.Collections.Frozen; using System.Collections.Frozen;
@ -96,10 +97,16 @@ public static partial class AcBinaryDeserializer
readers[BinaryTypeCode.Float64] = static (ctx, _, _) => ctx.ReadDoubleUnsafe(); readers[BinaryTypeCode.Float64] = static (ctx, _, _) => ctx.ReadDoubleUnsafe();
readers[BinaryTypeCode.Decimal] = static (ctx, _, _) => ctx.ReadDecimalUnsafe(); readers[BinaryTypeCode.Decimal] = static (ctx, _, _) => ctx.ReadDecimalUnsafe();
readers[BinaryTypeCode.Char] = static (ctx, _, _) => ctx.ReadCharUnsafe(); readers[BinaryTypeCode.Char] = static (ctx, _, _) => ctx.ReadCharUnsafe();
readers[BinaryTypeCode.String] = static (ctx, _, _) => ReadPlainString(ctx); // H2Q6 non-ASCII tier readers (Compact mode): fixed-width header [charLen][utf8Len] + 1-pass decode.
// FastWire mode dispatches the StringSmall (=91) marker through the same handler — see ReadStringSmall.
readers[BinaryTypeCode.StringSmall] = static (ctx, _, _) => ReadStringSmall(ctx);
readers[BinaryTypeCode.StringMedium] = static (ctx, _, _) => ReadStringMedium(ctx);
readers[BinaryTypeCode.StringBig] = static (ctx, _, _) => ReadStringBig(ctx);
readers[BinaryTypeCode.StringInterned] = static (ctx, _, _) => ctx.GetInternedString((int)ctx.ReadVarUInt()); readers[BinaryTypeCode.StringInterned] = static (ctx, _, _) => ctx.GetInternedString((int)ctx.ReadVarUInt());
readers[BinaryTypeCode.StringEmpty] = static (_, _, _) => string.Empty; readers[BinaryTypeCode.StringEmpty] = static (_, _, _) => string.Empty;
readers[BinaryTypeCode.StringInternFirst] = static (ctx, _, _) => ReadAndRegisterInternedString(ctx); // H2Q6 interning tier readers (Compact mode only — Big tier never engages on interning path)
readers[BinaryTypeCode.StringInternFirstSmall] = static (ctx, _, _) => ReadAndRegisterInternedStringSmall(ctx);
readers[BinaryTypeCode.StringInternFirstMedium] = static (ctx, _, _) => ReadAndRegisterInternedStringMedium(ctx);
readers[BinaryTypeCode.StringAscii] = static (ctx, _, _) => ReadPlainStringAscii(ctx); readers[BinaryTypeCode.StringAscii] = static (ctx, _, _) => ReadPlainStringAscii(ctx);
readers[BinaryTypeCode.DateTime] = static (ctx, _, _) => ctx.ReadDateTimeUnsafe(); readers[BinaryTypeCode.DateTime] = static (ctx, _, _) => ctx.ReadDateTimeUnsafe();
readers[BinaryTypeCode.DateTimeOffset] = static (ctx, _, _) => ctx.ReadDateTimeOffsetUnsafe(); readers[BinaryTypeCode.DateTimeOffset] = static (ctx, _, _) => ctx.ReadDateTimeOffsetUnsafe();
@ -119,12 +126,8 @@ public static partial class AcBinaryDeserializer
readers[BinaryTypeCode.Dictionary] = ReadDictionary; readers[BinaryTypeCode.Dictionary] = ReadDictionary;
readers[BinaryTypeCode.ByteArray] = static (ctx, _, _) => ReadByteArray(ctx); readers[BinaryTypeCode.ByteArray] = static (ctx, _, _) => ReadByteArray(ctx);
// Register FixStr readers // V4N5 cleanup (2026-05-06): FixStr (UTF-8 short non-ASCII, 103..134) range REMOVED.
for (var code = BinaryTypeCode.FixStrBase; code <= BinaryTypeCode.FixStrMax; code++) // Non-ASCII short strings now use StringSmall tier marker (registered above).
{
var length = BinaryTypeCode.DecodeFixStrLength(code);
readers[code] = CreateFixStrReader<TInput>(length);
}
// Register FixStrAscii readers (135..166) — pure-ASCII short-string fast path. // Register FixStrAscii readers (135..166) — pure-ASCII short-string fast path.
// The marker IS the validity contract — reader byte→char widens without UTF-8 decode. // The marker IS the validity contract — reader byte→char widens without UTF-8 decode.
@ -142,16 +145,8 @@ public static partial class AcBinaryDeserializer
} }
/// <summary> // V4N5 cleanup (2026-05-06): CreateFixStrReader removed — non-ASCII short strings now use
/// Creates a reader for FixStr with the given length. // StringSmall tier reader (see ReadStringSmall below).
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static TypeReader<TInput> CreateFixStrReader<TInput>(int length) where TInput : struct, IBinaryInputBase
{
if (length == 0) return static (_, _, _) => string.Empty;
return (ctx, _, _) => ctx.ReadStringUtf8(length);
}
/// <summary> /// <summary>
/// Creates a reader for FixStrAscii with the given byte length (also char count, ASCII = 1:1). /// Creates a reader for FixStrAscii with the given byte length (also char count, ASCII = 1:1).
@ -1041,42 +1036,42 @@ public static partial class AcBinaryDeserializer
break; break;
case PropertyAccessorType.String: case PropertyAccessorType.String:
if (BinaryTypeCode.IsFixStr(typeCode)) // FixStrAscii is a range (135-166), can't go in switch — keep as range-check first.
{ // Hot path on ASCII property names; the marker carries the length, byte→char widen only.
var length = BinaryTypeCode.DecodeFixStrLength(typeCode);
propInfo.SetValue(target, length == 0 ? string.Empty : context.ReadStringUtf8(length));
return true;
}
if (BinaryTypeCode.IsFixStrAscii(typeCode)) if (BinaryTypeCode.IsFixStrAscii(typeCode))
{ {
var length = BinaryTypeCode.DecodeFixStrAsciiLength(typeCode); var length = BinaryTypeCode.DecodeFixStrAsciiLength(typeCode);
propInfo.SetValue(target, length == 0 ? string.Empty : context.ReadAsciiBytesAsString(length)); propInfo.SetValue(target, length == 0 ? string.Empty : context.ReadAsciiBytesAsString(length));
return true; return true;
} }
if (typeCode == BinaryTypeCode.String) // Single-value markers — switch lowers to a JIT/AOT jump table for O(1) dispatch
// (vs. sequential if-chain that branches per non-matching marker).
switch (typeCode)
{ {
propInfo.SetValue(target, ReadPlainString(context)); case BinaryTypeCode.StringSmall:
return true; propInfo.SetValue(target, ReadStringSmall(context));
} return true;
if (typeCode == BinaryTypeCode.StringAscii) case BinaryTypeCode.StringMedium:
{ propInfo.SetValue(target, ReadStringMedium(context));
propInfo.SetValue(target, ReadPlainStringAscii(context)); return true;
return true; case BinaryTypeCode.StringBig:
} propInfo.SetValue(target, ReadStringBig(context));
if (typeCode == BinaryTypeCode.StringEmpty) return true;
{ case BinaryTypeCode.StringAscii:
propInfo.SetValue(target, string.Empty); propInfo.SetValue(target, ReadPlainStringAscii(context));
return true; return true;
} case BinaryTypeCode.StringEmpty:
if (typeCode == BinaryTypeCode.StringInterned) propInfo.SetValue(target, string.Empty);
{ return true;
propInfo.SetValue(target, context.GetInternedString((int)context.ReadVarUInt())); case BinaryTypeCode.StringInterned:
return true; propInfo.SetValue(target, context.GetInternedString((int)context.ReadVarUInt()));
} return true;
if (typeCode == BinaryTypeCode.StringInternFirst) case BinaryTypeCode.StringInternFirstSmall:
{ propInfo.SetValue(target, ReadAndRegisterInternedStringSmall(context));
propInfo.SetValue(target, ReadAndRegisterInternedString(context)); return true;
return true; case BinaryTypeCode.StringInternFirstMedium:
propInfo.SetValue(target, ReadAndRegisterInternedStringMedium(context));
return true;
} }
break; break;
} }
@ -1115,19 +1110,13 @@ public static partial class AcBinaryDeserializer
// Handle null // Handle null
if (typeCode == BinaryTypeCode.Null) return null; if (typeCode == BinaryTypeCode.Null) return null;
// Handle FixStr (short strings with length in type code)
if (BinaryTypeCode.IsFixStr(typeCode))
{
var length = BinaryTypeCode.DecodeFixStrLength(typeCode);
return length == 0 ? string.Empty : context.ReadStringUtf8(length);
}
// Handle FixStrAscii (short ASCII strings — byte→char widen, no UTF-8 decode) // Handle FixStrAscii (short ASCII strings — byte→char widen, no UTF-8 decode)
if (BinaryTypeCode.IsFixStrAscii(typeCode)) if (BinaryTypeCode.IsFixStrAscii(typeCode))
{ {
var length = BinaryTypeCode.DecodeFixStrAsciiLength(typeCode); var length = BinaryTypeCode.DecodeFixStrAsciiLength(typeCode);
return length == 0 ? string.Empty : context.ReadAsciiBytesAsString(length); return length == 0 ? string.Empty : context.ReadAsciiBytesAsString(length);
} }
// H2Q6: non-ASCII short strings now use StringSmall tier (handled below via TypeReaderTable dispatch).
var reader = TypeReaderTable<TInput>.Readers[typeCode]; var reader = TypeReaderTable<TInput>.Readers[typeCode];
if (reader != null) if (reader != null)
@ -1141,7 +1130,11 @@ public static partial class AcBinaryDeserializer
} }
/// <summary> /// <summary>
/// Sima string olvas<61>sa - NEM regisztr<74>l az intern t<>bl<62>ba. /// Body-only string read for marker-less paths (polymorphism: assembly-qualified type-name).
/// Wire format: <c>[VarUInt utf8Len][UTF-8 bytes]</c> — caller already consumed any marker.
/// Used by <c>ReadObjectWithTypeName</c> / <c>ReadObjectWithTypeNameRefFirst</c> after their
/// outer marker has been read; symmetric to <see cref="BinaryDeserializationContext{T}.WriteStringUtf8"/>
/// on the writer side.
/// </summary> /// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)] [MethodImpl(MethodImplOptions.AggressiveInlining)]
private static string ReadPlainString<TInput>(BinaryDeserializationContext<TInput> context) private static string ReadPlainString<TInput>(BinaryDeserializationContext<TInput> context)
@ -1152,6 +1145,58 @@ public static partial class AcBinaryDeserializer
return context.ReadStringUtf8(length); return context.ReadStringUtf8(length);
} }
/// <summary>
/// H2Q6 StringSmall reader (Compact mode): wire <c>[charLen:8][utf8Len:8][UTF-8 bytes]</c> after the
/// marker has been consumed. 1-pass decode (no <c>CountUtf8Chars</c>). FastWire mode uses the same
/// marker (=91) but a different layout — handled via <see cref="BinaryDeserializationContext{T}.ReadStringUtf8"/>
/// when the deserializer is in FastWire mode.
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static string ReadStringSmall<TInput>(BinaryDeserializationContext<TInput> context)
where TInput : struct, IBinaryInputBase
{
if (context.FastWire)
{
// Mode-shared marker: FastWire payload is [VarUInt charCount][UTF-16 raw bytes]
var charLenF = (int)context.ReadVarUInt();
return context.ReadStringUtf8(charLenF);
}
// Compact mode — H2Q6 StringSmall: [charLen:8][utf8Len:8][bytes]
var header = context.ReadTwoBytesUnsafe();
var charLength = (byte)header;
var byteLength = (byte)(header >> 8);
return context.ReadStringUtf8WithCharLen(charLength, byteLength);
}
/// <summary>
/// H2Q6 StringMedium reader: wire <c>[charLen:16 LE][utf8Len:16 LE][UTF-8 bytes]</c>. 1-pass decode.
/// Header read in a single uint load (vs 2 ushort loads).
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static string ReadStringMedium<TInput>(BinaryDeserializationContext<TInput> context)
where TInput : struct, IBinaryInputBase
{
var packed = context.ReadUInt32Unsafe();
var charLength = (ushort)packed;
var byteLength = (ushort)(packed >> 16);
return context.ReadStringUtf8WithCharLen(charLength, byteLength);
}
/// <summary>
/// H2Q6 StringBig reader: wire <c>[charLen:32 LE][utf8Len:32 LE][UTF-8 bytes]</c>. 1-pass decode.
/// Header read in a single ulong load (vs 2 uint loads).
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static string ReadStringBig<TInput>(BinaryDeserializationContext<TInput> context)
where TInput : struct, IBinaryInputBase
{
var packed = context.ReadUInt64Unsafe();
var charLength = (int)(uint)packed;
var byteLength = (int)(uint)(packed >> 32);
return context.ReadStringUtf8WithCharLen(charLength, byteLength);
}
/// <summary> /// <summary>
/// Reads a long ASCII string payload (after the <c>StringAscii</c> marker has been consumed). /// Reads a long ASCII string payload (after the <c>StringAscii</c> marker has been consumed).
/// Wire format: <c>[VarUInt byteCount][ASCII bytes]</c>. Byte→char widen, no UTF-8 decode. /// Wire format: <c>[VarUInt byteCount][ASCII bytes]</c>. Byte→char widen, no UTF-8 decode.
@ -1166,20 +1211,51 @@ public static partial class AcBinaryDeserializer
} }
/// <summary> /// <summary>
/// Read interned string (StringInternFirst marker) and register in cache at specified index. /// H2Q6 StringInternFirstSmall reader: wire <c>[cacheIdx:VarUInt][charLen:8][utf8Len:8][bytes]</c>
/// Wire format: [StringInternFirst][VarUInt cacheIndex][VarUInt length][UTF8 bytes] /// after the marker has been consumed. Registers the decoded string in the intern cache.
/// </summary> /// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)] [MethodImpl(MethodImplOptions.AggressiveInlining)]
private static string ReadAndRegisterInternedString<TInput>(BinaryDeserializationContext<TInput> context) private static string ReadAndRegisterInternedStringSmall<TInput>(BinaryDeserializationContext<TInput> context)
where TInput : struct, IBinaryInputBase where TInput : struct, IBinaryInputBase
{ {
// First StringInternFirst marker proves payload uses string interning → // First interning marker proves payload uses string interning → plain String entries
// plain String entries appear only once, so _stringCache would never hit // appear only once, so _stringCache would never hit on them.
context.DisableStringCaching(); context.DisableStringCaching();
var cacheIndex = (int)context.ReadVarUInt(); var cacheIndex = (int)context.ReadVarUInt();
var length = (int)context.ReadVarUInt(); var header = context.ReadTwoBytesUnsafe();
if (length == 0) return string.Empty; var charLength = (byte)header;
var str = context.ReadStringUtf8(length); var byteLength = (byte)(header >> 8);
if (byteLength == 0)
{
context.RegisterInternedValueAt(cacheIndex, string.Empty);
return string.Empty;
}
var str = context.ReadStringUtf8WithCharLen(charLength, byteLength);
context.RegisterInternedValueAt(cacheIndex, str);
return str;
}
/// <summary>
/// H2Q6 StringInternFirstMedium reader: wire <c>[cacheIdx:VarUInt][charLen:16 LE][utf8Len:16 LE][bytes]</c>.
/// Registers the decoded string in the intern cache. (Big tier never engages on the interning path —
/// see <see cref="BinaryTypeCode"/> H2Q6 layout comment.)
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static string ReadAndRegisterInternedStringMedium<TInput>(BinaryDeserializationContext<TInput> context)
where TInput : struct, IBinaryInputBase
{
context.DisableStringCaching();
var cacheIndex = (int)context.ReadVarUInt();
// Pack charLen:16 | utf8Len:16 read in a single uint load
var packed = context.ReadUInt32Unsafe();
var charLength = (ushort)packed;
var byteLength = (ushort)(packed >> 16);
if (byteLength == 0)
{
context.RegisterInternedValueAt(cacheIndex, string.Empty);
return string.Empty;
}
var str = context.ReadStringUtf8WithCharLen(charLength, byteLength);
context.RegisterInternedValueAt(cacheIndex, str); context.RegisterInternedValueAt(cacheIndex, str);
return str; return str;
} }
@ -2032,16 +2108,7 @@ public static partial class AcBinaryDeserializer
if (BinaryTypeCode.IsTinyInt(typeCode)) return; if (BinaryTypeCode.IsTinyInt(typeCode)) return;
// Handle FixStr (short strings) // Handle FixStrAscii (short ASCII strings — marker carries length, ASCII payload)
if (BinaryTypeCode.IsFixStr(typeCode))
{
var length = BinaryTypeCode.DecodeFixStrLength(typeCode);
if (length > 0)
context.Skip(length);
return;
}
// Handle FixStrAscii (short ASCII strings — same skip layout as FixStr, just different marker range)
if (BinaryTypeCode.IsFixStrAscii(typeCode)) if (BinaryTypeCode.IsFixStrAscii(typeCode))
{ {
var length = BinaryTypeCode.DecodeFixStrAsciiLength(typeCode); var length = BinaryTypeCode.DecodeFixStrAsciiLength(typeCode);
@ -2049,6 +2116,7 @@ public static partial class AcBinaryDeserializer
context.Skip(length); context.Skip(length);
return; return;
} }
// H2Q6: non-ASCII short strings now use StringSmall tier (handled in switch below).
switch (typeCode) switch (typeCode)
{ {
@ -2094,17 +2162,44 @@ public static partial class AcBinaryDeserializer
case BinaryTypeCode.Decimal: case BinaryTypeCode.Decimal:
context.Skip(16); context.Skip(16);
return; return;
case BinaryTypeCode.String:
case BinaryTypeCode.StringAscii: case BinaryTypeCode.StringAscii:
// Same skip layout: [VarUInt byteCount][bytes]. ASCII vs UTF-8 distinction is content-only. // Skip layout: [VarUInt byteCount][bytes]
SkipPlainString(context); SkipPlainString(context);
return; return;
case BinaryTypeCode.StringSmall:
// H2Q6 Small tier: [charLen:8][utf8Len:8][bytes] — skip 2 byte header + utf8Len bytes
{
var header = context.ReadTwoBytesUnsafe();
var utf8Len = (byte)(header >> 8);
if (utf8Len > 0) context.Skip(utf8Len);
}
return;
case BinaryTypeCode.StringMedium:
// H2Q6 Medium tier: [charLen:16][utf8Len:16][bytes] — single uint read
{
var packed = context.ReadUInt32Unsafe();
var utf8Len = (int)(packed >> 16);
if (utf8Len > 0) context.Skip(utf8Len);
}
return;
case BinaryTypeCode.StringBig:
// H2Q6 Big tier: [charLen:32][utf8Len:32][bytes] — single ulong read
{
var packed = context.ReadUInt64Unsafe();
var utf8Len = (int)(uint)(packed >> 32);
if (utf8Len > 0) context.Skip(utf8Len);
}
return;
case BinaryTypeCode.StringInterned: case BinaryTypeCode.StringInterned:
context.ReadVarUInt(); context.ReadVarUInt();
return; return;
case BinaryTypeCode.StringInternFirst: case BinaryTypeCode.StringInternFirstSmall:
// First occurrence - must register even when skipping // H2Q6 interning Small: [cacheIdx:VarUInt][charLen:8][utf8Len:8][bytes] — register decoded string
SkipAndRegisterInternedString(context); SkipAndRegisterInternedStringSmall(context);
return;
case BinaryTypeCode.StringInternFirstMedium:
// H2Q6 interning Medium: [cacheIdx:VarUInt][charLen:16][utf8Len:16][bytes]
SkipAndRegisterInternedStringMedium(context);
return; return;
case BinaryTypeCode.ByteArray: case BinaryTypeCode.ByteArray:
var byteLen = (int)context.ReadVarUInt(); var byteLen = (int)context.ReadVarUInt();
@ -2157,17 +2252,44 @@ public static partial class AcBinaryDeserializer
} }
/// <summary> /// <summary>
/// Skip an interned string (StringInternFirst) - must still read cacheIndex and register in cache. /// Skip an H2Q6 interning Small first-occurrence — still register in cache for subsequent ref reads.
/// Wire format: [StringInternFirst][VarUInt cacheIndex][VarUInt length][UTF8 bytes] /// Wire format: <c>[StringInternFirstSmall][cacheIdx:VarUInt][charLen:8][utf8Len:8][UTF-8 bytes]</c>
/// </summary> /// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)] [MethodImpl(MethodImplOptions.AggressiveInlining)]
private static void SkipAndRegisterInternedString<TInput>(BinaryDeserializationContext<TInput> context) private static void SkipAndRegisterInternedStringSmall<TInput>(BinaryDeserializationContext<TInput> context)
where TInput : struct, IBinaryInputBase where TInput : struct, IBinaryInputBase
{ {
var cacheIndex = (int)context.ReadVarUInt(); var cacheIndex = (int)context.ReadVarUInt();
var byteLen = (int)context.ReadVarUInt(); var header = context.ReadTwoBytesUnsafe();
if (byteLen == 0) return; var charLen = (byte)header;
var str = context.ReadStringUtf8(byteLen); var byteLen = (byte)(header >> 8);
if (byteLen == 0)
{
context.RegisterInternedValueAt(cacheIndex, string.Empty);
return;
}
var str = context.ReadStringUtf8WithCharLen(charLen, byteLen);
context.RegisterInternedValueAt(cacheIndex, str);
}
/// <summary>
/// Skip an H2Q6 interning Medium first-occurrence — still register in cache for subsequent ref reads.
/// Wire format: <c>[StringInternFirstMedium][cacheIdx:VarUInt][charLen:16 LE][utf8Len:16 LE][UTF-8 bytes]</c>
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static void SkipAndRegisterInternedStringMedium<TInput>(BinaryDeserializationContext<TInput> context)
where TInput : struct, IBinaryInputBase
{
var cacheIndex = (int)context.ReadVarUInt();
var packed = context.ReadUInt32Unsafe();
var charLen = (ushort)packed;
var byteLen = (ushort)(packed >> 16);
if (byteLen == 0)
{
context.RegisterInternedValueAt(cacheIndex, string.Empty);
return;
}
var str = context.ReadStringUtf8WithCharLen(charLen, byteLen);
context.RegisterInternedValueAt(cacheIndex, str); context.RegisterInternedValueAt(cacheIndex, str);
} }

View File

@ -1,5 +1,6 @@
using System; using System;
using System.Buffers; using System.Buffers;
using System.Buffers.Binary;
using System.Collections.Concurrent; using System.Collections.Concurrent;
using System.Collections.Generic; using System.Collections.Generic;
using System.Numerics; using System.Numerics;
@ -715,30 +716,42 @@ public static partial class AcBinarySerializer
} }
/// <summary> /// <summary>
/// Writes a non-empty string with marker-dispatch: detects ASCII vs UTF-8 in-place from the /// Writes a non-empty string with marker-dispatch: detects ASCII vs non-ASCII in-place from
/// encoder's byte count and emits the appropriate wire marker (<c>FixStrAscii</c>, /// the encoder's byte count, then emits the appropriate wire marker:
/// <c>FixStr</c>, <c>StringAscii</c>, or <c>String</c>). The reader uses the marker as an /// <list type="bullet">
/// ASCII-validity contract — pure-ASCII payloads skip UTF-8 decode entirely (byte→char widen). /// <item>ASCII ≤ 31 byte → <c>FixStrAscii</c> (1-byte header, length in marker)</item>
/// <item>ASCII &gt; 31 byte → <c>StringAscii</c> (1+VarUInt header)</item>
/// <item>Non-ASCII utf8Len ≤ 255 → <c>StringSmall</c> (3-byte header: marker + charLen:8 + utf8Len:8)</item>
/// <item>Non-ASCII utf8Len ≤ 65535 → <c>StringMedium</c> (5-byte header: marker + charLen:16 + utf8Len:16)</item>
/// <item>Non-ASCII utf8Len &gt; 65535 → <c>StringBig</c> (9-byte header: marker + charLen:32 + utf8Len:32)</item>
/// </list>
/// </summary> /// </summary>
/// <remarks> /// <remarks>
/// Layout (Compact wire): <c>[marker: 1 byte][optional VarUInt byteCount][encoded bytes]</c> /// H2Q6 wire format v3 — non-ASCII tiers carry both <c>charLen</c> and <c>utf8Len</c> in the header,
/// — VarUInt is omitted for FixStr/FixStrAscii (length is encoded in the marker). /// enabling 1-pass deserialize (no <c>CountUtf8Chars</c> Pass 1). ASCII path unchanged from M3R7.
/// ///
/// ASCII detection is free: <c>bytesWritten == charLength</c> after a UTF-8 encode is a /// <para>Optimistic encode position is chosen by tier-prediction from <c>charLength</c>
/// necessary AND sufficient condition for the input being pure ASCII (every UTF-16 char /// (worst-case 4 byte/char): ≤ 63 char → Small (3-byte header reserved); ≤ 16383 char → Medium
/// &lt; 0x80 produces exactly 1 UTF-8 byte; non-ASCII chars always produce 2-4 bytes). /// (5-byte header reserved); else Big (9-byte). After encoding, <c>bytesWritten</c> determines
/// the actual tier and the body is left-shifted only if the actual header is smaller than
/// reserved (rare on Magyar text — short Hungarian content stays in Small tier with 0 shift).</para>
/// ///
/// Caller MUST guarantee non-empty input (<c>value.Length &gt; 0</c>) — empty strings are /// <para>FastWire mode: re-uses the <c>StringSmall</c> marker value (91) as a generic
/// handled by the higher-level <c>WriteString</c> via the <c>StringEmpty</c> marker. /// "string marker" — body layout differs (UTF-16 raw + VarUInt charCount) and the reader
/// dispatches by serializer mode, NOT by re-interpreting the marker. The 91 value is
/// mode-shared because the wire envelope is mode-tagged at the header level.</para>
///
/// <para>Caller MUST guarantee non-empty input (<c>value.Length &gt; 0</c>) — empty strings
/// are handled by the higher-level <c>WriteString</c> via the <c>StringEmpty</c> marker.</para>
/// </remarks> /// </remarks>
public void WriteStringWithDispatch(string value) public void WriteStringWithDispatch(string value)
{ {
if (FastWire) if (FastWire)
{ {
// FastWire: char count (VarUInt) + raw UTF-16 memcopy. ASCII detection adds no value // FastWire: [StringSmall marker][VarUInt charCount][UTF-16 raw bytes]
// here — the wire size is identical (2 bytes/char) and the read path is memcpy-based, // Marker value 91 is mode-shared (Compact StringSmall vs FastWire string marker);
// so the encoder/decoder UTF-8 cost (which the ASCII marker would skip) doesn't apply. // reader dispatches by deserializer mode, NOT by re-interpreting the marker.
WriteByte(BinaryTypeCode.String); WriteByte(BinaryTypeCode.StringSmall);
var charLenF = value.Length; var charLenF = value.Length;
var byteLenF = charLenF * 2; var byteLenF = charLenF * 2;
WriteVarUInt((uint)charLenF); WriteVarUInt((uint)charLenF);
@ -748,134 +761,186 @@ public static partial class AcBinarySerializer
return; return;
} }
// Compact mode — H2Q6 post-encode tier dispatch (wire-optimal).
//
// Two-step tier logic:
// 1. reserveHeader (from charLength, worst-case 4 byte/char): bounds the buffer allocation
// AND the encode position. Tight reserve (3/5/9) avoids large memmove on the hot path.
// 2. actualHeader (from bytesWritten after encode): chooses the smallest fitting tier.
// A mostly-ASCII string in the 64-16383 char band gets Small (3 byte header) even though
// reserve was Medium (5 byte) — body is left-shifted by 2 bytes to compact.
//
// Why post-encode tier choice (vs. pre-chosen): mostly-ASCII content (English description fields,
// log/error messages, URL paths) at 64+ char would otherwise pay +2 byte/string for Medium
// header when Small fits. Production payloads include both Magyar/CJK multi-byte AND ASCII-
// dominated strings; wire-size narrative ("smallest") matters across the realistic mix.
//
// ASCII override (bytesWritten == charLength) emits FixStrAscii / StringAscii with their own
// compact headers (1 byte / 1+VarUInt) — body shifted left from the encode position.
var charLength = value.Length; var charLength = value.Length;
var maxBytes = charLength * 4;
// Hot-path split: encode position is chosen to MINIMIZE post-encode shifts. int reserveHeader;
// if (charLength <= 63) reserveHeader = 3;
// • charLength ≤ 31 → MIGHT be FixStr (bytesWritten ≤ 31) or long String (multibyte else if (charLength <= 16383) reserveHeader = 5;
// expansion). Encode optimistically at savedPos+1 (FixStr position). FixStr hit ⇒ 0 shift, else reserveHeader = 9;
// only marker byte write. Long-fallback (rare, requires Hungarian/CJK chars in a
// short-char string AND post-expand size > 31) ⇒ shift bytes RIGHT by 1 (since the EnsureCapacity(reserveHeader + maxBytes);
// long lane needs 1 VarUInt byte after the marker; charLength ≤ 31 ⇒ maxBytes ≤ 124
// ⇒ VarUInt size = 1). var savedPos = _position;
// var encodeStart = savedPos + reserveHeader;
// • charLength > 31 → ALWAYS long String (bytesWritten ≥ charLength > 31). Use full var bytesWritten = Utf8Transcoder.EncodeUtf8SinglePass(value.AsSpan(), _buffer.AsSpan(encodeStart, maxBytes));
// D-2 layout [marker][reserveVarUInt][bytes], encode at savedPos+1+reserveVarUInt.
// Backfill compacts only when actual VarUInt size < reserved (rare). if (bytesWritten == charLength)
if (charLength <= BinaryTypeCode.FixStrMaxLength)
{ {
var maxBytesShort = charLength * 4; // ≤ 124, fits in 1-byte VarUInt // ASCII override — FixStrAscii (≤31) or StringAscii (>31) with compact header
EnsureCapacity(2 + maxBytesShort); // marker + 1-byte VarUInt + bytes (worst case) if (bytesWritten <= BinaryTypeCode.FixStrAsciiMaxLength)
var savedPosShort = _position;
var bytesWrittenShort = Utf8Transcoder.EncodeUtf8SinglePass(
value.AsSpan(),
_buffer.AsSpan(savedPosShort + 1, maxBytesShort));
var isAsciiShort = bytesWrittenShort == charLength;
if (bytesWrittenShort <= BinaryTypeCode.FixStrMaxLength)
{ {
// Hot path: FixStr hit → bytes already at savedPos+1, no shift. var shift = reserveHeader - 1;
_buffer[savedPosShort] = isAsciiShort _buffer.AsSpan(encodeStart, bytesWritten).CopyTo(_buffer.AsSpan(savedPos + 1, bytesWritten));
? BinaryTypeCode.EncodeFixStrAscii(bytesWrittenShort) _buffer[savedPos] = BinaryTypeCode.EncodeFixStrAscii(bytesWritten);
: BinaryTypeCode.EncodeFixStr(bytesWrittenShort); _position = savedPos + 1 + bytesWritten;
_position = savedPosShort + 1 + bytesWrittenShort;
} }
else else
{ {
// Cold: multibyte expansion pushed bytes > 31 → become long String/StringAscii. var actualVarUIntSize = VarUIntSize((uint)bytesWritten);
// Shift bytes right by 1 to insert the 1-byte VarUInt slot. var asciiHeader = 1 + actualVarUIntSize;
_buffer.AsSpan(savedPosShort + 1, bytesWrittenShort) var shift = reserveHeader - asciiHeader;
.CopyTo(_buffer.AsSpan(savedPosShort + 2, bytesWrittenShort)); if (shift > 0)
_buffer[savedPosShort] = isAsciiShort ? BinaryTypeCode.StringAscii : BinaryTypeCode.String; _buffer.AsSpan(encodeStart, bytesWritten).CopyTo(_buffer.AsSpan(encodeStart - shift, bytesWritten));
_position = savedPosShort + 1; _buffer[savedPos] = BinaryTypeCode.StringAscii;
WriteVarUIntUnsafe((uint)bytesWrittenShort); _position = savedPos + 1;
_position += bytesWrittenShort; WriteVarUIntUnsafe((uint)bytesWritten);
_position += bytesWritten;
} }
return;
}
// Long path: charLength > 31 ⇒ bytesWritten > 31 ⇒ always String / StringAscii.
// D-2 layout [marker:1][VarUInt slot:reserveVarUInt][bytes], encode at savedPos+1+reserveVarUInt.
var maxBytes = charLength * 4;
var reserveVarUInt = VarUIntSize((uint)maxBytes);
EnsureCapacity(1 + reserveVarUInt + maxBytes);
var savedPos = _position;
var encodeStart = savedPos + 1 + reserveVarUInt;
var bytesWritten = Utf8Transcoder.EncodeUtf8SinglePass(value.AsSpan(), _buffer.AsSpan(encodeStart, maxBytes));
var isAscii = bytesWritten == charLength;
_buffer[savedPos] = isAscii ? BinaryTypeCode.StringAscii : BinaryTypeCode.String;
var actualVarUIntSize = VarUIntSize((uint)bytesWritten);
if (actualVarUIntSize < reserveVarUInt)
{
var shift = reserveVarUInt - actualVarUIntSize;
_buffer.AsSpan(encodeStart, bytesWritten).CopyTo(_buffer.AsSpan(encodeStart - shift, bytesWritten));
}
_position = savedPos + 1;
WriteVarUIntUnsafe((uint)bytesWritten);
_position += bytesWritten;
}
public void WriteFixStr(string value)
{
var length = value.Length;
EnsureCapacity(1 + length);
_buffer[_position++] = BinaryTypeCode.EncodeFixStr(length);
Ascii.FromUtf16(value.AsSpan(), _buffer.AsSpan(_position, length), out _);
_position += length;
}
public void WriteFixStrDirect(string value)
{
var length = value.Length;
EnsureCapacity(1 + length);
var destSpan = _buffer.AsSpan(_position + 1, length);
var status = Ascii.FromUtf16(value.AsSpan(), destSpan, out var bytesWritten);
if (status == OperationStatus.Done && bytesWritten == length)
{
_buffer[_position] = BinaryTypeCode.EncodeFixStr(length);
_position += 1 + length;
} }
else else
{ {
_buffer[_position++] = BinaryTypeCode.String; // Non-ASCII — post-encode tier choice from bytesWritten (smallest fitting tier wins)
WriteStringUtf8Internal(value); int actualHeader;
byte tierMarker;
switch (bytesWritten)
{
case <= 255:
actualHeader = 3;
tierMarker = BinaryTypeCode.StringSmall;
break;
case <= 65535:
actualHeader = 5;
tierMarker = BinaryTypeCode.StringMedium;
break;
default:
actualHeader = 9;
tierMarker = BinaryTypeCode.StringBig;
break;
}
var shift = reserveHeader - actualHeader;
if (shift > 0)
_buffer.AsSpan(encodeStart, bytesWritten).CopyTo(_buffer.AsSpan(encodeStart - shift, bytesWritten));
_buffer[savedPos] = tierMarker;
switch (actualHeader)
{
case 3:
{
// Pack charLen:8 | utf8Len:8 → single ushort store (vs 2 byte-stores)
var packed = (ushort)(charLength | (bytesWritten << 8));
Unsafe.WriteUnaligned<ushort>(ref _buffer[savedPos + 1], packed);
break;
}
case 5:
{
// Pack charLen:16 | utf8Len:16 → single uint store, LE (vs 2 ushort-stores)
var packed = (uint)charLength | ((uint)bytesWritten << 16);
Unsafe.WriteUnaligned<uint>(ref _buffer[savedPos + 1], packed);
break;
}
default:
{
// Pack charLen:32 | utf8Len:32 → single ulong store, LE (vs 2 uint-stores)
var packed = (ulong)(uint)charLength | ((ulong)(uint)bytesWritten << 32);
Unsafe.WriteUnaligned<ulong>(ref _buffer[savedPos + 1], packed);
break;
}
}
_position = savedPos + actualHeader + bytesWritten;
} }
} }
[MethodImpl(MethodImplOptions.AggressiveInlining)] /// <summary>
public void WriteFixStrBytes(ReadOnlySpan<byte> utf8Bytes) /// Writes the first-occurrence body of an interned string with H2Q6 tier-marker dispatch.
/// Used by the runtime/SGen string-intern write path; subsequent occurrences use cache-index ref.
/// </summary>
/// <remarks>
/// Wire layout per tier:
/// <list type="bullet">
/// <item><c>StringInternFirstSmall</c>: <c>[marker:1][cacheIdx:VarUInt][charLen:8][utf8Len:8][bytes]</c> — utf8Len ≤ 255</item>
/// <item><c>StringInternFirstMedium</c>: <c>[marker:1][cacheIdx:VarUInt][charLen:16][utf8Len:16][bytes]</c> — utf8Len ≤ 65535</item>
/// </list>
///
/// <para>Big tier never engages — <c>MaxStringInternLength</c> is byte-typed in
/// <c>AcBinarySerializerOptions</c> (abszolút max 255 char × 4 byte/char = 1020 byte fits in Medium).</para>
///
/// <para>Tier prediction by <c>charLength</c>: ≤ 63 char → Small (worst-case 252 byte ≤ 255);
/// &gt; 63 char → Medium. Body is left-shifted by 2 bytes only when a long mostly-ASCII interning
/// string drops back into Small tier (rare).</para>
/// </remarks>
public void WriteStringInternFirstWithDispatch(string value, int cacheMapIndex)
{ {
var length = utf8Bytes.Length; // Post-encode tier choice (wire-optimal): mostly-ASCII interning string in the 64+ char band
EnsureCapacity(1 + length); // emits Small tier (3 byte) when bytesWritten ≤ 255, instead of Medium (5 byte). Big tier
_buffer[_position++] = BinaryTypeCode.EncodeFixStr(length); // never engages — MaxStringInternLength byte-typed (max 255 char × 4 byte = 1020 byte fits in Medium).
utf8Bytes.CopyTo(_buffer.AsSpan(_position, length)); var charLength = value.Length;
_position += length; var maxBytes = charLength * 4;
} var cacheIdxSize = VarUIntSize((uint)cacheMapIndex);
public void WritePreencodedPropertyName(ReadOnlySpan<byte> utf8Name) // reserveHeader: charLength ≤ 63 → guaranteed Small (252 byte ≤ 255); else Medium-reserve.
{ var reserveHeader = charLength <= 63 ? 3 : 5;
WriteByte(BinaryTypeCode.String);
WriteVarUInt((uint)utf8Name.Length); EnsureCapacity(cacheIdxSize + reserveHeader + maxBytes);
WriteBytes(utf8Name);
} var savedPos = _position;
var encodeStart = savedPos + cacheIdxSize + reserveHeader;
[MethodImpl(MethodImplOptions.AggressiveInlining)] var bytesWritten = Utf8Transcoder.EncodeUtf8SinglePass(value.AsSpan(), _buffer.AsSpan(encodeStart, maxBytes));
private void WriteStringUtf8Internal(string value)
{ // Choose tier from actual bytesWritten (smallest fits)
var byteCount = Utf8NoBom.GetByteCount(value); var actualHeader = bytesWritten <= 255 ? 3 : 5;
WriteVarUInt((uint)byteCount); var tierMarker = actualHeader == 3 ? BinaryTypeCode.StringInternFirstSmall : BinaryTypeCode.StringInternFirstMedium;
EnsureCapacity(byteCount);
Utf8NoBom.GetBytes(value.AsSpan(), _buffer.AsSpan(_position, byteCount)); var shift = reserveHeader - actualHeader;
_position += byteCount; if (shift > 0)
_buffer.AsSpan(encodeStart, bytesWritten).CopyTo(_buffer.AsSpan(encodeStart - shift, bytesWritten));
// Write [marker][cacheIdx VarUInt][charLen + utf8Len header][bytes]
_buffer[savedPos] = tierMarker;
_position = savedPos + 1;
WriteVarUIntUnsafe((uint)cacheMapIndex);
if (actualHeader == 3)
{
// Pack charLen:8 | utf8Len:8 → single ushort store
var packed = (ushort)(charLength | (bytesWritten << 8));
Unsafe.WriteUnaligned<ushort>(ref _buffer[_position], packed);
_position += 2;
}
else
{
// Pack charLen:16 | utf8Len:16 → single uint store, LE
var packed = (uint)charLength | ((uint)bytesWritten << 16);
Unsafe.WriteUnaligned<uint>(ref _buffer[_position], packed);
_position += 4;
}
_position += bytesWritten;
} }
// ─────────────────────────────────────────────────────────────────
// V4N5 dead-code cleanup (2026-05-06): WriteFixStr, WriteFixStrDirect, WriteFixStrBytes,
// WritePreencodedPropertyName, and WriteStringUtf8Internal removed — these were unreachable
// (no core call site, no SourceGenerator template hit, no test, no reflection path).
// The hot-path string writes go through WriteStringWithDispatch (M3R7 + H2Q6 marker dispatch).
// ─────────────────────────────────────────────────────────────────
#endregion #endregion

View File

@ -1427,10 +1427,11 @@ public static partial class AcBinarySerializer
ValidateWritePlanString(in planEntry, value); ValidateWritePlanString(in planEntry, value);
if (planEntry.IsFirst) if (planEntry.IsFirst)
{ {
// StringFirst: write interned string + cache index + data (Value holds the string) // H2Q6 v3 wire format — StringFirst with tier-marker dispatch (Small/Medium):
context.WriteByte(BinaryTypeCode.StringInternFirst); // [StringInternFirstSmall][cacheIdx:VarUInt][charLen:8][utf8Len:8][bytes] if utf8Len ≤ 255
context.WriteVarUInt((uint)planEntry.CacheMapIndex); // [StringInternFirstMedium][cacheIdx:VarUInt][charLen:16][utf8Len:16][bytes] if utf8Len ≤ 65535
context.WriteStringUtf8(planEntry.Value ?? value); // 1-pass decode: charLen carried in header, no CountUtf8Chars Pass 1.
context.WriteStringInternFirstWithDispatch(planEntry.Value ?? value, planEntry.CacheMapIndex);
} }
else else
{ {

View File

@ -63,11 +63,30 @@ internal static class BinaryTypeCode
public const byte Decimal = SlotCount + 25; // 89 public const byte Decimal = SlotCount + 25; // 89
public const byte Char = SlotCount + 26; // 90 public const byte Char = SlotCount + 26; // 90
// ============================================================================
// String types — H2Q6 layout (post 2026-05-06 marker reorg, wire format v3)
// ============================================================================
//
// Non-ASCII strings use fixed-width header tier markers (NO VarUInt utf8Len),
// enabling 1-pass decode (CountUtf8Chars Pass 1 eliminated).
//
// Tier dispatch (writer chooses smallest fitting tier based on utf8Len):
// StringSmall — utf8Len ≤ 255 — header: 1 marker + 1 charLen + 1 utf8Len = 3 byte
// StringMedium — utf8Len ≤ 65535 — header: 1 marker + 2 charLen + 2 utf8Len = 5 byte
// StringBig — utf8Len > 65535 — header: 1 marker + 4 charLen + 4 utf8Len = 9 byte
//
// Interning tiers (writer chooses based on utf8Len; Big never engages — MaxStringInternLength
// is byte-typed (max 255 char × max 4 byte/char = 1020 byte fits in Medium):
// StringInternFirstSmall — utf8Len ≤ 255 — header: 1 + cacheIdx-VarUInt + 1 + 1
// StringInternFirstMedium — utf8Len ≤ 65535 — header: 1 + cacheIdx-VarUInt + 2 + 2
//
// ASCII strings continue to use FixStrAscii (135..166) and StringAscii (167) — unchanged from M3R7.
//
// String types (SlotCount + 27..30) // String types (SlotCount + 27..30)
public const byte String = SlotCount + 27; // 91 — Inline UTF8 string (non-interned) public const byte StringSmall = SlotCount + 27; // 91 — Non-ASCII tier 1: [marker:1][charLen:8][utf8Len:8][bytes], utf8Len ≤ 255
public const byte StringInterned = SlotCount + 28; // 92 — Reference to interned string by index (2+ occurrence) public const byte StringInterned = SlotCount + 28; // 92 — Reference to interned string by index (2+ occurrence) — UNCHANGED
public const byte StringEmpty = SlotCount + 29; // 93 — Empty string marker public const byte StringEmpty = SlotCount + 29; // 93 — Empty string marker — UNCHANGED
public const byte StringInternFirst = SlotCount + 30; // 94 — First occurrence of interned string public const byte StringMedium = SlotCount + 30; // 94 — Non-ASCII tier 2: [marker:1][charLen:16][utf8Len:16][bytes], utf8Len ≤ 65535
// Date/Time types (SlotCount + 31..34) // Date/Time types (SlotCount + 31..34)
public const byte DateTime = SlotCount + 31; // 95 public const byte DateTime = SlotCount + 31; // 95
@ -85,14 +104,29 @@ internal static class BinaryTypeCode
// Property skip marker (SlotCount + 38) // Property skip marker (SlotCount + 38)
public const byte PropertySkip = SlotCount + 38; // 102 — Marks a property with default/null value (skipped during serialization) public const byte PropertySkip = SlotCount + 38; // 102 — Marks a property with default/null value (skipped during serialization)
// FixStr range (UTF-8 short strings): 103..134 (32 values for byte lengths 0-31) // ============================================================================
// FixStr encoding: FixStrBase + byteLength // Reserved/Extended marker range (post 2026-05-06 H2Q6 marker reorg, v3 wire format)
// Saves 1 byte for short strings by combining type + length in single byte. // ============================================================================
// Content semantics: UTF-8 bytes (may be ASCII or multi-byte). The reader-side decoder dispatches //
// on content via the new ASCII variant range below — this range is the "universal short" / UTF-8 lane. // Range 103..134 (32 values) was previously the FixStr range (UTF-8 short, byteLength 0-31).
public const byte FixStrBase = SlotCount + 39; // 103 // FixStr (non-ASCII) markers REMOVED in H2Q6 — non-ASCII strings now use Small/Medium/Big tiers
public const byte FixStrMax = FixStrBase + 31; // 134 // for 1-pass decode (eliminated CountUtf8Chars Pass 1).
public const int FixStrMaxLength = 31; //
// CURRENT ALLOCATION (5 of 32 used):
public const byte StringBig = SlotCount + 39; // 103 — Non-ASCII tier 3: [marker:1][charLen:32][utf8Len:32][bytes], utf8Len > 65535
public const byte StringInternFirstSmall = SlotCount + 40; // 104 — Interning tier 1: [marker:1][cacheIdx:VarUInt][charLen:8][utf8Len:8][bytes]
public const byte StringInternFirstMedium = SlotCount + 41; // 105 — Interning tier 2: [marker:1][cacheIdx:VarUInt][charLen:16][utf8Len:16][bytes]
// RESERVED (27 values: 106..134) — strategic future-feature reservation per BINARY_TODO.md V4N3 marker address space plan:
// 106..121 (16 values): ACCORE-BIN-T-L9Y3 — FixArray short-list count in marker (count 0-15)
// 122..126 (5 values): ACCORE-BIN-T-S5L8 — sentinel-length encoding tiers
// 127..130 (4 values): ACCORE-BIN-T-S2X9 — markerless schema lane opt-in
// 131..134 (4 values): general reserve
//
// Readers MUST throw "unknown marker" on any value in 106..134 until the corresponding feature
// activates within the v3 wire format envelope (no further wire-format break needed).
public const byte ReservedRangeMin = SlotCount + 42; // 106 — first reserved value (post-H2Q6 future-feature range)
public const byte ReservedRangeMax = SlotCount + 70; // 134 — last reserved value
// FixStrAscii range (ASCII-only short strings): 135..166 (32 values for byte lengths 0-31) // FixStrAscii range (ASCII-only short strings): 135..166 (32 values for byte lengths 0-31)
// FixStrAscii encoding: FixStrAsciiBase + byteLength // FixStrAscii encoding: FixStrAsciiBase + byteLength
@ -105,7 +139,7 @@ internal static class BinaryTypeCode
// Long ASCII string marker: 167 // Long ASCII string marker: 167
// Layout: [StringAscii] [VarUInt byteCount] [ASCII bytes] // Layout: [StringAscii] [VarUInt byteCount] [ASCII bytes]
// Counterpart to String (91) which is the universal/UTF-8 long-string marker. // Counterpart to StringSmall/Medium/Big — but ASCII content (charLen == byteCount, no UTF-8 decode).
// Reader fast-widens via byte→char without UTF-8 decode or IsValid scan. // Reader fast-widens via byte→char without UTF-8 decode or IsValid scan.
public const byte StringAscii = SlotCount + 103; // 167 public const byte StringAscii = SlotCount + 103; // 167
@ -139,41 +173,32 @@ internal static class BinaryTypeCode
public static bool IsReference(byte typeCode) => typeCode is StringInterned or ObjectRef; public static bool IsReference(byte typeCode) => typeCode is StringInterned or ObjectRef;
/// <summary> /// <summary>
/// Check if type code is any string-related marker — long inline (String / StringAscii), /// Check if type code is any string-related marker — H2Q6 non-ASCII tiers (Small/Medium/Big),
/// interning markers (StringInterned, StringInternFirst), empty marker, or any FixStr variant /// H2Q6 interning tiers (InternFirstSmall/Medium), interning ref (StringInterned), empty marker,
/// (UTF-8 or ASCII). Centralized predicate so adding/removing string markers requires updating /// or any ASCII variant (FixStrAscii / StringAscii). Centralized predicate so adding/removing
/// only this method, not every dispatch site. /// string markers requires updating only this method, not every dispatch site.
/// </summary> /// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)] [MethodImpl(MethodImplOptions.AggressiveInlining)]
public static bool IsString(byte typeCode) public static bool IsString(byte typeCode)
=> (typeCode is >= String and <= StringInternFirst) // 91..94: String, StringInterned, StringEmpty, StringInternFirst => (typeCode is >= StringSmall and <= StringMedium) // 91..94: StringSmall, StringInterned, StringEmpty, StringMedium
|| (typeCode is >= FixStrBase and <= StringAscii); // 103..167: FixStr (UTF-8 short) + FixStrAscii (ASCII short) + StringAscii (ASCII long) || (typeCode is >= StringBig and <= StringInternFirstMedium) // 103..105: StringBig, StringInternFirstSmall, StringInternFirstMedium
|| (typeCode is >= FixStrAsciiBase and <= StringAscii); // 135..167: FixStrAscii + StringAscii
/// <summary> /// <summary>
/// Check if type code is a FixStr (UTF-8 short string with byte length encoded in type code). /// Check if type code is one of the H2Q6 non-ASCII string tier markers (StringSmall / StringMedium / StringBig).
/// Does NOT match FixStrAscii — use <see cref="IsFixStrAscii"/> for that, or <see cref="IsAsciiString"/> /// Excludes interning tier markers (use <see cref="IsStringInternFirst"/>) and ASCII markers (use <see cref="IsAsciiString"/>).
/// for the full ASCII-string range.
/// </summary> /// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)] [MethodImpl(MethodImplOptions.AggressiveInlining)]
public static bool IsFixStr(byte typeCode) => typeCode is >= FixStrBase and <= FixStrMax; public static bool IsStringTier(byte typeCode)
=> typeCode == StringSmall || typeCode == StringMedium || typeCode == StringBig;
/// <summary> /// <summary>
/// Decode FixStr byte length from type code. /// Check if type code is a H2Q6 interning first-occurrence tier marker (StringInternFirstSmall / Medium).
/// (Big tier never engages on the interning path — see BinaryTypeCode header comment for rationale.)
/// </summary> /// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)] [MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int DecodeFixStrLength(byte typeCode) => typeCode - FixStrBase; public static bool IsStringInternFirst(byte typeCode)
=> typeCode == StringInternFirstSmall || typeCode == StringInternFirstMedium;
/// <summary>
/// Encode FixStr type code for given byte length (0-31). Caller asserts UTF-8 content semantics.
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static byte EncodeFixStr(int byteLength) => (byte)(FixStrBase + byteLength);
/// <summary>
/// Check if byte length can be encoded as FixStr (UTF-8 short string, 0..31 bytes).
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static bool CanEncodeAsFixStr(int byteLength) => byteLength is >= 0 and <= 31;
/// <summary> /// <summary>
/// Check if type code is any ASCII string marker — FixStrAscii (short) or StringAscii (long). /// Check if type code is any ASCII string marker — FixStrAscii (short) or StringAscii (long).

View File

@ -2,6 +2,18 @@
Advanced serialization features on top of the wire format. Wire format: `BINARY_FORMAT.md` | Options/presets: `BINARY_OPTIONS.md` | Internal architecture: `BINARY_IMPLEMENTATION.md` | Source generation: `BINARY_SGEN.md`. Advanced serialization features on top of the wire format. Wire format: `BINARY_FORMAT.md` | Options/presets: `BINARY_OPTIONS.md` | Internal architecture: `BINARY_IMPLEMENTATION.md` | Source generation: `BINARY_SGEN.md`.
## Optimization Policy (LLM)
AcBinary is a **general-purpose serializer**, not a benchmark-only implementation.
When proposing or implementing performance work, optimize for broad real-world workloads and maintain balanced trade-offs across:
- mixed payload shapes (small/medium/large/deep)
- language distributions (ASCII-heavy, mixed Latin, multi-byte UTF-8 such as CJK)
- throughput, latency, allocation, and wire size
Do not accept a change solely because one benchmark cell improves. Any optimization should be validated across multiple representative scenarios and must avoid benchmark-specific overfitting.
## Compact Encoding Selection ## Compact Encoding Selection
The serializer applies compact encodings automatically: The serializer applies compact encodings automatically:

File diff suppressed because one or more lines are too long

View File

@ -2,6 +2,12 @@
This page covers planned work for the **binary serializer core** (format, SGen, options, deserialization context, buffer writer). Work specific to the **streaming I/O layer** (`AsyncPipeReaderInput` + `AsyncPipeWriterOutput`, multi-message wire framing, sliding-window buffer, producer-consumer synchronization) is tracked separately in [`BINARY_ASYNCPIPE_TODO.md`](BINARY_ASYNCPIPE_TODO.md). This page covers planned work for the **binary serializer core** (format, SGen, options, deserialization context, buffer writer). Work specific to the **streaming I/O layer** (`AsyncPipeReaderInput` + `AsyncPipeWriterOutput`, multi-message wire framing, sliding-window buffer, producer-consumer synchronization) is tracked separately in [`BINARY_ASYNCPIPE_TODO.md`](BINARY_ASYNCPIPE_TODO.md).
## Optimization policy reminder (LLM)
AcBinary is a universal serializer. Performance TODO execution must avoid benchmark-only overfitting.
For each optimization item, validate gains on multiple representative workloads (ASCII-heavy, mixed Latin, multi-byte UTF-8; small/medium/large/deep payloads) and evaluate throughput + latency + allocation + wire-size together.
## Priority legend ## Priority legend
- **P0** blocker · **P1** important · **P2** nice-to-have · **P3** idea - **P0** blocker · **P1** important · **P2** nice-to-have · **P3** idea
@ -851,12 +857,66 @@ Writer picks the smallest fitting tier; reader dispatches by marker and reads fi
- Performance evaluation target is non-ASCII-heavy data (ASCII-shortcuts intentionally not primary) - Performance evaluation target is non-ASCII-heavy data (ASCII-shortcuts intentionally not primary)
- Wire-format backward compatibility is not required for this development phase - Wire-format backward compatibility is not required for this development phase
### Marker layout decision (2026-05-06)
After analysis on the new "all UTF-8 Magyar" benchmark baseline (`2026-05-06_13-10-30.LLM` — Compact +5-25% slower than MemPack on every cell):
**Confirmed**: the previous benchmark's Compact-vs-MemPack advantage was an artifact of ASCII property names hitting the `FixStrAscii` / Latin1-widen fast path; once string property values are also UTF-8 Magyar, the actual hot path (`EncodeUtf8SinglePass` + two-pass `CountUtf8Chars` + `DecodeUtf8SinglePass`) becomes the bottleneck.
**Marker scope decision** — clean split between ASCII fast path and non-ASCII tier dispatch:
**MEGMARAD (changeless)**:
- `FixStrAscii` (≤31 byte ASCII) — kompakt 1-byte header + Latin1 widen, zero UTF-8 decode pipeline
- `StringAscii` (>31 byte ASCII) — long ASCII fast path, Latin1 widen
- `StringInternRef` — 2nd+ occurrence of interned string (no body, just cache index — not affected by 2-pass problem)
- `StringEmpty`, `Null` — sentinel markers
**MEGSZŰNIK (replaced by H2Q6 tiers)**:
- `FixStr` (32 marker values 103-134 — non-ASCII short) → replaced by `StringSmall`
- `String` (1 marker value 91 — non-ASCII long with VarUInt utf8Len) → replaced by `StringSmall` / `StringMedium` / `StringBig`
- `StringInternFirst` (1 marker value 94 — VarUInt utf8Len interning) → replaced by `StringInternFirstSmall` / `StringInternFirstMedium`
**ÚJ markers** (5 total):
- `StringSmall` — non-ASCII, `[marker:1][charLen:8][utf8Len:8][bytes]`, utf8Len ≤ 255
- `StringMedium` — non-ASCII, `[marker:1][charLen:16][utf8Len:16][bytes]`, utf8Len ≤ 65535
- `StringBig` — non-ASCII, `[marker:1][charLen:32][utf8Len:32][bytes]`, utf8Len > 65535
- `StringInternFirstSmall``[marker:1][cacheIdx:VarUInt][charLen:8][utf8Len:8][bytes]`
- `StringInternFirstMedium``[marker:1][cacheIdx:VarUInt][charLen:16][utf8Len:16][bytes]`
**Trade-off justification**:
- Wire cost on short non-ASCII strings: +2 byte/string header (3 vs 1) → ~0.07-0.36% wire growth on Repeated cell (10 short Magyar string × 2 byte / 28 KB)
- CPU saving: `CountUtf8Chars` Pass 1 eliminated on every non-ASCII string decode → directly attacks the +25% Deser baseline gap
- The 2-byte hybrid `FixStr` (non-ASCII) variant (1 byte marker + 1 byte charLen) was considered but **rejected**: marginal wire saving (-1 byte vs StringSmall) does not justify the +1 marker complexity given the tiny absolute wire impact on the Repeated cell. Cleaner to have ASCII-vs-non-ASCII at the marker level (FixStrAscii vs StringSmall/Medium/Big).
**Interning tier sizing rationale**:
- `MaxStringInternLength` is `byte`-typed (`AcBinarySerializerOptions.cs:125`, default 64, abszolút max 255 char)
- Worst-case: 255 char × 4 byte/char (emoji-only) = 1020 byte → fits in Medium tier (utf8Len ≤ 65535)
- Realistic Magyar/CJK: 64 char × 2-3 byte = 128-192 byte → Small tier
- **Big tier never engages on the interning path** — only Small + Medium needed (+2 markers, not +3)
### Marker address space reservation (post-H2Q6)
The marker reorg frees **34 marker values** (32 `FixStr` non-ASCII + `String` + `StringInternFirst`). After allocating 5 for H2Q6, **29 values remain free**. Strategic reservation plan to prevent ad-hoc consumption and minimize future wire-format breaks:
| Reserved range | Count | Future feature | Status |
|---|---|---|---|
| `StringSmall` / `StringMedium` / `StringBig` | 3 | H2Q6 Compact tiers | **active (this entry)** |
| `StringInternFirstSmall` / `StringInternFirstMedium` | 2 | H2Q6 interning tiers | **active (this entry)** |
| `FixArrayBase..FixArrayMax` | 16 | `ACCORE-BIN-T-L9Y3` (FixArray short-list count in marker) | reserved, future |
| Sentinel-length string tier markers | ~5 | `ACCORE-BIN-T-S5L8` (sentinel-length encoding) | reserved, future |
| Markerless schema lane | ~4 | `ACCORE-BIN-T-S2X9` (markerless schema lane opt-in) | reserved, future |
| General reserve | 4-8 | unallocated | tartalék |
**Wire-format version bump**: v2 → v3 at H2Q6 landing. The reserved-but-unimplemented marker values are documented but not yet decoded — readers throw `unknown marker` if wire contains them. Future activation of `FixArray` / sentinel-length / markerless schema lane within the **same v3 wire format** is non-breaking for already-deployed v3 consumers (they reject unknown markers cleanly; producers opt in to emit them).
### Acceptance ### Acceptance
- New string markers implemented for Small/Medium/Big tiers - New string markers implemented for Small/Medium/Big tiers + InternFirstSmall/InternFirstMedium tiers
- Deserialize path for these markers performs single-pass decode without `CountUtf8Chars` - Deserialize path for these markers performs single-pass decode without `CountUtf8Chars`
- Existing round-trip tests pass, plus new boundary tests for tier transitions - 29 freed marker values strategically reserved per the address-space reservation table; documented in `BinaryTypeCode.cs` with `// Reserved for ACCORE-BIN-T-XXXX (future)` comments
- Benchmark report includes before/after for Compact mode on non-ASCII dataset (Ser/Deser/RT + Size) - Wire-format version bump v2 → v3 documented in `BINARY_FORMAT.md`
- Existing round-trip tests pass, plus new boundary tests for tier transitions (utf8Len = 254/255/256/65534/65535/65536) and interning tier transitions
- Benchmark report includes before/after for Compact mode on non-ASCII dataset (Ser/Deser/RT + Size) vs the `2026-05-06_13-10-30.LLM` baseline
## ACCORE-BIN-T-S5L8: Sentinel-length encoding for strings (wire-size optimization, both modes) ## ACCORE-BIN-T-S5L8: Sentinel-length encoding for strings (wire-size optimization, both modes)
**Priority:** P3 · **Type:** Wire-format optimization · **Related:** `AcBinarySerializer.WriteString`, `AcBinaryDeserializer.ReadValue` string dispatch **Priority:** P3 · **Type:** Wire-format optimization · **Related:** `AcBinarySerializer.WriteString`, `AcBinaryDeserializer.ReadValue` string dispatch
@ -1211,3 +1271,66 @@ The pair forms a closed dead loop (`WriteFixStrDirect` → `WriteStringUtf8Inter
- Pre-NuGet release housekeeping pass - Pre-NuGet release housekeeping pass
- Or: any future refactor that touches `BinarySerializationContext` string-write methods (then decide rather than leave the dead pair behind) - Or: any future refactor that touches `BinarySerializationContext` string-write methods (then decide rather than leave the dead pair behind)
## ACCORE-BIN-T-L9Y3: FixArray marker tier — short-list count encoded in marker
**Priority:** P3 · **Type:** Wire-format optimization · **Status:** Open · **Related:** `Array` (66) marker, `VarUInt itemCount`, `ACCORE-BIN-T-H2Q6` marker reservation
Analog to `FixStr` — short list count (0-15) encoded in marker, eliminating the `VarUInt itemCount` byte for typical DTO collections (Tags, Categories, Items, Properties, Variations, etc. — any list whose size statistically lands in the 0-15 range).
### Wire format
**Current**: `[Array marker:1][VarUInt itemCount][items]` — header 2-6 byte
**FixArray**: `[FixArrayBase + N marker:1][items]` — header 1 byte (N = item count, 0-15)
Writer dispatch (in `WriteArray` / scan-pass list-writer equivalents):
- `itemCount ≤ 15``FixArrayBase + itemCount` marker (1 byte total header)
- `itemCount > 15` → existing `Array` marker + `VarUInt` count (2-6 byte total header)
### Marker reservation
**16 marker values** pre-reserved in the post-H2Q6 marker layout (see `ACCORE-BIN-T-H2Q6` "Marker address space reservation" table). The reservation guarantees that activating FixArray does NOT require another wire-format-version bump after H2Q6 lands at v3 — producers opt in to emit FixArray markers within the same v3 envelope, consumers extend their dispatch to decode them.
Activation steps when implementing:
1. Allocate `FixArrayBase` (16 contiguous values from the H2Q6-freed range)
2. Add `IsFixArray(byte marker)`, `DecodeFixArrayCount(byte marker)`, `EncodeFixArray(int count)` helpers in `BinaryTypeCode.cs`
3. Writer: branch in `WriteArray` and equivalent ScanPass list-writers, emit FixArray for `count ≤ 15`
4. Reader: extend marker dispatch in `ReadValue` / `SkipValue` / `ReadArray`
5. SGen: regenerate readers/writers with `IsFixArray` dispatch in the array-typed property paths
6. Round-trip tests for boundary `itemCount` values: 0, 1, 14, 15, 16, 17 (last tier transition)
### Why P3
- **Wire saving**: -1 byte per short list. Realistic per-cell estimates:
- **Repeated** (10 OrderItem, ~50 list overall): ~50 byte / 28 KB = **~0.18%** wire reduction (marginal)
- **Large** (5×5×5×10 nested, ~6000 list): ~6 KB / 118 KB = **~5%** wire reduction ✓
- **Medium**: ~500 byte / 21 KB = **~2.4%** wire reduction
- **Deep** (2×4×4×8 nested): similar to Medium, ~2-3% wire reduction
- **CPU saving**: marginal (~1-2 ns/list — `VarUInt` short-loop replaced by 1-byte marker decode). NOT a hot-path mover for the current Repeated-cell baseline gap.
- **Release-narrative value**: complements the post-H2Q6 wire-size advantage, particularly on deep-nested structures (Large benchmark). Sharpens the "smallest AND fastest" claim once the CPU gap closes via V4N2 Phase 3 + V4N4.
### Why not P2/P1 — and why not now
- The current `2026-05-06_13-10-30.LLM` baseline's primary problem is **CPU** (Compact +5-25% slower than MemPack on every cell), NOT wire size. FixArray addresses wire size, marginal CPU.
- Activation **after** H2Q6 + V4N2 Phase 3 + V4N4 is the natural sequence: CPU gap closes first, then wire-saver features sharpen the release narrative.
- The marker reservation lets us defer activation indefinitely without losing the address-space slot.
### Acceptance
- 16 marker values aligned in `BinaryTypeCode.cs` (`FixArrayBase..FixArrayMax`) with `IsFixArray`, `DecodeFixArrayCount`, `EncodeFixArray` helpers
- Writer + reader dispatch with boundary tests (count = 0, 1, 14, 15, 16, 17)
- SGen-regenerated readers/writers correctly dispatch via `IsFixArray` for array-typed properties
- Round-trip tests pass, no Ser/Deser regression vs current `Array` path
- Wire-size benchmark: ≥-2% on Medium, ≥-3% on Deep, ≥-4% on Large, no regression on any cell
- Documentation update in `BINARY_FORMAT.md` (new marker range + dispatch rules)
### Trigger
- After `ACCORE-BIN-T-H2Q6` lands (marker reservation must be active first)
- After CPU gap closes (V4N2 Phase 3 + V4N4) — wire-saver value clearer once "fast" is settled
- Pre-NuGet release housekeeping for the wire-size narrative (along with `S5L8` / `S2X9` if their scope justifies)
### Future extension (not part of this entry)
- **`FixDict` analog** — same pattern for `Dictionary` marker (67) with `kvCount` 0-15. Worth considering only if a benchmark workload demonstrates dictionary-heavy structures; the current bench data (Order DTOs) does not. **Defer until evidence.**
- **`FixArray 0-31`** — wider count range (32 markers). Marginal additional saving (16-31 elem list-ek ritkák); would consume nearly all freed marker space, leaving no slack for `S5L8`/`S2X9`. **Reject unless evidence warrants.**