Reference first
Scalar/reference operations remain the correctness anchor. Optimized paths should be admitted through parity evidence rather than labels alone.
UAIX.LmRuntime / Package guide
Reference, dispatched, half-precision, and quantized managed CPU kernels with parity evidence.
Required For managed CPU math
UAIX.LmRuntime.Kernels.Cpu
Reference, portable-vector, AVX2-aware, half-precision, and quantized CPU kernels with explicit dispatch and parity checks.
Scalar, Vector<T>, and intrinsic-ready CPU kernels for pure C# local LLM runtime inference.
dotnet add package UAIX.LmRuntime.Kernels.Cpu
<PackageReference Include="UAIX.LmRuntime.Kernels.Cpu" />
Version policy: The documentation deliberately omits UAIX.LmRuntime package version numbers. Resolve and pin versions through your normal dependency-management and lock-file process.
Scalar/reference operations remain the correctness anchor. Optimized paths should be admitted through parity evidence rather than labels alone.
CpuKernelDispatcher reports requested tier, selected tier, operation, and rationale so fallback behavior is not hidden.
Quantized kernels depend on exact block layout, row shape, byte length, alignment, and destination bounds supplied by tensor/storage metadata.
These are the main public entry points. The generated reference below includes the documented public package surface.
CpuKernelDispatcher CpuKernelTier CpuKernelSelection ReferenceCpuKernels QuantizedCpuKernels KQuantizedCpuKernels Scalar16CpuKernels QuantizedKernelParityRunner ReferenceMatrixRowDispatcher Examples use the documented public package surface. Paths, identities, runtime identifiers, device evidence, and application policy remain host inputs.
Request the best available implemented tier and retain the actual selection evidence.
using UAIX.LmRuntime.Kernels.Cpu;
ReadOnlySpan<float> left = [1.0f, 2.0f, 3.0f, 4.0f];
ReadOnlySpan<float> right = [0.5f, 0.25f, -1.0f, 2.0f];
float value = CpuKernelDispatcher.DotFloat32(
left,
right,
CpuKernelTier.Auto,
out CpuKernelSelection selection);
Console.WriteLine($"Value: {value}");
Console.WriteLine(
$"{selection.RequestedTier} -> {selection.SelectedTier}: {selection.Reason}");
Provide row-major matrix values, explicit dimensions, and caller-owned output storage.
using UAIX.LmRuntime.Kernels.Cpu;
const int rowCount = 2;
const int columnCount = 3;
ReadOnlySpan<float> matrix =
[
1.0f, 2.0f, 3.0f,
4.0f, 5.0f, 6.0f
];
ReadOnlySpan<float> vector = [0.5f, 1.0f, -0.5f];
Span<float> output = stackalloc float[rowCount];
CpuKernelDispatcher.MatVecFloat32(
matrix,
rowCount,
columnCount,
vector,
output,
CpuKernelTier.Auto,
out CpuKernelSelection selection);
Console.WriteLine(
$"{selection.RequestedTier} -> {selection.SelectedTier}: {selection.Reason}");
Make tolerance and maximum deviation visible in a kernel-admission test.
using UAIX.LmRuntime.Kernels.Cpu;
ReadOnlySpan<float> reference = [1.0f, 2.0f, 3.0f];
ReadOnlySpan<float> actual = [1.0f, 2.000001f, 2.999999f];
QuantizedKernelParityReport report =
QuantizedKernelParityRunner.CompareAgainstReference(
reference,
actual,
tolerance: 1e-5f);
if (!report.Passed)
{
throw new InvalidOperationException(
$"Kernel parity failed: max error {report.MaxAbsoluteError}");
}
Use the GGML tensor type and explicit dimensions to select the matching managed row implementation.
using UAIX.LmRuntime.Kernels.Cpu;
using UAIX.LmRuntime.Tensors;
public static class QuantizedMatrixExample
{
/// <summary>
/// Multiplies a Q4_0 matrix by a float activation vector through the reference row dispatcher.
/// </summary>
/// <param name="encodedMatrixBytes">The complete encoded matrix storage.</param>
/// <param name="rowCount">The number of matrix rows.</param>
/// <param name="columnCount">The number of matrix columns and activation values.</param>
/// <param name="activations">The float activation vector.</param>
/// <returns>A newly allocated output vector containing one value per row.</returns>
public static float[] MultiplyQ4_0(
ReadOnlySpan<byte> encodedMatrixBytes,
int rowCount,
int columnCount,
ReadOnlySpan<float> activations)
{
var output = new float[rowCount];
ReferenceMatrixRowDispatcher.MatVec(
GgmlTensorType.Q4_0,
encodedMatrixBytes,
rowCount,
columnCount,
activations,
output);
return output;
}
}
Boundary: The encoded buffer must contain complete blocks for the requested storage type and logical row shape.
Expand a type to review its documented public fields, properties, constructors, methods, parameter descriptions, and return descriptions.
Q4_1DequantizerUAIX.LmRuntime.Kernels.Cpu
1 member
Dequantizes Q4_1 blocks for scalar reference parity.
DequantizeBlock(System.ReadOnlySpan<byte>,System.Span<float>)
Dequantizes one Q4_1 block into destination floats.
sourcedestinationQ5_0DequantizerUAIX.LmRuntime.Kernels.Cpu
1 member
Dequantizes Q5_0 blocks for scalar reference parity.
DequantizeBlock(System.ReadOnlySpan<byte>,System.Span<float>)
Dequantizes one Q5_0 block into destination floats.
sourcedestinationQ6_KDequantizerUAIX.LmRuntime.Kernels.Cpu
1 member
Dequantizes Q6_K blocks for scalar reference parity.
DequantizeBlock(System.ReadOnlySpan<byte>,System.Span<float>)
Dequantizes one exact GGML Q6_K block into destination floats.
sourcedestinationFusedQuantizedDotProductUAIX.LmRuntime.Kernels.Cpu
1 member
Provides fused dequantize-and-dot reference kernels.
Dot(UAIX.LmRuntime.Tensors.GgmlTensorType,System.ReadOnlySpan<byte>,System.ReadOnlySpan<float>)
Computes a dot product between a quantized block and float activations.
typeblockactivationsReturns: The float value computed by FusedQuantizedDotProduct.Dot for this contract: Computes a dot product between a quantized block and float activations. Range, finite-value, and overflow checks are completed before the value is returned.
QuantizedKernelParityReportUAIX.LmRuntime.Kernels.Cpu
2 members
Represents the result of comparing an optimized quantized kernel to a reference kernel.
Passed
Gets a value indicating whether outputs are within tolerance.
MaxAbsoluteError
Gets the maximum absolute error observed.
QuantizedKernelParityRunnerUAIX.LmRuntime.Kernels.Cpu
1 member
Compares quantized kernels against scalar references.
CompareAgainstReference(System.ReadOnlySpan<float>,System.ReadOnlySpan<float>,float)
Compares two output vectors with an absolute tolerance.
referenceactualtoleranceReturns: The QuantizedKernelParityReport result produced by QuantizedKernelParityRunner.CompareAgainstReference for this contract: Compares two output vectors with an absolute tolerance. It is published only after all documented validation and ownership transitions succeed.
Avx2Float32KernelsUAIX.LmRuntime.Kernels.Cpu
2 members
Provides dedicated AVX2 float32 correctness kernels with scalar tails.
These kernels are selected only when AVX2 is explicitly requested and supported. Scalar implementations remain the numerical authority, and no throughput claim is implied until executed benchmark evidence exists.
Dot(System.ReadOnlySpan<float>,System.ReadOnlySpan<float>)
Computes a float32 dot product with AVX/FMA vector arithmetic and a scalar tail.
leftrightReturns: The float value computed by Avx2Float32Kernels.Dot for this contract: Computes a float32 dot product with AVX/FMA vector arithmetic and a scalar tail. Range, finite-value, and overflow checks are completed before the value is returned.
MatVec(System.ReadOnlySpan<float>,int,int,System.ReadOnlySpan<float>,System.Span<float>)
Computes a row-major float32 matrix-vector product by reusing the dedicated AVX2 dot kernel.
matrixrowCountcolumnCountvectoroutputCpuKernelTierUAIX.LmRuntime.Kernels.Cpu
6 members
Identifies a managed CPU kernel implementation tier.
Auto
Selects the highest supported tier implemented for the requested operation.
Scalar
Uses the scalar correctness implementation.
PortableVector
Uses portable operations.
Avx2
Uses an AVX2 implementation when the operation provides one.
Avx512
Uses an AVX-512 implementation when the operation provides one.
AdvSimd
Uses an ARM64 AdvSimd implementation when the operation provides one.
CpuKernelSelectionUAIX.LmRuntime.Kernels.Cpu
4 members
Describes the requested and selected CPU kernel tier for one operation.
RequestedTier
Gets the requested tier.
SelectedTier
Gets the selected tier.
Operation
Gets the operation name.
Reason
Gets the stable selection rationale.
CpuKernelDispatcherUAIX.LmRuntime.Kernels.Cpu
6 members
Dispatches correctness-first CPU kernels through explicitly selectable implementation tiers.
Scalar implementations remain the numerical authority. Portable and architecture-specific paths are additive and can always be bypassed by requesting .
SelectFloat32DotTier(UAIX.LmRuntime.Kernels.Cpu.CpuKernelTier)
Selects the implemented tier for a float32 dot product.
requestedTierReturns: The CpuKernelSelection result produced by CpuKernelDispatcher.SelectFloat32DotTier for this contract: Selects the implemented tier for a float32 dot product. It is published only after all documented validation and ownership transitions succeed.
DotFloat32(System.ReadOnlySpan<float>,System.ReadOnlySpan<float>,UAIX.LmRuntime.Kernels.Cpu.CpuKernelTier,UAIX.LmRuntime.Kernels.Cpu.CpuKernelSelection&)
Computes a float32 dot product through the selected implementation tier.
leftrightrequestedTierselectionReturns: The float value computed by CpuKernelDispatcher.DotFloat32 for this contract: Computes a float32 dot product through the selected implementation tier. Range, finite-value, and overflow checks are completed before the value is returned.
MatVecFloat32(System.ReadOnlySpan<float>,int,int,System.ReadOnlySpan<float>,System.Span<float>,UAIX.LmRuntime.Kernels.Cpu.CpuKernelTier,UAIX.LmRuntime.Kernels.Cpu.CpuKernelSelection&)
Computes a row-major float32 matrix-vector product through the selected implementation tier.
matrixrowCountcolumnCountvectoroutputrequestedTierselectionRmsNorm(System.ReadOnlySpan<float>,System.ReadOnlySpan<float>,System.Span<float>,float,UAIX.LmRuntime.Kernels.Cpu.CpuKernelTier,UAIX.LmRuntime.Kernels.Cpu.CpuKernelSelection&)
Applies RMS normalization using the selected scalar or portable-vector accumulation tier.
inputweightoutputepsilonrequestedTierselectionDotQ8_0(System.ReadOnlySpan<byte>,System.ReadOnlySpan<float>,UAIX.LmRuntime.Kernels.Cpu.CpuKernelTier,UAIX.LmRuntime.Kernels.Cpu.CpuKernelSelection&)
Computes a Q8_0 block dot product through a scalar or portable-vector correctness path.
blockactivationsrequestedTierselectionReturns: The float value computed by CpuKernelDispatcher.DotQ8_0 for this contract: Computes a Q8_0 block dot product through a scalar or portable-vector correctness path. Range, finite-value, and overflow checks are completed before the value is returned.
DotQ4_0(System.ReadOnlySpan<byte>,System.ReadOnlySpan<float>,UAIX.LmRuntime.Kernels.Cpu.CpuKernelTier,UAIX.LmRuntime.Kernels.Cpu.CpuKernelSelection&)
Computes a Q4_0 block dot product through a scalar or portable-vector correctness path.
blockactivationsrequestedTierselectionReturns: The float value computed by CpuKernelDispatcher.DotQ4_0 for this contract: Computes a Q4_0 block dot product through a scalar or portable-vector correctness path. Range, finite-value, and overflow checks are completed before the value is returned.
Q4KBlockUAIX.LmRuntime.Kernels.Cpu
4 members
Defines the audited packed GGML Q4_K block layout for 256 logical values.
Scale
Gets or sets the common little-endian IEEE half scale.
MinimumScale
Gets or sets the common little-endian IEEE half minimum scale.
ScaleMinimums
Stores eight packed 6-bit scales and eight packed 6-bit minimum factors.
QuantizedValues
Stores 256 four-bit quants in 128 bytes.
Q6KBlockUAIX.LmRuntime.Kernels.Cpu
4 members
Defines the audited packed GGML Q6_K block layout for 256 logical values.
LowBits
Stores the lower four bits for 256 quants.
HighBits
Stores the upper two bits for 256 quants.
Scales
Stores sixteen signed sub-block scales.
Scale
Gets or sets the common little-endian IEEE half scale.
KQuantizedBlockLayoutUAIX.LmRuntime.Kernels.Cpu
4 members
Describes one audited K-quantized block layout.
Format
Gets the format name.
ElementCount
Gets the logical element count.
ByteCount
Gets the physical byte count.
LayoutDescription
Gets the audited layout statement.
KQuantizedCpuKernelsUAIX.LmRuntime.Kernels.Cpu
17 members
Provides correctness-first scalar GGML Q4_K and Q6_K block kernels.
These methods operate on one exact 256-element block and never materialize a complete model matrix. All scale fields are interpreted as little-endian IEEE half values because current direct K-quant execution is limited to little-endian GGUF storage.
BlockElementCount
Gets the number of logical values in one K-quant block.
Q4KBlockByteCount
Gets the exact Q4_K block byte count.
Q6KBlockByteCount
Gets the exact Q6_K block byte count.
Q4_KBlockBytes
Gets the historical Q4_K block-byte constant retained for source compatibility.
Q6_KBlockBytes
Gets the historical Q6_K block-byte constant retained for source compatibility.
Q4KLayout
Gets the audited Q4_K block layout.
Q6KLayout
Gets the audited Q6_K block layout.
DequantizeQ4K(System.ReadOnlySpan<byte>,System.Span<float>)
Dequantizes one exact Q4_K block into a caller-owned destination.
blockdestinationDotQ4K(System.ReadOnlySpan<byte>,System.ReadOnlySpan<float>)
Computes an allocation-free dot product for one exact Q4_K block.
blockactivationsReturns: The float value computed by KQuantizedCpuKernels.DotQ4K for this contract: Computes an allocation-free dot product for one exact Q4_K block. Range, finite-value, and overflow checks are completed before the value is returned.
DequantizeQ6K(System.ReadOnlySpan<byte>,System.Span<float>)
Dequantizes one exact Q6_K block into a caller-owned destination.
blockdestinationDotQ6K(System.ReadOnlySpan<byte>,System.ReadOnlySpan<float>)
Computes an allocation-free dot product for one exact Q6_K block.
blockactivationsReturns: The float value computed by KQuantizedCpuKernels.DotQ6K for this contract: Computes an allocation-free dot product for one exact Q6_K block. Range, finite-value, and overflow checks are completed before the value is returned.
DequantizeQ4_K(System.ReadOnlySpan<byte>,System.Span<float>)
Dequantizes one Q4_K block using the historical method name retained for source compatibility.
blockdestinationDequantizeQ6_K(System.ReadOnlySpan<byte>,System.Span<float>)
Dequantizes one Q6_K block using the historical method name retained for source compatibility.
blockdestinationMatVecQ4_K(System.ReadOnlySpan<byte>,int,int,System.ReadOnlySpan<float>,System.Span<float>)
Computes a Q4_K matrix-vector product using the historical method name retained for source compatibility.
matrixrowCountcolumnCountactivationsoutputMatVecQ6_K(System.ReadOnlySpan<byte>,int,int,System.ReadOnlySpan<float>,System.Span<float>)
Computes a Q6_K matrix-vector product using the historical method name retained for source compatibility.
matrixrowCountcolumnCountactivationsoutputMatVecQ4K(System.ReadOnlySpan<byte>,int,int,System.ReadOnlySpan<float>,System.Span<float>)
Computes a row-major Q4_K matrix-vector product without whole-matrix dequantization.
matrixrowCountcolumnCountactivationsoutputMatVecQ6K(System.ReadOnlySpan<byte>,int,int,System.ReadOnlySpan<float>,System.Span<float>)
Computes a row-major Q6_K matrix-vector product without whole-matrix dequantization.
matrixrowCountcolumnCountactivationsoutputQuantizedCpuKernelsUAIX.LmRuntime.Kernels.Cpu
9 members
Provides allocation-free scalar correctness kernels for high-value GGML quantization formats.
Q4_0BlockBytes
Gets the byte length of a Q4_0 block.
Q8_0BlockBytes
Gets the byte length of a Q8_0 block.
BlockElementCount
Gets the logical element count in a Q4_0 or Q8_0 block.
DequantizeQ4_0(System.ReadOnlySpan<byte>,System.Span<float>)
Dequantizes one little-endian Q4_0 block into float32 values.
blockdestinationDequantizeQ8_0(System.ReadOnlySpan<byte>,System.Span<float>)
Dequantizes one little-endian Q8_0 block into float32 values.
blockdestinationDotQ4_0(System.ReadOnlySpan<byte>,System.ReadOnlySpan<float>)
Computes an allocation-free dequantize-and-dot operation for one Q4_0 block.
blockactivationsReturns: The float value computed by QuantizedCpuKernels.DotQ4_0 for this contract: Computes an allocation-free dequantize-and-dot operation for one Q4_0 block. Range, finite-value, and overflow checks are completed before the value is returned.
DotQ8_0(System.ReadOnlySpan<byte>,System.ReadOnlySpan<float>)
Computes an allocation-free dequantize-and-dot operation for one Q8_0 block.
blockactivationsReturns: The float value computed by QuantizedCpuKernels.DotQ8_0 for this contract: Computes an allocation-free dequantize-and-dot operation for one Q8_0 block. Range, finite-value, and overflow checks are completed before the value is returned.
MatVecQ4_0(System.ReadOnlySpan<byte>,int,int,System.ReadOnlySpan<float>,System.Span<float>)
Computes a row-major Q4_0 matrix-vector product without materializing full-precision rows.
matrixrowCountcolumnCountactivationsoutputMatVecQ8_0(System.ReadOnlySpan<byte>,int,int,System.ReadOnlySpan<float>,System.Span<float>)
Computes a row-major Q8_0 matrix-vector product without materializing full-precision rows.
matrixrowCountcolumnCountactivationsoutputReferenceCpuKernelsUAIX.LmRuntime.Kernels.Cpu
6 members
Provides scalar and portable CPU reference kernels for correctness anchoring.
SoftmaxInPlace(System.Span<float>)
Computes softmax probabilities for the in place using numerically stable normalization.
valuesApplyRopeInPlace(System.Span<float>,System.ReadOnlySpan<float>,System.ReadOnlySpan<float>,int)
Applies RoPE rotation to one query or key vector in place using precomputed sine and cosine values.
vectorcossinropeDimensionsSoftmax(System.Span<float>)
Computes softmax probabilities for the supplied values using numerically stable normalization.
valuesApplyRope(System.Span<float>,int,float)
Applies RoPE rotation using generated trigonometric tables for the supplied position.
vectorpositionthetaRmsNorm(System.ReadOnlySpan<float>,System.ReadOnlySpan<float>,System.Span<float>,float)
Applies RMS normalization using the shared vector math implementation.
inputweightoutputepsilonMatVec(System.ReadOnlySpan<float>,int,int,System.ReadOnlySpan<float>,System.Span<float>)
Computes a matrix-vector product for row-major float32 weights.
matrixrowCountcolumnCountvectoroutputQ4_0BlockUAIX.LmRuntime.Kernels.Cpu
2 members
Defines the exact packed Q4_0 block layout used by GGML storage.
Scale
Gets or sets the little-endian IEEE half scale field.
QuantizedValues
Stores 32 signed 4-bit values in 16 packed bytes.
Q8_0BlockUAIX.LmRuntime.Kernels.Cpu
2 members
Defines the exact packed Q8_0 block layout used by GGML storage.
Scale
Gets or sets the little-endian IEEE half scale field.
QuantizedValues
Stores 32 signed 8-bit values.
ReferenceMatrixStorageDescriptorUAIX.LmRuntime.Kernels.Cpu
4 members
Describes one supported scalar matrix storage layout.
GgmlType
Gets the GGML tensor type.
RowCount
Gets the logical row count.
ColumnCount
Gets the logical column count.
RequiredByteCount
Gets the exact required storage byte count.
ReferenceMatrixRowDispatcherUAIX.LmRuntime.Kernels.Cpu
2 members
Dispatches correctness-first matrix-vector operations for supported mapped scalar and quantized rows.
Describe(UAIX.LmRuntime.Tensors.GgmlTensorType,int,int)
Computes the exact storage byte count for a supported matrix.
typerowCountcolumnCountReturns: The ReferenceMatrixStorageDescriptor result produced by ReferenceMatrixRowDispatcher.Describe for this contract: Computes the exact storage byte count for a supported matrix. It is published only after all documented validation and ownership transitions succeed.
MatVec(UAIX.LmRuntime.Tensors.GgmlTensorType,System.ReadOnlySpan<byte>,int,int,System.ReadOnlySpan<float>,System.Span<float>)
Computes a little-endian matrix-vector product without materializing a complete dequantized matrix.
typematrixrowCountcolumnCountactivationsoutputScalar16CpuKernelsUAIX.LmRuntime.Kernels.Cpu
8 members
Provides correctness-first scalar F16 and BF16 decoding and matrix-vector kernels.
DecodeFloat16(System.ReadOnlySpan<byte>,bool)
Decodes one IEEE binary16 value with explicit byte order.
bytesbigEndianReturns: The float value computed by Scalar16CpuKernels.DecodeFloat16 for this contract: Decodes one IEEE binary16 value with explicit byte order. Range, finite-value, and overflow checks are completed before the value is returned.
DecodeFloat16(System.ReadOnlySpan<byte>,bool,bool)
Decodes one IEEE binary16 value with explicit byte order and non-finite policy.
bytesbigEndianrejectNonFiniteReturns: The float value computed by Scalar16CpuKernels.DecodeFloat16 for this contract: Decodes one IEEE binary16 value with explicit byte order and non-finite policy. Range, finite-value, and overflow checks are completed before the value is returned.
DecodeBFloat16(System.ReadOnlySpan<byte>,bool)
Decodes one bfloat16 value with explicit byte order.
bytesbigEndianReturns: The float value computed by Scalar16CpuKernels.DecodeBFloat16 for this contract: Decodes one bfloat16 value with explicit byte order. Range, finite-value, and overflow checks are completed before the value is returned.
DecodeBFloat16(System.ReadOnlySpan<byte>,bool,bool)
Decodes one bfloat16 value with explicit byte order and non-finite policy.
bytesbigEndianrejectNonFiniteReturns: The float value computed by Scalar16CpuKernels.DecodeBFloat16 for this contract: Decodes one bfloat16 value with explicit byte order and non-finite policy. Range, finite-value, and overflow checks are completed before the value is returned.
CopyFloat16(System.ReadOnlySpan<byte>,System.Span<float>,bool)
Copies F16 values into a caller-owned float32 destination.
sourcedestinationbigEndianCopyBFloat16(System.ReadOnlySpan<byte>,System.Span<float>,bool)
Copies BF16 values into a caller-owned float32 destination.
sourcedestinationbigEndianMatVecFloat16(System.ReadOnlySpan<byte>,int,int,System.ReadOnlySpan<float>,System.Span<float>,bool)
Computes a row-major F16 matrix-vector product without whole-matrix conversion.
matrixrowCountcolumnCountvectoroutputbigEndianMatVecBFloat16(System.ReadOnlySpan<byte>,int,int,System.ReadOnlySpan<float>,System.Span<float>,bool)
Computes a row-major BF16 matrix-vector product without whole-matrix conversion.
matrixrowCountcolumnCountvectoroutputbigEndianNo. Auto selects the highest implemented and supported tier for that operation. Inspect CpuKernelSelection rather than inferring the result from the host CPU.
No. Representation in Tensors and execution in Kernels.Cpu are separate evidence levels. Check the specific dispatcher, dequantizer, or matrix source used by your model path.
Do not assume overlap is supported unless the member documentation explicitly permits it. Use separate caller-owned output storage for matrix and normalization operations.
The package describes pure managed CPU kernels. It does not provide GPU or native inference backends.