UAIX.LmRuntime / Package guide

UAIX.LmRuntime.Kernels.Cpu

Reference, dispatched, half-precision, and quantized managed CPU kernels with parity evidence.

Required For managed CPU math

UAIX.LmRuntime.Kernels.Cpu

Reference, portable-vector, AVX2-aware, half-precision, and quantized CPU kernels with explicit dispatch and parity checks.

Open on NuGet Package family

Overview

Scalar, Vector<T>, and intrinsic-ready CPU kernels for pure C# local LLM runtime inference.

Who should use it Model executors, kernel developers, and test suites that need managed CPU math over float and quantized tensor storage.

Execution status Managed scalar, portable-vector, intrinsic-aware, half-precision, and quantized CPU kernels are represented in the supplied source.

Install

.NET CLI

dotnet add package UAIX.LmRuntime.Kernels.Cpu

Project file

<PackageReference Include="UAIX.LmRuntime.Kernels.Cpu" />

Version policy: The documentation deliberately omits UAIX.LmRuntime package version numbers. Resolve and pin versions through your normal dependency-management and lock-file process.

Direct package dependencies

UAIX.LmRuntime.Tensors Guide NuGet ↗

Package role and boundaries

Required For managed CPU math

You need scalar correctness kernels or CPU dispatch for dot, matrix-vector, RMS normalization, softmax, and RoPE primitives.
You need Q4, Q5, Q6, Q8, or K-quantized dequantization/dot/matrix behavior represented by this package.
You need parity comparisons between reference and selected CPU paths.

Boundary

GPU execution or native acceleration.
Assuming a requested ISA tier is selected; inspect CpuKernelSelection for the actual tier and reason.

Reference first

Scalar/reference operations remain the correctness anchor. Optimized paths should be admitted through parity evidence rather than labels alone.

Dispatch is observable

CpuKernelDispatcher reports requested tier, selected tier, operation, and rationale so fallback behavior is not hidden.

Storage-specific validation

Quantized kernels depend on exact block layout, row shape, byte length, alignment, and destination bounds supplied by tensor/storage metadata.

Key types

These are the main public entry points. The generated reference below includes the documented public package surface.

CpuKernelDispatcher CpuKernelTier CpuKernelSelection ReferenceCpuKernels QuantizedCpuKernels KQuantizedCpuKernels Scalar16CpuKernels QuantizedKernelParityRunner ReferenceMatrixRowDispatcher

Coding examples

Examples use the documented public package surface. Paths, identities, runtime identifiers, device evidence, and application policy remain host inputs.

Dispatch a float32 dot product

Request the best available implemented tier and retain the actual selection evidence.

CpuDotExample.cs

using UAIX.LmRuntime.Kernels.Cpu;

ReadOnlySpan<float> left = [1.0f, 2.0f, 3.0f, 4.0f];
ReadOnlySpan<float> right = [0.5f, 0.25f, -1.0f, 2.0f];

float value = CpuKernelDispatcher.DotFloat32(
    left,
    right,
    CpuKernelTier.Auto,
    out CpuKernelSelection selection);

Console.WriteLine($"Value: {value}");
Console.WriteLine(
    $"{selection.RequestedTier} -> {selection.SelectedTier}: {selection.Reason}");

Run a managed matrix-vector operation

Provide row-major matrix values, explicit dimensions, and caller-owned output storage.

CpuMatVecExample.cs

using UAIX.LmRuntime.Kernels.Cpu;

const int rowCount = 2;
const int columnCount = 3;

ReadOnlySpan<float> matrix =
[
    1.0f, 2.0f, 3.0f,
    4.0f, 5.0f, 6.0f
];

ReadOnlySpan<float> vector = [0.5f, 1.0f, -0.5f];
Span<float> output = stackalloc float[rowCount];

CpuKernelDispatcher.MatVecFloat32(
    matrix,
    rowCount,
    columnCount,
    vector,
    output,
    CpuKernelTier.Auto,
    out CpuKernelSelection selection);

Console.WriteLine(
    $"{selection.RequestedTier} -> {selection.SelectedTier}: {selection.Reason}");

Compare an optimized result with the reference

Make tolerance and maximum deviation visible in a kernel-admission test.

KernelParityExample.cs

using UAIX.LmRuntime.Kernels.Cpu;

ReadOnlySpan<float> reference = [1.0f, 2.0f, 3.0f];
ReadOnlySpan<float> actual = [1.0f, 2.000001f, 2.999999f];

QuantizedKernelParityReport report =
    QuantizedKernelParityRunner.CompareAgainstReference(
        reference,
        actual,
        tolerance: 1e-5f);

if (!report.Passed)
{
    throw new InvalidOperationException(
        $"Kernel parity failed: max error {report.MaxAbsoluteError}");
}

Dispatch a quantized matrix row layout

Use the GGML tensor type and explicit dimensions to select the matching managed row implementation.

QuantizedMatVecExample.cs

using UAIX.LmRuntime.Kernels.Cpu;
using UAIX.LmRuntime.Tensors;

public static class QuantizedMatrixExample
{
    /// <summary>
    /// Multiplies a Q4_0 matrix by a float activation vector through the reference row dispatcher.
    /// </summary>
    /// <param name="encodedMatrixBytes">The complete encoded matrix storage.</param>
    /// <param name="rowCount">The number of matrix rows.</param>
    /// <param name="columnCount">The number of matrix columns and activation values.</param>
    /// <param name="activations">The float activation vector.</param>
    /// <returns>A newly allocated output vector containing one value per row.</returns>
    public static float[] MultiplyQ4_0(
        ReadOnlySpan<byte> encodedMatrixBytes,
        int rowCount,
        int columnCount,
        ReadOnlySpan<float> activations)
    {
        var output = new float[rowCount];

        ReferenceMatrixRowDispatcher.MatVec(
            GgmlTensorType.Q4_0,
            encodedMatrixBytes,
            rowCount,
            columnCount,
            activations,
            output);

        return output;
    }
}

Boundary: The encoded buffer must contain complete blocks for the requested storage type and logical row shape.

Generated API reference

Expand a type to review its documented public fields, properties, constructors, methods, parameter descriptions, and return descriptions.

Q4_1DequantizerUAIX.LmRuntime.Kernels.Cpu 1 member

Dequantizes Q4_1 blocks for scalar reference parity.

Method DequantizeBlock(System.ReadOnlySpan<byte>,System.Span<float>)

Dequantizes one Q4_1 block into destination floats.

source: The source data consumed by the operation; caller-owned storage is not retained after the method returns.
destination: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.

Q5_0DequantizerUAIX.LmRuntime.Kernels.Cpu 1 member

Dequantizes Q5_0 blocks for scalar reference parity.

Method DequantizeBlock(System.ReadOnlySpan<byte>,System.Span<float>)

Dequantizes one Q5_0 block into destination floats.

source: The source data consumed by the operation; caller-owned storage is not retained after the method returns.
destination: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.

Q6_KDequantizerUAIX.LmRuntime.Kernels.Cpu 1 member

Dequantizes Q6_K blocks for scalar reference parity.

Method DequantizeBlock(System.ReadOnlySpan<byte>,System.Span<float>)

Dequantizes one exact GGML Q6_K block into destination floats.

source: The source data consumed by the operation; caller-owned storage is not retained after the method returns.
destination: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.

FusedQuantizedDotProductUAIX.LmRuntime.Kernels.Cpu 1 member

Provides fused dequantize-and-dot reference kernels.

Method Dot(UAIX.LmRuntime.Tensors.GgmlTensorType,System.ReadOnlySpan<byte>,System.ReadOnlySpan<float>)

Computes a dot product between a quantized block and float activations.

type: The declared GGML tensor or metadata type used to select the corresponding decoding and validation rules.
block: The block sequence used by this operation; its required length, ordering, and element bounds are validated before access.
activations: The activations sequence used by this operation; its required length, ordering, and element bounds are validated before access.

Returns: The float value computed by FusedQuantizedDotProduct.Dot for this contract: Computes a dot product between a quantized block and float activations. Range, finite-value, and overflow checks are completed before the value is returned.

QuantizedKernelParityReportUAIX.LmRuntime.Kernels.Cpu 2 members

Represents the result of comparing an optimized quantized kernel to a reference kernel.

Property Passed

Gets a value indicating whether outputs are within tolerance.

Property MaxAbsoluteError

Gets the maximum absolute error observed.

QuantizedKernelParityRunnerUAIX.LmRuntime.Kernels.Cpu 1 member

Compares quantized kernels against scalar references.

Method CompareAgainstReference(System.ReadOnlySpan<float>,System.ReadOnlySpan<float>,float)

Compares two output vectors with an absolute tolerance.

reference: The reference sequence used by this operation; its required length, ordering, and element bounds are validated before access.
actual: The actual sequence used by this operation; its required length, ordering, and element bounds are validated before access.
tolerance: The numeric tolerance consumed by CompareAgainstReference; it must satisfy the member's documented range, geometry, and finite-value requirements.

Returns: The QuantizedKernelParityReport result produced by QuantizedKernelParityRunner.CompareAgainstReference for this contract: Compares two output vectors with an absolute tolerance. It is published only after all documented validation and ownership transitions succeed.

Avx2Float32KernelsUAIX.LmRuntime.Kernels.Cpu 2 members

Provides dedicated AVX2 float32 correctness kernels with scalar tails.

These kernels are selected only when AVX2 is explicitly requested and supported. Scalar implementations remain the numerical authority, and no throughput claim is implied until executed benchmark evidence exists.

Method Dot(System.ReadOnlySpan<float>,System.ReadOnlySpan<float>)

Computes a float32 dot product with AVX/FMA vector arithmetic and a scalar tail.

left: The left sequence used by this operation; its required length, ordering, and element bounds are validated before access.
right: The right sequence used by this operation; its required length, ordering, and element bounds are validated before access.

Returns: The float value computed by Avx2Float32Kernels.Dot for this contract: Computes a float32 dot product with AVX/FMA vector arithmetic and a scalar tail. Range, finite-value, and overflow checks are completed before the value is returned.

Method MatVec(System.ReadOnlySpan<float>,int,int,System.ReadOnlySpan<float>,System.Span<float>)

Computes a row-major float32 matrix-vector product by reusing the dedicated AVX2 dot kernel.

matrix: The matrix sequence used by this operation; its required length, ordering, and element bounds are validated before access.
rowCount: The row count used to bound this operation; it must be nonnegative and within the supported range.
columnCount: The column count used to bound this operation; it must be nonnegative and within the supported range.
vector: The vector sequence used by this operation; its required length, ordering, and element bounds are validated before access.
output: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.

CpuKernelTierUAIX.LmRuntime.Kernels.Cpu 6 members

Identifies a managed CPU kernel implementation tier.

Field Auto

Selects the highest supported tier implemented for the requested operation.

Field Scalar

Uses the scalar correctness implementation.

Field PortableVector

Uses portable operations.

Field Avx2

Uses an AVX2 implementation when the operation provides one.

Field Avx512

Uses an AVX-512 implementation when the operation provides one.

Field AdvSimd

Uses an ARM64 AdvSimd implementation when the operation provides one.

CpuKernelSelectionUAIX.LmRuntime.Kernels.Cpu 4 members

Describes the requested and selected CPU kernel tier for one operation.

Property RequestedTier

Gets the requested tier.

Property SelectedTier

Gets the selected tier.

Property Operation

Gets the operation name.

Property Reason

Gets the stable selection rationale.

CpuKernelDispatcherUAIX.LmRuntime.Kernels.Cpu 6 members

Dispatches correctness-first CPU kernels through explicitly selectable implementation tiers.

Scalar implementations remain the numerical authority. Portable and architecture-specific paths are additive and can always be bypassed by requesting .

Method SelectFloat32DotTier(UAIX.LmRuntime.Kernels.Cpu.CpuKernelTier)

Selects the implemented tier for a float32 dot product.

requestedTier: The requested tier containing caller-supplied values for this operation; all required fields are validated before processing.

Returns: The CpuKernelSelection result produced by CpuKernelDispatcher.SelectFloat32DotTier for this contract: Selects the implemented tier for a float32 dot product. It is published only after all documented validation and ownership transitions succeed.

Method

DotFloat32(System.ReadOnlySpan<float>,System.ReadOnlySpan<float>,UAIX.LmRuntime.Kernels.Cpu.CpuKernelTier,UAIX.LmRuntime.Kernels.Cpu.CpuKernelSelection&)

Computes a float32 dot product through the selected implementation tier.

left: The left sequence used by this operation; its required length, ordering, and element bounds are validated before access.
right: The right sequence used by this operation; its required length, ordering, and element bounds are validated before access.
requestedTier: The requested tier containing caller-supplied values for this operation; all required fields are validated before processing.
selection: When the method returns, contains the selection produced by the operation when successful; otherwise contains the type's default value.

Returns: The float value computed by CpuKernelDispatcher.DotFloat32 for this contract: Computes a float32 dot product through the selected implementation tier. Range, finite-value, and overflow checks are completed before the value is returned.

Method

MatVecFloat32(System.ReadOnlySpan<float>,int,int,System.ReadOnlySpan<float>,System.Span<float>,UAIX.LmRuntime.Kernels.Cpu.CpuKernelTier,UAIX.LmRuntime.Kernels.Cpu.CpuKernelSelection&)

Computes a row-major float32 matrix-vector product through the selected implementation tier.

matrix: The matrix sequence used by this operation; its required length, ordering, and element bounds are validated before access.
rowCount: The row count used to bound this operation; it must be nonnegative and within the supported range.
columnCount: The column count used to bound this operation; it must be nonnegative and within the supported range.
vector: The vector sequence used by this operation; its required length, ordering, and element bounds are validated before access.
output: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.
requestedTier: The requested tier containing caller-supplied values for this operation; all required fields are validated before processing.
selection: When the method returns, contains the selection produced by the operation when successful; otherwise contains the type's default value.

Method

RmsNorm(System.ReadOnlySpan<float>,System.ReadOnlySpan<float>,System.Span<float>,float,UAIX.LmRuntime.Kernels.Cpu.CpuKernelTier,UAIX.LmRuntime.Kernels.Cpu.CpuKernelSelection&)

Applies RMS normalization using the selected scalar or portable-vector accumulation tier.

input: The source data consumed by the operation; caller-owned storage is not retained after the method returns.
weight: The weight sequence used by this operation; its required length, ordering, and element bounds are validated before access.
output: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.
epsilon: The positive normalization epsilon added to the mean-square term to avoid division by zero while preserving deterministic numerical behavior.
requestedTier: The requested tier containing caller-supplied values for this operation; all required fields are validated before processing.
selection: When the method returns, contains the selection produced by the operation when successful; otherwise contains the type's default value.

Method

DotQ8_0(System.ReadOnlySpan<byte>,System.ReadOnlySpan<float>,UAIX.LmRuntime.Kernels.Cpu.CpuKernelTier,UAIX.LmRuntime.Kernels.Cpu.CpuKernelSelection&)

Computes a Q8_0 block dot product through a scalar or portable-vector correctness path.

block: The block sequence used by this operation; its required length, ordering, and element bounds are validated before access.
activations: The activations sequence used by this operation; its required length, ordering, and element bounds are validated before access.
requestedTier: The requested tier containing caller-supplied values for this operation; all required fields are validated before processing.
selection: When the method returns, contains the selection produced by the operation when successful; otherwise contains the type's default value.

Returns: The float value computed by CpuKernelDispatcher.DotQ8_0 for this contract: Computes a Q8_0 block dot product through a scalar or portable-vector correctness path. Range, finite-value, and overflow checks are completed before the value is returned.

Method

DotQ4_0(System.ReadOnlySpan<byte>,System.ReadOnlySpan<float>,UAIX.LmRuntime.Kernels.Cpu.CpuKernelTier,UAIX.LmRuntime.Kernels.Cpu.CpuKernelSelection&)

Computes a Q4_0 block dot product through a scalar or portable-vector correctness path.

block: The block sequence used by this operation; its required length, ordering, and element bounds are validated before access.
activations: The activations sequence used by this operation; its required length, ordering, and element bounds are validated before access.
requestedTier: The requested tier containing caller-supplied values for this operation; all required fields are validated before processing.
selection: When the method returns, contains the selection produced by the operation when successful; otherwise contains the type's default value.

Returns: The float value computed by CpuKernelDispatcher.DotQ4_0 for this contract: Computes a Q4_0 block dot product through a scalar or portable-vector correctness path. Range, finite-value, and overflow checks are completed before the value is returned.

Q4KBlockUAIX.LmRuntime.Kernels.Cpu 4 members

Defines the audited packed GGML Q4_K block layout for 256 logical values.

Field Scale

Gets or sets the common little-endian IEEE half scale.

Field MinimumScale

Gets or sets the common little-endian IEEE half minimum scale.

Field ScaleMinimums

Stores eight packed 6-bit scales and eight packed 6-bit minimum factors.

Field QuantizedValues

Stores 256 four-bit quants in 128 bytes.

Q6KBlockUAIX.LmRuntime.Kernels.Cpu 4 members

Defines the audited packed GGML Q6_K block layout for 256 logical values.

Field LowBits

Stores the lower four bits for 256 quants.

Field HighBits

Stores the upper two bits for 256 quants.

Field Scales

Stores sixteen signed sub-block scales.

Field Scale

Gets or sets the common little-endian IEEE half scale.

KQuantizedBlockLayoutUAIX.LmRuntime.Kernels.Cpu 4 members

Describes one audited K-quantized block layout.

Property Format

Gets the format name.

Property ElementCount

Gets the logical element count.

Property ByteCount

Gets the physical byte count.

Property LayoutDescription

Gets the audited layout statement.

KQuantizedCpuKernelsUAIX.LmRuntime.Kernels.Cpu 17 members

Provides correctness-first scalar GGML Q4_K and Q6_K block kernels.

These methods operate on one exact 256-element block and never materialize a complete model matrix. All scale fields are interpreted as little-endian IEEE half values because current direct K-quant execution is limited to little-endian GGUF storage.

Field BlockElementCount

Gets the number of logical values in one K-quant block.

Field Q4KBlockByteCount

Gets the exact Q4_K block byte count.

Field Q6KBlockByteCount

Gets the exact Q6_K block byte count.

Field Q4_KBlockBytes

Gets the historical Q4_K block-byte constant retained for source compatibility.

Field Q6_KBlockBytes

Gets the historical Q6_K block-byte constant retained for source compatibility.

Property Q4KLayout

Gets the audited Q4_K block layout.

Property Q6KLayout

Gets the audited Q6_K block layout.

Method DequantizeQ4K(System.ReadOnlySpan<byte>,System.Span<float>)

Dequantizes one exact Q4_K block into a caller-owned destination.

block: The block sequence used by this operation; its required length, ordering, and element bounds are validated before access.
destination: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.

Method DotQ4K(System.ReadOnlySpan<byte>,System.ReadOnlySpan<float>)

Computes an allocation-free dot product for one exact Q4_K block.

block: The block sequence used by this operation; its required length, ordering, and element bounds are validated before access.
activations: The activations sequence used by this operation; its required length, ordering, and element bounds are validated before access.

Returns: The float value computed by KQuantizedCpuKernels.DotQ4K for this contract: Computes an allocation-free dot product for one exact Q4_K block. Range, finite-value, and overflow checks are completed before the value is returned.

Method DequantizeQ6K(System.ReadOnlySpan<byte>,System.Span<float>)

Dequantizes one exact Q6_K block into a caller-owned destination.

block: The block sequence used by this operation; its required length, ordering, and element bounds are validated before access.
destination: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.

Method DotQ6K(System.ReadOnlySpan<byte>,System.ReadOnlySpan<float>)

Computes an allocation-free dot product for one exact Q6_K block.

block: The block sequence used by this operation; its required length, ordering, and element bounds are validated before access.
activations: The activations sequence used by this operation; its required length, ordering, and element bounds are validated before access.

Returns: The float value computed by KQuantizedCpuKernels.DotQ6K for this contract: Computes an allocation-free dot product for one exact Q6_K block. Range, finite-value, and overflow checks are completed before the value is returned.

Method DequantizeQ4_K(System.ReadOnlySpan<byte>,System.Span<float>)

Dequantizes one Q4_K block using the historical method name retained for source compatibility.

block: The block sequence used by this operation; its required length, ordering, and element bounds are validated before access.
destination: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.

Method DequantizeQ6_K(System.ReadOnlySpan<byte>,System.Span<float>)

Dequantizes one Q6_K block using the historical method name retained for source compatibility.

block: The block sequence used by this operation; its required length, ordering, and element bounds are validated before access.
destination: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.

Method MatVecQ4_K(System.ReadOnlySpan<byte>,int,int,System.ReadOnlySpan<float>,System.Span<float>)

Computes a Q4_K matrix-vector product using the historical method name retained for source compatibility.

matrix: The matrix sequence used by this operation; its required length, ordering, and element bounds are validated before access.
rowCount: The row count used to bound this operation; it must be nonnegative and within the supported range.
columnCount: The column count used to bound this operation; it must be nonnegative and within the supported range.
activations: The activations sequence used by this operation; its required length, ordering, and element bounds are validated before access.
output: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.

Method MatVecQ6_K(System.ReadOnlySpan<byte>,int,int,System.ReadOnlySpan<float>,System.Span<float>)

Computes a Q6_K matrix-vector product using the historical method name retained for source compatibility.

matrix: The matrix sequence used by this operation; its required length, ordering, and element bounds are validated before access.
rowCount: The row count used to bound this operation; it must be nonnegative and within the supported range.
columnCount: The column count used to bound this operation; it must be nonnegative and within the supported range.
activations: The activations sequence used by this operation; its required length, ordering, and element bounds are validated before access.
output: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.

Method MatVecQ4K(System.ReadOnlySpan<byte>,int,int,System.ReadOnlySpan<float>,System.Span<float>)

Computes a row-major Q4_K matrix-vector product without whole-matrix dequantization.

matrix: The matrix sequence used by this operation; its required length, ordering, and element bounds are validated before access.
rowCount: The row count used to bound this operation; it must be nonnegative and within the supported range.
columnCount: The column count used to bound this operation; it must be nonnegative and within the supported range.
activations: The activations sequence used by this operation; its required length, ordering, and element bounds are validated before access.
output: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.

Method MatVecQ6K(System.ReadOnlySpan<byte>,int,int,System.ReadOnlySpan<float>,System.Span<float>)

Computes a row-major Q6_K matrix-vector product without whole-matrix dequantization.

matrix: The matrix sequence used by this operation; its required length, ordering, and element bounds are validated before access.
rowCount: The row count used to bound this operation; it must be nonnegative and within the supported range.
columnCount: The column count used to bound this operation; it must be nonnegative and within the supported range.
activations: The activations sequence used by this operation; its required length, ordering, and element bounds are validated before access.
output: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.

QuantizedCpuKernelsUAIX.LmRuntime.Kernels.Cpu 9 members

Provides allocation-free scalar correctness kernels for high-value GGML quantization formats.

Field Q4_0BlockBytes

Gets the byte length of a Q4_0 block.

Field Q8_0BlockBytes

Gets the byte length of a Q8_0 block.

Field BlockElementCount

Gets the logical element count in a Q4_0 or Q8_0 block.

Method DequantizeQ4_0(System.ReadOnlySpan<byte>,System.Span<float>)

Dequantizes one little-endian Q4_0 block into float32 values.

block: The block sequence used by this operation; its required length, ordering, and element bounds are validated before access.
destination: The destination buffer with room for 32 values.

Method DequantizeQ8_0(System.ReadOnlySpan<byte>,System.Span<float>)

Dequantizes one little-endian Q8_0 block into float32 values.

block: The block sequence used by this operation; its required length, ordering, and element bounds are validated before access.
destination: The destination buffer with room for 32 values.

Method DotQ4_0(System.ReadOnlySpan<byte>,System.ReadOnlySpan<float>)

Computes an allocation-free dequantize-and-dot operation for one Q4_0 block.

block: The block sequence used by this operation; its required length, ordering, and element bounds are validated before access.
activations: The activations sequence used by this operation; its required length, ordering, and element bounds are validated before access.

Returns: The float value computed by QuantizedCpuKernels.DotQ4_0 for this contract: Computes an allocation-free dequantize-and-dot operation for one Q4_0 block. Range, finite-value, and overflow checks are completed before the value is returned.

Method DotQ8_0(System.ReadOnlySpan<byte>,System.ReadOnlySpan<float>)

Computes an allocation-free dequantize-and-dot operation for one Q8_0 block.

block: The block sequence used by this operation; its required length, ordering, and element bounds are validated before access.
activations: The activations sequence used by this operation; its required length, ordering, and element bounds are validated before access.

Returns: The float value computed by QuantizedCpuKernels.DotQ8_0 for this contract: Computes an allocation-free dequantize-and-dot operation for one Q8_0 block. Range, finite-value, and overflow checks are completed before the value is returned.

Method MatVecQ4_0(System.ReadOnlySpan<byte>,int,int,System.ReadOnlySpan<float>,System.Span<float>)

Computes a row-major Q4_0 matrix-vector product without materializing full-precision rows.

matrix: The matrix sequence used by this operation; its required length, ordering, and element bounds are validated before access.
rowCount: The row count used to bound this operation; it must be nonnegative and within the supported range.
columnCount: The number of logical columns; it must be divisible by 32.
activations: The activations sequence used by this operation; its required length, ordering, and element bounds are validated before access.
output: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.

Method MatVecQ8_0(System.ReadOnlySpan<byte>,int,int,System.ReadOnlySpan<float>,System.Span<float>)

Computes a row-major Q8_0 matrix-vector product without materializing full-precision rows.

matrix: The matrix sequence used by this operation; its required length, ordering, and element bounds are validated before access.
rowCount: The row count used to bound this operation; it must be nonnegative and within the supported range.
columnCount: The number of logical columns; it must be divisible by 32.
activations: The activations sequence used by this operation; its required length, ordering, and element bounds are validated before access.
output: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.

ReferenceCpuKernelsUAIX.LmRuntime.Kernels.Cpu 6 members

Provides scalar and portable CPU reference kernels for correctness anchoring.

Method SoftmaxInPlace(System.Span<float>)

Computes softmax probabilities for the in place using numerically stable normalization.

values: The values sequence used by this operation; its required length, ordering, and element bounds are validated before access.

Method ApplyRopeInPlace(System.Span<float>,System.ReadOnlySpan<float>,System.ReadOnlySpan<float>,int)

Applies RoPE rotation to one query or key vector in place using precomputed sine and cosine values.

vector: The vector sequence used by this operation; its required length, ordering, and element bounds are validated before access.
cos: The cos sequence used by this operation; its required length, ordering, and element bounds are validated before access.
sin: The sin sequence used by this operation; its required length, ordering, and element bounds are validated before access.
ropeDimensions: The even number of leading head dimensions transformed by rotary positional encoding.

Method Softmax(System.Span<float>)

Computes softmax probabilities for the supplied values using numerically stable normalization.

values: The values sequence used by this operation; its required length, ordering, and element bounds are validated before access.

Method ApplyRope(System.Span<float>,int,float)

Applies RoPE rotation using generated trigonometric tables for the supplied position.

vector: The vector sequence used by this operation; its required length, ordering, and element bounds are validated before access.
position: The zero-based sequence or cache position addressed by the operation; it must lie within the allocated context and readable or writable range.
theta: The rotary angle in radians applied to the paired vector components at the addressed position.

Method RmsNorm(System.ReadOnlySpan<float>,System.ReadOnlySpan<float>,System.Span<float>,float)

Applies RMS normalization using the shared vector math implementation.

input: The source data consumed by the operation; caller-owned storage is not retained after the method returns.
weight: The weight sequence used by this operation; its required length, ordering, and element bounds are validated before access.
output: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.
epsilon: The positive normalization epsilon added to the mean-square term to avoid division by zero while preserving deterministic numerical behavior.

Method MatVec(System.ReadOnlySpan<float>,int,int,System.ReadOnlySpan<float>,System.Span<float>)

Computes a matrix-vector product for row-major float32 weights.

matrix: The matrix sequence used by this operation; its required length, ordering, and element bounds are validated before access.
rowCount: The row count used to bound this operation; it must be nonnegative and within the supported range.
columnCount: The column count used to bound this operation; it must be nonnegative and within the supported range.
vector: The vector sequence used by this operation; its required length, ordering, and element bounds are validated before access.
output: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.

Q4_0BlockUAIX.LmRuntime.Kernels.Cpu 2 members

Defines the exact packed Q4_0 block layout used by GGML storage.

Field Scale

Gets or sets the little-endian IEEE half scale field.

Field QuantizedValues

Stores 32 signed 4-bit values in 16 packed bytes.

Q8_0BlockUAIX.LmRuntime.Kernels.Cpu 2 members

Defines the exact packed Q8_0 block layout used by GGML storage.

Field Scale

Gets or sets the little-endian IEEE half scale field.

Field QuantizedValues

Stores 32 signed 8-bit values.

ReferenceMatrixStorageDescriptorUAIX.LmRuntime.Kernels.Cpu 4 members

Describes one supported scalar matrix storage layout.

Property GgmlType

Gets the GGML tensor type.

Property RowCount

Gets the logical row count.

Property ColumnCount

Gets the logical column count.

Property RequiredByteCount

Gets the exact required storage byte count.

ReferenceMatrixRowDispatcherUAIX.LmRuntime.Kernels.Cpu 2 members

Dispatches correctness-first matrix-vector operations for supported mapped scalar and quantized rows.

Method Describe(UAIX.LmRuntime.Tensors.GgmlTensorType,int,int)

Computes the exact storage byte count for a supported matrix.

type: The declared GGML tensor or metadata type used to select the corresponding decoding and validation rules.
rowCount: The row count used to bound this operation; it must be nonnegative and within the supported range.
columnCount: The column count used to bound this operation; it must be nonnegative and within the supported range.

Returns: The ReferenceMatrixStorageDescriptor result produced by ReferenceMatrixRowDispatcher.Describe for this contract: Computes the exact storage byte count for a supported matrix. It is published only after all documented validation and ownership transitions succeed.

Method

MatVec(UAIX.LmRuntime.Tensors.GgmlTensorType,System.ReadOnlySpan<byte>,int,int,System.ReadOnlySpan<float>,System.Span<float>)

Computes a little-endian matrix-vector product without materializing a complete dequantized matrix.

type: The declared GGML tensor or metadata type used to select the corresponding decoding and validation rules.
matrix: The matrix sequence used by this operation; its required length, ordering, and element bounds are validated before access.
rowCount: The row count used to bound this operation; it must be nonnegative and within the supported range.
columnCount: The column count used to bound this operation; it must be nonnegative and within the supported range.
activations: The activations sequence used by this operation; its required length, ordering, and element bounds are validated before access.
output: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.

Scalar16CpuKernelsUAIX.LmRuntime.Kernels.Cpu 8 members

Provides correctness-first scalar F16 and BF16 decoding and matrix-vector kernels.

Method DecodeFloat16(System.ReadOnlySpan<byte>,bool)

Decodes one IEEE binary16 value with explicit byte order.

bytes: The bytes sequence used by this operation; its required length, ordering, and element bounds are validated before access.
bigEndian: True to decode source bytes in big-endian order; false to decode the little-endian representation used by the local GGUF artifact.

Returns: The float value computed by Scalar16CpuKernels.DecodeFloat16 for this contract: Decodes one IEEE binary16 value with explicit byte order. Range, finite-value, and overflow checks are completed before the value is returned.

Method DecodeFloat16(System.ReadOnlySpan<byte>,bool,bool)

Decodes one IEEE binary16 value with explicit byte order and non-finite policy.

bytes: The bytes sequence used by this operation; its required length, ordering, and element bounds are validated before access.
bigEndian: True to decode source bytes in big-endian order; false to decode the little-endian representation used by the local GGUF artifact.
rejectNonFinite: Whether NaN and infinity are rejected as invalid model weights.

Returns: The float value computed by Scalar16CpuKernels.DecodeFloat16 for this contract: Decodes one IEEE binary16 value with explicit byte order and non-finite policy. Range, finite-value, and overflow checks are completed before the value is returned.

Method DecodeBFloat16(System.ReadOnlySpan<byte>,bool)

Decodes one bfloat16 value with explicit byte order.

bytes: The bytes sequence used by this operation; its required length, ordering, and element bounds are validated before access.
bigEndian: True to decode source bytes in big-endian order; false to decode the little-endian representation used by the local GGUF artifact.

Returns: The float value computed by Scalar16CpuKernels.DecodeBFloat16 for this contract: Decodes one bfloat16 value with explicit byte order. Range, finite-value, and overflow checks are completed before the value is returned.

Method DecodeBFloat16(System.ReadOnlySpan<byte>,bool,bool)

Decodes one bfloat16 value with explicit byte order and non-finite policy.

bytes: The bytes sequence used by this operation; its required length, ordering, and element bounds are validated before access.
bigEndian: True to decode source bytes in big-endian order; false to decode the little-endian representation used by the local GGUF artifact.
rejectNonFinite: Whether NaN and infinity are rejected as invalid model weights.

Returns: The float value computed by Scalar16CpuKernels.DecodeBFloat16 for this contract: Decodes one bfloat16 value with explicit byte order and non-finite policy. Range, finite-value, and overflow checks are completed before the value is returned.

Method CopyFloat16(System.ReadOnlySpan<byte>,System.Span<float>,bool)

Copies F16 values into a caller-owned float32 destination.

source: The source data consumed by the operation; caller-owned storage is not retained after the method returns.
destination: The destination with one element per F16 value.
bigEndian: True to decode source bytes in big-endian order; false to decode the little-endian representation used by the local GGUF artifact.

Method CopyBFloat16(System.ReadOnlySpan<byte>,System.Span<float>,bool)

Copies BF16 values into a caller-owned float32 destination.

source: The source data consumed by the operation; caller-owned storage is not retained after the method returns.
destination: The destination with one element per BF16 value.
bigEndian: True to decode source bytes in big-endian order; false to decode the little-endian representation used by the local GGUF artifact.

Method MatVecFloat16(System.ReadOnlySpan<byte>,int,int,System.ReadOnlySpan<float>,System.Span<float>,bool)

Computes a row-major F16 matrix-vector product without whole-matrix conversion.

matrix: The matrix sequence used by this operation; its required length, ordering, and element bounds are validated before access.
rowCount: The row count used to bound this operation; it must be nonnegative and within the supported range.
columnCount: The column count used to bound this operation; it must be nonnegative and within the supported range.
vector: The vector sequence used by this operation; its required length, ordering, and element bounds are validated before access.
output: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.
bigEndian: True to decode source bytes in big-endian order; false to decode the little-endian representation used by the local GGUF artifact.

Method MatVecBFloat16(System.ReadOnlySpan<byte>,int,int,System.ReadOnlySpan<float>,System.Span<float>,bool)

Computes a row-major BF16 matrix-vector product without whole-matrix conversion.

matrix: The matrix sequence used by this operation; its required length, ordering, and element bounds are validated before access.
rowCount: The row count used to bound this operation; it must be nonnegative and within the supported range.
columnCount: The column count used to bound this operation; it must be nonnegative and within the supported range.
vector: The vector sequence used by this operation; its required length, ordering, and element bounds are validated before access.
output: The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.
bigEndian: True to decode source bytes in big-endian order; false to decode the little-endian representation used by the local GGUF artifact.

Frequently asked questions

Does Auto always mean AVX2?

No. Auto selects the highest implemented and supported tier for that operation. Inspect CpuKernelSelection rather than inferring the result from the host CPU.

Are all GGML storage types executable?

No. Representation in Tensors and execution in Kernels.Cpu are separate evidence levels. Check the specific dispatcher, dequantizer, or matrix source used by your model path.

Can I pass overlapping input and output spans?

Do not assume overlap is supported unless the member documentation explicitly permits it. Use separate caller-owned output storage for matrix and normalization operations.

Does this package use native libraries?

The package describes pure managed CPU kernels. It does not provide GPU or native inference backends.