UAIX.LmRuntime / Package guide

UAIX.LmRuntime.Tokenization

GGUF tokenizer metadata, tokenizer engines, chat rendering, special-token control, and parity tools.

Required For tokenizer and chat-template work

UAIX.LmRuntime.Tokenization

GGUF tokenizer metadata, tokenizer factories and engines, special-token handling, chat templates, truncation, safety, and parity tools.

Overview

Tokenizer implementations and chat template rendering for pure C# local LLM runtime packages.

Who should use it Applications and model loaders that need model-coupled encoding/decoding, chat rendering, token budgets, or tokenizer verification.
Execution status Managed tokenizer metadata, encoding, decoding, chat templates, truncation, special-token handling, and parity tools are represented in the supplied source.

Install

.NET CLI
dotnet add package UAIX.LmRuntime.Tokenization
Project file
<PackageReference Include="UAIX.LmRuntime.Tokenization" />

Version policy: The documentation deliberately omits UAIX.LmRuntime package version numbers. Resolve and pin versions through your normal dependency-management and lock-file process.

Direct package dependencies
UAIX.LmRuntime.Abstractions Guide NuGet ↗
UAIX.LmRuntime.Gguf Guide NuGet ↗

Package role and boundaries

Required For tokenizer and chat-template work

  • Creating a tokenizer from GGUF metadata.
  • Using SentencePiece BPE, GPT-2 BPE, RWKV-world, metadata-driven, or tokenizer.json adapter surfaces.
  • Controlling special tokens, streaming UTF-8 decode, chat-template validation, token-budget truncation, or golden-corpus parity.

Boundary

  • Assuming one tokenizer is interchangeable across model artifacts.
  • Executing arbitrary general-purpose Jinja templates or silently repairing unsupported metadata.

Tokenizer follows the artifact

Read and validate tokenizer metadata from the same GGUF model that supplies the weights. Vocabulary, merges, special IDs, pre-tokenizer, and chat-template behavior are model-coupled.

Configuration is explicit

Adding BOS/EOS tokens, parsing special tokens, removing or unparsing special tokens, whitespace cleanup, traces, and invalid UTF-16 policy are caller-visible choices.

Parity is testable

Golden records, fingerprints, reconciliation, and parity reports let maintainers compare behavior without treating a single successful prompt as broad compatibility proof.

Key types

These are the main public entry points. The generated reference below includes the documented public package surface.

Coding examples

Examples use the documented public package surface. Paths, identities, runtime identifiers, device evidence, and application policy remain host inputs.

Create a strict tokenizer from GGUF metadata

Use the common ITokenizer contract for straightforward encoding and decoding.

TokenizerFactoryExample.cs
using UAIX.LmRuntime.Abstractions;
using UAIX.LmRuntime.Gguf;
using UAIX.LmRuntime.Tokenization;

GgufModel model = GgufReader.Read(
    "models/model.gguf",
    new GgufParseOptions());

ITokenizer tokenizer =
    new GgufTokenizerFactory().CreateStrict(model);

IReadOnlyList<int> tokenIds = tokenizer.Encode(
    "Hello, local runtime.",
    addBos: true,
    addEos: false);

string roundTrip = tokenizer.Decode(tokenIds);

Console.WriteLine(string.Join(", ", tokenIds));
Console.WriteLine(roundTrip);

Request a detailed tokenization trace

Use the GGUF-specific interface when the caller needs explicit special-token and trace controls.

DetailedTokenizationExample.cs
using UAIX.LmRuntime.Gguf;
using UAIX.LmRuntime.Tokenization;

GgufModel model = GgufModel.Load(
    "models/model.gguf",
    new GgufParseOptions());

GgufTokenizerMetadata metadata =
    GgufTokenizerMetadataReader.ReadStrict(model);

var tokenizer = new MetadataDrivenGgufTokenizer(metadata);

TokenizationResult result = tokenizer.Encode(
    "Explain token boundaries.",
    new TokenizationOptions
    {
        AddSpecialTokens = true,
        ParseSpecialTokens = false,
        EmitTrace = true,
        InvalidUtf16Policy = InvalidUtf16Policy.Reject
    });

foreach (string traceLine in result.Trace)
{
    Console.WriteLine(traceLine);
}

string decoded = tokenizer.Decode(
    result.TokenIds,
    new DetokenizationOptions
    {
        RemoveSpecialTokens = true,
        UnparseSpecialTokens = false,
        CleanSpaces = false
    });

Render a bounded chat transcript

Use the deterministic role/content renderer when a general Jinja interpreter is neither required nor desired.

ChatTemplateExample.cs
using UAIX.LmRuntime.Contracts;
using UAIX.LmRuntime.Tokenization;

var messages = new[]
{
    LlmMessage.System("Answer in one paragraph."),
    LlmMessage.User("What is a tensor?")
};

string prompt = new ChatTemplateRenderer().Render(messages);
Console.WriteLine(prompt);

Truncate chat history to a token budget

Delegate counting to the active model tokenizer rather than using character length as a proxy.

TokenBudgetExample.cs
using UAIX.LmRuntime.Abstractions;
using UAIX.LmRuntime.Contracts;
using UAIX.LmRuntime.Tokenization;

/// <summary>
/// Truncates an ordered transcript to the supplied tokenizer budget.
/// </summary>
/// <param name="messages">The messages to fit into the budget.</param>
/// <param name="tokenizer">The tokenizer for the target model.</param>
/// <param name="maximumTokens">The maximum accepted token count.</param>
/// <returns>The retained ordered message sequence.</returns>
static IReadOnlyList<LlmMessage> FitTranscript(
    IReadOnlyList<LlmMessage> messages,
    ITokenizer tokenizer,
    int maximumTokens)
{
    return new TokenBudgetTruncator().TruncateMessages(
        messages,
        tokenizer,
        maximumTokens);
}

Decode partial UTF-8 token bytes safely

Keep incomplete UTF-8 sequences buffered across token boundaries and flush at stream completion.

StreamingDecodeExample.cs
using UAIX.LmRuntime.Tokenization;

var decoder = new StreamingUtf8TokenDecoder();

string first = decoder.Decode([0xE2, 0x82], flush: false);
string second = decoder.Decode([0xAC], flush: false);
string final = decoder.Decode(ReadOnlySpan<byte>.Empty, flush: true);

Console.Write(first);
Console.Write(second);
Console.Write(final);

Generated API reference

Expand a type to review its documented public fields, properties, constructors, methods, parameter descriptions, and return descriptions.

ChatTemplateRendererUAIX.LmRuntime.Tokenization 1 member

Renders a minimal safe chat template suitable for deterministic tests and initial GGUF tokenizer work.

Method Render(System.Collections.Generic.IEnumerable<UAIX.LmRuntime.Contracts.LlmMessage>)

Renders messages using a small role/content template rather than a general Jinja interpreter.

messages
The messages sequence used by this operation; its required length, ordering, and element bounds are validated before access.

Returns: The text produced by ChatTemplateRenderer.Render for this contract: Renders messages using a small role/content template rather than a general Jinja interpreter. The returned string is detached from mutable caller storage and is not persisted by the operation.

GgufTokenizerFingerprintUAIX.LmRuntime.Tokenization 1 member

Computes a deterministic SHA-256 identity for model-facing GGUF tokenizer metadata.

Method Create(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata)

Computes a canonical tokenizer fingerprint without treating decoded text as token-ID authority.

metadata
The metadata containing validated format or tokenizer metadata required by this operation.

Returns: The text produced by GgufTokenizerFingerprint.Create for this contract: Computes a canonical tokenizer fingerprint without treating decoded text as token-ID authority. The returned string is detached from mutable caller storage and is not persisted by the operation.

GgufTokenTypeUAIX.LmRuntime.Tokenization 6 members

Identifies the tokenizer token type stored in GGUF metadata.

Field Normal

Normal token text.

Field Unknown

Unknown token.

Field Control

Control token.

Field UserDefined

User-defined token.

Field Unused

Unused token slot.

Field Byte

Byte-fallback token.

GgufTokenUAIX.LmRuntime.Tokenization 4 members

Represents one GGUF vocabulary token.

Property TokenId

Gets the token identifier used by model embedding rows.

Property Text

Gets the raw token text from the GGUF vocabulary.

Property Score

Gets the tokenizer score associated with the token.

Property Type

Gets the token type associated with the token.

GgufSpecialTokenMapUAIX.LmRuntime.Tokenization 6 members

Represents special token identifiers resolved from GGUF metadata.

Property BosTokenId

Gets the beginning-of-sequence token identifier.

Property EosTokenId

Gets the end-of-sequence token identifier.

Property UnknownTokenId

Gets the unknown token identifier.

Property SeparatorTokenId

Gets the separator token identifier.

Property PaddingTokenId

Gets the padding token identifier.

Method EnumerateKnownTokenIds

Enumerates the known token identifiers in stable source order without exposing mutable internal storage.

Returns: An ordered sequence containing the known token identifiers as produced by the validated operation.

GgufTokenizerMetadataUAIX.LmRuntime.Tokenization 21 members

Captures tokenizer metadata loaded from a GGUF model.

Property TokenizerModel

Gets the tokenizer model name from GGUF metadata.

Property PreTokenizer

Gets the tokenizer pre-tokenizer name, when present.

Property Tokens

Gets the vocabulary tokens indexed by token identifier.

Property Merges

Gets the BPE merge rules from GGUF metadata.

Property AddedTokens

Gets the added tokens from GGUF metadata.

Property SourceScoreCount

Gets the source score-array length, or zero when the metadata key was absent.

Property SourceTokenTypeCount

Gets the source token-type-array length, or zero when the metadata key was absent.

Property ScoresPresent

Gets a value indicating whether tokenizer.ggml.scores was present.

Property TokenTypesPresent

Gets a value indicating whether tokenizer.ggml.token_type was present.

Property PrecompiledCharsMap

Gets the optional precompiled SentencePiece normalization character map.

Property SpecialTokens

Gets the special token identifiers.

Property AddBos

Gets whether model-defined BOS insertion is enabled.

Property AddEos

Gets whether model-defined EOS insertion is enabled.

Property AddSeparator

Gets whether model-defined separator insertion is enabled.

Property AddSpacePrefix

Gets whether a leading space prefix is added before text fragments.

Property EscapeWhitespaces

Gets whether whitespace characters are escaped using SentencePiece whitespace notation.

Property RemoveExtraWhitespaces

Gets whether tokenizer-specific extra whitespace removal is enabled.

Property CleanSpaces

Gets whether detokenization should clean spaces around punctuation.

Property ChatTemplate

Gets the chat template from GGUF metadata, when present.

Property HuggingFaceTokenizerJson

Gets the embedded Hugging Face tokenizer JSON, when present.

Property VocabularySize

Gets the effective vocabulary size from the token array.

TokenizationOptionsUAIX.LmRuntime.Tokenization 6 members

Describes tokenization behavior for one encode operation.

Property AddSpecialTokens

Gets whether model-defined special tokens should be added.

Property ParseSpecialTokens

Gets whether raw special-token text should be parsed as special tokens.

Property OverrideAddBos

Gets an optional override for BOS insertion.

Property OverrideAddEos

Gets an optional override for EOS insertion.

Property EmitTrace

Gets whether content-minimized trace data should be emitted for parity diagnostics.

Property InvalidUtf16Policy

Gets the policy for invalid UTF-16 surrogate sequences.

DetokenizationOptionsUAIX.LmRuntime.Tokenization 3 members

Describes detokenization behavior for one decode operation.

Property RemoveSpecialTokens

Gets whether special tokens should be removed from decoded text.

Property UnparseSpecialTokens

Gets whether special tokens should be emitted as their raw token text.

Property CleanSpaces

Gets whether tokenizer-specific space cleanup should be applied.

MetadataDrivenGgufTokenizerDetokenizationOptionsUAIX.LmRuntime.Tokenization 4 members

Provides a stable LocalEndpoint-facing name for metadata-driven GGUF detokenization controls.

This compatibility type mirrors without inheritance because the canonical options type is sealed. It allows integration code to use a descriptive contract while the tokenizer retains one canonical internal representation.

Property RemoveSpecialTokens

Gets whether special tokens should be removed from decoded text.

Property UnparseSpecialTokens

Gets whether special tokens should be emitted as their raw token text.

Property CleanSpaces

Gets whether tokenizer-specific space cleanup should be applied.

Method ToDetokenizationOptions

Creates the canonical detokenization options consumed by the tokenizer engine.

Returns: A new canonical options instance with the same behavior flags.

TokenizationResultUAIX.LmRuntime.Tokenization 2 members

Represents the output of a tokenizer encode operation.

Property TokenIds

Gets the emitted token identifiers.

Property Trace

Gets optional content-minimized events used for tokenizer parity diagnostics.

IGgufTokenizerUAIX.LmRuntime.Tokenization 3 members

Encodes and decodes text for a GGUF-backed model.

Property Metadata

Gets the tokenizer metadata used by this tokenizer.

Method Encode(string,UAIX.LmRuntime.Tokenization.TokenizationOptions)

Encodes the supplied text with the configured tokenizer and validated special-token policy.

text
The text processed by the configured encoding or normalization rules; it must satisfy the declared nullability contract.
options
The optional TokenizationOptions controlling Encode; null selects the documented defaults, supplied limits are validated before allocation, and the instance is not mutated.

Returns: The TokenizationResult result produced by IGgufTokenizer.Encode for this contract: Encodes the supplied text with the configured tokenizer and validated special-token policy. It is published only after all documented validation and ownership transitions succeed.

Method Decode(System.Collections.Generic.IReadOnlyList<int>,UAIX.LmRuntime.Tokenization.DetokenizationOptions)

Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.

tokenIds
The ordered token identifiers to process; sequence order is preserved and each identifier is validated where required.
options
The optional DetokenizationOptions controlling Decode; null selects the documented defaults, supplied limits are validated before allocation, and the instance is not mutated.

Returns: The decoded text produced from the validated token sequence in the original sequence order.

GgufTokenizerMetadataValidationResultUAIX.LmRuntime.Tokenization 2 members

Represents tokenizer metadata validation output.

Property Diagnostics

Gets validation diagnostics.

Property IsValid

Gets a value indicating whether no diagnostics were emitted.

GgufTokenizerMetadataReaderUAIX.LmRuntime.Tokenization 2 members

Builds tokenizer metadata from a parsed GGUF artifact.

Method Read(UAIX.LmRuntime.Gguf.GgufModel)

Reads tokenizer metadata without throwing for validation failures.

model
The parsed GGUF model whose validated metadata and tensor catalog are consumed by this operation.

Returns: The GgufTokenizerMetadata result produced by GgufTokenizerMetadataReader.Read for this contract: Reads tokenizer metadata without throwing for validation failures. It is published only after all documented validation and ownership transitions succeed.

Method ReadStrict(UAIX.LmRuntime.Gguf.GgufModel)

Reads tokenizer metadata and throws when validation fails.

model
The parsed GGUF model whose validated metadata and tensor catalog are consumed by this operation.

Returns: The GgufTokenizerMetadata result produced by GgufTokenizerMetadataReader.ReadStrict for this contract: Reads tokenizer metadata and throws when validation fails. It is published only after all documented validation and ownership transitions succeed.

GgufTokenizerMetadataValidatorUAIX.LmRuntime.Tokenization 1 member

Validates GGUF tokenizer metadata before runtime use.

Method Validate(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata)

Validates the supplied metadata against the invariants required by GgufTokenizerMetadataValidator.

metadata
The metadata containing validated format or tokenizer metadata required by this operation.

Returns: The GgufTokenizerMetadataValidationResult result produced by GgufTokenizerMetadataValidator.Validate for this contract: Validates the supplied metadata against the invariants required by GgufTokenizerMetadataValidator. It is published only after all documented validation and ownership transitions succeed.

InvalidGgufTokenizerExceptionUAIX.LmRuntime.Tokenization 1 member

Thrown when GGUF tokenizer metadata is invalid.

Method InvalidGgufTokenizerException(string)

Initializes a new InvalidGgufTokenizerException instance with validated dependencies and operational bounds.

message
The display-safe diagnostic message describing the failure without embedding prompt text, generated text, credentials, or private file contents.
UnsupportedTokenizerExceptionUAIX.LmRuntime.Tokenization 1 member

Thrown when a GGUF tokenizer family is not supported.

Method UnsupportedTokenizerException(string)

Initializes a new UnsupportedTokenizerException instance with validated dependencies and operational bounds.

message
The display-safe diagnostic message describing the failure without embedding prompt text, generated text, credentials, or private file contents.
Gpt2BpeTokenizerEngineUAIX.LmRuntime.Tokenization 4 members

Implements GPT-2 byte-level BPE from GGUF vocabulary and merge metadata.

Method Gpt2BpeTokenizerEngine(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata)

Initializes a GPT-2 BPE engine from validated GGUF tokenizer metadata.

metadata
The metadata containing validated format or tokenizer metadata required by this operation.
Property Name
Method EncodeRaw(string,UAIX.LmRuntime.Tokenization.TokenizerFragmentContext,System.Collections.Generic.IList<int>,System.Collections.Generic.IList<string>)

Encodes the raw with the configured tokenizer and validated special-token policy.

text
The text to process using the configured encoding and normalization rules.
context
The context that supplies session-scoped identity and boundary state; it is validated before dependent work begins.
destination
The destination buffer that receives the produced values.
trace
The trace sequence used by this operation; its required length, ordering, and element bounds are validated before access.
Method Decode(System.Collections.Generic.IReadOnlyList<int>,UAIX.LmRuntime.Tokenization.DetokenizationOptions)

Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.

tokenIds
The token identifiers to process in sequence order.
options
The optional DetokenizationOptions controlling Decode; null selects the documented defaults, supplied limits are validated before allocation, and the instance is not mutated.

Returns: The decoded text produced from the validated token sequence in the original sequence order.

MetadataDrivenGgufTokenizerUAIX.LmRuntime.Tokenization 12 members

Executes a GGUF tokenizer by combining special-token partitioning with a family-specific tokenizer engine.

Real GGUF execution never falls back to whitespace tokenization. Unsupported tokenizer families fail during construction so token identifiers cannot silently diverge from the model embedding table.

Method MetadataDrivenGgufTokenizer(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata)

Initializes a tokenizer from validated GGUF metadata.

metadata
The metadata containing validated format or tokenizer metadata required by this operation.
Method MetadataDrivenGgufTokenizer(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata,UAIX.LmRuntime.Tokenization.IGgufTokenizerEngine)

Initializes a tokenizer with an explicitly selected family engine.

metadata
The metadata containing validated format or tokenizer metadata required by this operation.
engine
The validated IGgufTokenizerEngine dependency consumed by MetadataDrivenGgufTokenizer; ownership and lifetime remain with the caller unless this member explicitly documents a transfer.
Property Name
Property Metadata
Method Tokenize(string)

Tokenizes the supplied text with the configured metadata and preserves deterministic token order.

text
The text to process using the configured encoding and normalization rules.

Returns: An ordered read-only collection of token text values produced by the configured tokenizer.

Method Encode(string,bool,bool)

Encodes the supplied text with the configured tokenizer and validated special-token policy.

text
The text to process using the configured encoding and normalization rules.
addBos
A value indicating whether add BOS applies to this operation.
addEos
A value indicating whether add EOS applies to this operation.

Returns: An ordered read-only collection of token identifiers produced by the configured tokenizer.

Method Encode(string,UAIX.LmRuntime.Tokenization.TokenizationOptions)

Encodes the supplied text with the configured tokenizer and validated special-token policy.

text
The text to process using the configured encoding and normalization rules.
options
The optional TokenizationOptions controlling Encode; null selects the documented defaults, supplied limits are validated before allocation, and the instance is not mutated.

Returns: The TokenizationResult result produced by MetadataDrivenGgufTokenizer.Encode for this contract: Encodes the supplied text with the configured tokenizer and validated special-token policy. It is published only after all documented validation and ownership transitions succeed.

Method Decode(System.Collections.Generic.IEnumerable<int>)

Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.

tokenIds
The token identifiers to process in sequence order.

Returns: The decoded text produced from the validated token sequence in the original sequence order.

Method Decode(System.Collections.Generic.IReadOnlyList<int>,UAIX.LmRuntime.Tokenization.DetokenizationOptions)

Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.

tokenIds
The token identifiers to process in sequence order.
options
The optional DetokenizationOptions controlling Decode; null selects the documented defaults, supplied limits are validated before allocation, and the instance is not mutated.

Returns: The decoded text produced from the validated token sequence in the original sequence order.

Method Decode(System.Collections.Generic.IReadOnlyList<int>,UAIX.LmRuntime.Tokenization.MetadataDrivenGgufTokenizerDetokenizationOptions)

Decodes model token identifiers using the stable metadata-driven compatibility options contract.

tokenIds
The token identifiers to process in sequence order.
options
The optional MetadataDrivenGgufTokenizerDetokenizationOptions controlling Decode; null selects the documented defaults, supplied limits are validated before allocation, and the instance is not mutated.

Returns: The decoded text produced from the validated token sequence in the original sequence order.

Method CountTokens(string)

Counts the tokens using the same deterministic rules as the corresponding processing operation.

text
The text to process using the configured encoding and normalization rules.

Returns: The int value computed by MetadataDrivenGgufTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. Range, finite-value, and overflow checks are completed before the value is returned.

Method CountTokens(System.Collections.Generic.IEnumerable<UAIX.LmRuntime.Contracts.LlmMessage>)

Counts the tokens using the same deterministic rules as the corresponding processing operation.

messages
The messages sequence used by this operation; its required length, ordering, and element bounds are validated before access.

Returns: The TokenCountResult result produced by MetadataDrivenGgufTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. It is published only after all documented validation and ownership transitions succeed.

SentencePieceBpeTokenizerEngineUAIX.LmRuntime.Tokenization 4 members

Implements the SentencePiece-BPE execution path used by LLaMA-style GGUF vocabularies.

Method SentencePieceBpeTokenizerEngine(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata)

Initializes the engine from validated GGUF tokenizer metadata.

metadata
The metadata containing validated format or tokenizer metadata required by this operation.
Property Name
Method EncodeRaw(string,UAIX.LmRuntime.Tokenization.TokenizerFragmentContext,System.Collections.Generic.IList<int>,System.Collections.Generic.IList<string>)

Encodes the raw with the configured tokenizer and validated special-token policy.

text
The text to process using the configured encoding and normalization rules.
context
The context that supplies session-scoped identity and boundary state; it is validated before dependent work begins.
destination
The destination buffer that receives the produced values.
trace
The trace sequence used by this operation; its required length, ordering, and element bounds are validated before access.
Method Decode(System.Collections.Generic.IReadOnlyList<int>,UAIX.LmRuntime.Tokenization.DetokenizationOptions)

Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.

tokenIds
The token identifiers to process in sequence order.
options
The optional DetokenizationOptions controlling Decode; null selects the documented defaults, supplied limits are validated before allocation, and the instance is not mutated.

Returns: The decoded text produced from the validated token sequence in the original sequence order.

SpecialTokenFragmentKindUAIX.LmRuntime.Tokenization 2 members

Identifies the type of fragment emitted by special-token partitioning.

Field RawText

A raw text fragment that must be processed by the tokenizer engine.

Field Token

A pre-resolved token identifier fragment.

SpecialTokenFragmentUAIX.LmRuntime.Tokenization 7 members

Represents one fragment emitted by special-token partitioning.

Property Kind

Gets the fragment kind.

Property Text

Gets the raw text fragment.

Property TokenId

Gets the token identifier for token fragments.

Property Offset

Gets the character offset in the source text.

Property Length

Gets the fragment length in source text characters.

Method Raw(string,int)

Creates a raw-text fragment representing an unmodified source slice at the supplied offset.

text
The text processed by the configured encoding or normalization rules; it must satisfy the declared nullability contract.
offset
The zero-based offset into the relevant source or destination; range validation occurs before access.

Returns: The SpecialTokenFragment result produced by SpecialTokenFragment.Raw for this contract: Creates a raw-text fragment representing an unmodified source slice at the supplied offset. It is published only after all documented validation and ownership transitions succeed.

Method Token(int,string,int)

Creates a special-token fragment at the supplied source-text offset.

tokenId
The token identifier to process; it must fall within the validated vocabulary and operation-specific range.
text
The text processed by the configured encoding or normalization rules; it must satisfy the declared nullability contract.
offset
The zero-based offset into the relevant source or destination; range validation occurs before access.

Returns: The SpecialTokenFragment result produced by SpecialTokenFragment.Token for this contract: Creates a special-token fragment at the supplied source-text offset. It is published only after all documented validation and ownership transitions succeed.

SpecialTokenPartitionerUAIX.LmRuntime.Tokenization 1 member

Partitions raw text around tokenizer special tokens before normal tokenization.

Method Partition(string,UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata,bool)

Partitions text around known special tokens using longest-token-first matching.

text
The text processed by the configured encoding or normalization rules; it must satisfy the declared nullability contract.
metadata
The metadata containing validated format or tokenizer metadata required by this operation.
parseSpecial
Whether control and unknown tokens should be parsed as special tokens.

Returns: An ordered read-only IReadOnlyList<SpecialTokenFragment> result from SpecialTokenPartitioner.Partition: Partitions text around known special tokens using longest-token-first matching. Mutable internal collection aliases are not exposed through the returned contract.

TokenizerFragmentContextUAIX.LmRuntime.Tokenization 2 members

Describes the position of one raw-text fragment within special-token partitioning.

Property IsFirstFragment

Gets a value indicating whether this is the first raw-text fragment in the input.

Property PreviousFragmentWasSpecial

Gets a value indicating whether the immediately preceding fragment was a special token.

IGgufTokenizerEngineUAIX.LmRuntime.Tokenization 3 members

Defines a family-specific tokenizer engine that operates after special-token partitioning.

Property Name

Gets the stable engine name.

Method EncodeRaw(string,UAIX.LmRuntime.Tokenization.TokenizerFragmentContext,System.Collections.Generic.IList<int>,System.Collections.Generic.IList<string>)

Encodes a raw-text fragment into model token identifiers.

text
The text processed by the configured encoding or normalization rules; it must satisfy the declared nullability contract.
context
The context that supplies session-scoped identity and boundary state; it is validated before dependent work begins.
destination
The caller-owned destination buffer that receives the result; required capacity is validated before any write occurs.
trace
The trace sequence used by this operation; its required length, ordering, and element bounds are validated before access.
Method Decode(System.Collections.Generic.IReadOnlyList<int>,UAIX.LmRuntime.Tokenization.DetokenizationOptions)

Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.

tokenIds
The ordered token identifiers to process; sequence order is preserved and each identifier is validated where required.
options
The optional DetokenizationOptions controlling Decode; null selects the documented defaults, supplied limits are validated before allocation, and the instance is not mutated.

Returns: The decoded text produced from the validated token sequence in the original sequence order.

BpeMergeRuleUAIX.LmRuntime.Tokenization 3 members

Represents one parsed BPE merge rule.

Property Left

Gets the left symbol.

Property Right

Gets the right symbol.

Method TryParse(string,UAIX.LmRuntime.Tokenization.BpeMergeRule&)

Attempts to parse the Boolean result while reporting invalid input without a successful result.

text
The text processed by the configured encoding or normalization rules; it must satisfy the declared nullability contract.
rule
When the method returns, contains the rule produced by the operation when successful; otherwise contains the type's default value.

Returns: True when the rule contains two non-empty symbols.

GgufPreTokenizerRegistryUAIX.LmRuntime.Tokenization 2 members

Provides a conservative allow-list for tokenizer.ggml.pre values implemented by this build.

Method IsSupported(string)

Determines whether a pre-tokenizer identifier is supported.

name
The exact ordinal name used for catalog lookup, canonical hashing, or diagnostic labeling as defined by the containing member.

Returns: True when the identifier is absent or explicitly supported.

Method GetSupportedNames

Retrieves the supported names from the configured tokenizer after validating the requested access.

Returns: An ordered read-only IReadOnlyList<string> result from GgufPreTokenizerRegistry.GetSupportedNames: Retrieves the supported names from the configured tokenizer after validating the requested access. Mutable internal collection aliases are not exposed through the returned contract.

GgufTokenizerEngineFactoryUAIX.LmRuntime.Tokenization 1 member

Selects a concrete tokenizer engine from validated GGUF tokenizer metadata.

Method Create(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata)

Creates the GGUF tokenizer engine from the validated inputs required by GgufTokenizerEngineFactory.

metadata
The metadata containing validated format or tokenizer metadata required by this operation.

Returns: The concrete tokenizer engine, with ownership and disposal obligations defined by the returned type and the Create contract.

IGgufTokenizerFactoryUAIX.LmRuntime.Tokenization 1 member

Creates tokenizer instances from GGUF tokenizer metadata.

Method Create(UAIX.LmRuntime.Gguf.GgufModel)

Creates a tokenizer for a parsed GGUF model.

model
The parsed GGUF model whose validated metadata and tensor catalog are consumed by this operation.

Returns: The tokenizer selected from metadata, with ownership and disposal obligations defined by the returned type and the Create contract.

GgufTokenizerFactoryUAIX.LmRuntime.Tokenization 2 members

Creates strict, metadata-routed tokenizers for parsed GGUF artifacts.

Method Create(UAIX.LmRuntime.Gguf.GgufModel)

Creates the tokenizer from the validated inputs required by GgufTokenizerFactory.

model
The parsed GGUF model whose validated metadata and tensor catalog are consumed by this operation.

Returns: The ITokenizer result produced by GgufTokenizerFactory.Create for this contract: Creates the tokenizer from the validated inputs required by GgufTokenizerFactory. It is published only after all documented validation and ownership transitions succeed.

Method CreateStrict(UAIX.LmRuntime.Gguf.GgufModel)

Creates a tokenizer after strict GGUF tokenizer metadata validation.

model
The parsed GGUF model whose validated metadata and tensor catalog are consumed by this operation.

Returns: The selected tokenizer, with ownership and disposal obligations defined by the returned type and the CreateStrict contract.

SentencePieceGgufTokenizerUAIX.LmRuntime.Tokenization 10 members

Executes the SentencePiece-BPE tokenizer path used by LLaMA-style GGUF artifacts.

Method SentencePieceGgufTokenizer(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata)

Initializes the tokenizer from validated GGUF metadata.

metadata
The metadata containing validated format or tokenizer metadata required by this operation.
Property Name
Property Metadata
Method Tokenize(string)

Tokenizes the supplied text with the configured metadata and preserves deterministic token order.

text
The text to process using the configured encoding and normalization rules.

Returns: An ordered read-only collection of token text values produced by the configured tokenizer.

Method Encode(string,bool,bool)

Encodes the supplied text with the configured tokenizer and validated special-token policy.

text
The text to process using the configured encoding and normalization rules.
addBos
A value indicating whether add BOS applies to this operation.
addEos
A value indicating whether add EOS applies to this operation.

Returns: An ordered read-only collection of token identifiers produced by the configured tokenizer.

Method Encode(string,UAIX.LmRuntime.Tokenization.TokenizationOptions)

Encodes the supplied text with the configured tokenizer and validated special-token policy.

text
The text to process using the configured encoding and normalization rules.
options
The optional TokenizationOptions controlling Encode; null selects the documented defaults, supplied limits are validated before allocation, and the instance is not mutated.

Returns: The TokenizationResult result produced by SentencePieceGgufTokenizer.Encode for this contract: Encodes the supplied text with the configured tokenizer and validated special-token policy. It is published only after all documented validation and ownership transitions succeed.

Method Decode(System.Collections.Generic.IEnumerable<int>)

Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.

tokenIds
The token identifiers to process in sequence order.

Returns: The decoded text produced from the validated token sequence in the original sequence order.

Method Decode(System.Collections.Generic.IReadOnlyList<int>,UAIX.LmRuntime.Tokenization.DetokenizationOptions)

Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.

tokenIds
The token identifiers to process in sequence order.
options
The optional DetokenizationOptions controlling Decode; null selects the documented defaults, supplied limits are validated before allocation, and the instance is not mutated.

Returns: The decoded text produced from the validated token sequence in the original sequence order.

Method CountTokens(string)

Counts the tokens using the same deterministic rules as the corresponding processing operation.

text
The text to process using the configured encoding and normalization rules.

Returns: The int value computed by SentencePieceGgufTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. Range, finite-value, and overflow checks are completed before the value is returned.

Method CountTokens(System.Collections.Generic.IEnumerable<UAIX.LmRuntime.Contracts.LlmMessage>)

Counts the tokens using the same deterministic rules as the corresponding processing operation.

messages
The messages sequence used by this operation; its required length, ordering, and element bounds are validated before access.

Returns: The TokenCountResult result produced by SentencePieceGgufTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. It is published only after all documented validation and ownership transitions succeed.

Gpt2BpeTokenizerUAIX.LmRuntime.Tokenization 10 members

Executes the GPT-2 byte-level BPE tokenizer path from GGUF vocabulary and merge metadata.

Method Gpt2BpeTokenizer(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata)

Initializes the tokenizer from validated GGUF metadata.

metadata
The metadata containing validated format or tokenizer metadata required by this operation.
Property Name
Property Metadata
Method Tokenize(string)

Tokenizes the supplied text with the configured metadata and preserves deterministic token order.

text
The text to process using the configured encoding and normalization rules.

Returns: An ordered read-only collection of token text values produced by the configured tokenizer.

Method Encode(string,bool,bool)

Encodes the supplied text with the configured tokenizer and validated special-token policy.

text
The text to process using the configured encoding and normalization rules.
addBos
A value indicating whether add BOS applies to this operation.
addEos
A value indicating whether add EOS applies to this operation.

Returns: An ordered read-only collection of token identifiers produced by the configured tokenizer.

Method Encode(string,UAIX.LmRuntime.Tokenization.TokenizationOptions)

Encodes the supplied text with the configured tokenizer and validated special-token policy.

text
The text to process using the configured encoding and normalization rules.
options
The optional TokenizationOptions controlling Encode; null selects the documented defaults, supplied limits are validated before allocation, and the instance is not mutated.

Returns: The TokenizationResult result produced by Gpt2BpeTokenizer.Encode for this contract: Encodes the supplied text with the configured tokenizer and validated special-token policy. It is published only after all documented validation and ownership transitions succeed.

Method Decode(System.Collections.Generic.IEnumerable<int>)

Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.

tokenIds
The token identifiers to process in sequence order.

Returns: The decoded text produced from the validated token sequence in the original sequence order.

Method Decode(System.Collections.Generic.IReadOnlyList<int>,UAIX.LmRuntime.Tokenization.DetokenizationOptions)

Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.

tokenIds
The token identifiers to process in sequence order.
options
The optional DetokenizationOptions controlling Decode; null selects the documented defaults, supplied limits are validated before allocation, and the instance is not mutated.

Returns: The decoded text produced from the validated token sequence in the original sequence order.

Method CountTokens(string)

Counts the tokens using the same deterministic rules as the corresponding processing operation.

text
The text to process using the configured encoding and normalization rules.

Returns: The int value computed by Gpt2BpeTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. Range, finite-value, and overflow checks are completed before the value is returned.

Method CountTokens(System.Collections.Generic.IEnumerable<UAIX.LmRuntime.Contracts.LlmMessage>)

Counts the tokens using the same deterministic rules as the corresponding processing operation.

messages
The messages sequence used by this operation; its required length, ordering, and element bounds are validated before access.

Returns: The TokenCountResult result produced by Gpt2BpeTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. It is published only after all documented validation and ownership transitions succeed.

RwkvWorldTokenizerUAIX.LmRuntime.Tokenization 6 members

Marks the RWKV tokenizer family as an explicit unsupported boundary until a dedicated engine is implemented.

Property Name
Method Tokenize(string)

Tokenizes the supplied text with the configured metadata and preserves deterministic token order.

text
The text to process using the configured encoding and normalization rules.

Returns: An ordered read-only collection of token text values produced by the configured tokenizer.

Method Encode(string,bool,bool)

Encodes the supplied text with the configured tokenizer and validated special-token policy.

text
The text to process using the configured encoding and normalization rules.
addBos
A value indicating whether add BOS applies to this operation.
addEos
A value indicating whether add EOS applies to this operation.

Returns: An ordered read-only collection of token identifiers produced by the configured tokenizer.

Method Decode(System.Collections.Generic.IEnumerable<int>)

Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.

tokenIds
The token identifiers to process in sequence order.

Returns: The decoded text produced from the validated token sequence in the original sequence order.

Method CountTokens(string)

Counts the tokens using the same deterministic rules as the corresponding processing operation.

text
The text to process using the configured encoding and normalization rules.

Returns: The int value computed by RwkvWorldTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. Range, finite-value, and overflow checks are completed before the value is returned.

Method CountTokens(System.Collections.Generic.IEnumerable<UAIX.LmRuntime.Contracts.LlmMessage>)

Counts the tokens using the same deterministic rules as the corresponding processing operation.

messages
The messages sequence used by this operation; its required length, ordering, and element bounds are validated before access.

Returns: The TokenCountResult result produced by RwkvWorldTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. It is published only after all documented validation and ownership transitions succeed.

HuggingFaceTokenizerJsonAdapterUAIX.LmRuntime.Tokenization 1 member

Provides an optional seam for embedded Hugging Face tokenizer JSON metadata.

Method Create(string)

Creates a tokenizer from embedded tokenizer JSON metadata when supported.

json
The json text consumed by HuggingFaceTokenizerJsonAdapter.Create; null, emptiness, length, encoding, identifier, or path rules are enforced as documented, and the value is not persisted by this operation.

Returns: A tokenizer instance, with ownership and disposal obligations defined by the returned type and the Create contract.

ChatTemplateConformanceSuiteUAIX.LmRuntime.Tokenization 1 member

Runs chat-template conformance checks against rendered message sequences.

Method RenderAndValidate(string,System.Collections.Generic.IReadOnlyList<UAIX.LmRuntime.Contracts.LlmMessage>)

Renders and validates a chat template against a message sequence.

template
The template text. The current safe subset ignores arbitrary code.
messages
The messages sequence used by this operation; its required length, ordering, and element bounds are validated before access.

Returns: The TokenizerParityReport result produced by ChatTemplateConformanceSuite.RenderAndValidate for this contract: Renders and validates a chat template against a message sequence. It is published only after all documented validation and ownership transitions succeed.

SpecialTokenMapUAIX.LmRuntime.Tokenization 4 members

Represents model special-token identities.

Property BeginningOfSequence

Gets the beginning-of-sequence token identifier.

Property EndOfSequence

Gets the end-of-sequence token identifier.

Property Padding

Gets the padding token identifier.

Property Unknown

Gets the unknown token identifier.

TokenBudgetTruncatorUAIX.LmRuntime.Tokenization 1 member

Truncates message sequences by token budget.

Method TruncateMessages(System.Collections.Generic.IEnumerable<UAIX.LmRuntime.Contracts.LlmMessage>,UAIX.LmRuntime.Abstractions.ITokenizer,int)

Truncates messages so the total token count does not exceed the budget.

messages
The messages sequence used by this operation; its required length, ordering, and element bounds are validated before access.
tokenizer
The validated ITokenizer dependency consumed by TruncateMessages; ownership and lifetime remain with the caller unless this member explicitly documents a transfer.
maxTokens
The numeric max tokens consumed by TruncateMessages; it must satisfy the member's documented range, geometry, and finite-value requirements.

Returns: An ordered read-only IReadOnlyList<LlmMessage> result from TokenBudgetTruncator.TruncateMessages: Truncates messages so the total token count does not exceed the budget. Mutable internal collection aliases are not exposed through the returned contract.

TokenizerGoldenCorpusUAIX.LmRuntime.Tokenization 1 member

Loads tokenizer golden corpora.

Method Load(string)

Loads ordered tokenizer golden record collection from a verified local source into TokenizerGoldenCorpus.

json
The json text consumed by TokenizerGoldenCorpus.Load; null, emptiness, length, encoding, identifier, or path rules are enforced as documented, and the value is not persisted by this operation.

Returns: An ordered read-only IReadOnlyList<TokenizerGoldenRecord> result from TokenizerGoldenCorpus.Load: Loads ordered tokenizer golden record collection from a verified local source into TokenizerGoldenCorpus. Mutable internal collection aliases are not exposed through the returned contract.

TokenizerGoldenRecordUAIX.LmRuntime.Tokenization 2 members

Represents one tokenizer golden record.

Property Text

Gets the source text.

Property ExpectedTokenIds

Gets expected token identifiers.

TokenizerParityReportUAIX.LmRuntime.Tokenization 1 member

Represents tokenizer parity diagnostics.

Property Mismatches

Gets tokenizer mismatches.

InvalidUtf16PolicyUAIX.LmRuntime.Tokenization 2 members

Defines how tokenizer entry points handle invalid UTF-16 surrogate sequences.

Field Reject

Rejects invalid UTF-16 before tokenizer-specific normalization or segmentation.

Field Replace

Replaces each invalid surrogate code unit with the Unicode replacement character.

TokenizerTextSafetyUAIX.LmRuntime.Tokenization 1 member

Validates and normalizes managed strings before tokenizer-specific processing.

Method NormalizeUtf16(string,UAIX.LmRuntime.Tokenization.InvalidUtf16Policy)

Validates a managed string and optionally replaces unpaired surrogate code units.

text
The text processed by the configured encoding or normalization rules; it must satisfy the declared nullability contract.
policy
The policy that define validation limits and execution behavior; required values are checked before use.

Returns: The original string when valid, or a normalized replacement string when requested.

StreamingUtf8TokenDecoderUAIX.LmRuntime.Tokenization 2 members

Incrementally decodes byte-token payloads without corrupting UTF-8 sequences split across token boundaries.

Method Decode(System.ReadOnlySpan<byte>,bool)

Decodes one byte fragment and retains incomplete UTF-8 state for the next call.

bytes
The bytes sequence used by this operation; its required length, ordering, and element bounds are validated before access.
flush
The flush flag selecting the documented branch of Decode; it does not grant authority beyond this in-memory operation.

Returns: The decoded text produced from the validated token sequence in the original sequence order.

Method Reset

Resets the requested state to its validated initial state without publishing partial state.

TokenizerVocabularyReconciliationResultUAIX.LmRuntime.Tokenization 3 members

Describes a consistency check between GGUF vocabulary order and embedded Hugging Face tokenizer JSON.

Property IsConsistent

Gets whether the embedded tokenizer JSON is absent or consistent with GGUF token identifiers.

Property EmbeddedJsonPresent

Gets whether embedded tokenizer JSON was present.

Property Diagnostics

Gets bounded deterministic diagnostics.

TokenizerVocabularyReconcilerUAIX.LmRuntime.Tokenization 1 member

Reconciles embedded Hugging Face vocabulary identifiers against authoritative GGUF token-array order.

Method Reconcile(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata)

Validates embedded tokenizer JSON without allowing it to reorder GGUF token identifiers.

metadata
The metadata containing validated format or tokenizer metadata required by this operation.

Returns: The TokenizerVocabularyReconciliationResult result produced by TokenizerVocabularyReconciler.Reconcile for this contract: Validates embedded tokenizer JSON without allowing it to reorder GGUF token identifiers. It is published only after all documented validation and ownership transitions succeed.

WhitespaceTokenizerUAIX.LmRuntime.Tokenization 6 members

Provides a deterministic tokenizer for tests, examples, and fallback token budgeting.

Property Name
Method Tokenize(string)

Tokenizes the supplied text with the configured metadata and preserves deterministic token order.

text
The text to process using the configured encoding and normalization rules.

Returns: An ordered read-only collection of token text values produced by the configured tokenizer.

Method Encode(string,bool,bool)

Encodes the supplied text with the configured tokenizer and validated special-token policy.

text
The text to process using the configured encoding and normalization rules.
addBos
A value indicating whether add BOS applies to this operation.
addEos
A value indicating whether add EOS applies to this operation.

Returns: An ordered read-only collection of token identifiers produced by the configured tokenizer.

Method Decode(System.Collections.Generic.IEnumerable<int>)

Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.

tokenIds
The token identifiers to process in sequence order.

Returns: The decoded text produced from the validated token sequence in the original sequence order.

Method CountTokens(string)

Counts the tokens using the same deterministic rules as the corresponding processing operation.

text
The text to process using the configured encoding and normalization rules.

Returns: The int value computed by WhitespaceTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. Range, finite-value, and overflow checks are completed before the value is returned.

Method CountTokens(System.Collections.Generic.IEnumerable<UAIX.LmRuntime.Contracts.LlmMessage>)

Counts the tokens using the same deterministic rules as the corresponding processing operation.

messages
The messages sequence used by this operation; its required length, ordering, and element bounds are validated before access.

Returns: The TokenCountResult result produced by WhitespaceTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. It is published only after all documented validation and ownership transitions succeed.

Frequently asked questions

Should I use Create or CreateStrict?

Use the strict path when unsupported or inconsistent metadata must fail closed. Use the non-strict path only when its fallback behavior is understood and covered by your own compatibility tests.

Why can encoding and detokenization settings change output?

Special-token insertion, parsing, removal, whitespace cleanup, and invalid-text policy are part of the tokenizer contract. Set them explicitly at application boundaries.

Does ChatTemplateRenderer execute arbitrary templates from a model?

No. It is a small deterministic role/content renderer. Use the conformance surface to evaluate supported template behavior, and do not imply general Jinja compatibility.

Can I reuse one tokenizer for another model with a similar name?

Do not assume so. Validate vocabulary, merges, special token IDs, pre-tokenizer behavior, and fingerprints against the actual artifact.