Tokenizer follows the artifact
Read and validate tokenizer metadata from the same GGUF model that supplies the weights. Vocabulary, merges, special IDs, pre-tokenizer, and chat-template behavior are model-coupled.
UAIX.LmRuntime / Package guide
GGUF tokenizer metadata, tokenizer engines, chat rendering, special-token control, and parity tools.
Required For tokenizer and chat-template work
UAIX.LmRuntime.Tokenization
GGUF tokenizer metadata, tokenizer factories and engines, special-token handling, chat templates, truncation, safety, and parity tools.
Tokenizer implementations and chat template rendering for pure C# local LLM runtime packages.
dotnet add package UAIX.LmRuntime.Tokenization
<PackageReference Include="UAIX.LmRuntime.Tokenization" />
Version policy: The documentation deliberately omits UAIX.LmRuntime package version numbers. Resolve and pin versions through your normal dependency-management and lock-file process.
Read and validate tokenizer metadata from the same GGUF model that supplies the weights. Vocabulary, merges, special IDs, pre-tokenizer, and chat-template behavior are model-coupled.
Adding BOS/EOS tokens, parsing special tokens, removing or unparsing special tokens, whitespace cleanup, traces, and invalid UTF-16 policy are caller-visible choices.
Golden records, fingerprints, reconciliation, and parity reports let maintainers compare behavior without treating a single successful prompt as broad compatibility proof.
These are the main public entry points. The generated reference below includes the documented public package surface.
GgufTokenizerFactory GgufTokenizerMetadataReader MetadataDrivenGgufTokenizer IGgufTokenizer TokenizationOptions DetokenizationOptions ChatTemplateRenderer ChatTemplateConformanceSuite TokenBudgetTruncator StreamingUtf8TokenDecoder TokenizerGoldenCorpus Examples use the documented public package surface. Paths, identities, runtime identifiers, device evidence, and application policy remain host inputs.
Use the common ITokenizer contract for straightforward encoding and decoding.
using UAIX.LmRuntime.Abstractions;
using UAIX.LmRuntime.Gguf;
using UAIX.LmRuntime.Tokenization;
GgufModel model = GgufReader.Read(
"models/model.gguf",
new GgufParseOptions());
ITokenizer tokenizer =
new GgufTokenizerFactory().CreateStrict(model);
IReadOnlyList<int> tokenIds = tokenizer.Encode(
"Hello, local runtime.",
addBos: true,
addEos: false);
string roundTrip = tokenizer.Decode(tokenIds);
Console.WriteLine(string.Join(", ", tokenIds));
Console.WriteLine(roundTrip);
Use the GGUF-specific interface when the caller needs explicit special-token and trace controls.
using UAIX.LmRuntime.Gguf;
using UAIX.LmRuntime.Tokenization;
GgufModel model = GgufModel.Load(
"models/model.gguf",
new GgufParseOptions());
GgufTokenizerMetadata metadata =
GgufTokenizerMetadataReader.ReadStrict(model);
var tokenizer = new MetadataDrivenGgufTokenizer(metadata);
TokenizationResult result = tokenizer.Encode(
"Explain token boundaries.",
new TokenizationOptions
{
AddSpecialTokens = true,
ParseSpecialTokens = false,
EmitTrace = true,
InvalidUtf16Policy = InvalidUtf16Policy.Reject
});
foreach (string traceLine in result.Trace)
{
Console.WriteLine(traceLine);
}
string decoded = tokenizer.Decode(
result.TokenIds,
new DetokenizationOptions
{
RemoveSpecialTokens = true,
UnparseSpecialTokens = false,
CleanSpaces = false
});
Use the deterministic role/content renderer when a general Jinja interpreter is neither required nor desired.
using UAIX.LmRuntime.Contracts;
using UAIX.LmRuntime.Tokenization;
var messages = new[]
{
LlmMessage.System("Answer in one paragraph."),
LlmMessage.User("What is a tensor?")
};
string prompt = new ChatTemplateRenderer().Render(messages);
Console.WriteLine(prompt);
Delegate counting to the active model tokenizer rather than using character length as a proxy.
using UAIX.LmRuntime.Abstractions;
using UAIX.LmRuntime.Contracts;
using UAIX.LmRuntime.Tokenization;
/// <summary>
/// Truncates an ordered transcript to the supplied tokenizer budget.
/// </summary>
/// <param name="messages">The messages to fit into the budget.</param>
/// <param name="tokenizer">The tokenizer for the target model.</param>
/// <param name="maximumTokens">The maximum accepted token count.</param>
/// <returns>The retained ordered message sequence.</returns>
static IReadOnlyList<LlmMessage> FitTranscript(
IReadOnlyList<LlmMessage> messages,
ITokenizer tokenizer,
int maximumTokens)
{
return new TokenBudgetTruncator().TruncateMessages(
messages,
tokenizer,
maximumTokens);
}
Keep incomplete UTF-8 sequences buffered across token boundaries and flush at stream completion.
using UAIX.LmRuntime.Tokenization;
var decoder = new StreamingUtf8TokenDecoder();
string first = decoder.Decode([0xE2, 0x82], flush: false);
string second = decoder.Decode([0xAC], flush: false);
string final = decoder.Decode(ReadOnlySpan<byte>.Empty, flush: true);
Console.Write(first);
Console.Write(second);
Console.Write(final);
Expand a type to review its documented public fields, properties, constructors, methods, parameter descriptions, and return descriptions.
ChatTemplateRendererUAIX.LmRuntime.Tokenization
1 member
Renders a minimal safe chat template suitable for deterministic tests and initial GGUF tokenizer work.
Render(System.Collections.Generic.IEnumerable<UAIX.LmRuntime.Contracts.LlmMessage>)
Renders messages using a small role/content template rather than a general Jinja interpreter.
messagesReturns: The text produced by ChatTemplateRenderer.Render for this contract: Renders messages using a small role/content template rather than a general Jinja interpreter. The returned string is detached from mutable caller storage and is not persisted by the operation.
GgufTokenizerFingerprintUAIX.LmRuntime.Tokenization
1 member
Computes a deterministic SHA-256 identity for model-facing GGUF tokenizer metadata.
Create(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata)
Computes a canonical tokenizer fingerprint without treating decoded text as token-ID authority.
metadataReturns: The text produced by GgufTokenizerFingerprint.Create for this contract: Computes a canonical tokenizer fingerprint without treating decoded text as token-ID authority. The returned string is detached from mutable caller storage and is not persisted by the operation.
GgufTokenTypeUAIX.LmRuntime.Tokenization
6 members
Identifies the tokenizer token type stored in GGUF metadata.
Normal
Normal token text.
Unknown
Unknown token.
Control
Control token.
UserDefined
User-defined token.
Unused
Unused token slot.
Byte
Byte-fallback token.
GgufTokenUAIX.LmRuntime.Tokenization
4 members
Represents one GGUF vocabulary token.
TokenId
Gets the token identifier used by model embedding rows.
Text
Gets the raw token text from the GGUF vocabulary.
Score
Gets the tokenizer score associated with the token.
Type
Gets the token type associated with the token.
GgufSpecialTokenMapUAIX.LmRuntime.Tokenization
6 members
Represents special token identifiers resolved from GGUF metadata.
BosTokenId
Gets the beginning-of-sequence token identifier.
EosTokenId
Gets the end-of-sequence token identifier.
UnknownTokenId
Gets the unknown token identifier.
SeparatorTokenId
Gets the separator token identifier.
PaddingTokenId
Gets the padding token identifier.
EnumerateKnownTokenIds
Enumerates the known token identifiers in stable source order without exposing mutable internal storage.
Returns: An ordered sequence containing the known token identifiers as produced by the validated operation.
GgufTokenizerMetadataUAIX.LmRuntime.Tokenization
21 members
Captures tokenizer metadata loaded from a GGUF model.
TokenizerModel
Gets the tokenizer model name from GGUF metadata.
PreTokenizer
Gets the tokenizer pre-tokenizer name, when present.
Tokens
Gets the vocabulary tokens indexed by token identifier.
Merges
Gets the BPE merge rules from GGUF metadata.
AddedTokens
Gets the added tokens from GGUF metadata.
SourceScoreCount
Gets the source score-array length, or zero when the metadata key was absent.
SourceTokenTypeCount
Gets the source token-type-array length, or zero when the metadata key was absent.
ScoresPresent
Gets a value indicating whether tokenizer.ggml.scores was present.
TokenTypesPresent
Gets a value indicating whether tokenizer.ggml.token_type was present.
PrecompiledCharsMap
Gets the optional precompiled SentencePiece normalization character map.
SpecialTokens
Gets the special token identifiers.
AddBos
Gets whether model-defined BOS insertion is enabled.
AddEos
Gets whether model-defined EOS insertion is enabled.
AddSeparator
Gets whether model-defined separator insertion is enabled.
AddSpacePrefix
Gets whether a leading space prefix is added before text fragments.
EscapeWhitespaces
Gets whether whitespace characters are escaped using SentencePiece whitespace notation.
RemoveExtraWhitespaces
Gets whether tokenizer-specific extra whitespace removal is enabled.
CleanSpaces
Gets whether detokenization should clean spaces around punctuation.
ChatTemplate
Gets the chat template from GGUF metadata, when present.
HuggingFaceTokenizerJson
Gets the embedded Hugging Face tokenizer JSON, when present.
VocabularySize
Gets the effective vocabulary size from the token array.
TokenizationOptionsUAIX.LmRuntime.Tokenization
6 members
Describes tokenization behavior for one encode operation.
AddSpecialTokens
Gets whether model-defined special tokens should be added.
ParseSpecialTokens
Gets whether raw special-token text should be parsed as special tokens.
OverrideAddBos
Gets an optional override for BOS insertion.
OverrideAddEos
Gets an optional override for EOS insertion.
EmitTrace
Gets whether content-minimized trace data should be emitted for parity diagnostics.
InvalidUtf16Policy
Gets the policy for invalid UTF-16 surrogate sequences.
DetokenizationOptionsUAIX.LmRuntime.Tokenization
3 members
Describes detokenization behavior for one decode operation.
RemoveSpecialTokens
Gets whether special tokens should be removed from decoded text.
UnparseSpecialTokens
Gets whether special tokens should be emitted as their raw token text.
CleanSpaces
Gets whether tokenizer-specific space cleanup should be applied.
MetadataDrivenGgufTokenizerDetokenizationOptionsUAIX.LmRuntime.Tokenization
4 members
Provides a stable LocalEndpoint-facing name for metadata-driven GGUF detokenization controls.
This compatibility type mirrors without inheritance because the canonical options type is sealed. It allows integration code to use a descriptive contract while the tokenizer retains one canonical internal representation.
RemoveSpecialTokens
Gets whether special tokens should be removed from decoded text.
UnparseSpecialTokens
Gets whether special tokens should be emitted as their raw token text.
CleanSpaces
Gets whether tokenizer-specific space cleanup should be applied.
ToDetokenizationOptions
Creates the canonical detokenization options consumed by the tokenizer engine.
Returns: A new canonical options instance with the same behavior flags.
TokenizationResultUAIX.LmRuntime.Tokenization
2 members
Represents the output of a tokenizer encode operation.
TokenIds
Gets the emitted token identifiers.
Trace
Gets optional content-minimized events used for tokenizer parity diagnostics.
IGgufTokenizerUAIX.LmRuntime.Tokenization
3 members
Encodes and decodes text for a GGUF-backed model.
Metadata
Gets the tokenizer metadata used by this tokenizer.
Encode(string,UAIX.LmRuntime.Tokenization.TokenizationOptions)
Encodes the supplied text with the configured tokenizer and validated special-token policy.
textoptionsReturns: The TokenizationResult result produced by IGgufTokenizer.Encode for this contract: Encodes the supplied text with the configured tokenizer and validated special-token policy. It is published only after all documented validation and ownership transitions succeed.
Decode(System.Collections.Generic.IReadOnlyList<int>,UAIX.LmRuntime.Tokenization.DetokenizationOptions)
Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.
tokenIdsoptionsReturns: The decoded text produced from the validated token sequence in the original sequence order.
GgufTokenizerMetadataValidationResultUAIX.LmRuntime.Tokenization
2 members
Represents tokenizer metadata validation output.
Diagnostics
Gets validation diagnostics.
IsValid
Gets a value indicating whether no diagnostics were emitted.
GgufTokenizerMetadataReaderUAIX.LmRuntime.Tokenization
2 members
Builds tokenizer metadata from a parsed GGUF artifact.
Read(UAIX.LmRuntime.Gguf.GgufModel)
Reads tokenizer metadata without throwing for validation failures.
modelReturns: The GgufTokenizerMetadata result produced by GgufTokenizerMetadataReader.Read for this contract: Reads tokenizer metadata without throwing for validation failures. It is published only after all documented validation and ownership transitions succeed.
ReadStrict(UAIX.LmRuntime.Gguf.GgufModel)
Reads tokenizer metadata and throws when validation fails.
modelReturns: The GgufTokenizerMetadata result produced by GgufTokenizerMetadataReader.ReadStrict for this contract: Reads tokenizer metadata and throws when validation fails. It is published only after all documented validation and ownership transitions succeed.
GgufTokenizerMetadataValidatorUAIX.LmRuntime.Tokenization
1 member
Validates GGUF tokenizer metadata before runtime use.
Validate(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata)
Validates the supplied metadata against the invariants required by GgufTokenizerMetadataValidator.
metadataReturns: The GgufTokenizerMetadataValidationResult result produced by GgufTokenizerMetadataValidator.Validate for this contract: Validates the supplied metadata against the invariants required by GgufTokenizerMetadataValidator. It is published only after all documented validation and ownership transitions succeed.
InvalidGgufTokenizerExceptionUAIX.LmRuntime.Tokenization
1 member
Thrown when GGUF tokenizer metadata is invalid.
InvalidGgufTokenizerException(string)
Initializes a new InvalidGgufTokenizerException instance with validated dependencies and operational bounds.
messageUnsupportedTokenizerExceptionUAIX.LmRuntime.Tokenization
1 member
Thrown when a GGUF tokenizer family is not supported.
UnsupportedTokenizerException(string)
Initializes a new UnsupportedTokenizerException instance with validated dependencies and operational bounds.
messageGpt2BpeTokenizerEngineUAIX.LmRuntime.Tokenization
4 members
Implements GPT-2 byte-level BPE from GGUF vocabulary and merge metadata.
Gpt2BpeTokenizerEngine(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata)
Initializes a GPT-2 BPE engine from validated GGUF tokenizer metadata.
metadataName
EncodeRaw(string,UAIX.LmRuntime.Tokenization.TokenizerFragmentContext,System.Collections.Generic.IList<int>,System.Collections.Generic.IList<string>)
Encodes the raw with the configured tokenizer and validated special-token policy.
textcontextdestinationtraceDecode(System.Collections.Generic.IReadOnlyList<int>,UAIX.LmRuntime.Tokenization.DetokenizationOptions)
Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.
tokenIdsoptionsReturns: The decoded text produced from the validated token sequence in the original sequence order.
MetadataDrivenGgufTokenizerUAIX.LmRuntime.Tokenization
12 members
Executes a GGUF tokenizer by combining special-token partitioning with a family-specific tokenizer engine.
Real GGUF execution never falls back to whitespace tokenization. Unsupported tokenizer families fail during construction so token identifiers cannot silently diverge from the model embedding table.
MetadataDrivenGgufTokenizer(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata)
Initializes a tokenizer from validated GGUF metadata.
metadataMetadataDrivenGgufTokenizer(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata,UAIX.LmRuntime.Tokenization.IGgufTokenizerEngine)
Initializes a tokenizer with an explicitly selected family engine.
metadataengineName
Metadata
Tokenize(string)
Tokenizes the supplied text with the configured metadata and preserves deterministic token order.
textReturns: An ordered read-only collection of token text values produced by the configured tokenizer.
Encode(string,bool,bool)
Encodes the supplied text with the configured tokenizer and validated special-token policy.
textaddBosaddEosReturns: An ordered read-only collection of token identifiers produced by the configured tokenizer.
Encode(string,UAIX.LmRuntime.Tokenization.TokenizationOptions)
Encodes the supplied text with the configured tokenizer and validated special-token policy.
textoptionsReturns: The TokenizationResult result produced by MetadataDrivenGgufTokenizer.Encode for this contract: Encodes the supplied text with the configured tokenizer and validated special-token policy. It is published only after all documented validation and ownership transitions succeed.
Decode(System.Collections.Generic.IEnumerable<int>)
Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.
tokenIdsReturns: The decoded text produced from the validated token sequence in the original sequence order.
Decode(System.Collections.Generic.IReadOnlyList<int>,UAIX.LmRuntime.Tokenization.DetokenizationOptions)
Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.
tokenIdsoptionsReturns: The decoded text produced from the validated token sequence in the original sequence order.
Decode(System.Collections.Generic.IReadOnlyList<int>,UAIX.LmRuntime.Tokenization.MetadataDrivenGgufTokenizerDetokenizationOptions)
Decodes model token identifiers using the stable metadata-driven compatibility options contract.
tokenIdsoptionsReturns: The decoded text produced from the validated token sequence in the original sequence order.
CountTokens(string)
Counts the tokens using the same deterministic rules as the corresponding processing operation.
textReturns: The int value computed by MetadataDrivenGgufTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. Range, finite-value, and overflow checks are completed before the value is returned.
CountTokens(System.Collections.Generic.IEnumerable<UAIX.LmRuntime.Contracts.LlmMessage>)
Counts the tokens using the same deterministic rules as the corresponding processing operation.
messagesReturns: The TokenCountResult result produced by MetadataDrivenGgufTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. It is published only after all documented validation and ownership transitions succeed.
SentencePieceBpeTokenizerEngineUAIX.LmRuntime.Tokenization
4 members
Implements the SentencePiece-BPE execution path used by LLaMA-style GGUF vocabularies.
SentencePieceBpeTokenizerEngine(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata)
Initializes the engine from validated GGUF tokenizer metadata.
metadataName
EncodeRaw(string,UAIX.LmRuntime.Tokenization.TokenizerFragmentContext,System.Collections.Generic.IList<int>,System.Collections.Generic.IList<string>)
Encodes the raw with the configured tokenizer and validated special-token policy.
textcontextdestinationtraceDecode(System.Collections.Generic.IReadOnlyList<int>,UAIX.LmRuntime.Tokenization.DetokenizationOptions)
Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.
tokenIdsoptionsReturns: The decoded text produced from the validated token sequence in the original sequence order.
SpecialTokenFragmentKindUAIX.LmRuntime.Tokenization
2 members
Identifies the type of fragment emitted by special-token partitioning.
RawText
A raw text fragment that must be processed by the tokenizer engine.
Token
A pre-resolved token identifier fragment.
SpecialTokenFragmentUAIX.LmRuntime.Tokenization
7 members
Represents one fragment emitted by special-token partitioning.
Kind
Gets the fragment kind.
Text
Gets the raw text fragment.
TokenId
Gets the token identifier for token fragments.
Offset
Gets the character offset in the source text.
Length
Gets the fragment length in source text characters.
Raw(string,int)
Creates a raw-text fragment representing an unmodified source slice at the supplied offset.
textoffsetReturns: The SpecialTokenFragment result produced by SpecialTokenFragment.Raw for this contract: Creates a raw-text fragment representing an unmodified source slice at the supplied offset. It is published only after all documented validation and ownership transitions succeed.
Token(int,string,int)
Creates a special-token fragment at the supplied source-text offset.
tokenIdtextoffsetReturns: The SpecialTokenFragment result produced by SpecialTokenFragment.Token for this contract: Creates a special-token fragment at the supplied source-text offset. It is published only after all documented validation and ownership transitions succeed.
SpecialTokenPartitionerUAIX.LmRuntime.Tokenization
1 member
Partitions raw text around tokenizer special tokens before normal tokenization.
Partition(string,UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata,bool)
Partitions text around known special tokens using longest-token-first matching.
textmetadataparseSpecialReturns: An ordered read-only IReadOnlyList<SpecialTokenFragment> result from SpecialTokenPartitioner.Partition: Partitions text around known special tokens using longest-token-first matching. Mutable internal collection aliases are not exposed through the returned contract.
TokenizerFragmentContextUAIX.LmRuntime.Tokenization
2 members
Describes the position of one raw-text fragment within special-token partitioning.
IsFirstFragment
Gets a value indicating whether this is the first raw-text fragment in the input.
PreviousFragmentWasSpecial
Gets a value indicating whether the immediately preceding fragment was a special token.
IGgufTokenizerEngineUAIX.LmRuntime.Tokenization
3 members
Defines a family-specific tokenizer engine that operates after special-token partitioning.
Name
Gets the stable engine name.
EncodeRaw(string,UAIX.LmRuntime.Tokenization.TokenizerFragmentContext,System.Collections.Generic.IList<int>,System.Collections.Generic.IList<string>)
Encodes a raw-text fragment into model token identifiers.
textcontextdestinationtraceDecode(System.Collections.Generic.IReadOnlyList<int>,UAIX.LmRuntime.Tokenization.DetokenizationOptions)
Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.
tokenIdsoptionsReturns: The decoded text produced from the validated token sequence in the original sequence order.
BpeMergeRuleUAIX.LmRuntime.Tokenization
3 members
Represents one parsed BPE merge rule.
Left
Gets the left symbol.
Right
Gets the right symbol.
TryParse(string,UAIX.LmRuntime.Tokenization.BpeMergeRule&)
Attempts to parse the Boolean result while reporting invalid input without a successful result.
textruleReturns: True when the rule contains two non-empty symbols.
GgufPreTokenizerRegistryUAIX.LmRuntime.Tokenization
2 members
Provides a conservative allow-list for tokenizer.ggml.pre values implemented by this build.
IsSupported(string)
Determines whether a pre-tokenizer identifier is supported.
nameReturns: True when the identifier is absent or explicitly supported.
GetSupportedNames
Retrieves the supported names from the configured tokenizer after validating the requested access.
Returns: An ordered read-only IReadOnlyList<string> result from GgufPreTokenizerRegistry.GetSupportedNames: Retrieves the supported names from the configured tokenizer after validating the requested access. Mutable internal collection aliases are not exposed through the returned contract.
GgufTokenizerEngineFactoryUAIX.LmRuntime.Tokenization
1 member
Selects a concrete tokenizer engine from validated GGUF tokenizer metadata.
Create(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata)
Creates the GGUF tokenizer engine from the validated inputs required by GgufTokenizerEngineFactory.
metadataReturns: The concrete tokenizer engine, with ownership and disposal obligations defined by the returned type and the Create contract.
IGgufTokenizerFactoryUAIX.LmRuntime.Tokenization
1 member
Creates tokenizer instances from GGUF tokenizer metadata.
Create(UAIX.LmRuntime.Gguf.GgufModel)
Creates a tokenizer for a parsed GGUF model.
modelReturns: The tokenizer selected from metadata, with ownership and disposal obligations defined by the returned type and the Create contract.
GgufTokenizerFactoryUAIX.LmRuntime.Tokenization
2 members
Creates strict, metadata-routed tokenizers for parsed GGUF artifacts.
Create(UAIX.LmRuntime.Gguf.GgufModel)
Creates the tokenizer from the validated inputs required by GgufTokenizerFactory.
modelReturns: The ITokenizer result produced by GgufTokenizerFactory.Create for this contract: Creates the tokenizer from the validated inputs required by GgufTokenizerFactory. It is published only after all documented validation and ownership transitions succeed.
CreateStrict(UAIX.LmRuntime.Gguf.GgufModel)
Creates a tokenizer after strict GGUF tokenizer metadata validation.
modelReturns: The selected tokenizer, with ownership and disposal obligations defined by the returned type and the CreateStrict contract.
SentencePieceGgufTokenizerUAIX.LmRuntime.Tokenization
10 members
Executes the SentencePiece-BPE tokenizer path used by LLaMA-style GGUF artifacts.
SentencePieceGgufTokenizer(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata)
Initializes the tokenizer from validated GGUF metadata.
metadataName
Metadata
Tokenize(string)
Tokenizes the supplied text with the configured metadata and preserves deterministic token order.
textReturns: An ordered read-only collection of token text values produced by the configured tokenizer.
Encode(string,bool,bool)
Encodes the supplied text with the configured tokenizer and validated special-token policy.
textaddBosaddEosReturns: An ordered read-only collection of token identifiers produced by the configured tokenizer.
Encode(string,UAIX.LmRuntime.Tokenization.TokenizationOptions)
Encodes the supplied text with the configured tokenizer and validated special-token policy.
textoptionsReturns: The TokenizationResult result produced by SentencePieceGgufTokenizer.Encode for this contract: Encodes the supplied text with the configured tokenizer and validated special-token policy. It is published only after all documented validation and ownership transitions succeed.
Decode(System.Collections.Generic.IEnumerable<int>)
Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.
tokenIdsReturns: The decoded text produced from the validated token sequence in the original sequence order.
Decode(System.Collections.Generic.IReadOnlyList<int>,UAIX.LmRuntime.Tokenization.DetokenizationOptions)
Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.
tokenIdsoptionsReturns: The decoded text produced from the validated token sequence in the original sequence order.
CountTokens(string)
Counts the tokens using the same deterministic rules as the corresponding processing operation.
textReturns: The int value computed by SentencePieceGgufTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. Range, finite-value, and overflow checks are completed before the value is returned.
CountTokens(System.Collections.Generic.IEnumerable<UAIX.LmRuntime.Contracts.LlmMessage>)
Counts the tokens using the same deterministic rules as the corresponding processing operation.
messagesReturns: The TokenCountResult result produced by SentencePieceGgufTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. It is published only after all documented validation and ownership transitions succeed.
Gpt2BpeTokenizerUAIX.LmRuntime.Tokenization
10 members
Executes the GPT-2 byte-level BPE tokenizer path from GGUF vocabulary and merge metadata.
Gpt2BpeTokenizer(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata)
Initializes the tokenizer from validated GGUF metadata.
metadataName
Metadata
Tokenize(string)
Tokenizes the supplied text with the configured metadata and preserves deterministic token order.
textReturns: An ordered read-only collection of token text values produced by the configured tokenizer.
Encode(string,bool,bool)
Encodes the supplied text with the configured tokenizer and validated special-token policy.
textaddBosaddEosReturns: An ordered read-only collection of token identifiers produced by the configured tokenizer.
Encode(string,UAIX.LmRuntime.Tokenization.TokenizationOptions)
Encodes the supplied text with the configured tokenizer and validated special-token policy.
textoptionsReturns: The TokenizationResult result produced by Gpt2BpeTokenizer.Encode for this contract: Encodes the supplied text with the configured tokenizer and validated special-token policy. It is published only after all documented validation and ownership transitions succeed.
Decode(System.Collections.Generic.IEnumerable<int>)
Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.
tokenIdsReturns: The decoded text produced from the validated token sequence in the original sequence order.
Decode(System.Collections.Generic.IReadOnlyList<int>,UAIX.LmRuntime.Tokenization.DetokenizationOptions)
Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.
tokenIdsoptionsReturns: The decoded text produced from the validated token sequence in the original sequence order.
CountTokens(string)
Counts the tokens using the same deterministic rules as the corresponding processing operation.
textReturns: The int value computed by Gpt2BpeTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. Range, finite-value, and overflow checks are completed before the value is returned.
CountTokens(System.Collections.Generic.IEnumerable<UAIX.LmRuntime.Contracts.LlmMessage>)
Counts the tokens using the same deterministic rules as the corresponding processing operation.
messagesReturns: The TokenCountResult result produced by Gpt2BpeTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. It is published only after all documented validation and ownership transitions succeed.
RwkvWorldTokenizerUAIX.LmRuntime.Tokenization
6 members
Marks the RWKV tokenizer family as an explicit unsupported boundary until a dedicated engine is implemented.
Name
Tokenize(string)
Tokenizes the supplied text with the configured metadata and preserves deterministic token order.
textReturns: An ordered read-only collection of token text values produced by the configured tokenizer.
Encode(string,bool,bool)
Encodes the supplied text with the configured tokenizer and validated special-token policy.
textaddBosaddEosReturns: An ordered read-only collection of token identifiers produced by the configured tokenizer.
Decode(System.Collections.Generic.IEnumerable<int>)
Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.
tokenIdsReturns: The decoded text produced from the validated token sequence in the original sequence order.
CountTokens(string)
Counts the tokens using the same deterministic rules as the corresponding processing operation.
textReturns: The int value computed by RwkvWorldTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. Range, finite-value, and overflow checks are completed before the value is returned.
CountTokens(System.Collections.Generic.IEnumerable<UAIX.LmRuntime.Contracts.LlmMessage>)
Counts the tokens using the same deterministic rules as the corresponding processing operation.
messagesReturns: The TokenCountResult result produced by RwkvWorldTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. It is published only after all documented validation and ownership transitions succeed.
HuggingFaceTokenizerJsonAdapterUAIX.LmRuntime.Tokenization
1 member
Provides an optional seam for embedded Hugging Face tokenizer JSON metadata.
Create(string)
Creates a tokenizer from embedded tokenizer JSON metadata when supported.
jsonReturns: A tokenizer instance, with ownership and disposal obligations defined by the returned type and the Create contract.
ChatTemplateConformanceSuiteUAIX.LmRuntime.Tokenization
1 member
Runs chat-template conformance checks against rendered message sequences.
RenderAndValidate(string,System.Collections.Generic.IReadOnlyList<UAIX.LmRuntime.Contracts.LlmMessage>)
Renders and validates a chat template against a message sequence.
templatemessagesReturns: The TokenizerParityReport result produced by ChatTemplateConformanceSuite.RenderAndValidate for this contract: Renders and validates a chat template against a message sequence. It is published only after all documented validation and ownership transitions succeed.
SpecialTokenMapUAIX.LmRuntime.Tokenization
4 members
Represents model special-token identities.
BeginningOfSequence
Gets the beginning-of-sequence token identifier.
EndOfSequence
Gets the end-of-sequence token identifier.
Padding
Gets the padding token identifier.
Unknown
Gets the unknown token identifier.
TokenBudgetTruncatorUAIX.LmRuntime.Tokenization
1 member
Truncates message sequences by token budget.
TruncateMessages(System.Collections.Generic.IEnumerable<UAIX.LmRuntime.Contracts.LlmMessage>,UAIX.LmRuntime.Abstractions.ITokenizer,int)
Truncates messages so the total token count does not exceed the budget.
messagestokenizermaxTokensReturns: An ordered read-only IReadOnlyList<LlmMessage> result from TokenBudgetTruncator.TruncateMessages: Truncates messages so the total token count does not exceed the budget. Mutable internal collection aliases are not exposed through the returned contract.
TokenizerGoldenCorpusUAIX.LmRuntime.Tokenization
1 member
Loads tokenizer golden corpora.
Load(string)
Loads ordered tokenizer golden record collection from a verified local source into TokenizerGoldenCorpus.
jsonReturns: An ordered read-only IReadOnlyList<TokenizerGoldenRecord> result from TokenizerGoldenCorpus.Load: Loads ordered tokenizer golden record collection from a verified local source into TokenizerGoldenCorpus. Mutable internal collection aliases are not exposed through the returned contract.
TokenizerGoldenRecordUAIX.LmRuntime.Tokenization
2 members
Represents one tokenizer golden record.
Text
Gets the source text.
ExpectedTokenIds
Gets expected token identifiers.
TokenizerParityReportUAIX.LmRuntime.Tokenization
1 member
Represents tokenizer parity diagnostics.
Mismatches
Gets tokenizer mismatches.
InvalidUtf16PolicyUAIX.LmRuntime.Tokenization
2 members
Defines how tokenizer entry points handle invalid UTF-16 surrogate sequences.
Reject
Rejects invalid UTF-16 before tokenizer-specific normalization or segmentation.
Replace
Replaces each invalid surrogate code unit with the Unicode replacement character.
TokenizerTextSafetyUAIX.LmRuntime.Tokenization
1 member
Validates and normalizes managed strings before tokenizer-specific processing.
NormalizeUtf16(string,UAIX.LmRuntime.Tokenization.InvalidUtf16Policy)
Validates a managed string and optionally replaces unpaired surrogate code units.
textpolicyReturns: The original string when valid, or a normalized replacement string when requested.
StreamingUtf8TokenDecoderUAIX.LmRuntime.Tokenization
2 members
Incrementally decodes byte-token payloads without corrupting UTF-8 sequences split across token boundaries.
Decode(System.ReadOnlySpan<byte>,bool)
Decodes one byte fragment and retains incomplete UTF-8 state for the next call.
bytesflushReturns: The decoded text produced from the validated token sequence in the original sequence order.
Reset
Resets the requested state to its validated initial state without publishing partial state.
TokenizerVocabularyReconciliationResultUAIX.LmRuntime.Tokenization
3 members
Describes a consistency check between GGUF vocabulary order and embedded Hugging Face tokenizer JSON.
IsConsistent
Gets whether the embedded tokenizer JSON is absent or consistent with GGUF token identifiers.
EmbeddedJsonPresent
Gets whether embedded tokenizer JSON was present.
Diagnostics
Gets bounded deterministic diagnostics.
TokenizerVocabularyReconcilerUAIX.LmRuntime.Tokenization
1 member
Reconciles embedded Hugging Face vocabulary identifiers against authoritative GGUF token-array order.
Reconcile(UAIX.LmRuntime.Tokenization.GgufTokenizerMetadata)
Validates embedded tokenizer JSON without allowing it to reorder GGUF token identifiers.
metadataReturns: The TokenizerVocabularyReconciliationResult result produced by TokenizerVocabularyReconciler.Reconcile for this contract: Validates embedded tokenizer JSON without allowing it to reorder GGUF token identifiers. It is published only after all documented validation and ownership transitions succeed.
WhitespaceTokenizerUAIX.LmRuntime.Tokenization
6 members
Provides a deterministic tokenizer for tests, examples, and fallback token budgeting.
Name
Tokenize(string)
Tokenizes the supplied text with the configured metadata and preserves deterministic token order.
textReturns: An ordered read-only collection of token text values produced by the configured tokenizer.
Encode(string,bool,bool)
Encodes the supplied text with the configured tokenizer and validated special-token policy.
textaddBosaddEosReturns: An ordered read-only collection of token identifiers produced by the configured tokenizer.
Decode(System.Collections.Generic.IEnumerable<int>)
Decodes the supplied token sequence with the configured tokenizer while preserving sequence order.
tokenIdsReturns: The decoded text produced from the validated token sequence in the original sequence order.
CountTokens(string)
Counts the tokens using the same deterministic rules as the corresponding processing operation.
textReturns: The int value computed by WhitespaceTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. Range, finite-value, and overflow checks are completed before the value is returned.
CountTokens(System.Collections.Generic.IEnumerable<UAIX.LmRuntime.Contracts.LlmMessage>)
Counts the tokens using the same deterministic rules as the corresponding processing operation.
messagesReturns: The TokenCountResult result produced by WhitespaceTokenizer.CountTokens for this contract: Counts the tokens using the same deterministic rules as the corresponding processing operation. It is published only after all documented validation and ownership transitions succeed.
Use the strict path when unsupported or inconsistent metadata must fail closed. Use the non-strict path only when its fallback behavior is understood and covered by your own compatibility tests.
Special-token insertion, parsing, removal, whitespace cleanup, and invalid-text policy are part of the tokenizer contract. Set them explicitly at application boundaries.
No. It is a small deterministic role/content renderer. Use the conformance surface to evaluate supported template behavior, and do not imply general Jinja compatibility.
Do not assume so. Validate vocabulary, merges, special token IDs, pre-tokenizer behavior, and fingerprints against the actual artifact.