[FEATURE] LLM Token-Level Generation Supervision #370

iwr-redmond · 2025-02-04T15:48:23Z

Feature Description

Rescued from #368:

You may wish to consider implementing one of the token-level supervision options for LlamaCPP to deliver superior adherence during structured generation. It's the difference between asking "pretty please" and guaranteeing a correctly structured response.

As currently implemented by @xsxszab in nexa_inference_text.py, generation will fail if the model does not return a valid JSON response or doesn't follow the requested schema.

Options

LM Format Enforcer (Python)

LM Format Enforcer's llama-cpp-python integration code should be easy to adapt. This package is already being used in RedHat/IBM's enterprise-focused VLLM project (reference).

A demonstration workbook is available here. You may be able to run this workbook as-is by merely changing the imports. e.g.:

-from llama_cpp import LogitsProcessorList
+from nexa.gguf.llama import LogitsProcessorList

LLGuidance (upstream)

The LLGuidance Rust crate has recently been added to upstream llama.cpp.

Enabling this feature during compilation requires some fiddling with Rust, and there are still some bug fixes that need to be finalized (pull 11644). However, these are transitional problems and adopting this approach would probably make it easier for end-users to utilize structured generation using the SDK.

iwr-redmond · 2025-03-12T06:33:36Z

Going through the linked workbook, implementing LM Format Enforcer for JSON schema inputs would probably look something like:

# nexa/constants.py
NEXA_RUN_COMPLETION_TEMPLATE_MAP = {
    "format_enforcer": "You MUST answer using the following JSON schema:",
    "octopus-v2": "Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: {input} \n\nResponse:",
    "octopus-v4": "<|system|>You are a router. Below is the query from the users, please call the correct function and generate the parameters to call the function.<|end|><|user|>{input}<|end|><|assistant|>",
}

# nexa/gguf/structure_utils.py
from nexa.gguf.llama import Llama
from nexa.gguf.llama import LogitsProcessorList

from pydantic import BaseModel, ValidationError
from typing import Optional, List
from lmformatenforcer import CharacterLevelParser, JsonSchemaParser
from lmformatenforcer.integrations.llamacpp import build_llamacpp_logits_processor, build_token_enforcer_tokenizer_data

from nexa.constants import (
    NEXA_RUN_COMPLETION_TEMPLATE_MAP
)

class FormatEnforcer(downloaded_path: str = None):
    """
    Character level parser for llama cpp
    Source: https://github.com/noamgat/lm-format-enforcer
                   samples/colab_llamacpppython_integration.ipynb
    """

    llm = Llama(model_path=downloaded_path)
    def llamacpp_with_character_level_parser(prompt: str, character_level_parser: Optional[CharacterLevelParser]) -> str:
        logits_processors: Optional[LogitsProcessorList] = None
        if character_level_parser:
            logits_processors = LogitsProcessorList([build_llamacpp_logits_processor(tokenizer_data, character_level_parser)])
    
        output = llm(prompt, logits_processor=logits_processors, max_tokens=100)
        text: str = output['choices'][0]['text']
        return text

class PydanticToJson(pydantic_model):
    """
    Validates a Pydantic model and converts it to JSON for processing. Not currently used.
    Based on: https://github.com/noamgat/lm-format-enforcer
                       samples/colab_llamacpppython_integration.ipynb
    """
    pydantic_model.model_validate(pydantic_model, strict=True)
    return pydantic_model.schema_json()

This code would then be called on to replace nexa_inference_text.py lines 361-400 with something like:

# nexa/gguf/nexa_inference_text.py

# top of file
from nexa.gguf.structure_utils import FormatEnforcer
from nexa.gguf.constants import (
        NEXA_RUN_COMPLETION_TEMPLATE_MAP
)

# from line 361
        enforcer_instructions = NEXA_RUN_COMPLETION_TEMPLATE_MAP.get(format_enforcer, None)
        structured_prompt = f"{prompt} {enforcer_instructions} {json_schema}"

        params = {
            "temperature": self.params.get("temperature", 0.7),
            "max_tokens": self.params.get("max_new_tokens", 2048),
            "top_k": self.params.get("top_k", 50),
            "top_p": self.params.get("top_p", 1.0),
            "stop": self.stop_words,
            "logprobs": self.logprobs
        }
        params.update(kwargs)
        # Perform structured inference
        structured_data = FormatEnforcer(structured_prompt, JsonSchemaParser(AnswerFormat.schema()))

iwr-redmond added the 💡 feature request New feature or request label Feb 4, 2025

iwr-redmond changed the title ~~[FEATURE] LM Format Enforcer integration~~ [FEATURE] LLM Token-Level Generation Supervision Feb 7, 2025

iwr-redmond mentioned this issue Mar 4, 2025

[Enhancement] LM Format Enforcer Support emcie-co/parlant#306

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] LLM Token-Level Generation Supervision #370

[FEATURE] LLM Token-Level Generation Supervision #370

iwr-redmond commented Feb 4, 2025 •

edited

Loading

iwr-redmond commented Mar 12, 2025 •

edited

Loading

[FEATURE] LLM Token-Level Generation Supervision #370

[FEATURE] LLM Token-Level Generation Supervision #370

Comments

iwr-redmond commented Feb 4, 2025 • edited Loading

Feature Description

Options

iwr-redmond commented Mar 12, 2025 • edited Loading

iwr-redmond commented Feb 4, 2025 •

edited

Loading

iwr-redmond commented Mar 12, 2025 •

edited

Loading