Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: markdown with large language model evaluation recipe #46

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ChristianMurphy
Copy link
Member

@ChristianMurphy ChristianMurphy commented Nov 7, 2024

Initial checklist

  • I read the support docs
  • I read the contributing guide
  • I agree to follow the code of conduct
  • I searched issues and couldn’t find anything (or linked relevant results below)
  • If applicable, I’ve added docs and tests

Description of changes

This draft article is open for feedback to determine if it’s a fit for recipes and aligns with maintainer expectations.

It focuses on using model evaluations to generate valid Markdown with LLMs, prioritizing standards like CommonMark and GFM. It covers defining syntax rules, creating test cases, setting metrics, and refining prompts, with an emphasis on standards-compliant Markdown without custom workarounds.

Though it includes more Python than JavaScript, the recipe addresses developer needs around reliable Markdown output, especially for math syntax related discussions in remarkjs/remark-math#39.

Copy link
Member

@wooorm wooorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine that running remark as a lint task in this repo will find some stuff in this document.
Particularly around word use. Line wrapping.

Using “math” as an example is especially interesting, but also especially error prone, as it isn’t really in GFM either. And the thing supported by github.com isn’t exactly what the micromark extension does either.

I worry about this article going into testing LLMs almost immediately. I assume this article is meant to be shared in discussions with people asking about how to do math. I think they will want an answer on how to change their prompt instead of an article on how to create a robust test framework.

Perhaps that isn’t the goal. What is the goal? Who do you want to reach? How do you want to help them?

Thinking about the audience, I do worry about Python too. Can this be done in JavaScript too?

Or, could this be split? One in how to improve promts, one on how to test them?
Perhaps, even a recipe on how to adjust a prompt for math that works with micromark/remark/react-markdown/etc?

There’s some editorial stuff that I can take care of later,
such as sentence case for headings, wrapping the prose, etc.

Comment on lines +10 to +12
- remark
- use
- introduction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort this list

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, I'm also thinking of addnig more tags

tags:
- accuracy
- consistency
- fewshot
- gpt4
- latex
- llms
- markdown
- math
- parsing
- prompt
- promptfoo
- prompting
- refinement
- remark
- structure
- testing
- tools
- validation

title: Generating valid markdown with LLMs
---

# Generating Valid Markdown with Large Language Models: A Guide to Model Evaluations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Title is a bit long, maybe something shorter?

I believe SEO and similar want not-too-long titles


# Generating Valid Markdown with Large Language Models: A Guide to Model Evaluations

Large language models (LLMs) are increasingly popular for generating Markdown across technical documentation, structured content, and prototyping. However, all LLMs have the potential to hallucinate or produce incorrect Markdown, especially when different standards are present in their training data. Markdown standards like CommonMark and GitHub-Flavored Markdown (GFM) are necessary for ensuring that content displays consistently across tools and platforms, from documentation generators to code repositories. When an LLM’s output deviates from these standards, content can appear incorrectly formatted, fail to render as expected, or disrupt workflows that rely on Markdown’s consistency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would sway away from “standards” and prefer “flavors”, I think that better reflects the absence of specifications... the vagueness of it all. Similarly, the word “valid”.

This is also why I use markdown instead of Markdown.
Like how things such as googling become something detached from Google.
The capitalized version to me has a connotation with the OG Markdown.pl

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from “standards” and prefer “flavors”,

I reworked this in the outline below

@ChristianMurphy
Copy link
Member Author

I imagine that running remark as a lint task in this repo will find some stuff in this document.
Particularly around word use. Line wrapping.

I imagine it will.
I'd prefer to get alignment on high level direction before dealing with the linter.

Using “math” as an example is especially interesting, but also especially error prone

I think we're on the same page, what do you mean by error prone here.
I chose it because it is error prone in the sense that it is semi-standard, and different LLMs use different markers.

Perhaps that isn’t the goal. What is the goal? Who do you want to reach?

Adopters using LLMs and either:

  • wanting to ensure their markdown output is consistent and valid
  • having trouble generating valid markdown and looking for help

How do you want to help them?

Help with knowing how to create a benchmark of how well an LLM/system prompt is performing

I think they will want an answer on how to change their prompt instead of an article on how to create a robust test framework.

and I need people to understand from a prompt perspective there is no, one-size fits all answer.
Some reasons:

  • Different LLMs have different training sets and different capabilities
    • This changes both what models output by default
    • and what kind of prompts are needed
  • Even within a the same LLM model class (E.G. ChatGPT 4o), different checkpoints/releases (E.G. gpt-4o-2024-05-13 vs gpt-4o-2024-08-06) can change how outputs and prompts work
  • LLMs have limited ability to follow instructions, too many instructions and they start ignoring some.
    • Adopters need to also test other functionality they expect to make sure performance there is not hurt in the process of improving markdown adherence

There are some high level recommendations for different models in https://www.promptingguide.ai/ which we could link to.
But the measurement piece is more important.

Can this be done in JavaScript too?

There are less options, and this is less common in the LLM/data-science communities, but, yes.
https://www.promptfoo.dev/docs/configuration/expected-outputs/model-graded/llm-rubric/ could be used.
It would be a total rewrite of the examples though.

Or, could this be split? One in how to improve promts, one on how to test them?

I'd be cautious, IMO prompt improvements should be guided by benchmarks/empirical measures.

@wooorm
Copy link
Member

wooorm commented Nov 11, 2024

and I need people to understand from a prompt perspective there is no, one-size fits all answer […]

These points you write down here, of what people need to get, I didn’t really get from the article. But it is likely that I don’t understand LLMs.

It seems to me very very far to go from a react-markdown user with little knowledge of JavaScript and LLMs, to go and write a test suite in python, and grasp why writing test suites in Python is a good idea.

LLM/data-science communities

I am not sure that community reads unifiedjs.com.
Or posts questions in our Discussions.
Would it perhaps be better to have this guide instead in a place frequented by them?

Quickly searching through the guide, there is no mention of mdast/micromark/unified, and remark only once for a link to remark-math? Perhaps, as this guide talks about LLM + markdown more generally, this guide is more suited for another place?

I'd be cautious, IMO prompt improvements should be guided by benchmarks/empirical measures.

This too seems very far from the people who don’t know markdown/JS/LLMs that post questions in our places?

@ChristianMurphy
Copy link
Member Author

ChristianMurphy commented Nov 12, 2024

What do you think about this outline for a potential second version of the article?


Generating Reliable markdown with LLMs

Learning Objectives

By the end of this guide, you will be able to:

  1. Understand the Basics of markdown and Its Variations
    Learn about markdown’s purpose, its different flavors (like CommonMark, GFM, and MDX), and how these affect formatting consistency. Recognize when to specify custom syntax, such as math expressions, to get accurate markdown output from LLMs. Additionally, understand how tools like remark can be used to parse and transform markdown effectively.

  2. Recognize the Importance of Model and Prompt Selection
    Understand how different model versions (or checkpoints) affect output consistency and why choosing the right model is just as important as designing a clear prompt. Explore examples that show how prompts interact with specific model versions.

  3. Design Effective Prompts for markdown Generation
    Develop best practices for crafting prompts that focus on your primary goal while also specifying format requirements. Gain practical skills in using few-shot prompting, specifying markdown syntax, and structuring prompts to achieve clear, well-formatted output. Learn how remark can support consistent formatting when combined with well-crafted prompts.

  4. Test and Refine markdown Output Using Promptfoo
    Set up and run tests in Promptfoo to validate the quality and consistency of markdown output, including any specialized syntax like math. Understand how testing, in combination with remark, plays a role in producing reliable LLM outputs across markdown and other formats.

  5. Adapt Prompts for Consistency in Complex Outputs
    Build skills in iterative prompt refinement to ensure markdown outputs match your requirements, especially when dealing with structured content or technical elements. Using remark can also assist in refining and processing complex markdown outputs.

1. markdown Basics and Its Variations

  • Understand markdown’s Use Cases
    markdown is a lightweight text format widely used in documentation, READMEs, and web content. It’s simple to read and edit, making it ideal for content that needs structured formatting without complex tools.

  • Learn markdown Flavors and Extensions
    Discover the different "flavors" of markdown, such as CommonMark, GitHub-Flavored markdown (GFM), and MDX. Each flavor has unique rules and features, and some projects use extensions for special features like math syntax ($$...$$ for equations) or custom directives. When generating markdown with an LLM, specify the required format and any unique features to ensure the output aligns with your project’s needs.

  • Working with Math in markdown
    To effectively integrate math into your markdown content, you can use tools like remark, remark-math, and rehype-katex. These tools help parse and render mathematical expressions within markdown, ensuring well-formatted output. Later in the article, we will introduce a practical example related to math, showing how these tools work in conjunction to ensure consistency and accuracy.

    • remark is a markdown processor that can parse, transform, and serialize content, while remark-math adds support for LaTeX-style math notation.
    • rehype-katex is used to render these math expressions as HTML using KaTeX for a polished display.
These tools will be used in an example later to illustrate how math content can be handled effectively.

 ```javascript
 import { unified } from 'unified';
 import remarkParse from 'remark-parse';
 import remarkMath from 'remark-math';
 import remarkRehype from 'remark-rehype';
 import rehypeKatex from 'rehype-katex';
 import rehypeStringify from 'rehype-stringify';

 const processor = unified()
   .use(remarkParse)       // Parse markdown
   .use(remarkMath)        // Process math syntax
   .use(remarkRehype)      // Convert markdown to HTML
   .use(rehypeKatex)       // Render math with KaTeX
   .use(rehypeStringify);  // Serialize HTML
 ```

 This example sets up a processor that can take markdown input containing mathematical expressions and output HTML with properly formatted math.

 When generating markdown through LLMs, specifying this setup can help ensure the correct parsing and rendering of complex math content.

2. Understand Why Prompt Design and Model Choice Affect Output Consistency

  • Recognize the Impact of Model and Prompt Together
    The effectiveness of a prompt depends not only on its design but also on the specific model and checkpoint you’re using. Different versions of the same model family, such as OpenAI’s GPT-4, perform differently depending on the task, and slight variations between checkpoints can result in distinct responses.

  • Explore Model Checkpoint Variability
    Feedback from the OpenAI community illustrates how different versions, like GPT-4-turbo-2024-04-09, GPT-4-1106, and GPT-4o, vary in their performance across logical reasoning, coding, and general explanation tasks. For instance:

    • Logical and Mathematical Reasoning
      GPT-4-turbo-2024-04-09 is noted for better performance in logical and mathematical tasks, while GPT-4o can sometimes struggle with maintaining consistency in complex logic.

    • Coding Performance
      GPT-4o can generate longer code segments but may produce verbose outputs, while other versions, like GPT-4-1106, are preferred for efficient, shorter coding tasks.

    • Conceptual Clarity
      GPT-4o tends to produce confident yet incorrect responses (“hallucinations”) more frequently than GPT-4-1106, which performs better on tasks that require precise explanations.

    These examples show why model selection can be as important as prompt refinement. If you’re generating markdown or other structured outputs, model checkpoints may impact consistency in ways that prompt adjustments alone cannot address. For more insights, see the OpenAI Community review on GPT-4o vs. GPT-4-turbo vs. GPT-4-1106.

3. Write Effective Prompts to Achieve Your Goal

  • Focus on the Goal Over Format
    Prompts are generally aimed at achieving a specific goal, such as explaining a technical concept, supporting a learning task, or assisting a customer. Formatting, like markdown or JSON, is often used to support that goal. When designing a prompt, keep the main objective in focus, using format instructions as secondary guidance to clarify presentation.

  • Follow Best Practices for Structuring Prompts for markdown

    1. Define Clear Objectives
      State the main goal (e.g., “Explain Newton’s Laws with examples”) and the desired format (e.g., “Use markdown with headers, bullet points, and math syntax for equations”).

    2. Use Simple, Structured Prompts
      Organize your prompt logically and provide examples to guide the model. For instance, if you need markdown with math:

      Explain Newton's Second Law and use markdown with math syntax. Include the formula as `$$F = ma$$`.
      
    3. Provide Examples (Few-Shot Prompting)
      Including one or two examples of the output format helps the model produce consistent results, particularly for specialized syntax like math in markdown.

    4. Refine the Prompt Based on Results
      Adjust prompts based on output. If the response doesn’t align with expectations, simplify or clarify the prompt in iterative steps.

4. Test Output Quality with Promptfoo: A Step-by-Step Approach

Use Promptfoo to validate that the LLM-generated markdown (or any other structured format) meets both formatting and task-specific goals. Follow these steps to set up and refine your tests:

  1. Define Metrics for Output Quality
    Start by defining clear metrics for success. These could include:

    • Correct use of markdown syntax (e.g., headers, lists, math formatting with $$...$$)
    • Task-specific accuracy (e.g., correctness in explaining a concept)
    • Consistency in tone and style, if applicable
  2. Create Promptfoo Tests Based on Defined Metrics
    Set up Promptfoo tests to measure these metrics. For instance:

    • Test for markdown structure by verifying correct header levels, bullet points, or math syntax
    • Check if the explanation follows a logical structure and clarity
  3. Design Mini Test Cases with Expected Outcomes
    Develop mini test cases that cover both “pass” and “fail” conditions based on your metrics. This will help create a confusion matrix that visually shows which cases the model handles well and where it needs improvement. For example:

    • Pass Case: Correctly formatted markdown with headers and math syntax (e.g., $$E = mc^2$$).
    • Fail Case: Missing headers, incorrect math formatting, or inconsistent explanation structure.
  4. Refine Metrics Based on Test Case Insights
    As you see which cases pass or fail, you may find certain metrics need refining to better capture your needs. Adjust metrics as needed to focus on what’s most relevant to your output goals.

  5. Generate Realistic Input Examples
    Create examples that represent realistic prompts users would input. If you’re generating educational content on physics, for instance, use a prompt like “Explain Newton’s Second Law with equations and examples.” Ensure your examples are varied enough to test the model’s response to different input types.

  6. Run Tests to Measure Model Output Against Metrics
    Use Promptfoo to test your prompt on the LLM output, measuring it against the defined metrics and evaluating areas of success and areas for improvement.

  7. Refine the Prompt Based on Metric Feedback
    Based on the test outcomes, adjust your prompt to address any areas that didn’t meet the metrics. For example, if markdown math syntax is inconsistent, clarify this in the prompt with specific examples or further simplify the prompt. Continue refining and retesting until the prompt meets all key metrics reliably.

5. Example: Testing a Physics Explanation with Math in markdown

  • Define Your Example Goal
    Generate an explanation of Newton’s Second Law in markdown with structured headers and math expressions (e.g., $$F = ma$$). Use remark and remark-math to ensure correct parsing and rendering of the math content.

  • Testing with Promptfoo

    • Step 1: Create a prompt to explain Newton’s Second Law using markdown, specifying math syntax.
    • Step 2: Set up a Promptfoo test to verify correct markdown structure and the math equation format, utilizing remark and remark-math for accurate processing.
    • Step 3: Run the test, refine the prompt based on results, and repeat until the output meets the desired structure and mathematical clarity.

6. Summary and Further Resources

  • Recap Key Points
    Prompt structure and model selection both influence output consistency. Testing with tools like Promptfoo and using tools like remark helps ensure the output supports the main task while maintaining formatting requirements.

  • Further Learning
    Explore markdown syntax resources, Promptfoo documentation, remark plugins, and the Prompting Guide for additional prompt strategies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants