docs: markdown with large language model evaluation recipe #46

ChristianMurphy · 2024-11-07T00:40:46Z

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)
If applicable, I’ve added docs and tests

Description of changes

This draft article is open for feedback to determine if it’s a fit for recipes and aligns with maintainer expectations.

It focuses on using model evaluations to generate valid Markdown with LLMs, prioritizing standards like CommonMark and GFM. It covers defining syntax rules, creating test cases, setting metrics, and refining prompts, with an emphasis on standards-compliant Markdown without custom workarounds.

Though it includes more Python than JavaScript, the recipe addresses developer needs around reliable Markdown output, especially for math syntax related discussions in remarkjs/remark-math#39.

wooorm

I imagine that running remark as a lint task in this repo will find some stuff in this document.
Particularly around word use. Line wrapping.

Using “math” as an example is especially interesting, but also especially error prone, as it isn’t really in GFM either. And the thing supported by github.com isn’t exactly what the micromark extension does either.

I worry about this article going into testing LLMs almost immediately. I assume this article is meant to be shared in discussions with people asking about how to do math. I think they will want an answer on how to change their prompt instead of an article on how to create a robust test framework.

Perhaps that isn’t the goal. What is the goal? Who do you want to reach? How do you want to help them?

Thinking about the audience, I do worry about Python too. Can this be done in JavaScript too?

Or, could this be split? One in how to improve promts, one on how to test them?
Perhaps, even a recipe on how to adjust a prompt for math that works with micromark/remark/react-markdown/etc?

There’s some editorial stuff that I can take care of later,
such as sentence case for headings, wrapping the prose, etc.

wooorm · 2024-11-07T09:47:10Z

doc/learn/math-and-language-models.md

+  - remark
+  - use
+  - introduction


sort this list

Will do, I'm also thinking of addnig more tags

tags: - accuracy - consistency - fewshot - gpt4 - latex - llms - markdown - math - parsing - prompt - promptfoo - prompting - refinement - remark - structure - testing - tools - validation

wooorm · 2024-11-07T09:47:39Z

doc/learn/math-and-language-models.md

+title: Generating valid markdown with LLMs
+---
+
+# Generating Valid Markdown with Large Language Models: A Guide to Model Evaluations


Title is a bit long, maybe something shorter?

I believe SEO and similar want not-too-long titles

wooorm · 2024-11-07T09:52:38Z

doc/learn/math-and-language-models.md

+
+# Generating Valid Markdown with Large Language Models: A Guide to Model Evaluations
+
+Large language models (LLMs) are increasingly popular for generating Markdown across technical documentation, structured content, and prototyping. However, all LLMs have the potential to hallucinate or produce incorrect Markdown, especially when different standards are present in their training data. Markdown standards like CommonMark and GitHub-Flavored Markdown (GFM) are necessary for ensuring that content displays consistently across tools and platforms, from documentation generators to code repositories. When an LLM’s output deviates from these standards, content can appear incorrectly formatted, fail to render as expected, or disrupt workflows that rely on Markdown’s consistency.


I would sway away from “standards” and prefer “flavors”, I think that better reflects the absence of specifications... the vagueness of it all. Similarly, the word “valid”.

This is also why I use markdown instead of Markdown.
Like how things such as googling become something detached from Google.
The capitalized version to me has a connotation with the OG Markdown.pl

from “standards” and prefer “flavors”,

I reworked this in the outline below

ChristianMurphy · 2024-11-07T13:23:37Z

I imagine that running remark as a lint task in this repo will find some stuff in this document.
Particularly around word use. Line wrapping.

I imagine it will.
I'd prefer to get alignment on high level direction before dealing with the linter.

Using “math” as an example is especially interesting, but also especially error prone

I think we're on the same page, what do you mean by error prone here.
I chose it because it is error prone in the sense that it is semi-standard, and different LLMs use different markers.

Perhaps that isn’t the goal. What is the goal? Who do you want to reach?

Adopters using LLMs and either:

wanting to ensure their markdown output is consistent and valid
having trouble generating valid markdown and looking for help

How do you want to help them?

Help with knowing how to create a benchmark of how well an LLM/system prompt is performing

I think they will want an answer on how to change their prompt instead of an article on how to create a robust test framework.

and I need people to understand from a prompt perspective there is no, one-size fits all answer.
Some reasons:

Different LLMs have different training sets and different capabilities
- This changes both what models output by default
- and what kind of prompts are needed
Even within a the same LLM model class (E.G. ChatGPT 4o), different checkpoints/releases (E.G. gpt-4o-2024-05-13 vs gpt-4o-2024-08-06) can change how outputs and prompts work
LLMs have limited ability to follow instructions, too many instructions and they start ignoring some.
- Adopters need to also test other functionality they expect to make sure performance there is not hurt in the process of improving markdown adherence

There are some high level recommendations for different models in https://www.promptingguide.ai/ which we could link to.
But the measurement piece is more important.

Can this be done in JavaScript too?

There are less options, and this is less common in the LLM/data-science communities, but, yes.
https://www.promptfoo.dev/docs/configuration/expected-outputs/model-graded/llm-rubric/ could be used.
It would be a total rewrite of the examples though.

Or, could this be split? One in how to improve promts, one on how to test them?

I'd be cautious, IMO prompt improvements should be guided by benchmarks/empirical measures.

wooorm · 2024-11-11T13:48:27Z

and I need people to understand from a prompt perspective there is no, one-size fits all answer […]

These points you write down here, of what people need to get, I didn’t really get from the article. But it is likely that I don’t understand LLMs.

It seems to me very very far to go from a react-markdown user with little knowledge of JavaScript and LLMs, to go and write a test suite in python, and grasp why writing test suites in Python is a good idea.

LLM/data-science communities

I am not sure that community reads unifiedjs.com.
Or posts questions in our Discussions.
Would it perhaps be better to have this guide instead in a place frequented by them?

Quickly searching through the guide, there is no mention of mdast/micromark/unified, and remark only once for a link to remark-math? Perhaps, as this guide talks about LLM + markdown more generally, this guide is more suited for another place?

I'd be cautious, IMO prompt improvements should be guided by benchmarks/empirical measures.

This too seems very far from the people who don’t know markdown/JS/LLMs that post questions in our places?

ChristianMurphy · 2024-11-12T20:42:35Z

What do you think about this outline for a potential second version of the article?

Generating Reliable markdown with LLMs

Learning Objectives

By the end of this guide, you will be able to:

Understand the Basics of markdown and Its Variations
Learn about markdown’s purpose, its different flavors (like CommonMark, GFM, and MDX), and how these affect formatting consistency. Recognize when to specify custom syntax, such as math expressions, to get accurate markdown output from LLMs. Additionally, understand how tools like remark can be used to parse and transform markdown effectively.
Recognize the Importance of Model and Prompt Selection
Understand how different model versions (or checkpoints) affect output consistency and why choosing the right model is just as important as designing a clear prompt. Explore examples that show how prompts interact with specific model versions.
Design Effective Prompts for markdown Generation
Develop best practices for crafting prompts that focus on your primary goal while also specifying format requirements. Gain practical skills in using few-shot prompting, specifying markdown syntax, and structuring prompts to achieve clear, well-formatted output. Learn how remark can support consistent formatting when combined with well-crafted prompts.
Test and Refine markdown Output Using Promptfoo
Set up and run tests in Promptfoo to validate the quality and consistency of markdown output, including any specialized syntax like math. Understand how testing, in combination with remark, plays a role in producing reliable LLM outputs across markdown and other formats.
Adapt Prompts for Consistency in Complex Outputs
Build skills in iterative prompt refinement to ensure markdown outputs match your requirements, especially when dealing with structured content or technical elements. Using remark can also assist in refining and processing complex markdown outputs.

1. markdown Basics and Its Variations

Understand markdown’s Use Cases
markdown is a lightweight text format widely used in documentation, READMEs, and web content. It’s simple to read and edit, making it ideal for content that needs structured formatting without complex tools.
Learn markdown Flavors and Extensions
Discover the different "flavors" of markdown, such as CommonMark, GitHub-Flavored markdown (GFM), and MDX. Each flavor has unique rules and features, and some projects use extensions for special features like math syntax ($$...$$ for equations) or custom directives. When generating markdown with an LLM, specify the required format and any unique features to ensure the output aligns with your project’s needs.
Working with Math in markdown
To effectively integrate math into your markdown content, you can use tools like remark, remark-math, and rehype-katex. These tools help parse and render mathematical expressions within markdown, ensuring well-formatted output. Later in the article, we will introduce a practical example related to math, showing how these tools work in conjunction to ensure consistency and accuracy.
- remark is a markdown processor that can parse, transform, and serialize content, while remark-math adds support for LaTeX-style math notation.
- rehype-katex is used to render these math expressions as HTML using KaTeX for a polished display.

These tools will be used in an example later to illustrate how math content can be handled effectively.

 ```javascript
 import { unified } from 'unified';
 import remarkParse from 'remark-parse';
 import remarkMath from 'remark-math';
 import remarkRehype from 'remark-rehype';
 import rehypeKatex from 'rehype-katex';
 import rehypeStringify from 'rehype-stringify';

 const processor = unified()
   .use(remarkParse)       // Parse markdown
   .use(remarkMath)        // Process math syntax
   .use(remarkRehype)      // Convert markdown to HTML
   .use(rehypeKatex)       // Render math with KaTeX
   .use(rehypeStringify);  // Serialize HTML
 ```

 This example sets up a processor that can take markdown input containing mathematical expressions and output HTML with properly formatted math.

 When generating markdown through LLMs, specifying this setup can help ensure the correct parsing and rendering of complex math content.

2. Understand Why Prompt Design and Model Choice Affect Output Consistency

Recognize the Impact of Model and Prompt Together
The effectiveness of a prompt depends not only on its design but also on the specific model and checkpoint you’re using. Different versions of the same model family, such as OpenAI’s GPT-4, perform differently depending on the task, and slight variations between checkpoints can result in distinct responses.
Explore Model Checkpoint Variability
Feedback from the OpenAI community illustrates how different versions, like GPT-4-turbo-2024-04-09, GPT-4-1106, and GPT-4o, vary in their performance across logical reasoning, coding, and general explanation tasks. For instance:
- Logical and Mathematical Reasoning
  GPT-4-turbo-2024-04-09 is noted for better performance in logical and mathematical tasks, while GPT-4o can sometimes struggle with maintaining consistency in complex logic.
- Coding Performance
  GPT-4o can generate longer code segments but may produce verbose outputs, while other versions, like GPT-4-1106, are preferred for efficient, shorter coding tasks.
- Conceptual Clarity
  GPT-4o tends to produce confident yet incorrect responses (“hallucinations”) more frequently than GPT-4-1106, which performs better on tasks that require precise explanations.
These examples show why model selection can be as important as prompt refinement. If you’re generating markdown or other structured outputs, model checkpoints may impact consistency in ways that prompt adjustments alone cannot address. For more insights, see the OpenAI Community review on GPT-4o vs. GPT-4-turbo vs. GPT-4-1106.

3. Write Effective Prompts to Achieve Your Goal

Focus on the Goal Over Format
Prompts are generally aimed at achieving a specific goal, such as explaining a technical concept, supporting a learning task, or assisting a customer. Formatting, like markdown or JSON, is often used to support that goal. When designing a prompt, keep the main objective in focus, using format instructions as secondary guidance to clarify presentation.
Follow Best Practices for Structuring Prompts for markdown
1. Define Clear Objectives
  State the main goal (e.g., “Explain Newton’s Laws with examples”) and the desired format (e.g., “Use markdown with headers, bullet points, and math syntax for equations”).
2. Use Simple, Structured Prompts
  Organize your prompt logically and provide examples to guide the model. For instance, if you need markdown with math:
```
Explain Newton's Second Law and use markdown with math syntax. Include the formula as `$$F = ma$$`.
```
3. Provide Examples (Few-Shot Prompting)
  Including one or two examples of the output format helps the model produce consistent results, particularly for specialized syntax like math in markdown.
4. Refine the Prompt Based on Results
  Adjust prompts based on output. If the response doesn’t align with expectations, simplify or clarify the prompt in iterative steps.

4. Test Output Quality with Promptfoo: A Step-by-Step Approach

Use Promptfoo to validate that the LLM-generated markdown (or any other structured format) meets both formatting and task-specific goals. Follow these steps to set up and refine your tests:

Define Metrics for Output Quality
Start by defining clear metrics for success. These could include:
- Correct use of markdown syntax (e.g., headers, lists, math formatting with $$...$$)
- Task-specific accuracy (e.g., correctness in explaining a concept)
- Consistency in tone and style, if applicable
Create Promptfoo Tests Based on Defined Metrics
Set up Promptfoo tests to measure these metrics. For instance:
- Test for markdown structure by verifying correct header levels, bullet points, or math syntax
- Check if the explanation follows a logical structure and clarity
Design Mini Test Cases with Expected Outcomes
Develop mini test cases that cover both “pass” and “fail” conditions based on your metrics. This will help create a confusion matrix that visually shows which cases the model handles well and where it needs improvement. For example:
- Pass Case: Correctly formatted markdown with headers and math syntax (e.g., $$E = mc^2$$).
- Fail Case: Missing headers, incorrect math formatting, or inconsistent explanation structure.
Refine Metrics Based on Test Case Insights
As you see which cases pass or fail, you may find certain metrics need refining to better capture your needs. Adjust metrics as needed to focus on what’s most relevant to your output goals.
Generate Realistic Input Examples
Create examples that represent realistic prompts users would input. If you’re generating educational content on physics, for instance, use a prompt like “Explain Newton’s Second Law with equations and examples.” Ensure your examples are varied enough to test the model’s response to different input types.
Run Tests to Measure Model Output Against Metrics
Use Promptfoo to test your prompt on the LLM output, measuring it against the defined metrics and evaluating areas of success and areas for improvement.
Refine the Prompt Based on Metric Feedback
Based on the test outcomes, adjust your prompt to address any areas that didn’t meet the metrics. For example, if markdown math syntax is inconsistent, clarify this in the prompt with specific examples or further simplify the prompt. Continue refining and retesting until the prompt meets all key metrics reliably.

5. Example: Testing a Physics Explanation with Math in markdown

Define Your Example Goal
Generate an explanation of Newton’s Second Law in markdown with structured headers and math expressions (e.g., $$F = ma$$). Use remark and remark-math to ensure correct parsing and rendering of the math content.
Testing with Promptfoo
- Step 1: Create a prompt to explain Newton’s Second Law using markdown, specifying math syntax.
- Step 2: Set up a Promptfoo test to verify correct markdown structure and the math equation format, utilizing remark and remark-math for accurate processing.
- Step 3: Run the test, refine the prompt based on results, and repeat until the output meets the desired structure and mathematical clarity.

6. Summary and Further Resources

Recap Key Points
Prompt structure and model selection both influence output consistency. Testing with tools like Promptfoo and using tools like remark helps ensure the output supports the main task while maintaining formatting requirements.
Further Learning
Explore markdown syntax resources, Promptfoo documentation, remark plugins, and the Prompting Guide for additional prompt strategies.

ChristianMurphy requested a review from a team November 7, 2024 00:41

ChristianMurphy force-pushed the markdown-with-llms-example branch from 9d930b2 to d719f27 Compare November 7, 2024 04:28

docs: markdown with large language model evaluation recipe

24c1590

ChristianMurphy force-pushed the markdown-with-llms-example branch from d719f27 to 24c1590 Compare November 7, 2024 04:30

wooorm reviewed Nov 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: markdown with large language model evaluation recipe #46

docs: markdown with large language model evaluation recipe #46

ChristianMurphy commented Nov 7, 2024 •

edited

Loading

wooorm left a comment

wooorm Nov 7, 2024

ChristianMurphy Nov 12, 2024

wooorm Nov 7, 2024

wooorm Nov 7, 2024

ChristianMurphy Nov 12, 2024

ChristianMurphy commented Nov 7, 2024

wooorm commented Nov 11, 2024

ChristianMurphy commented Nov 12, 2024 •

edited

Loading


		# Generating Valid Markdown with Large Language Models: A Guide to Model Evaluations

		Large language models (LLMs) are increasingly popular for generating Markdown across technical documentation, structured content, and prototyping. However, all LLMs have the potential to hallucinate or produce incorrect Markdown, especially when different standards are present in their training data. Markdown standards like CommonMark and GitHub-Flavored Markdown (GFM) are necessary for ensuring that content displays consistently across tools and platforms, from documentation generators to code repositories. When an LLM’s output deviates from these standards, content can appear incorrectly formatted, fail to render as expected, or disrupt workflows that rely on Markdown’s consistency.

docs: markdown with large language model evaluation recipe #46

Are you sure you want to change the base?

docs: markdown with large language model evaluation recipe #46

Conversation

ChristianMurphy commented Nov 7, 2024 • edited Loading

Initial checklist

Description of changes

wooorm left a comment

Choose a reason for hiding this comment

wooorm Nov 7, 2024

Choose a reason for hiding this comment

ChristianMurphy Nov 12, 2024

Choose a reason for hiding this comment

wooorm Nov 7, 2024

Choose a reason for hiding this comment

wooorm Nov 7, 2024

Choose a reason for hiding this comment

ChristianMurphy Nov 12, 2024

Choose a reason for hiding this comment

ChristianMurphy commented Nov 7, 2024

wooorm commented Nov 11, 2024

ChristianMurphy commented Nov 12, 2024 • edited Loading

Generating Reliable markdown with LLMs

Learning Objectives

1. markdown Basics and Its Variations

2. Understand Why Prompt Design and Model Choice Affect Output Consistency

3. Write Effective Prompts to Achieve Your Goal

4. Test Output Quality with Promptfoo: A Step-by-Step Approach

5. Example: Testing a Physics Explanation with Math in markdown

6. Summary and Further Resources

ChristianMurphy commented Nov 7, 2024 •

edited

Loading

ChristianMurphy commented Nov 12, 2024 •

edited

Loading