-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: markdown with large language model evaluation recipe #46
base: main
Are you sure you want to change the base?
docs: markdown with large language model evaluation recipe #46
Conversation
9d930b2
to
d719f27
Compare
d719f27
to
24c1590
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine that running remark
as a lint task in this repo will find some stuff in this document.
Particularly around word use. Line wrapping.
Using “math” as an example is especially interesting, but also especially error prone, as it isn’t really in GFM either. And the thing supported by github.com
isn’t exactly what the micromark extension does either.
I worry about this article going into testing LLMs almost immediately. I assume this article is meant to be shared in discussions with people asking about how to do math. I think they will want an answer on how to change their prompt instead of an article on how to create a robust test framework.
Perhaps that isn’t the goal. What is the goal? Who do you want to reach? How do you want to help them?
Thinking about the audience, I do worry about Python too. Can this be done in JavaScript too?
Or, could this be split? One in how to improve promts, one on how to test them?
Perhaps, even a recipe on how to adjust a prompt for math that works with micromark/remark/react-markdown/etc?
There’s some editorial stuff that I can take care of later,
such as sentence case for headings, wrapping the prose, etc.
- remark | ||
- use | ||
- introduction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort this list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, I'm also thinking of addnig more tags
tags:
- accuracy
- consistency
- fewshot
- gpt4
- latex
- llms
- markdown
- math
- parsing
- prompt
- promptfoo
- prompting
- refinement
- remark
- structure
- testing
- tools
- validation
title: Generating valid markdown with LLMs | ||
--- | ||
|
||
# Generating Valid Markdown with Large Language Models: A Guide to Model Evaluations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Title is a bit long, maybe something shorter?
I believe SEO and similar want not-too-long titles
|
||
# Generating Valid Markdown with Large Language Models: A Guide to Model Evaluations | ||
|
||
Large language models (LLMs) are increasingly popular for generating Markdown across technical documentation, structured content, and prototyping. However, all LLMs have the potential to hallucinate or produce incorrect Markdown, especially when different standards are present in their training data. Markdown standards like CommonMark and GitHub-Flavored Markdown (GFM) are necessary for ensuring that content displays consistently across tools and platforms, from documentation generators to code repositories. When an LLM’s output deviates from these standards, content can appear incorrectly formatted, fail to render as expected, or disrupt workflows that rely on Markdown’s consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would sway away from “standards” and prefer “flavors”, I think that better reflects the absence of specifications... the vagueness of it all. Similarly, the word “valid”.
This is also why I use markdown
instead of Markdown
.
Like how things such as googling
become something detached from Google
.
The capitalized version to me has a connotation with the OG Markdown.pl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from “standards” and prefer “flavors”,
I reworked this in the outline below
I imagine it will.
I think we're on the same page, what do you mean by error prone here.
Adopters using LLMs and either:
Help with knowing how to create a benchmark of how well an LLM/system prompt is performing
and I need people to understand from a prompt perspective there is no, one-size fits all answer.
There are some high level recommendations for different models in https://www.promptingguide.ai/ which we could link to.
There are less options, and this is less common in the LLM/data-science communities, but, yes.
I'd be cautious, IMO prompt improvements should be guided by benchmarks/empirical measures. |
These points you write down here, of what people need to get, I didn’t really get from the article. But it is likely that I don’t understand LLMs. It seems to me very very far to go from a
I am not sure that community reads Quickly searching through the guide, there is no mention of mdast/micromark/unified, and remark only once for a link to
This too seems very far from the people who don’t know markdown/JS/LLMs that post questions in our places? |
What do you think about this outline for a potential second version of the article? Generating Reliable markdown with LLMsLearning ObjectivesBy the end of this guide, you will be able to:
1. markdown Basics and Its Variations
2. Understand Why Prompt Design and Model Choice Affect Output Consistency
3. Write Effective Prompts to Achieve Your Goal
4. Test Output Quality with Promptfoo: A Step-by-Step ApproachUse Promptfoo to validate that the LLM-generated markdown (or any other structured format) meets both formatting and task-specific goals. Follow these steps to set up and refine your tests:
5. Example: Testing a Physics Explanation with Math in markdown
6. Summary and Further Resources
|
Initial checklist
Description of changes
This draft article is open for feedback to determine if it’s a fit for recipes and aligns with maintainer expectations.
It focuses on using model evaluations to generate valid Markdown with LLMs, prioritizing standards like CommonMark and GFM. It covers defining syntax rules, creating test cases, setting metrics, and refining prompts, with an emphasis on standards-compliant Markdown without custom workarounds.
Though it includes more Python than JavaScript, the recipe addresses developer needs around reliable Markdown output, especially for math syntax related discussions in remarkjs/remark-math#39.