You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Looks like a prompting/answer extraction issue. Added the prompt from llama evals (as humaneval_instruct) in the PR, but score still lower then official (0.5976 vs. 0.726 )
I tried to evaluate humaneval on meta-llama-3.1-instruct, but got a score close to 0. I printed the output and found
I think this may be due to
generation_kwargs.until
in the configuration. So what is the correct way to evaluate?The text was updated successfully, but these errors were encountered: