-
I believe that the Pearson coefficient does not adequately represent the correlation between the two evaluation methods. In fact, observing the scatter chart it is clear that there are strong discrepancies both when gpt-4 assigns a score of 0 and when it assigns a score of 100 (llama3 covers a much wider range in both cases). Probably a correlation based on the concept of rank would better describe the relation. import scipy (PearsonRResult(statistic=0.8048901206421856, pvalue=6.122429296730889e-24), |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Beta Was this translation helpful? Give feedback.
I updated it now here: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/03_model-evaluation/scores/correlation-analysis.ipynb