Sort metrics table by combined performance score (CPS) with radar chart to dynamically adjust metric weights #223
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Since its beginning, Matbench Discovery chose the F1 discovery score as the default metric by which models were ranked. Users could click other metric columns in the metrics table to sort by, but initial page load was always ranked by F1.
This PR defines a new combined performance score$\text{CPS} \in [0, 1]$ which currently is the weighted average of $50\% \text{ F1} + 40\% \ \kappa_\text{SRME} + 10\% \text{ RMSD}$ but will be extended to additional metrics in the future. Over the course of this leaderboard's evolution, the CPS metric is meant to combine the highest-signal metrics into a single number that best reflects a model's overall utility across different simulation tasks.
Being mindful of the fact that user needs and opinions of what matters in a good model differ widely across the community, we made it easy to dynamically adjust the weighting between different metrics with a single click. This allows the leaderboard to prioritize the metrics that best align with whatever simulation task a user has in mind.
Metric Combination Logic
RMSD can take values in$[0, \infty)$ . The current worst model for this metric gets $\text{RMSD} = 0.0227$ . Models will hopefully get better, not worse in this metric so we pick $[0, 0.03]$ as the range from which to normalize RMSD values to between $[0, 1]$ . That is, a perfect model achieving $\text{RMSD}=0$ would map to 1 before entering the
CPS
weighting function.F1 score is already a normalized metric in the range$[0, 1]$ with higher = better and requires no normalization. It enters
CPS
directly.For models where any of F1,$\kappa_\text{SRME}$ or RMSD are
NaN
, theCPS
is also set toNaN
which gets sorted to the bottom of the ranking by default. Energy-only models which don't offer geometry optimization and force predictions for phonon modeling are in this category but they were already ranked at the bottom based on the prior F1 ranking, so little changes here. MLFFs that can deliver all 3 metrics but haven't yet are invited to do so. Those are GNoME and MatterSim v1 5M.The current ranking for all models now looks like this:
major changes:
RadarChart.svelte
for visualizing metric weights and drag/drop knob to adjust weightsTableControls.svelte
component for managing filters and column visibilitymetrics.ts
with new geometry optimization metrics and scaling functions for combining disparate metrics into a single score