What's new
Tasks
- LiveCodeBench by @plaguss in #548, #587, #518
- GPQA diamond by @lewtun in #534
- Humanity's last exam by @clefourrier in #520
- Olympiad Bench by @NathanHB in #521
- aime24, 25 and math500 by @NathanHB in #586
- french models Evals by @mdiazmel in #505
Metrics
- Pass@k by @clefourrier in #519
- Extractive Match metric by @hynky1999 in #495, #503, #522, #535
Features
Better logging
- log model config by @NathanHB in #627
- Support custom results/details push to hub by @albertvillanova in #457
- Push details without converting fields to str by @NathanHB in #572
Inference providers
Load details to be evaluated
- Implemented the possibility to load predictions from details files and continue evaluating from there by @JoelNiklaus in #488
sglang support
Bug Fixes and refacto
- Tiny improvements to
endpoint_model.py
,base_model.py
,... by @sadra-barikbin in #219 - Update README.md by @NathanHB in #486
- Fix issue with encodings for together models. by @JoelNiklaus in #483
- Made litellm judge backend more robust. by @JoelNiklaus in #485
- Fix
T_co
import bug by @gucci-j in #484 - fix README link by @vxw3t8fhjsdkghvbdifuk in #500
- Fixed issue with o1 in litellm. by @JoelNiklaus in #493
- Hotfix for litellm judge by @JoelNiklaus in #490
- Made judge response processing more robust. by @JoelNiklaus in #491
- VLLM: Allows for max tokens to be set in model config file by @NathanHB in #547
- Bump up the latex2sympy2_extended version + more tests by @hynky1999 in #510
- Fixed bug of import url_to_fs from fsspec by @LoserCheems in #507)
- Fix Ukrainian indices and confirmation word by @ayukh in #516
- Fix VLLM data-parallel by @hynky1999 in #541
- relax spacy import to relax dep by @clefourrier in #622
- vllm fix sampling params by @NathanHB in #625
- relax deps for tgi by @NathanHB in #626
- Bug fix extractive match by @hynky1999 in #540
- Fix loading of vllm model from files by @NathanHB in #533
- fix: broken URLs by @deep-diver in #550
- typo(vllm):
gpu_memory_utilisation
typo by @tpoisonooo in #553 - allows better flexibility for litellm endpoints by @NathanHB in #549
- Translate task template to Catalan and Galician and fix typos by @mariagrandury in #506
- Relax upper bound on torch by @lewtun in #508
- Fix vLLM generation with sampling params by @lewtun in #578
- Make BLEURT lazy by @hynky1999 in #536
- Fixing backend error in main_sglang. by @TankNee in #597
- VLLM + Math-Verify fixes by @hynky1999 in #603
- raise exception when generation size is more than model length by @NathanHB in #571
Thanks
Huge thanks to Hyneck, Lewis, Ben, Agustín, Elie and everyone helping and and giving feedback 💙
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @hynky1999
- Extractive Match metric (#495)
- Fix math extraction (#503)
- Bump up the latex2sympy2_extended version + more tests (#510)
- Math extraction - allow only trying the first match, more customizable latex extraction + bump deps (#522)
- add missing inits (#524)
- Sync Math-verify (#535)
- Make BLEURT lazy (#536)
- Bug fix extractive match (#540)
- Fix VLLM data-parallel (#541)
- VLLM + Math-Verify fixes (#603)
- @plaguss
- @Jayon02
- Let lighteval support sglang (#552)
- @NathanHB
- adds olympiad bench (#521)
- Fix loading of vllm model from files (#533)
- [VLLM] Allows for max tokens to be set in model config file (#547)
- allows better flexibility for litellm endpoints (#549)
- raise exception when generation size is more than model length (#571)
- Push details without converting fields to str (#572)
- adds aime24, 25 and math500 (#586)
- adds inference providers support (#616)
- vllm fix sampling params (#625)
- relax deps for tgi (#626)
- log model config (#627)