Releases: huggingface/optimum
v1.8.5: Patch release
Full Changelog: v1.8.4...v1.8.5
v1.8.4: Patch release
- Set onnx requirement by @echarlaix @regisss in #1037
Full Changelog: v1.8.3...v1.8.4
v1.8.3: Patch release
- Fix Stable Diffusion model ONNX export by @echarlaix in #1020
- Add
optimum-neuron
extra by @michaelbenayoun in #1021
Full Changelog: v1.8.2...v1.8.3
v1.8: extended BetterTransformer support, ONNX merged seq2seq models
Extended BetterTransformer support
Various improvements in the PyTorch BetterTransformer integration.
- [BT] add
BetterTransformer
support for ProphetNet by @hirotasoshu in #923 - Improve bettertransformer benchmark script by @fxmarty in #939
- Fix sdpa with batch size = 1, better benchmark by @fxmarty in #915
- Fix slow tests & sdpa dropout by @fxmarty in #974
- Remove getattr overhead in spda by @fxmarty in #934
- [
BT
] Improve docs by @younesbelkada in #944
ONNX merged seq2seq models
Instead of using two separate decoder_model.onnx
and decoder_with_past_model.onnx
models, a single decoder can be used for encoder-decoder models: decoder_model_merged.onnx
. This allows to avoid duplicated weights in the two without/with past ONNX models.
By default, if available, the decoder_model_merged.onnx
will be used in the ORTModel integration. This can be disabled with the option --no-post-process
in the ONNX export CLI, and with use_merged=False
in the ORTModel.from_pretrained
method.
Example:
optimum-cli export onnx --model t5-small t5_onnx
will give:
└── t5_onnx
├── config.json
├── decoder_model_merged.onnx
├── decoder_model.onnx
├── decoder_with_past_model.onnx
├── encoder_model.onnx
├── generation_config.json
├── special_tokens_map.json
├── spiece.model
├── tokenizer_config.json
└── tokenizer.json
And decoder_model_merged.onnx
is enough to be used for inference. We strongly recommend to inspect the subgraphs with netron to understand what are the inputs/outputs, in case the exported model is to be used with an other engine than ONNX Runtime in the Optimum integration.
- Fix encoder-decoder ONNX merge by @fxmarty in #924
- Support the merge of decoder without/with past for encoder-decoder models in the ONNX export by @fxmarty in #926
- Support merged seq2seq models in ORTModel by @fxmarty in #930
New models in the ONNX export
Major bugfix
- Remove constant output in encoder-decoder ONNX models decoder with past by @fxmarty in #920
- Hash tensor data during deduplication by @VikParuchuri in #932
Potentially breaking changes
The TasksManager replaces legacy tasks names by the canonical ones used on the Hub and in transformers metadata:
sequence-classification
becomestext-classification
,causal-lm
becomestext-generation
,seq2seq-lm
becomestext2text-generation
,speech2seq-lm
andaudio-ctc
becomesautomatic-speech-recognition
,default
becomesfeature-extraction
,masked-lm
becomesfill-mask
,vision2seq-lm
becomesimage-to-text
This should not break anything except if you rely on private methods and attributes from TasksManager
.
What's Changed
- Update ort trainer to transformers 4.27.2 by @JingyaHuang in #917
- Compute Loss inside the training step. by @AdamLouly in #686
- Fix ORTModel MRO for whisper by @fxmarty in #919
- add ORTStableDiffusionPipeline reference in documentation by @echarlaix in #890
- Fix decoder ONNX model loading from the Hub by @fxmarty in #929
optimun-cli onnxruntime quantize / optimize
output argument is now required by @michaelbenayoun in #927- Register mechanism for the Optimum CLI by @michaelbenayoun in #928
- Ensure backward compatibility of ORTModel by @fxmarty in #933
- Update the README by @michaelbenayoun in #925
- Update README by @echarlaix in #941
- Update readme by @echarlaix in #942
- Remove GC from README by @michaelbenayoun in #943
- Add user and token for CI by @michaelbenayoun in #945
- Update README by @echarlaix in #946
optimum-cli
print the help of subcommands by @michaelbenayoun in #940- Remove from_transformers references from the documentation by @fxmarty in #935
- Turn command import into optional by @JingyaHuang in #936
- Auto-set use_merged to False if use_cache is passed as False by @fxmarty in #954
- Raise error with use_cache=False, use_io_binding=True by @fxmarty in #955
- Add an ORT training notebook by @JingyaHuang in #959
- Fix issue with doc build sometimes failing silently in GH workflows by @regisss in #960
- Fix typos by @regisss in #963
- Disable tests upon transformers 4.28 release by @fxmarty in #976
New Contributors
- @hirotasoshu made their first contribution in #923
- @VikParuchuri made their first contribution in #932
Full Changelog: v1.7.3...v1.8.2
v1.7.3: Patch release for PyTorch 2.0 and transformers 4.27.0
This patch releases fixes a few bugs with PyTorch 2.0 release, and include a few new features as well.
Breaking change: constant outputs removed from ONNX encoder-decoder models
We removed some constant past key values outputs from encoder-decoder models in the ONNX export. Beware that this could potentially break your existing code, but we recommend to use the new exported models as this removes unnecessary Identity
nodes in the models.
- Remove constant outputs from decoder with past ONNX model for encoder-decoder architectures by @fxmarty in #872
torch.nn.functional.scaled_dot_product_attention
support for decoders in BetterTransformer
Pytorch 2.0 introduces in beta torch.nn.functional.scaled_dot_product_attention
, a fastpath for attention extending their accelerated transformer features. This is included in optimum.bettertransformer
to be used with the following architectures: Bart, Blenderbot, GPT2, GTP-J, M2M100, Marian, Mbart, OPT, Pegasus, T5.
Beware that this is still experimental and speedups have yet to be validated on all architectures.
PyTorch's scaled_dot_product_attention
allows to use flash attention and memory efficient attention natively in PyTorch.
Usage is as follow:
from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.bettertransformer import BetterTransformer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
model = BetterTransformer.transform(model) # modify transformers modeling to use native scaled_dot_product_attention
# do you inference or training here
model = BetterTransformer.reverse(model) # go back to using canonical transformers modeling
model.save_pretrained("gpt2_model")
Inference benchmark (on fp16):
Model | batch size | Input sequence length | Generated tokens | Latency eager (s) | Latency BT (s) | Speedup | Peak memory eager (MB) | Peak memory BT (MB) | Memory savings |
---|---|---|---|---|---|---|---|---|---|
gpt2 | 1 | 64 | 256 | 1.800 | 1.607 | 12.0% | 569.90 | 569.89 | 0% |
gpt2 | 64 | 64 | 256 | 2.159 | 1.617 | 33.5% | 2067.45 | 2093.80 | 0% |
opt-1.3b | 1 | 64 | 256 | 3.010 | 2.667 | 12.9% | 5408.238 | 5408.238 | 0% |
gpt-neox-20b | 1 | 64 | 256 | 10.869 | 9.937 | 9.4% | 83670.67 | 83673.53 | 0% |
Training benchmark (on fp16):
Model | batch size | Sequence length | time/epoch (eager, s) | time/epoch (BT, s) | Speedup | Peak memory eager (MB) | Peak memory BT (MB) | Memory savings |
---|---|---|---|---|---|---|---|---|
gpt2 | 8 | 1024 | 17.732 | 14.037 | 26.3% | 13291.16 | 10191.52 | 30.4% |
gpt2 | 32 | 1024 | 17.336 | 13.309 | 30.3% | 52834.83 | 38858.56 | 36.0% |
gpt2 | 64 | 1024 | OOM | 14.067 | / | OOM | 75600.08 | / |
Benchmarks can be reproduced using the inference script and training script:
python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256
python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256 --seqlen-stdev 0
- Add scaled_dot_product_attention support for decoder models by @fxmarty in #853
- Support scaled_dot_product_attention for t5 by @fxmarty in #856
- [
BT
] add decoder benchmark script by @younesbelkada in #857 - [
BT
] Fix bt benchmark by @younesbelkada in #858 - Fix pytorch version check in bettertransformer by @fxmarty in #862
- [
BT
] Add fp16 support by @younesbelkada in #859 - [
BT
] Add decoder training support by @younesbelkada in #860 - Bart support scaled_dot_product_attention by @fxmarty in #863
- [
BT
] addaccelerate_test
markers by @younesbelkada in #864 - Mbart, pegasus, blenderbot, marian, m2m_100 support scaled_dot_product_attention by @fxmarty in #865
- Add bettertransformer reverse transform by @fxmarty in #868
- Add bettertransformer training benchmark script by @fxmarty in #873
New architectures in the ONNX export
Three additional architectures are supported in the ONNX export: ImageGPT, RegNet, OPT.
- Adding ONNX support for ImageGPT by @adit299 in #819
- Add ONNX support for RegNet by @asrimanth in #833
- Adding support for Facebook's OPT models by @hivaze in #852
(WIP) TFLite export with quantization support
Continued progress in the TFLite export with quantization support. This is work in progress and not documented yet.
- Quantization with TFLite by @michaelbenayoun in #854
Bugfixes and improvements
- Update documentation by @echarlaix in #843
- Fix typo in documentation by @regisss in #848
- Remove redundant code by @mht-sharma in #841
- Update README by @echarlaix in #850
- Update documentation by @echarlaix in #855
- Remove iobinding ORTModelForCTC by @mht-sharma in #840
- Fix typo in documentation by @echarlaix in #861
- Fix causal-lm ONNX axis names by @fxmarty in #871
- add NNCF openvino notebook by @echarlaix in #875
- Remove positional-only parameters not support by python < v3.8 by @echarlaix in #881
- lazy import for task manager by @JingyaHuang in #844
- Remove onnx and ort dependencies on the TasksManager by @michaelbenayoun in #846
- Reactivate export & optimization tests for causal-lm models by @fxmarty in #885
- Fix ONNX export on transformers 4.27 release by @fxmarty in #884
- Do not use scaled_dot_product_attention for stable diffusion onnx export by @fxmarty in #888
- Fix loading of an ONNX stable diffusion model when config doesn't match by @echarlaix in #887
- Automatic framework detection in TasksManager for large models by @fxmarty in #883
- Fix WavLM onnx export upon torch 2.0 release by @fxmarty in #889
- Fix PushToHubMixin._create_repo according to transformers 4.27 release by @fxmarty in #892
- Fix stable diffusion framework detection by @fxmarty in #893
- Add donut CPU inference ORT by @mht-sharma in #761
- Fix check_model for large merged ONNX models by @fxmarty in #896
- Drop python 3.7 support by @fxmarty in #891
- Fix dummy label generator for vision tasks by @JingyaHuang in #900
- Add stable diffusion dummy object by @echarlaix in #899
- Automatic support for large ONNX models in ORTOptimizer by @fxmarty in #886
- Remove subprocess calls in ONNX export by @fxmarty in #897
- Registering mechanism for the
TasksManager
by @michaelbenayoun in https://github.com/huggingface/optimum/pull...
v1.7.1: Patch release
Temporarily fix a critical bug in BetterTransformer #849
Full Changelog: v1.7.0...v1.7.1
v1.7.0: ONNX export extension, TFLite export, single-ONNX decoding, ONNX Runtime extension for audio, vision tasks, stable diffusion
New models supported in the ONNX export
Additional architectures are supported in the ONNX export: PoolFormer, Pegasus, Audio Spectrogram Transformer, Hubert, SEW, Speech2Text, UniSpeech, UniSpeech-SAT, Wav2Vec2, Wav2Vec2-Conformer, WavLM, Data2Vec Audio, MPNet, stable diffusion VAE encoder, vision encoder decoder, Nystromformer, Splinter, GPT NeoX.
- Add PoolFormer support in exporters.onnx by @BakingBrains in #646
- Support pegasus exporters by @mht-sharma in #620
- Audio models support with
optimum.exporters.onnx
by @michaelbenayoun in #622 - Add MPNet ONNX export by @jplu in #691
- Add stable diffusion VAE encoder export by @echarlaix in #705
- Add vision encoder decoder model in exporters by @mht-sharma in #588
- Nystromformer ONNX export by @whr778 in #728
- Support Splinter exporters (#555) by @Allanbeddouk in #736
- Add gpt-neo-x support by @sidthekidder in #745
New models supported in BetterTransformer
A few additional architectures are supported in BetterTransformer: RoCBERT, RoFormer, Marian
- Add RoCBert support for Bettertransformer by @shogohida in #542
- Add better transformer support for RoFormer by @manish-p-gupta in #680
- added BetterTransformer support for Marian by @IlyasMoutawwakil in #808
Additional tasks supported in the ONNX Runtime integration
With ORTModelForMaskedLM, ORTModelForVision2Seq, ORTModelForAudioClassification, ORTModelForCTC, ORTModelForAudioXVector, ORTModelForAudioFrameClassification, ORTStableDiffusionPipeline.
Reference: https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort and https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/models#export-and-inference-of-stable-diffusion-models
- Add ORTModelForMaskedLM class by @JingyaHuang in #729
- Add ORTModelForVision2Seq for VisionEncoderDecoder models inference by @mht-sharma in #742
- Add ORTModelXXX for audio by @mht-sharma in #774
- Add stable diffusion onnx runtime pipeline by @echarlaix in #786
Support of the ONNX export from PyTorch on float16
In the ONNX export, it is possible to pass the options --fp16 --device cuda
to export using float16 when a GPU is available, directly with the native torch.onnx.export
.
Example: optimum-cli export onnx --model gpt2 --fp16 --device cuda gpt2_onnx/
TFLite export
TFLite export is now supported, with static shapes:
optimum-cli export tflite --help
optimum-cli export tflite --model bert-base-uncased --sequence_length 128 bert_tflite/
exporters.tflite
initial support by @michaelbenayoun in #716- TFLite auto-encoder models by @michaelbenayoun in #757
- [TFLite Export] Adds support for ResNet by @sayakpaul in #813
ONNX Runtime optimization and quantization directly in the CLI
- Add optimize and quantize command CLI by @jplu in #700
- Support ONNX Runtime optimizations in exporters.onnx by @fxmarty in #807
The ONNX export optionally supports the ONNX Runtime optimizations directly in the export, passing the --optimize O1
, up to --optimize O4
option:
optimum-cli export onnx --help
optimum-cli export onnx --model t5-small --optimize O3 t5small_onnx/
ONNX Runtime quantization is supported directly in command line, using optimum-cli onnxruntime quantize
:
optimum-cli onnxruntime quantize --help
optimum-cli onnxruntime quantize --onnx_model distilbert_onnx --avx512
ONNX Runtime optimization is supported directly in command line, using optimum-cli onnxruntime optimize
:
optimum-cli onnxruntime optimize --help
optimum-cli onnxruntime optimize --onnx_model distilbert_onnx -O3
ORTModelForCausalLM supports decoding with a single ONNX
Up no now, for decoders, two ONNX were used:
- One handling the first forward pass where no past key values have been cached yet - thus not taking them as input.
- One handling the following forward pass where past key values have been cached, thus taking them as input.
This release introduces the support in the ONNX export and in ORTModelForCausalLM
of a single ONNX handling both steps of the decoding. This allows to reduce memory usage, as weights are not duplicated between two separate models during inference.
Using a single ONNX for decoders can be used by passing use_merged=True
to ORTModelForCausalLM.from_pretrained
, loading directly from a PyTorch model:
from optimum.onnxruntime import ORTModelForCausalLM
model = ORTModelForCausalLM.from_pretrained("gpt2", export=True, use_merged=True)
Alternatively, using a single ONNX for decoders is the default behavior in the ONNX export, that can later be used for example with ORTModelForCausalLM
, the command optimum-cli export onnx --model gpt2 gpt2_onnx/
will produce:
└── gpt2_onnx
├── config.json
├── decoder_model_merged.onnx
├── decoder_model.onnx
├── decoder_with_past_model.onnx
├── merges.txt
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── vocab.json
The decoder_model.onnx
and decoder_with_past_model.onnx
are kept separate for backward compatibility, but during inference using solely decoder_model_merged.onnx
is enough.
- Enable inference with a merged decoder in
ORTModelForCausalLM
by @JingyaHuang in #647
Single-file ORTModel accept numpy arrays
ORTModel accept numpy arrays as inputs, in addition to PyTorch tensors. This is only the case for models that use a single ONNX.
ORTOptimizer support for ORTModelForCausalLM
- ORTOptimizer support ORTModelForCausalLM by @fxmarty in #794
- Support IO Binding for merged decoder by @fxmarty in #797
Breaking changes
- In the ONNX export, exporting models in several ONNX (encoder, decoder) is now the default behavior: #747. The old behavior is still accessible with
--monolith
. - In decoders, reusing past key values is now the default in the ONNX export: #748. The old behavior is still accessible by explicitly passing, for example,
--task causal-lm
instead of--task causal-lm-with-past
. - BigBird support in the ONNX export is removed, due to the
block_sparse
attention type being written in pure numpy in Transformers, and hence not exportable to ONNX: #778 - The parameter
from_transformers
ofORTModel.from_pretrained
will be deprecated in favor ofexport
.
Bugfixes and improvements
- Fix disable shape inference for optimization by @regisss in #652
- Fix uninformative message when passing
use_cache=True
to ORTModel and no ONNX with cache is available by @fxmarty in #650 - Fix provider options when several providers are passed by @fxmarty in #653
- Add TensorRT engine to ONNX Runtime GPU documentation by @fxmarty in #657
- Improve documentation around ONNX export by @fxmarty in #666
- minor updates on ONNX config guide by @mszsorondo in #662
- Fix FlaubertOnnxConfig by @michaelbenayoun in #669
- Use nvcr.io/nvidia/tensorrt image for GPU tests by @fxmarty in #660
- Better Transformer doc fix by @HamidShojanazeri in #670
- Add support for LongT5 optimization using ORT transformer optimizer script by @kunal-vaishnavi in #683
- Add test for missing execution providers error messages by @fxmarty in #659
- ONNX transformation to cast int64 constants to int32 when possible by @fxmarty in #655
- Add missing normalized configs by @fxmarty in #694
- Remove code duplication in ORTModel's load_model by @fxmarty in #695
- Test more architectures in ORTModel by @fxmarty in #675
- Avoid initializing unwanted attributes for ORTModel's having several inference sessions by @fxmarty in #696
- Fix the ORTQuantizer loading from specific file by @echarlaix in #701
- Add saving of diffusion model additional components ...
v1.6.4: Patch release
Bugfix
- Fix past key/value reuse in decoders following transformers 4.26.0 release and renaming: b9211d6
- ONNX Runtime 1.14 support: #772
Full Changelog: v1.6.3...v1.6.4
v1.6.3: Patch release
Fixes ORTTrainer
for the inference with the ONNX Runtime backend.
v1.6.2: Patch release
Hotfixes
Regressions
The export of speech-to-text architecture as a single ONNX file (that handles both the encoding and decoding) fails do to a regression with the latest transformers version: #721
Full Changelog: v1.6.1...v1.6.2