11 May 09:52

regisss

c14f4f0

v1.8.5: Patch release

Add transformers<4.29.0 in Habana extra by @regisss in #1047

Full Changelog: v1.8.4...v1.8.5

Contributors

regisss

Assets 2

07 May 20:53

echarlaix

v1.8.4

f43cee3

v1.8.4: Patch release

Set onnx requirement by @echarlaix @regisss in #1037

Full Changelog: v1.8.3...v1.8.4

Contributors

regisss and echarlaix

Assets 2

28 Apr 16:44

echarlaix

v1.8.3

a001ded

v1.8.3: Patch release

Fix Stable Diffusion model ONNX export by @echarlaix in #1020
Add optimum-neuron extra by @michaelbenayoun in #1021

Full Changelog: v1.8.2...v1.8.3

Contributors

michaelbenayoun and echarlaix

Assets 2

17 Apr 13:30

fxmarty

v1.8.2

6b8f1fd

v1.8: extended BetterTransformer support, ONNX merged seq2seq models

Extended BetterTransformer support

Various improvements in the PyTorch BetterTransformer integration.

[BT] add BetterTransformer support for ProphetNet by @hirotasoshu in #923
Improve bettertransformer benchmark script by @fxmarty in #939
Fix sdpa with batch size = 1, better benchmark by @fxmarty in #915
Fix slow tests & sdpa dropout by @fxmarty in #974
Remove getattr overhead in spda by @fxmarty in #934
[BT] Improve docs by @younesbelkada in #944

ONNX merged seq2seq models

Instead of using two separate decoder_model.onnx and decoder_with_past_model.onnx models, a single decoder can be used for encoder-decoder models: decoder_model_merged.onnx. This allows to avoid duplicated weights in the two without/with past ONNX models.

By default, if available, the decoder_model_merged.onnx will be used in the ORTModel integration. This can be disabled with the option --no-post-process in the ONNX export CLI, and with use_merged=False in the ORTModel.from_pretrained method.

Example:

optimum-cli export onnx --model t5-small t5_onnx

will give:

└── t5_onnx
    ├── config.json
    ├── decoder_model_merged.onnx
    ├── decoder_model.onnx
    ├── decoder_with_past_model.onnx
    ├── encoder_model.onnx
    ├── generation_config.json
    ├── special_tokens_map.json
    ├── spiece.model
    ├── tokenizer_config.json
    └── tokenizer.json

And decoder_model_merged.onnx is enough to be used for inference. We strongly recommend to inspect the subgraphs with netron to understand what are the inputs/outputs, in case the exported model is to be used with an other engine than ONNX Runtime in the Optimum integration.

Fix encoder-decoder ONNX merge by @fxmarty in #924
Support the merge of decoder without/with past for encoder-decoder models in the ONNX export by @fxmarty in #926
Support merged seq2seq models in ORTModel by @fxmarty in #930

New models in the ONNX export

Add llama onnx export & onnxruntime support by @nenkoru in #975

Major bugfix

Remove constant output in encoder-decoder ONNX models decoder with past by @fxmarty in #920
Hash tensor data during deduplication by @VikParuchuri in #932

Potentially breaking changes

The TasksManager replaces legacy tasks names by the canonical ones used on the Hub and in transformers metadata:

sequence-classification becomes text-classification,
causal-lm becomes text-generation,
seq2seq-lm becomes text2text-generation,
speech2seq-lm and audio-ctc becomes automatic-speech-recognition,
default becomes feature-extraction,
masked-lm becomes fill-mask,
vision2seq-lm becomes image-to-text

This should not break anything except if you rely on private methods and attributes from TasksManager.

Allow to use a custom class in TasksManager & use canonical tasks names by @fxmarty in #967

What's Changed

Update ort trainer to transformers 4.27.2 by @JingyaHuang in #917
Compute Loss inside the training step. by @AdamLouly in #686
Fix ORTModel MRO for whisper by @fxmarty in #919
add ORTStableDiffusionPipeline reference in documentation by @echarlaix in #890
Fix decoder ONNX model loading from the Hub by @fxmarty in #929
optimun-cli onnxruntime quantize / optimize output argument is now required by @michaelbenayoun in #927
Register mechanism for the Optimum CLI by @michaelbenayoun in #928
Ensure backward compatibility of ORTModel by @fxmarty in #933
Update the README by @michaelbenayoun in #925
Update README by @echarlaix in #941
Update readme by @echarlaix in #942
Remove GC from README by @michaelbenayoun in #943
Add user and token for CI by @michaelbenayoun in #945
Update README by @echarlaix in #946
optimum-cli print the help of subcommands by @michaelbenayoun in #940
Remove from_transformers references from the documentation by @fxmarty in #935
Turn command import into optional by @JingyaHuang in #936
Auto-set use_merged to False if use_cache is passed as False by @fxmarty in #954
Raise error with use_cache=False, use_io_binding=True by @fxmarty in #955
Add an ORT training notebook by @JingyaHuang in #959
Fix issue with doc build sometimes failing silently in GH workflows by @regisss in #960
Fix typos by @regisss in #963
Disable tests upon transformers 4.28 release by @fxmarty in #976

New Contributors

@hirotasoshu made their first contribution in #923
@VikParuchuri made their first contribution in #932

Full Changelog: v1.7.3...v1.8.2

Contributors

VikParuchuri, fxmarty, and 8 other contributors

Assets 2

23 Mar 16:37

fxmarty

v1.7.3

3685483

v1.7.3: Patch release for PyTorch 2.0 and transformers 4.27.0

This patch releases fixes a few bugs with PyTorch 2.0 release, and include a few new features as well.

Breaking change: constant outputs removed from ONNX encoder-decoder models

We removed some constant past key values outputs from encoder-decoder models in the ONNX export. Beware that this could potentially break your existing code, but we recommend to use the new exported models as this removes unnecessary Identity nodes in the models.

Remove constant outputs from decoder with past ONNX model for encoder-decoder architectures by @fxmarty in #872

`torch.nn.functional.scaled_dot_product_attention` support for decoders in BetterTransformer

Pytorch 2.0 introduces in beta torch.nn.functional.scaled_dot_product_attention, a fastpath for attention extending their accelerated transformer features. This is included in optimum.bettertransformer to be used with the following architectures: Bart, Blenderbot, GPT2, GTP-J, M2M100, Marian, Mbart, OPT, Pegasus, T5.

Beware that this is still experimental and speedups have yet to be validated on all architectures.

PyTorch's scaled_dot_product_attention allows to use flash attention and memory efficient attention natively in PyTorch.

Usage is as follow:

from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.bettertransformer import BetterTransformer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

model = BetterTransformer.transform(model)  # modify transformers modeling to use native scaled_dot_product_attention

# do you inference or training here

model = BetterTransformer.reverse(model)  # go back to using canonical transformers modeling
model.save_pretrained("gpt2_model")

Inference benchmark (on fp16):

Model	batch size	Input sequence length	Generated tokens	Latency eager (s)	Latency BT (s)	Speedup	Peak memory eager (MB)	Peak memory BT (MB)	Memory savings
gpt2	1	64	256	1.800	1.607	12.0%	569.90	569.89	0%
gpt2	64	64	256	2.159	1.617	33.5%	2067.45	2093.80	0%
opt-1.3b	1	64	256	3.010	2.667	12.9%	5408.238	5408.238	0%
gpt-neox-20b	1	64	256	10.869	9.937	9.4%	83670.67	83673.53	0%

Training benchmark (on fp16):

Model	batch size	Sequence length	time/epoch (eager, s)	time/epoch (BT, s)	Speedup	Peak memory eager (MB)	Peak memory BT (MB)	Memory savings
gpt2	8	1024	17.732	14.037	26.3%	13291.16	10191.52	30.4%
gpt2	32	1024	17.336	13.309	30.3%	52834.83	38858.56	36.0%
gpt2	64	1024	OOM	14.067	/	OOM	75600.08	/

Benchmarks can be reproduced using the inference script and training script:

python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256
python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256 --seqlen-stdev 0

Add scaled_dot_product_attention support for decoder models by @fxmarty in #853
Support scaled_dot_product_attention for t5 by @fxmarty in #856
[BT] add decoder benchmark script by @younesbelkada in #857
[BT] Fix bt benchmark by @younesbelkada in #858
Fix pytorch version check in bettertransformer by @fxmarty in #862
[BT] Add fp16 support by @younesbelkada in #859
[BT] Add decoder training support by @younesbelkada in #860
Bart support scaled_dot_product_attention by @fxmarty in #863
[BT] add accelerate_test markers by @younesbelkada in #864
Mbart, pegasus, blenderbot, marian, m2m_100 support scaled_dot_product_attention by @fxmarty in #865
Add bettertransformer reverse transform by @fxmarty in #868
Add bettertransformer training benchmark script by @fxmarty in #873

New architectures in the ONNX export

Three additional architectures are supported in the ONNX export: ImageGPT, RegNet, OPT.

Adding ONNX support for ImageGPT by @adit299 in #819
Add ONNX support for RegNet by @asrimanth in #833
Adding support for Facebook's OPT models by @hivaze in #852

(WIP) TFLite export with quantization support

Continued progress in the TFLite export with quantization support. This is work in progress and not documented yet.

Quantization with TFLite by @michaelbenayoun in #854

Bugfixes and improvements

Update documentation by @echarlaix in #843
Fix typo in documentation by @regisss in #848
Remove redundant code by @mht-sharma in #841
Update README by @echarlaix in #850
Update documentation by @echarlaix in #855
Remove iobinding ORTModelForCTC by @mht-sharma in #840
Fix typo in documentation by @echarlaix in #861
Fix causal-lm ONNX axis names by @fxmarty in #871
add NNCF openvino notebook by @echarlaix in #875
Remove positional-only parameters not support by python < v3.8 by @echarlaix in #881
lazy import for task manager by @JingyaHuang in #844
Remove onnx and ort dependencies on the TasksManager by @michaelbenayoun in #846
Reactivate export & optimization tests for causal-lm models by @fxmarty in #885
Fix ONNX export on transformers 4.27 release by @fxmarty in #884
Do not use scaled_dot_product_attention for stable diffusion onnx export by @fxmarty in #888
Fix loading of an ONNX stable diffusion model when config doesn't match by @echarlaix in #887
Automatic framework detection in TasksManager for large models by @fxmarty in #883
Fix WavLM onnx export upon torch 2.0 release by @fxmarty in #889
Fix PushToHubMixin._create_repo according to transformers 4.27 release by @fxmarty in #892
Fix stable diffusion framework detection by @fxmarty in #893
Add donut CPU inference ORT by @mht-sharma in #761
Fix check_model for large merged ONNX models by @fxmarty in #896
Drop python 3.7 support by @fxmarty in #891
Fix dummy label generator for vision tasks by @JingyaHuang in #900
Add stable diffusion dummy object by @echarlaix in #899
Automatic support for large ONNX models in ORTOptimizer by @fxmarty in #886
Remove subprocess calls in ONNX export by @fxmarty in #897
Registering mechanism for the TasksManager by @michaelbenayoun in https://github.com/huggingface/optimum/pull...

Contributors

fxmarty, regisss, and 10 other contributors

Assets 2

03 Mar 13:41

fxmarty

v1.7.1

8252f4b

v1.7.1: Patch release

Temporarily fix a critical bug in BetterTransformer #849

Full Changelog: v1.7.0...v1.7.1

Assets 2

02 Mar 12:32

fxmarty

v1.7.0

987b02e

v1.7.0: ONNX export extension, TFLite export, single-ONNX decoding, ONNX Runtime extension for audio, vision tasks, stable diffusion

New models supported in the ONNX export

Additional architectures are supported in the ONNX export: PoolFormer, Pegasus, Audio Spectrogram Transformer, Hubert, SEW, Speech2Text, UniSpeech, UniSpeech-SAT, Wav2Vec2, Wav2Vec2-Conformer, WavLM, Data2Vec Audio, MPNet, stable diffusion VAE encoder, vision encoder decoder, Nystromformer, Splinter, GPT NeoX.

Add PoolFormer support in exporters.onnx by @BakingBrains in #646
Support pegasus exporters by @mht-sharma in #620
Audio models support with optimum.exporters.onnx by @michaelbenayoun in #622
Add MPNet ONNX export by @jplu in #691
Add stable diffusion VAE encoder export by @echarlaix in #705
Add vision encoder decoder model in exporters by @mht-sharma in #588
Nystromformer ONNX export by @whr778 in #728
Support Splinter exporters (#555) by @Allanbeddouk in #736
Add gpt-neo-x support by @sidthekidder in #745

New models supported in BetterTransformer

A few additional architectures are supported in BetterTransformer: RoCBERT, RoFormer, Marian

Add RoCBert support for Bettertransformer by @shogohida in #542
Add better transformer support for RoFormer by @manish-p-gupta in #680
added BetterTransformer support for Marian by @IlyasMoutawwakil in #808

Additional tasks supported in the ONNX Runtime integration

With ORTModelForMaskedLM, ORTModelForVision2Seq, ORTModelForAudioClassification, ORTModelForCTC, ORTModelForAudioXVector, ORTModelForAudioFrameClassification, ORTStableDiffusionPipeline.

Reference: https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort and https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/models#export-and-inference-of-stable-diffusion-models

Add ORTModelForMaskedLM class by @JingyaHuang in #729
Add ORTModelForVision2Seq for VisionEncoderDecoder models inference by @mht-sharma in #742
Add ORTModelXXX for audio by @mht-sharma in #774
Add stable diffusion onnx runtime pipeline by @echarlaix in #786

Support of the ONNX export from PyTorch on float16

In the ONNX export, it is possible to pass the options --fp16 --device cuda to export using float16 when a GPU is available, directly with the native torch.onnx.export.

Example: optimum-cli export onnx --model gpt2 --fp16 --device cuda gpt2_onnx/

Support ONNX export on torch.float16 type by @fxmarty in #749

TFLite export

TFLite export is now supported, with static shapes:

optimum-cli export tflite --help
optimum-cli export tflite --model bert-base-uncased --sequence_length 128 bert_tflite/

exporters.tflite initial support by @michaelbenayoun in #716
TFLite auto-encoder models by @michaelbenayoun in #757
[TFLite Export] Adds support for ResNet by @sayakpaul in #813

ONNX Runtime optimization and quantization directly in the CLI

Add optimize and quantize command CLI by @jplu in #700
Support ONNX Runtime optimizations in exporters.onnx by @fxmarty in #807

The ONNX export optionally supports the ONNX Runtime optimizations directly in the export, passing the --optimize O1, up to --optimize O4 option:

optimum-cli export onnx --help
optimum-cli export onnx --model t5-small --optimize O3 t5small_onnx/

ONNX Runtime quantization is supported directly in command line, using optimum-cli onnxruntime quantize:

optimum-cli onnxruntime quantize --help
optimum-cli onnxruntime quantize --onnx_model distilbert_onnx --avx512

ONNX Runtime optimization is supported directly in command line, using optimum-cli onnxruntime optimize:

optimum-cli onnxruntime optimize --help
optimum-cli onnxruntime optimize --onnx_model distilbert_onnx -O3

ORTModelForCausalLM supports decoding with a single ONNX

Up no now, for decoders, two ONNX were used:

One handling the first forward pass where no past key values have been cached yet - thus not taking them as input.
One handling the following forward pass where past key values have been cached, thus taking them as input.

This release introduces the support in the ONNX export and in ORTModelForCausalLM of a single ONNX handling both steps of the decoding. This allows to reduce memory usage, as weights are not duplicated between two separate models during inference.

Using a single ONNX for decoders can be used by passing use_merged=True to ORTModelForCausalLM.from_pretrained, loading directly from a PyTorch model:

from optimum.onnxruntime import ORTModelForCausalLM

model = ORTModelForCausalLM.from_pretrained("gpt2", export=True, use_merged=True)

Alternatively, using a single ONNX for decoders is the default behavior in the ONNX export, that can later be used for example with ORTModelForCausalLM, the command optimum-cli export onnx --model gpt2 gpt2_onnx/ will produce:

└── gpt2_onnx
    ├── config.json
    ├── decoder_model_merged.onnx
    ├── decoder_model.onnx
    ├── decoder_with_past_model.onnx
    ├── merges.txt
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    ├── tokenizer.json
    └── vocab.json

The decoder_model.onnx and decoder_with_past_model.onnx are kept separate for backward compatibility, but during inference using solely decoder_model_merged.onnx is enough.

Enable inference with a merged decoder in ORTModelForCausalLM by @JingyaHuang in #647

Single-file ORTModel accept numpy arrays

ORTModel accept numpy arrays as inputs, in addition to PyTorch tensors. This is only the case for models that use a single ONNX.

Accept numpy.ndarray as input and output to ORTModel by @fxmarty in #790

ORTOptimizer support for ORTModelForCausalLM

ORTOptimizer support ORTModelForCausalLM by @fxmarty in #794
Support IO Binding for merged decoder by @fxmarty in #797

Breaking changes

In the ONNX export, exporting models in several ONNX (encoder, decoder) is now the default behavior: #747. The old behavior is still accessible with --monolith.
In decoders, reusing past key values is now the default in the ONNX export: #748. The old behavior is still accessible by explicitly passing, for example, --task causal-lm instead of --task causal-lm-with-past.
BigBird support in the ONNX export is removed, due to the block_sparse attention type being written in pure numpy in Transformers, and hence not exportable to ONNX: #778
The parameter from_transformers of ORTModel.from_pretrained will be deprecated in favor of export.

Bugfixes and improvements

Fix disable shape inference for optimization by @regisss in #652
Fix uninformative message when passing use_cache=True to ORTModel and no ONNX with cache is available by @fxmarty in #650
Fix provider options when several providers are passed by @fxmarty in #653
Add TensorRT engine to ONNX Runtime GPU documentation by @fxmarty in #657
Improve documentation around ONNX export by @fxmarty in #666
minor updates on ONNX config guide by @mszsorondo in #662
Fix FlaubertOnnxConfig by @michaelbenayoun in #669
Use nvcr.io/nvidia/tensorrt image for GPU tests by @fxmarty in #660
Better Transformer doc fix by @HamidShojanazeri in #670
Add support for LongT5 optimization using ORT transformer optimizer script by @kunal-vaishnavi in #683
Add test for missing execution providers error messages by @fxmarty in #659
ONNX transformation to cast int64 constants to int32 when possible by @fxmarty in #655
Add missing normalized configs by @fxmarty in #694
Remove code duplication in ORTModel's load_model by @fxmarty in #695
Test more architectures in ORTModel by @fxmarty in #675
Avoid initializing unwanted attributes for ORTModel's having several inference sessions by @fxmarty in #696
Fix the ORTQuantizer loading from specific file by @echarlaix in #701
Add saving of diffusion model additional components ...

Contributors

jplu, sidthekidder, and 20 other contributors

Assets 2

13 Feb 16:54

fxmarty

v1.6.4

5da0411

v1.6.4: Patch release

Bugfix

Fix past key/value reuse in decoders following transformers 4.26.0 release and renaming: b9211d6
ONNX Runtime 1.14 support: #772

Full Changelog: v1.6.3...v1.6.4

Assets 2

25 Jan 17:28

JingyaHuang

v1.6.3

eba6afc

v1.6.3: Patch release

Fixes ORTTrainer for the inference with the ONNX Runtime backend.

Assets 2

25 Jan 11:38

fxmarty

v1.6.2

9f9d997

v1.6.2: Patch release

Hotfixes

Support generation config in ORTModel by @fxmarty in #651

Regressions

The export of speech-to-text architecture as a single ONNX file (that handles both the encoding and decoding) fails do to a regression with the latest transformers version: #721

Full Changelog: v1.6.1...v1.6.2

Contributors

fxmarty

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributors

Contributors

Contributors

Extended BetterTransformer support

ONNX merged seq2seq models

New models in the ONNX export

Major bugfix

Potentially breaking changes

What's Changed

New Contributors

Contributors

Breaking change: constant outputs removed from ONNX encoder-decoder models

`torch.nn.functional.scaled_dot_product_attention` support for decoders in BetterTransformer

New architectures in the ONNX export

(WIP) TFLite export with quantization support

Bugfixes and improvements

Contributors

New models supported in the ONNX export

New models supported in BetterTransformer

Additional tasks supported in the ONNX Runtime integration

Support of the ONNX export from PyTorch on float16

TFLite export

ONNX Runtime optimization and quantization directly in the CLI

ORTModelForCausalLM supports decoding with a single ONNX

Single-file ORTModel accept numpy arrays

ORTOptimizer support for ORTModelForCausalLM

Breaking changes

Bugfixes and improvements

Contributors

Bugfix

Hotfixes

Regressions

Contributors

Releases: huggingface/optimum

v1.8.5: Patch release

Contributors

v1.8.4: Patch release

Contributors

v1.8.3: Patch release

Contributors

v1.8: extended BetterTransformer support, ONNX merged seq2seq models

Extended BetterTransformer support

ONNX merged seq2seq models

New models in the ONNX export

Major bugfix

Potentially breaking changes

What's Changed

New Contributors

Contributors

v1.7.3: Patch release for PyTorch 2.0 and transformers 4.27.0

Breaking change: constant outputs removed from ONNX encoder-decoder models

torch.nn.functional.scaled_dot_product_attention support for decoders in BetterTransformer

New architectures in the ONNX export

(WIP) TFLite export with quantization support

Bugfixes and improvements

Contributors

v1.7.1: Patch release

v1.7.0: ONNX export extension, TFLite export, single-ONNX decoding, ONNX Runtime extension for audio, vision tasks, stable diffusion

New models supported in the ONNX export

New models supported in BetterTransformer

Additional tasks supported in the ONNX Runtime integration

Support of the ONNX export from PyTorch on float16

TFLite export

ONNX Runtime optimization and quantization directly in the CLI

ORTModelForCausalLM supports decoding with a single ONNX

Single-file ORTModel accept numpy arrays

ORTOptimizer support for ORTModelForCausalLM

Breaking changes

Bugfixes and improvements

Contributors

v1.6.4: Patch release

Bugfix

v1.6.3: Patch release

v1.6.2: Patch release

Hotfixes

Regressions

Contributors

`torch.nn.functional.scaled_dot_product_attention` support for decoders in BetterTransformer