Phi-3.5-vision finetuning recipe

This is the official support of Phi-3.5-vision finetuning using huggingface libraries. Please cd to the code directory vision_finetuning before running the following commands.

Installation

# create a new conda environment
conda create -n phi3v python=3.10
conda activate phi3v

# install pytorch
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia

# other libraries needed to run the example code
pip install -r requirements.txt

# (optional) flash attention -- Ampere+ GPUs (e.g., A100, H100)
pip install ninja
MAX_JOBS=32 pip install flash-attn==2.4.2 --no-build-isolation

# (optional) QLoRA -- Turing+ GPUs (e.g., RTX 8000)
pip install bitsandbytes==0.43.1

Quick start

We provide two example finetuning scripts, one for DocVQA and one for hateful meme classification.

Minimal hardware tested on 4x RTX8000 (48GB RAM per GPU)

# minimal script on a mini-train split of DocVQA
torchrun --nproc_per_node=4 finetune_hf_trainer_docvqa.py

Phi-3.5-vision now officially support multi-image inputs. Here's an example for finetuning NLVR2

torchrun --nproc_per_node=8 finetune_hf_trainer_nlvr2.py

Usage guide

Depending on the hardware, users may choose different finetuning strategies. We support full-finetuning (with Deepspeed Zero-2) with optionally frozen vision parameters, and LoRA (including 4bit QLoRA). In general, we recommend using full finetuning with flash attention and bf16 whenever possible.

guide for converting your custom dataset to the required format

We use a minimum video classification dataset (a subset of UCF-101) as an end-to-end example to demonstrate how to convert your custom dataset to the required format and fine-tune Phi-3.5-vision on it.

# convert data
python convert_ucf101.py --out_dir /path/to/converted_ucf101

# training
torchrun --nproc_per_node=4 finetune_hf_trainer_ucf101.py --data_dir /path/to/converted_ucf101

The converted data will look like this:

> tree --filelimit=10 /path/to/converted_ucf101
/path/to/converted_ucf101
├── images
│   ├── test
│   │   ├── ApplyEyeMakeup [48 entries exceeds filelimit, not opening dir]
│   │   ├── ApplyLipstick [32 entries exceeds filelimit, not opening dir]
│   │   ├── Archery [56 entries exceeds filelimit, not opening dir]
│   │   ├── BabyCrawling [72 entries exceeds filelimit, not opening dir]
│   │   ├── BalanceBeam [32 entries exceeds filelimit, not opening dir]
│   │   ├── BandMarching [72 entries exceeds filelimit, not opening dir]
│   │   ├── BaseballPitch [80 entries exceeds filelimit, not opening dir]
│   │   ├── Basketball [88 entries exceeds filelimit, not opening dir]
│   │   ├── BasketballDunk [48 entries exceeds filelimit, not opening dir]
│   │   └── BenchPress [72 entries exceeds filelimit, not opening dir]
│   ├── train
│   │   ├── ApplyEyeMakeup [240 entries exceeds filelimit, not opening dir]
│   │   ├── ApplyLipstick [240 entries exceeds filelimit, not opening dir]
│   │   ├── Archery [240 entries exceeds filelimit, not opening dir]
│   │   ├── BabyCrawling [240 entries exceeds filelimit, not opening dir]
│   │   ├── BalanceBeam [240 entries exceeds filelimit, not opening dir]
│   │   ├── BandMarching [240 entries exceeds filelimit, not opening dir]
│   │   ├── BaseballPitch [240 entries exceeds filelimit, not opening dir]
│   │   ├── Basketball [240 entries exceeds filelimit, not opening dir]
│   │   ├── BasketballDunk [240 entries exceeds filelimit, not opening dir]
│   │   └── BenchPress [240 entries exceeds filelimit, not opening dir]
│   └── val
│       ├── ApplyEyeMakeup [24 entries exceeds filelimit, not opening dir]
│       ├── ApplyLipstick [24 entries exceeds filelimit, not opening dir]
│       ├── Archery [24 entries exceeds filelimit, not opening dir]
│       ├── BabyCrawling [24 entries exceeds filelimit, not opening dir]
│       ├── BalanceBeam [24 entries exceeds filelimit, not opening dir]
│       ├── BandMarching [24 entries exceeds filelimit, not opening dir]
│       ├── BaseballPitch [24 entries exceeds filelimit, not opening dir]
│       ├── Basketball [24 entries exceeds filelimit, not opening dir]
│       ├── BasketballDunk [24 entries exceeds filelimit, not opening dir]
│       └── BenchPress [24 entries exceeds filelimit, not opening dir]
├── ucf101_test.jsonl
├── ucf101_train.jsonl
└── ucf101_val.jsonl

34 directories, 3 files

For the jsonl annotation, each line should be a dictionary like:

{"id": "val-0000000300", "source": "ucf101", "conversations": [{"images": ["val/BabyCrawling/v_BabyCrawling_g21_c04.0.jpg", "val/BabyCrawling/v_BabyCrawling_g21_c04.1.jpg", "val/BabyCrawling/v_BabyCrawling_g21_c04.2.jpg", "val/BabyCrawling/v_BabyCrawling_g21_c04.3.jpg", "val/BabyCrawling/v_BabyCrawling_g21_c04.4.jpg", "val/BabyCrawling/v_BabyCrawling_g21_c04.5.jpg", "val/BabyCrawling/v_BabyCrawling_g21_c04.6.jpg", "val/BabyCrawling/v_BabyCrawling_g21_c04.7.jpg"], "user": "Classify the video into one of the following classes: ApplyEyeMakeup, ApplyLipstick, Archery, BabyCrawling, BalanceBeam, BandMarching, BaseballPitch, Basketball, BasketballDunk, BenchPress.", "assistant": "BabyCrawling"}]}
{"id": "val-0000000301", "source": "ucf101", "conversations": [{"images": ["val/BabyCrawling/v_BabyCrawling_g09_c06.0.jpg", "val/BabyCrawling/v_BabyCrawling_g09_c06.1.jpg", "val/BabyCrawling/v_BabyCrawling_g09_c06.2.jpg", "val/BabyCrawling/v_BabyCrawling_g09_c06.3.jpg", "val/BabyCrawling/v_BabyCrawling_g09_c06.4.jpg", "val/BabyCrawling/v_BabyCrawling_g09_c06.5.jpg", "val/BabyCrawling/v_BabyCrawling_g09_c06.6.jpg", "val/BabyCrawling/v_BabyCrawling_g09_c06.7.jpg"], "user": "Classify the video into one of the following classes: ApplyEyeMakeup, ApplyLipstick, Archery, BabyCrawling, BalanceBeam, BandMarching, BaseballPitch, Basketball, BasketballDunk, BenchPress.", "assistant": "BabyCrawling"}]}

Note that conversations is a list, thus multi-turn conversation can be supported if such data is available.

Requesting Azure GPU Quota

Prerequisites

An Azure account with the Contributor role (or another role that includes Contributor access).

If you don't have an Azure account, create a free account before you begin.

Request a quota increase

You can submit a request for a quota increase directly from My quotas. Follow the steps below to request an increase for a quota. For this example, you can select any adjustable quota in your subscription.

Sign in to the Azure portal.

Enter "quotas" into the search box, and then select Quotas.

On the Overview page, select a provider, such as Compute or AML.

Note For all providers other than Compute, you'll see a Request increase column instead of the Adjustable column described below. There, you can request an increase for a specific quota, or create a support request for the increase.

On the My quotas page, under Quota name, select the quota you want to increase. Make sure that the Adjustable column shows Yes for this quota.

Near the top of the page, select New Quota Request, then select Enter a new limit.

In the New Quota Request pane, enter a numerical value for your new quota limit, then select Submit.

Your request will be reviewed, and you'll be notified if the request can be fulfilled. This usually happens within a few minutes.

If your request isn't fulfilled, you'll see a link to create a support request. When you use this link, a support engineer will assist you with your increase request.

Azure Compute GPU machine SKU suggestions

ND A100 v4-series

ND H100 v5-series

Standard_ND40rs_v2

Here are some examples:

If you have A100 or H100 GPUs

Full finetuning usually gives the best performance. You can use the following command to finetune Phi-3-V on hateful memes classification.

torchrun --nproc_per_node=8 --nnodes=<num_nodes> \
  --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --node_rank=$NODE_RANK \
  finetune_hf_trainer_hateful_memes.py \
  --output_dir <output_dir> \
  --batch_size 64 \
  --use_flash_attention \
  --bf16

If you have Standard_ND40rs_v2 8x V100-32GB GPUs

It is still possible to fully finetune Phi-3-V on hateful memes classification. However, expect much lower throughput compared to A100 or H100 GPUs due to the lack of flash attention support. Accuracy could also be affected due to the lack of bf16 support (fp16 mixed-precision training is used instead).

torchrun --nproc_per_node=8 --nnodes=<num_nodes> \
  --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --node_rank=$NODE_RANK \
  finetune_hf_trainer_hateful_memes.py \
  --output_dir <output_dir> \
  --batch_size 64

If you don't have access to data center GPUs

Lora might be your only choice. You can use the following command to finetune Phi-3-V on hateful memes classification.

torchrun --nproc_per_node=2 \
  finetune_hf_trainer_hateful_memes.py \
  --output_dir <output_dir> \
  --batch_size 64 \
  --use_lora

For Turing+ GPU, QLoRA is supported

torchrun --nproc_per_node=2 \
  finetune_hf_trainer_hateful_memes.py \
  --output_dir <output_dir> \
  --batch_size 64 \
  --use_lora \
  --use_qlora

Suggested hyperparameters and expected accuracy

NLVR2

torchrun --nproc_per_node=4 \
  finetune_hf_trainer_nlvr2.py \
  --bf16 --use_flash_attention \
  --batch_size 64 \
  --output_dir <output_dir> \
  --learning_rate <lr> \
  --num_train_epochs <epochs>

Training method	Frozen vision model	data type	LoRA rank	LoRA alpha	batch size	learning rate	epochs	Accuracy
full-finetuning		bf16	-	-	64	1e-5	3	89.40
full-finetuning	✔	bf16	-	-	64	2e-5	2	89.20
LoRA results comming soon

NOTE

The Below DocVQA and Hateful memes results are based on the previous version (Phi-3-vision). The new results with Phi-3.5-vision will be updated soon.

DocVQA (NOTE: Phi-3-vision)

torchrun --nproc_per_node=4 \
  finetune_hf_trainer_docvqa.py \
  --full_train \
  --bf16 --use_flash_attention \
  --batch_size 64 \
  --output_dir <output_dir> \
  --learning_rate <lr> \
  --num_train_epochs <epochs>

Training method	data type	LoRA rank	LoRA alpha	batch size	learning rate	epochs	ANLS
full-finetuning	bf16	-	-	64	5e-6	2	83.65
full-finetuning	fp16	-	-	64	5e-6	2	82.60
frozen image model	bf16	-	-	64	1e-4	2	79.19
frozen image model	fp16	-	-	64	1e-4	2	78.74
LoRA	bf16	32	16	64	2e-4	2	82.46
LoRA	fp16	32	16	64	2e-4	2	82.34
QLoRA	bf16	32	16	64	2e-4	2	81.85
QLoRA	fp16	32	16	64	2e-4	2	81.85

Hateful memes (NOTE: Phi-3-vision)

torchrun --nproc_per_node=4 \
  finetune_hf_trainer_hateful_memes.py \
  --bf16 --use_flash_attention \
  --batch_size 64 \
  --output_dir <output_dir> \
  --learning_rate <lr> \
  --num_train_epochs <epochs>

Training method	data type	LoRA rank	LoRA alpha	batch size	learning rate	epochs	Accuracy
full-finetuning	bf16	-	-	64	5e-5	2	86.4
full-finetuning	fp16	-	-	64	5e-5	2	85.4
frozen image model	bf16	-	-	64	1e-4	3	79.4
frozen image model	fp16	-	-	64	1e-4	3	78.6
LoRA	bf16	128	256	64	2e-4	2	86.6
LoRA	fp16	128	256	64	2e-4	2	85.2
QLoRA	bf16	128	256	64	2e-4	2	84.0
QLoRA	fp16	128	256	64	2e-4	2	83.8

Speed benchmarking (NOTE: Phi-3-vision)

New benchmarking results with Phi-3.5-vision will be updated soon.

Speed benchmarking is performed on the DocVQA dataset. The average sequence length of this dataset is 2443.23 tokens (using num_crops=16 for the image model).

8x A100-80GB (Ampere)

Training method	# nodes	GPUs	flash attention	Effective batch size	Throughput (img/s)	Speedup	Peak GPU mem (GB)
full-finetuning	1	8		64	5.041	1x	~42
full-finetuning	1	8	✔	64	8.657	1.72x	~36
full-finetuning	2	16	✔	64	16.903	3.35x	~29
full-finetuning	4	32	✔	64	33.433	6.63x	~26
frozen image model	1	8		64	17.578	3.49x	~29
frozen image model	1	8	✔	64	31.736	6.30x	~27
LoRA	1	8		64	5.591	1.11x	~50
LoRA	1	8	✔	64	12.127	2.41x	~16
QLoRA	1	8		64	4.831	0.96x	~32
QLoRA	1	8	✔	64	10.545	2.09x	~10

8x V100-32GB (Volta)

Training method	# nodes	GPUs	Effective batch size	Throughput (img/s)	Speedup	Peak GPU mem (GB)
full-finetuning	1	8	64	2.462	1x	~32
full-finetuning	2	16	64	4.182	1.70x	~32
full-finetuning	4	32	64	5.465	2.22x	~32
frozen image model	1	8	64	8.942	3.63x	~27
LoRA	1	8	64	2.807	1.14x	~30

Known issues

Cannot run flash attention with fp16 (bf16 is always recommended when available, and all GPUs supporting flash attention also support bf16).
Do not support saving intermediate checkpoints and resuming training yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FineTuning_Vision.md

FineTuning_Vision.md

Phi-3.5-vision finetuning recipe

Installation

Quick start

Usage guide

guide for converting your custom dataset to the required format

Requesting Azure GPU Quota

Prerequisites

Request a quota increase

Azure Compute GPU machine SKU suggestions

If you have A100 or H100 GPUs

If you have Standard_ND40rs_v2 8x V100-32GB GPUs

If you don't have access to data center GPUs

Suggested hyperparameters and expected accuracy

NLVR2

NOTE

DocVQA (NOTE: Phi-3-vision)

Hateful memes (NOTE: Phi-3-vision)

Speed benchmarking (NOTE: Phi-3-vision)

8x A100-80GB (Ampere)

8x V100-32GB (Volta)

Known issues

Files

FineTuning_Vision.md

Latest commit

History

FineTuning_Vision.md

File metadata and controls

Phi-3.5-vision finetuning recipe

Installation

Quick start

Usage guide

guide for converting your custom dataset to the required format

Requesting Azure GPU Quota

Prerequisites

Request a quota increase

Azure Compute GPU machine SKU suggestions

If you have A100 or H100 GPUs

If you have Standard_ND40rs_v2 8x V100-32GB GPUs

If you don't have access to data center GPUs

Suggested hyperparameters and expected accuracy

NLVR2

NOTE

DocVQA (NOTE: Phi-3-vision)

Hateful memes (NOTE: Phi-3-vision)

Speed benchmarking (NOTE: Phi-3-vision)

8x A100-80GB (Ampere)

8x V100-32GB (Volta)

Known issues