Skip to content

Latest commit

 

History

History
223 lines (160 loc) · 8.67 KB

sam.md

File metadata and controls

223 lines (160 loc) · 8.67 KB

Segment Anything

Datasets

SA-1B, COCO2017, and LVIS annotations.

To conduct box-prompted instance segmentation, you must first obtain the source_json_file of detected bounding boxes. Follow the instructions of ViTDet, YOLOv8, and GroundingDINO to get the source_json_file. You can also download our pre-generated files.

Expected directory structure:
coco
├── train2017
├── val2017
├── annotations
│   ├── instances_val2017.json
│   ├── lvis_v1_val.json
|── source_json_file
│   ├── coco_groundingdino.json
│   ├── coco_vitdet.json
│   ├── coco_yolov8.json
│   ├── lvis_vitdet.json
sam
├── images
├── masks
├── sa_images_ids.txt

Pretrained Models

Latency/Throughput is measured on NVIDIA Jetson AGX Orin, and NVIDIA A100 GPU with TensorRT, fp16. Data transfer time is included.

Model Resolution COCO mAP LVIS mAP Params MACs Jetson Orin Latency (bs1) A100 Throughput (bs16) Checkpoint
EfficientViT-SAM-L0 512x512 45.7 41.8 34.8M 35G 8.2ms 762 images/s link
EfficientViT-SAM-L1 512x512 46.2 42.1 47.7M 49G 10.2ms 638 images/s link
EfficientViT-SAM-L2 512x512 46.6 42.7 61.3M 69G 12.9ms 538 images/s link
EfficientViT-SAM-XL0 1024x1024 47.5 43.9 117.0M 185G 22.5ms 278 images/s link
EfficientViT-SAM-XL1 1024x1024 47.8 44.4 203.3M 322G 37.2ms 182 images/s link

Table1: Summary of All EfficientViT-SAM Variants. COCO mAP and LVIS mAP are measured using ViTDet's predicted bounding boxes as the prompt. End-to-end Jetson Orin latency and A100 throughput are measured with TensorRT and fp16.

Usage

# segment anything
from efficientvit.sam_model_zoo import create_sam_model

efficientvit_sam = create_sam_model(
  name="xl1", weight_url="assets/checkpoints/sam/xl1.pt",
)
efficientvit_sam = efficientvit_sam.cuda().eval()
from efficientvit.models.efficientvit.sam import EfficientViTSamPredictor

efficientvit_sam_predictor = EfficientViTSamPredictor(efficientvit_sam)
from efficientvit.models.efficientvit.sam import EfficientViTSamAutomaticMaskGenerator

efficientvit_mask_generator = EfficientViTSamAutomaticMaskGenerator(efficientvit_sam)

Evaluation

Note: For LVIS evaluation, please manually install the lvis package (check this issue for more details).

Box-Prompted Zero-Shot Instance Segmentation

Ground Truth Bounding Box

# COCO
torchrun --nproc_per_node=8 eval_sam_model.py --dataset coco --image_root coco/val2017 --annotation_json_file coco/annotations/instances_val2017.json --model xl1 --weight_url assets/checkpoints/sam/xl1.pt --prompt_type box

# expected results: all=79.927, large=83.748, medium=82.210, small=75.833
# LVIS
torchrun --nproc_per_node=8 eval_sam_model.py --dataset lvis --image_root coco --annotation_json_file coco/annotations/lvis_v1_val.json --model xl1 --weight_url assets/checkpoints/sam/xl1.pt --prompt_type box

# expected results: all=79.886, large=91.577, medium=88.447, small=74.412

Detected Bounding Box

# COCO
torchrun --nproc_per_node=8 eval_sam_model.py --dataset coco --image_root coco/val2017 --annotation_json_file coco/annotations/instances_val2017.json --model xl1 --weight_url assets/checkpoints/sam/xl1.pt --prompt_type box_from_detector --source_json_file coco/source_json_file/coco_vitdet.json

# expected results: 
#  Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.478
# LVIS
torchrun --nproc_per_node=8 eval_sam_model.py --dataset lvis --image_root coco --annotation_json_file coco/annotations/lvis_v1_val.json --model xl1 --weight_url assets/checkpoints/sam/xl1.pt --prompt_type box_from_detector --source_json_file coco/source_json_file/lvis_vitdet.json

# expected results: 
#  Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=300 catIds=all] = 0.444

Point-Prompted Zero-Shot Instance Segmentation

# COCO
torchrun --nproc_per_node=8 eval_sam_model.py --dataset coco --image_root coco/val2017 --annotation_json_file coco/annotations/instances_val2017.json --model xl1 --weight_url assets/checkpoints/sam/xl1.pt --prompt_type point --num_click 1

# expected results: all=59.757, large=62.132, medium=63.837, small=55.029
# LVIS
torchrun --nproc_per_node=8 eval_sam_model.py --dataset lvis --image_root coco --annotation_json_file coco/annotations/lvis_v1_val.json --model xl1 --weight_url assets/checkpoints/sam/xl1.pt --prompt_type point --num_click 1

# expected results: all=56.624, large=72.442, medium=71.796, small=47.750

Visualization

Please run demo_sam_model.py to visualize our segment anything models.

Example:

# segment everything
python demo_sam_model.py --model xl1 --mode all

# prompt with points
python demo_sam_model.py --model xl1 --mode point

# prompt with box
python demo_sam_model.py --model xl1 --mode box --box "[150,70,640,400]"

Deployment

ONNX Export

# Export Encoder
python deployment/sam/onnx/export_encoder.py --model xl1 --weight_url assets/checkpoints/sam/xl1.pt --output assets/export_models/sam/onnx/xl1_encoder.onnx 
# Export Decoder
python deployment/sam/onnx/export_decoder.py --model xl1 --weight_url assets/checkpoints/sam/xl1.pt --output assets/export_models/sam/onnx/xl1_decoder.onnx --return-single-mask
# ONNX Inference
python -m deployment.sam.onnx.inference --model xl1 --encoder_model assets/export_models/sam/onnx/xl1_encoder.onnx --decoder_model assets/export_models/sam/onnx/xl1_decoder.onnx --mode point

TensorRT Export

# Export Encoder
trtexec --onnx=assets/export_models/sam/onnx/xl1_encoder.onnx --minShapes=input_image:1x3x1024x1024 --optShapes=input_image:4x3x1024x1024 --maxShapes=input_image:4x3x1024x1024 --saveEngine=assets/export_models/sam/tensorrt/xl1_encoder.engine
# Export Decoder
trtexec --onnx=assets/export_models/sam/onnx/xl1_decoder.onnx --minShapes=point_coords:1x1x2,point_labels:1x1 --optShapes=point_coords:16x2x2,point_labels:16x2 --maxShapes=point_coords:16x2x2,point_labels:16x2 --fp16 --saveEngine=assets/export_models/sam/tensorrt/xl1_decoder.engine
# TensorRT Inference
python -m deployment.sam.tensorrt.inference --model xl1 --encoder_engine assets/export_models/sam/tensorrt/xl1_encoder.engine --decoder_engine assets/export_models/sam/tensorrt/xl1_decoder.engine --mode point

Training

Download the distilled models and place them under assets/distilled_checkpoints.

Single-Node

We make use of torchrun to launch distributed jobs.

torchrun --nproc_per_node=8 train_sam_model.py configs/sam/xl1.yaml --path .exp/sam/efficientvit_sam_xl1 --resume

SLURM

bash slurm_run_sam.sh

Note: The training config matches with the slurm script.

Citation

If EfficientViT or EfficientViT-SAM is useful or relevant to your research, please kindly recognize our contributions by citing our papers:

@article{cai2022efficientvit,
  title={Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition},
  author={Cai, Han and Gan, Chuang and Han, Song},
  journal={arXiv preprint arXiv:2205.14756},
  year={2022}
}

@article{zhang2024efficientvit,
  title={EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss},
  author={Zhang, Zhuoyang and Cai, Han and Han, Song},
  journal={arXiv preprint arXiv:2402.05008},
  year={2024}
}