The proposed VLC model addresses the zero-shot object counting task. The input image and text prompts are first processed by the CLIP encoders. A cross-modal encoder then learns joint representations optimized by aligning text features with positive patch-level visual features to capture contextually relevant information. The refinement module enhances visual feature sets using: Affine transformation, and Atrous Spatial Pyramid Pooling (ASPP) for multi-scale contextual features. The fusion module then adaptively combines the feature sets. Finally, the decoder generates a density map to predict object counts.
Essential datasets for the project:
- FSC147: Diverse object counting scenarios
- CARPK: Aerial vehicle counting (used via Hub package - details here)
Folder Structure
/
├─VLC/
├─FSC147/
│ ├─gt/ # Ground truth data
│ ├─image/ # Image files
│ ├─ImageClasses_FSC147.txt
│ ├─Train_Test_Val_FSC_147.json
│ ├─annotation_FSC147_384.json
1. Install Core Packages
# PyTorch with CUDA 11.1
pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html
# Project dependencies
pip install -r requirements.txt
pip install hub
2. Get Pretrained CLIP Weights and BPE File
You will need to download the following files:
Place the files in the appropriate folders:
- CLIP weight: Place it under the
pretrain
folder. - BPE file: Place it under the
tools/dataset
folder.
bash scripts/train.sh FSC {gpu_id} {exp_number}
Configure options in train.sh
before running
bash scripts/test.sh FSC {gpu_id} {exp_number}
Specify weights using --ckpt_used
in test.sh
Dataset | MAE | RMSE |
---|---|---|
FSC-val | 16.08 | 62.28 |
FSC-test | 13.57 | 100.79 |
CARPK | 5.91 | 7.47 |
Qualitative results on FSC-147 (a-f) and CARPK (g-h). GT denotes ground-truth count.