Vision-Language Feature Refinement for Zero-Shot Object Counting

What's Inside

Model Overview
Setup Instructions
Code Execution
Performance
Acknowledgements

Model Overview

The proposed VLC model addresses the zero-shot object counting task. The input image and text prompts are first processed by the CLIP encoders. A cross-modal encoder then learns joint representations optimized by aligning text features with positive patch-level visual features to capture contextually relevant information. The refinement module enhances visual feature sets using: Affine transformation, and Atrous Spatial Pyramid Pooling (ASPP) for multi-scale contextual features. The fusion module then adaptively combines the feature sets. Finally, the decoder generates a density map to predict object counts.

Get Started

📂 Download Your Datasets

Essential datasets for the project:

FSC147: Diverse object counting scenarios
CARPK: Aerial vehicle counting (used via Hub package - details here)

Folder Structure

/
├─VLC/
├─FSC147/
│  ├─gt/            # Ground truth data
│  ├─image/         # Image files
│  ├─ImageClasses_FSC147.txt
│  ├─Train_Test_Val_FSC_147.json
│  ├─annotation_FSC147_384.json

🛠️ Setup Your Environment

1. Install Core Packages

# PyTorch with CUDA 11.1
pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html

# Project dependencies
pip install -r requirements.txt
pip install hub

2. Get Pretrained CLIP Weights and BPE File

You will need to download the following files:

CLIP Pretrained Weight
BPE File

Place the files in the appropriate folders:

CLIP weight: Place it under the pretrain folder.
BPE file: Place it under the tools/dataset folder.

Run the Model

🚀 Train Your Counter

bash scripts/train.sh FSC {gpu_id} {exp_number}

Configure options in train.sh before running

📊 Test the Results

bash scripts/test.sh FSC {gpu_id} {exp_number}

Specify weights using --ckpt_used in test.sh

Model Performance

Dataset	MAE	RMSE
FSC-val	16.08	62.28
FSC-test	13.57	100.79
CARPK	5.91	7.47

🎨 Qualitative Results

Qualitative results on FSC-147 (a-f) and CARPK (g-h). GT denotes ground-truth count.

Reference

CounTR and VLCounter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Vision-Language Feature Refinement for Zero-Shot Object Counting

What's Inside

Model Overview

Get Started

📂 Download Your Datasets

🛠️ Setup Your Environment

Run the Model

🚀 Train Your Counter

📊 Test the Results

Model Performance

🎨 Qualitative Results

Reference

Files

README.md

Latest commit

History

README.md

File metadata and controls

Vision-Language Feature Refinement for Zero-Shot Object Counting

What's Inside

Model Overview

Get Started

📂 Download Your Datasets

🛠️ Setup Your Environment

Run the Model

🚀 Train Your Counter

📊 Test the Results

Model Performance

🎨 Qualitative Results

Reference