Image generated by DALL·E circa 2023.
- Introduction
- Key Features
- Directory Structure
- Technical Overview
- Dataset
- Installation and Setup
- Configuration
- Training and Testing
- Output and Results
- Future Enhancements
- Contact
Recognising human actions in videos is a crucial task in Computer Vision and Machine Learning, with applications ranging from surveillance and human-computer interaction to sports analysis and autonomous systems. This repository offers a 3D Convolutional Neural Network (3DCNN) implemented in PyTorch Lightning to classify video-based actions. By capturing both spatial and temporal features, 3DCNNs are well-suited for tasks where motion and context over time are essential.
This project is part of my personal portfolio showcasing data science and deep learning skills, including data preparation, CNN architecture design, hyperparameter tuning, and experimentation with spatiotemporal data.
- 3D Convolutions: Learns spatial and temporal representations simultaneously.
- Modular Codebase: Separate modules for dataset loading, model construction, training, and testing.
- PyTorch Lightning: Simplifies training loops and experiment management.
- Configurable: Easy to customise hyperparameters via a single
config.ini
file. - State-of-the-Art Dataset: Trained and tested on UCF101, a benchmark dataset for video action recognition.
Below is a high-level overview of the project’s organisation:
3DCNN/
├── images/ # Visual outputs (e.g., confusion matrices, sample frames)
├── src/ # Source code
│ ├── config.ini # Configuration file for training/testing
│ ├── datasets.py # Dataset loading and preprocessing logic
│ ├── models.py # Model architecture definition (3DCNN)
│ ├── pl_model.py # PyTorch Lightning wrapper for modular training
│ ├── test_factory.py # Model evaluation and testing scripts
│ ├── trainer_factory.py# Primary training workflow scripts
│ ├── utils.py # Utility functions (logging, seeding, metrics etc.)
│ └── video_trainer.py # Main entry point for training the 3DCNN
└── README.md # Project documentation
The network is defined in models.py
as an Example3DCNN class. Key layers include:
- 3D Convolution layers with ReLU and Batch Normalisation to learn spatiotemporal features.
- 3D Max Pooling layers to reduce dimensionality and aggregate important features.
- Fully Connected layers for final classification into the desired action category.
class Example3DCNN(nn.Module):
def __init__(self):
# ...
self.conv1 = nn.Conv3d(3, 32, kernel_size=3, stride=1, padding=1)
# ...
self.fc2 = nn.Linear(1024, 10)
def forward(self, input):
# ...
x = self.fc2(x)
return x
Why 3D Convolutions?
Traditional 2D convolutions only capture spatial features (height and width). By extending to 3D convolutions, we incorporate the time dimension (depth), allowing the network to detect how an action unfolds across consecutive frames.
- Frame Extraction: Videos are read frame-by-frame using OpenCV, after which a subset of frames is selected or repeated to maintain a fixed length (e.g. 16 or 64 frames).
- Resizing and Normalisation: Frames are resized (e.g., 128×128) to ensure uniform input sizes and speed up training. Normalisation ensures stable training.
- Augmentations (Optional): Random cropping, flipping, or colour jitter can be applied to increase data diversity.
The logic is encapsulated in datasets.py
. We load videos from the UCF101 dataset, select only the necessary frames, and transform them into tensors ready for training.
We use the UCF101 Action Recognition Dataset, containing 13,320 videos across 101 action categories (e.g., CricketShot, Swimming, HandStandWalking).
- Splits: Typically divided into train, validation, and test subsets, e.g.
trainlist01.txt
andtestlist01.txt
. - Frame Extraction: The script automatically extracts frames and normalises them to the designated size.
- Classes to Use: The configuration file (
config.ini
) allows restricting or specifying certain classes for partial training or quick tests.
If you plan to use a custom dataset, adapt the code in datasets.py
accordingly.
-
Clone the repository:
git clone https://github.com/exponentialR/3DCNN.git cd 3DCNN
-
Create a virtual environment (recommended):
python -m venv slyk-venv
-
Activate the virtual environment:
- Windows:
slyk-venv\Scripts\activate
- Unix/MacOS:
source slyk-venv/bin/activate
- Windows:
-
Install dependencies:
pip install -r requirements.txt
The project uses a single file, config.ini
, to control hyperparameters and paths. Some crucial fields include:
-
[hyperparameters]
use_valid
: Whether to use a validation split (yes
orno
).batch_size
: Batch size for training.num_gpus
: Number of GPUs to utilise.epoch
: Total training epochs.data_dir
: Path to your dataset (e.g., UCF101).classes_to_use
: Class indices to train on (subset of UCF101).lr
: Learning rate for the optimiser.num_workers
: Number of subprocesses for data loading.
-
[outputs]
resume_ckpt
: Path to a checkpoint for resuming training.output_model
: Destination path for saving the trained model.
Adjust these parameters according to your setup before running the training script.
-
Training
In thesrc
directory, run:cd src python video_trainer.py --mode train
This will:
- Load your dataset from the location specified in
config.ini
. - Instantiate the 3DCNN model.
- Perform training for the specified number of epochs, logging metrics (loss, accuracy) via PyTorch Lightning.
- Load your dataset from the location specified in
-
Testing
Once training is completed, you can test using the same script:python video_trainer.py --mode test
Ensure
resume_ckpt
inconfig.ini
points to a valid checkpoint file (e.g.,EXPERIMENTAL3DCNN-14-0.0001-4.ckpt
).
During training and testing, logs and checkpoints will be saved in the OUTPUT
directory (or as configured).
- Logs: TensorBoard logs for losses, accuracy, and other metrics.
- Model Checkpoints: Stored in the
OUTPUT
directory. - Visualisations: Optionally, you can generate confusion matrices or sample predictions using your own scripts within the
images/
directory.
To explore results, launch TensorBoard:
tensorboard --logdir=training-logs
This allows you to visualise training curves, learning rates, and track model improvements over epochs.
- Data Augmentation: Incorporate more robust strategies like random temporal sampling or advanced geometric transformations.
- Advanced Architectures: Experiment with I3D (Inflated 3D ConvNet) or S3D models.
- Multi-Head Attention: Combine 3D convolutions with Transformers for long-sequence modelling.
- Hyperparameter Optimisation: Integrate libraries like Optuna for automatic hyperparameter tuning.
- Deployment: Convert the final model to TensorRT or ONNX for real-time inference on edge devices.
For any queries or suggestions, feel free to reach out via:
- Email: [email protected]
- LinkedIn: Samuel Adebayo
- GitHub: Samuel A.
Happy coding and best of luck with your 3D Action Recognition tasks!
© 2025 Samuel Adebayo. This project is provided as-is without warranty of any kind.