MPI-based distributed downloading tool for retrieving data from diverse domains.
This MPI-based distributed downloader was initially designed for the purpose of downloading all images from the
monthly GBIF occurrence snapshot. The overall setup is general enough that
it could be transformed into a functional tool beyond just our use; it should work on any list of URLs. We chose to
build this tool instead of using something like img2dataset to better avoid
overloading source servers (GBIF documents approximately 200M images across 545 servers) and have more control over the
final dataset construction and metadata management (e.g., using HDF5
as discussed
in issue #1).
- Install Miniconda
- Create a new conda environment:
conda env create -f environment.yaml --solver=libmamba -y
- Install Python 3.10 or higher
- Install MPI, any MPI should work, tested with OpenMPI and IntelMPI. Installation instructions can be found on official websites:
- Install required package:
- For general use:
pip install git+https://github.com/Imageomics/distributed-downloader
- For development:
pip install -e .[dev]
- For general use:
distributed-downloader
utilizes multiple nodes on a High Performance Computing (HPC) system (specifically, an HPC
with slurm
workload manager) to download a collection of images specified in a given tab-delimited text file.
There are one manual step to get the downloader running as designed:
You need to call function download_images
from package distributed_downloader
with the config_path
as an argument.
This will initialize filestructure in the output folder, partition the input file, profile the servers for their
possible download speed, and start downloading images. If downloading didn't finish, you can call the same function with
the same config_path
argument to continue downloading.
Downloader has two logging profiles:
INFO
- logs only the most important information, for example when a batch is started and finished. It also logs out any error that occurred during download, image decoding, or writing batch to the filesystemDEBUG
- logs all information, for example logging start and finish of each downloaded image.
After downloading is finished, you can use the tools
package perform various operations on them.
To do this, you need to call the function apply_tools
from package distributed_downloader
with the config_path
and tool_name
as an argument.
Following tools are available:
resize
- resizes images to a new sizeimage_verification
- verifies images by checking if they are corruptedduplication_based
- removes duplicate imagessize_based
- removes images that are too small
You can also add your own tool, the instructions are in the section below.
You can also add your own tool by creating 3 classes and registering them with respective decorators.
- Each tool's output will be saved in separate folder in
{config.output_structure.tools_folder}/{tool_name}
- There are 3 steps in the tool pipeline:
filter
,scheduler
andrunner
.filter
- filters the images that should be processed by the tool and creates csv files with themscheduler
- creates a schedule for processing the images for MPIrunner
- processes the images using MPI
- Each step should be implemented in a separate class.
- Tool name should be the same across all classes.
- Each tool should inherit from
ToolsBase
class. - Each tool should have a
run
method that will be called by the main script. - Each tool should be registered with a decorator from a respective package (
FilterRegister
fromfilters
etc.)
All scripts can expect to have the following custom environment variables, specific variables are only initialized when respective tool is called:
- General parameters
CONFIG_PATH
ACCOUNT
PATH_TO_INPUT
PATH_TO_OUTPUT
OUTPUT_URLS_FOLDER
OUTPUT_LOGS_FOLDER
OUTPUT_IMAGES_FOLDER
OUTPUT_SCHEDULES_FOLDER
OUTPUT_PROFILES_TABLE
OUTPUT_IGNORED_TABLE
OUTPUT_INNER_CHECKPOINT_FILE
OUTPUT_TOOLS_FOLDER
- Specific for downloader
DOWNLOADER_NUM_DOWNLOADS
DOWNLOADER_MAX_NODES
DOWNLOADER_WORKERS_PER_NODE
DOWNLOADER_CPU_PER_WORKER
DOWNLOADER_HEADER
DOWNLOADER_IMAGE_SIZE
DOWNLOADER_LOGGER_LEVEL
DOWNLOADER_BATCH_SIZE
DOWNLOADER_RATE_MULTIPLIER
DOWNLOADER_DEFAULT_RATE_LIMIT
- Specific for tools
TOOLS_NUM_WORKERS
TOOLS_MAX_NODES
TOOLS_WORKERS_PER_NODE
TOOLS_CPU_PER_WORKER
TOOLS_THRESHOLD_SIZE
TOOLS_NEW_RESIZE_SIZE