This repository contains a set of tools for creating and processing datasets, specifically focusing on Terraform repositories. The tools included are:
- RepositorySearcher
- Analyzer
- Cleaner
- Compressor
To produce the same result as "TerraDS", follow these steps:
- Run the RepositorySearcher with the provided queries. This creates the
dataset.sqlite
database. - Run the Analyzer.
- Run the Cleaner.
- Run the Compressor.
- Delete the
RedistributableRepositories
and_EFMigrationHistory
tables from the database.
The RepositorySearcher can be used to search for repositories based on provided queries. While it can search for various types of repositories, the other tools in this repository are specifically focused on Terraform.
The Analyzer processes the overall metadata about repositories in dataset.sqlite
. This means that all publicly available (e.g. permissive licensed) repositories are downloaded and analyzed.
The Cleaner removes unnecessary data from the dataset.sqlite
database to prepare it for compression. It also removes non-Terraform files, empty directories, and repositories containing non-Terraform code.
The Compressor compresses the cleaned data, making it easier to store and distribute.
A justfile
is provided to simplify the usage of these tools. Just is a command runner that allows you to define and run commands easily.
To use the justfile
, follow these steps:
- Install
just
by following the instructions on the Just website. - Prepare the
.env
file from.env.dist
. - Run the following commands in order:
just fetch-metadata
just download-repositories
just cleanup-repositories
just archive-repositories
This project is licensed under the terms of the Creative Commons Attribution 4.0 International License. For more details, see the LICENSE file.