Datasets Loader

The Datasets Loader is a tool designed to automatically process, generate mock data for, and distribute datasets across datasites. This project streamlines the handling of static datasets, making them easily accessible for testing and development purposes.

Features

Automatic schema generation for CSV and JSON files
Mock data creation based on original data structure
Distribution of processed datasets to owner's datasite
Preservation of data privacy by excluding original data from distribution

Project Structure

datasets_loader/
├── static_datasets/
│   ├── dataset_1/
│   │   ├── assets/
│   │   │   ├── file1.csv
│   │   │   ├── file2.json
│   │   │   └── contributors.csv
│   │   └── README.md
│   └── dataset_2/
│       └── ...
├── etl/
│   ├── csv/
│   │   └── main_csv.py
│   └── json/
│       └── main_json.py
├── main_1_day.py
└── README.md

How to Use

Adding a New Dataset:
- Create a new folder in static_datasets/ with your dataset name.
- Inside this folder, create an assets/ directory.
- For each dataset asset, create a folder like asset_0/.
- Place your original CSV or JSON files in the asset_X/ directory.
- Add a contributors.csv file in the assets/ directory.
- Create a README.md file in the dataset's root folder describing the dataset.
Running the Loader:
- The loader runs automatically as part of the daily task.
- To manually trigger the process, run:
```
python main_1_day.py
```
What the Loader Does:
- Generates schema files (*_schema.txt) for each data file.
- Creates mock data files (*_mock.csv or *_mock.json).
- Copies processed files (excluding original data) to the owner's datasite.
Accessing Processed Data:
- After processing, find the mock data and schema files in:
```
<owner_datasite>/datasets/mock/<dataset_name>/
```
Updating Existing Datasets:
- Replace or modify files in the static_datasets/ directory.
- Re-run the loader to update processed files.

Best Practices

Keep original data files in their native format (CSV or JSON).
Regularly update the contributors.csv files to credit all contributors.
Provide comprehensive information in each dataset's README.md file.
Set REPLACE_ALL = True in main_1_day.py to regenerate all files (use cautiously).

Privacy and Security

Original data files are never copied to the owner's datasite.
Only mock data, schema files, and metadata are distributed.
Ensure you have the right to use and share the data you add to the loader.

Troubleshooting

If schema or mock files are not generating, check file permissions and formats.
For issues with data distribution, verify the client_config.json file is correctly set up.

Contributing

To contribute to the Datasets Loader project:

Fork the repository.
Create a new branch for your feature or bug fix.
Submit a pull request with a clear description of your changes.

For any questions or support, please contact the development team.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
etl		etl
script_timestamps		script_timestamps
static_datasets		static_datasets
syncing_datasets		syncing_datasets
README.md		README.md
__init__.py		__init__.py
main_1_day.py		main_1_day.py
main_1_hour.py		main_1_hour.py
main_1_secs.py		main_1_secs.py
main_5_secs.py		main_5_secs.py
main_pipeline_setup.py		main_pipeline_setup.py
pipeline_folders.txt		pipeline_folders.txt
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datasets Loader

Features

Project Structure

How to Use

Best Practices

Privacy and Security

Troubleshooting

Contributing

About

Releases

Sponsor this project

Packages

Languages

OpenMined/datasets_loader

Folders and files

Latest commit

History

Repository files navigation

Datasets Loader

Features

Project Structure

How to Use

Best Practices

Privacy and Security

Troubleshooting

Contributing

About

Resources

Code of conduct

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Languages

Packages