Reddit Data Cleaner for Language Model Fine-Tuning

This repository contains a script for cleaning Reddit data to prepare it for fine-tuning language models (LLMs). It's adapted from the work of Sentdex and tailored specifically for processing Reddit text data.

Description

Prepare a JSON file tailored for language model fine-tuning, emphasizing conversational threads sourced from a given subreddit post. This script refines Reddit text data, optimizing it for language model adaptation. Drawing inspiration from Sentdex's methodology, it adeptly manages HTML tags, URLs, emojis, and special characters, while also spotlighting conversational threads for nuanced context. By standardizing text formatting and eliminating extraneous noise, it ensures top-tier data quality for refining language models. Its adaptability enables seamless integration with diverse training requirements.

Key Features

Preprocessing script tailored specifically for Reddit data.
Handles HTML tags, URLs, emojis, and special characters.
Standardizes text formatting and removes noise.
Optimizes data quality for fine-tuning language models.
Easily adaptable and customizable for different LLM training requirements.

Usage

Clone or download the repository.
Place your raw Reddit data file (e.g., subreddit_comments.zst) into the data/zst_file directory.
Open a terminal and navigate to the repository directory.
Run the cleaning script by executing the following command:
```
sh fineTuneDataset.sh data/zst_file/subreddit_comments.zst
```
The JSON generated in 'data/complete_json' can be used to Fine-tune a Large Language Model.

Contributions

Contributions and feedback are welcome! If you encounter any issues or have suggestions for improvements, feel free to open an issue or submit a pull request.

Disclaimer

Please note that while this script aims to improve the quality of Reddit data for language model fine-tuning, it may not cover all possible preprocessing requirements. Users are encouraged to review and adapt the script as needed for their specific datasets and use cases.

Acknowledgements

This project is inspired by the work of Sentdex.

Happy cleaning and fine-tuning!

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
scripts		scripts
README.md		README.md
fineTuneDataset.sh		fineTuneDataset.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Data Cleaner for Language Model Fine-Tuning

Description

Key Features

Usage

Contributions

Disclaimer

Acknowledgements

About

Releases

Packages

Languages

ergosumdre/RedditPrep4LLM

Folders and files

Latest commit

History

Repository files navigation

Reddit Data Cleaner for Language Model Fine-Tuning

Description

Key Features

Usage

Contributions

Disclaimer

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages