Skip to content

Latest commit

 

History

History
executable file
·
94 lines (74 loc) · 3.01 KB

preprocess_c4.md

File metadata and controls

executable file
·
94 lines (74 loc) · 3.01 KB

MaskLLM with C4 Dataset

Dataset Download

Download the C4 subset 00000-00019.

python scripts/data/download_c4.py

Output

assets/data
├── ...
└── en
    ├── c4-train.00000-of-01024.json
    ├── c4-train.00001-of-01024.json
    ├── c4-train.00002-of-01024.json
    ├── c4-train.00003-of-01024.json
    ├── c4-train.00004-of-01024.json
    ├── c4-train.00005-of-01024.json
    ├── c4-train.00006-of-01024.json
    ├── c4-train.00007-of-01024.json
    ├── c4-train.00008-of-01024.json
    ├── c4-train.00009-of-01024.json
    ├── c4-train.00010-of-01024.json
    ├── c4-train.00011-of-01024.json
    ├── c4-train.00012-of-01024.json
    ├── c4-train.00013-of-01024.json
    └── c4-train.00014-of-01024.json
    ...

Requirements

We use pytorch:24.01-py3 as the base image. Please make sure you have installed docker.

Install additional packages:

pip install nltk sentencepiece

Pre-processing for LLaMA-2

bash scripts/data/prepare_c4_megatron_llama2.py
assets/data/preprocessed/
├── llama2_00000_text_document.bin
├── llama2_00000_text_document.idx
├── llama2_00001_text_document.bin
├── llama2_00001_text_document.idx
├── llama2_00002_text_document.bin
├── llama2_00002_text_document.idx
├── llama2_00003_text_document.bin
├── llama2_00003_text_document.idx
├── llama2_00004_text_document.bin
├── llama2_00004_text_document.idx
...

To use this in Megatron-LM, we provide a blending file assets/c4-blend.sh for training.

Pre-processing for LLaMA-3

The preprocessing for LLaMA-3 closely resembles that of LLaMA-2, albeit with a modified script. Notably, LLaMA-3 employs a new tokenizer and tokenizer.model is no longer used. Instead, the new tokenizer.json will be loaded with AutoTokenizer. Thus, you will find that the script accepts a folder name --tokenizer-model ./assets/checkpoints/llama3_8b_hf to load the new tokenizer.

bash scripts/data/prepare_c4_megatron_llama3.py
assets/data/preprocessed_llama3/
├── llama3_00000_text_document.bin
├── llama3_00000_text_document.idx
├── llama3_00001_text_document.bin
├── llama3_00001_text_document.idx
├── llama3_00002_text_document.bin
├── llama3_00002_text_document.idx
├── llama3_00003_text_document.bin
├── llama3_00003_text_document.idx
├── llama3_00004_text_document.bin
├── llama3_00004_text_document.idx
...

The blending file can be also found at assets/c4-blend-llama3.sh.