Download the C4 subset 00000-00019.
python scripts/data/download_c4.py
Output
assets/data
├── ...
└── en
├── c4-train.00000-of-01024.json
├── c4-train.00001-of-01024.json
├── c4-train.00002-of-01024.json
├── c4-train.00003-of-01024.json
├── c4-train.00004-of-01024.json
├── c4-train.00005-of-01024.json
├── c4-train.00006-of-01024.json
├── c4-train.00007-of-01024.json
├── c4-train.00008-of-01024.json
├── c4-train.00009-of-01024.json
├── c4-train.00010-of-01024.json
├── c4-train.00011-of-01024.json
├── c4-train.00012-of-01024.json
├── c4-train.00013-of-01024.json
└── c4-train.00014-of-01024.json
...
We use pytorch:24.01-py3
as the base image. Please make sure you have installed docker.
Install additional packages:
pip install nltk sentencepiece
bash scripts/data/prepare_c4_megatron_llama2.py
assets/data/preprocessed/
├── llama2_00000_text_document.bin
├── llama2_00000_text_document.idx
├── llama2_00001_text_document.bin
├── llama2_00001_text_document.idx
├── llama2_00002_text_document.bin
├── llama2_00002_text_document.idx
├── llama2_00003_text_document.bin
├── llama2_00003_text_document.idx
├── llama2_00004_text_document.bin
├── llama2_00004_text_document.idx
...
To use this in Megatron-LM, we provide a blending file assets/c4-blend.sh for training.
The preprocessing for LLaMA-3 closely resembles that of LLaMA-2, albeit with a modified script. Notably, LLaMA-3 employs a new tokenizer and tokenizer.model
is no longer used. Instead, the new tokenizer.json
will be loaded with AutoTokenizer. Thus, you will find that the script accepts a folder name --tokenizer-model ./assets/checkpoints/llama3_8b_hf
to load the new tokenizer.
bash scripts/data/prepare_c4_megatron_llama3.py
assets/data/preprocessed_llama3/
├── llama3_00000_text_document.bin
├── llama3_00000_text_document.idx
├── llama3_00001_text_document.bin
├── llama3_00001_text_document.idx
├── llama3_00002_text_document.bin
├── llama3_00002_text_document.idx
├── llama3_00003_text_document.bin
├── llama3_00003_text_document.idx
├── llama3_00004_text_document.bin
├── llama3_00004_text_document.idx
...
The blending file can be also found at assets/c4-blend-llama3.sh.