Code and models from the blog post Scaling Laws for Language Transfer Learning
Building upon work from Scaling Laws for Transfer (Hernandez et. al. 2021), my experiments focused on exploring the relationships between fine-tuning on non-English languages and trying to answer the question: How much does pre-training on English help when transferring across different languages as we vary the dataset size and model size?
This repo contains the code for:
- Reproducing pre-trained decoder-only transformers using hyperparameters from Scaling Laws for Neural Languages but trained on OpenWebtext2 instead of WebText
- Reproducing language transfer experiments for pre-trained English models to Chinese, Spanish, and German texts
All English pre-trained models were trained for 26 billion tokens with no repeats:
- x6small 3.3M non-embedding parameters
- x5small 16M non-embedding parameters
- x4small 39M non-embedding parameters
- x3small 51M non-embedding parameters
- x2small 70M non-embedding parameters
- small 124M non-embedding parameters
- English: OpenWebtext2
- German: Oscar
- Spanish: Oscar
- Chinese: Community QA (webtext2091zh)