Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated files structure in 4. Sharing models and tokenizers - Sharing pretrained models #736

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 24 additions & 22 deletions chapters/de/chapter4/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -445,10 +445,10 @@ ls

{#if fw === 'pt'}
```bash
config.json pytorch_model.bin README.md sentencepiece.bpe.model special_tokens_map.json tokenizer_config.json tokenizer.json
added_tokens.json config.json model.safetensors sentencepiece.bpe.model special_tokens_map.json tokenizer_config.json tokenizer.json
```

Wenn du dir die Dateigrößen anschaust (z.B. mit `ls -lh`), solltest du sehen, dass die Modell-Statedict Datei (*pytorch_model.bin*) der einzige Ausreißer ist mit über 400 MB.
Wenn du dir die Dateigrößen anschaust (z.B. mit `ls -lh`), solltest du sehen, dass die Modell-Statedict Datei (*model.safetensors *) der einzige Ausreißer ist mit über 400 MB.

{:else}
```bash
Expand All @@ -460,7 +460,7 @@ Wenn du dir die Dateigrößen anschaust (z.B. mit `ls -lh`), solltest du sehen,
{/if}

<Tip>
✏️ Wenn ein Repository mittels der Webinterface kreiert wird, wird die *.gitattributes* Datei automatisch gesetzt, um bestimmte Dateiendungen wie *.bin* und *.h5* als große Dateien zu betrachten, sodass git-lfs sie tracken kann, ohne dass du weiteres konfigurieren musst.
✏️ Wenn ein Repository mittels der Webinterface kreiert wird, wird die *.gitattributes* Datei automatisch gesetzt, um bestimmte Dateiendungen wie *.safetensors* und *.h5* als große Dateien zu betrachten, sodass git-lfs sie tracken kann, ohne dass du weiteres konfigurieren musst.
</Tip>

Nun können wir weitermachen und so arbeiten wie wir es mit normalen Git Repositories machen. Wir können die Dateien stagen mit dem Git-Befehl `git add`:
Expand All @@ -483,12 +483,13 @@ Your branch is up to date with 'origin/main'.
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: .gitattributes
new file: added_tokens.json
new file: config.json
new file: pytorch_model.bin
new file: model.safetensors
new file: sentencepiece.bpe.model
new file: special_tokens_map.json
new file: tokenizer.json
new file: tokenizer_config.json
new file: tokenizer.json
```
{:else}
```bash
Expand Down Expand Up @@ -521,12 +522,13 @@ Objects to be pushed to origin/main:

Objects to be committed:

config.json (Git: bc20ff2)
pytorch_model.bin (LFS: 35686c2)
added_tokens.json (Git: 43734cd)
config.json (Git: acfd093)
model.safetensors (LFS: 2785d2e)
sentencepiece.bpe.model (LFS: 988bc5a)
special_tokens_map.json (Git: cb23931)
tokenizer.json (Git: 851ff3e)
tokenizer_config.json (Git: f0f7783)
special_tokens_map.json (Git: b547935)
tokenizer.json (Git: 18d0f7a)
tokenizer_config.json (Git: c49982e)

Objects not staged for commit:

Expand Down Expand Up @@ -567,11 +569,11 @@ git commit -m "First model version"

{#if fw === 'pt'}
```bash
[main b08aab1] First model version
7 files changed, 29027 insertions(+)
6 files changed, 36 insertions(+)
[main c2ec5c9] First model version
7 files changed, 128351 insertions(+)
create mode 100644 added_tokens.json
create mode 100644 config.json
create mode 100644 pytorch_model.bin
create mode 100644 model.safetensors
create mode 100644 sentencepiece.bpe.model
create mode 100644 special_tokens_map.json
create mode 100644 tokenizer.json
Expand All @@ -597,15 +599,15 @@ git push
```

```bash
Uploading LFS objects: 100% (1/1), 433 MB | 1.3 MB/s, done.
Enumerating objects: 11, done.
Counting objects: 100% (11/11), done.
Delta compression using up to 12 threads
Compressing objects: 100% (9/9), done.
Writing objects: 100% (9/9), 288.27 KiB | 6.27 MiB/s, done.
Total 9 (delta 1), reused 0 (delta 0), pack-reused 0
Uploading LFS objects: 100% (2/2), 444 MB | 86 MB/s, done.
Enumerating objects: 10, done.
Counting objects: 100% (10/10), done.
Delta compression using up to 2 threads
Compressing objects: 100% (8/8), done.
Writing objects: 100% (9/9), 592.02 KiB | 6.30 MiB/s, done.
Total 9 (delta 0), reused 0 (delta 0), pack-reused 0
To https://huggingface.co/lysandre/dummy
891b41d..b08aab1 main -> main
70fd9db..c2ec5c9 main -> main
```

{#if fw === 'pt'}
Expand Down
46 changes: 24 additions & 22 deletions chapters/en/chapter4/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -451,10 +451,10 @@ ls

{#if fw === 'pt'}
```bash
config.json pytorch_model.bin README.md sentencepiece.bpe.model special_tokens_map.json tokenizer_config.json tokenizer.json
added_tokens.json config.json model.safetensors sentencepiece.bpe.model special_tokens_map.json tokenizer_config.json tokenizer.json
```

If you look at the file sizes (for example, with `ls -lh`), you should see that the model state dict file (*pytorch_model.bin*) is the only outlier, at more than 400 MB.
If you look at the file sizes (for example, with `ls -lh`), you should see that the model state dict file (*model.safetensors *) is the only outlier, at more than 400 MB.

{:else}
```bash
Expand All @@ -466,7 +466,7 @@ If you look at the file sizes (for example, with `ls -lh`), you should see that
{/if}

<Tip>
✏️ When creating the repository from the web interface, the *.gitattributes* file is automatically set up to consider files with certain extensions, such as *.bin* and *.h5*, as large files, and git-lfs will track them with no necessary setup on your side.
✏️ When creating the repository from the web interface, the *.gitattributes* file is automatically set up to consider files with certain extensions, such as *.safetensors* and *.h5*, as large files, and git-lfs will track them with no necessary setup on your side.
</Tip>

We can now go ahead and proceed like we would usually do with traditional Git repositories. We can add all the files to Git's staging environment using the `git add` command:
Expand All @@ -489,12 +489,13 @@ Your branch is up to date with 'origin/main'.
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: .gitattributes
new file: added_tokens.json
new file: config.json
new file: pytorch_model.bin
new file: model.safetensors
new file: sentencepiece.bpe.model
new file: special_tokens_map.json
new file: tokenizer.json
new file: tokenizer_config.json
new file: tokenizer.json
```
{:else}
```bash
Expand Down Expand Up @@ -527,12 +528,13 @@ Objects to be pushed to origin/main:

Objects to be committed:

config.json (Git: bc20ff2)
pytorch_model.bin (LFS: 35686c2)
added_tokens.json (Git: 43734cd)
config.json (Git: acfd093)
model.safetensors (LFS: 2785d2e)
sentencepiece.bpe.model (LFS: 988bc5a)
special_tokens_map.json (Git: cb23931)
tokenizer.json (Git: 851ff3e)
tokenizer_config.json (Git: f0f7783)
special_tokens_map.json (Git: b547935)
tokenizer.json (Git: 18d0f7a)
tokenizer_config.json (Git: c49982e)

Objects not staged for commit:

Expand Down Expand Up @@ -573,11 +575,11 @@ git commit -m "First model version"

{#if fw === 'pt'}
```bash
[main b08aab1] First model version
7 files changed, 29027 insertions(+)
6 files changed, 36 insertions(+)
[main c2ec5c9] First model version
7 files changed, 128351 insertions(+)
create mode 100644 added_tokens.json
create mode 100644 config.json
create mode 100644 pytorch_model.bin
create mode 100644 model.safetensors
create mode 100644 sentencepiece.bpe.model
create mode 100644 special_tokens_map.json
create mode 100644 tokenizer.json
Expand All @@ -603,15 +605,15 @@ git push
```

```bash
Uploading LFS objects: 100% (1/1), 433 MB | 1.3 MB/s, done.
Enumerating objects: 11, done.
Counting objects: 100% (11/11), done.
Delta compression using up to 12 threads
Compressing objects: 100% (9/9), done.
Writing objects: 100% (9/9), 288.27 KiB | 6.27 MiB/s, done.
Total 9 (delta 1), reused 0 (delta 0), pack-reused 0
Uploading LFS objects: 100% (2/2), 444 MB | 86 MB/s, done.
Enumerating objects: 10, done.
Counting objects: 100% (10/10), done.
Delta compression using up to 2 threads
Compressing objects: 100% (8/8), done.
Writing objects: 100% (9/9), 592.02 KiB | 6.30 MiB/s, done.
Total 9 (delta 0), reused 0 (delta 0), pack-reused 0
To https://huggingface.co/lysandre/dummy
891b41d..b08aab1 main -> main
70fd9db..c2ec5c9 main -> main
```

{#if fw === 'pt'}
Expand Down
44 changes: 23 additions & 21 deletions chapters/fr/chapter4/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -450,10 +450,10 @@ ls

{#if fw === 'pt'}
```bash
config.json pytorch_model.bin README.md sentencepiece.bpe.model special_tokens_map.json tokenizer_config.json tokenizer.json
added_tokens.json config.json model.safetensors sentencepiece.bpe.model special_tokens_map.json tokenizer_config.json tokenizer.json
```

Si vous regardez la taille des fichiers (par exemple, avec `ls -lh`), vous devriez voir que le fichier d'état du modèle (*pytorch_model.bin*) est la seule exception, avec plus de 400 Mo.
Si vous regardez la taille des fichiers (par exemple, avec `ls -lh`), vous devriez voir que le fichier d'état du modèle (*model.safetensors *) est la seule exception, avec plus de 400 Mo.

{:else}
```bash
Expand Down Expand Up @@ -488,12 +488,13 @@ Your branch is up to date with 'origin/main'.
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: .gitattributes
new file: added_tokens.json
new file: config.json
new file: pytorch_model.bin
new file: model.safetensors
new file: sentencepiece.bpe.model
new file: special_tokens_map.json
new file: tokenizer.json
new file: tokenizer_config.json
new file: tokenizer.json
```
{:else}
```bash
Expand Down Expand Up @@ -526,12 +527,13 @@ Objects to be pushed to origin/main:

Objects to be committed:

config.json (Git: bc20ff2)
pytorch_model.bin (LFS: 35686c2)
added_tokens.json (Git: 43734cd)
config.json (Git: acfd093)
model.safetensors (LFS: 2785d2e)
sentencepiece.bpe.model (LFS: 988bc5a)
special_tokens_map.json (Git: cb23931)
tokenizer.json (Git: 851ff3e)
tokenizer_config.json (Git: f0f7783)
special_tokens_map.json (Git: b547935)
tokenizer.json (Git: 18d0f7a)
tokenizer_config.json (Git: c49982e)

Objects not staged for commit:

Expand Down Expand Up @@ -572,11 +574,11 @@ git commit -m "First model version"

{#if fw === 'pt'}
```bash
[main b08aab1] First model version
7 files changed, 29027 insertions(+)
6 files changed, 36 insertions(+)
[main c2ec5c9] First model version
7 files changed, 128351 insertions(+)
create mode 100644 added_tokens.json
create mode 100644 config.json
create mode 100644 pytorch_model.bin
create mode 100644 model.safetensors
create mode 100644 sentencepiece.bpe.model
create mode 100644 special_tokens_map.json
create mode 100644 tokenizer.json
Expand All @@ -602,15 +604,15 @@ git push
```

```bash
Uploading LFS objects: 100% (1/1), 433 MB | 1.3 MB/s, done.
Enumerating objects: 11, done.
Counting objects: 100% (11/11), done.
Delta compression using up to 12 threads
Compressing objects: 100% (9/9), done.
Writing objects: 100% (9/9), 288.27 KiB | 6.27 MiB/s, done.
Total 9 (delta 1), reused 0 (delta 0), pack-reused 0
Uploading LFS objects: 100% (2/2), 444 MB | 86 MB/s, done.
Enumerating objects: 10, done.
Counting objects: 100% (10/10), done.
Delta compression using up to 2 threads
Compressing objects: 100% (8/8), done.
Writing objects: 100% (9/9), 592.02 KiB | 6.30 MiB/s, done.
Total 9 (delta 0), reused 0 (delta 0), pack-reused 0
To https://huggingface.co/lysandre/dummy
891b41d..b08aab1 main -> main
70fd9db..c2ec5c9 main -> main
```

{#if fw === 'pt'}
Expand Down
48 changes: 25 additions & 23 deletions chapters/it/chapter4/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -452,10 +452,10 @@ ls

{#if fw === 'pt'}
```bash
config.json pytorch_model.bin README.md sentencepiece.bpe.model special_tokens_map.json tokenizer_config.json tokenizer.json
added_tokens.json config.json model.safetensors sentencepiece.bpe.model special_tokens_map.json tokenizer_config.json tokenizer.json
```

Guardando le dimensioni dei file (ad esempio con `ls -lh`), possiamo vedere che il file contenente lo stato del modello (model state dict file) (*pytorch_model.bin*) è l'unico file anomalo, occupando più di 400 MB.
Guardando le dimensioni dei file (ad esempio con `ls -lh`), possiamo vedere che il file contenente lo stato del modello (model state dict file) (*model.safetensors *) è l'unico file anomalo, occupando più di 400 MB.

{:else}
```bash
Expand All @@ -467,8 +467,8 @@ Guardando le dimensioni dei file (ad esempio con `ls -lh`), possiamo vedere che
{/if}

<Tip>
✏️ When creating the repository from the web interface, the *.gitattributes* file is automatically set up to consider files with certain extensions, such as *.bin* and *.h5*, as large files, and git-lfs will track them with no necessary setup on your side.
✏️ Creando il reposiotry dall'interfaccia web, il file *.gitattributes* viene automaticamente configurato per considerare file con alcune estensioni, come *.bin* e *.h5*, come file grandi, e git-lfs li traccerà senza necessità di configurazione da parte dell'utente.
✏️ When creating the repository from the web interface, the *.gitattributes* file is automatically set up to consider files with certain extensions, such as *.safetensors* and *.h5*, as large files, and git-lfs will track them with no necessary setup on your side.
✏️ Creando il reposiotry dall'interfaccia web, il file *.gitattributes* viene automaticamente configurato per considerare file con alcune estensioni, come *.safetensors* e *.h5*, come file grandi, e git-lfs li traccerà senza necessità di configurazione da parte dell'utente.
</Tip>

Possiamo quindi procedere come faremo per un repository Git tradizionale. Possiamo aggiungere tutti i file all'ambiente di staging di Git con il comando `git add`:
Expand All @@ -491,12 +491,13 @@ Your branch is up to date with 'origin/main'.
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: .gitattributes
new file: added_tokens.json
new file: config.json
new file: pytorch_model.bin
new file: model.safetensors
new file: sentencepiece.bpe.model
new file: special_tokens_map.json
new file: tokenizer.json
new file: tokenizer_config.json
new file: tokenizer.json
```
{:else}
```bash
Expand Down Expand Up @@ -529,12 +530,13 @@ Objects to be pushed to origin/main:

Objects to be committed:

config.json (Git: bc20ff2)
pytorch_model.bin (LFS: 35686c2)
added_tokens.json (Git: 43734cd)
config.json (Git: acfd093)
model.safetensors (LFS: 2785d2e)
sentencepiece.bpe.model (LFS: 988bc5a)
special_tokens_map.json (Git: cb23931)
tokenizer.json (Git: 851ff3e)
tokenizer_config.json (Git: f0f7783)
special_tokens_map.json (Git: b547935)
tokenizer.json (Git: 18d0f7a)
tokenizer_config.json (Git: c49982e)

Objects not staged for commit:

Expand Down Expand Up @@ -575,11 +577,11 @@ git commit -m "First model version"

{#if fw === 'pt'}
```bash
[main b08aab1] First model version
7 files changed, 29027 insertions(+)
6 files changed, 36 insertions(+)
[main c2ec5c9] First model version
7 files changed, 128351 insertions(+)
create mode 100644 added_tokens.json
create mode 100644 config.json
create mode 100644 pytorch_model.bin
create mode 100644 model.safetensors
create mode 100644 sentencepiece.bpe.model
create mode 100644 special_tokens_map.json
create mode 100644 tokenizer.json
Expand All @@ -605,15 +607,15 @@ git push
```

```bash
Uploading LFS objects: 100% (1/1), 433 MB | 1.3 MB/s, done.
Enumerating objects: 11, done.
Counting objects: 100% (11/11), done.
Delta compression using up to 12 threads
Compressing objects: 100% (9/9), done.
Writing objects: 100% (9/9), 288.27 KiB | 6.27 MiB/s, done.
Total 9 (delta 1), reused 0 (delta 0), pack-reused 0
Uploading LFS objects: 100% (2/2), 444 MB | 86 MB/s, done.
Enumerating objects: 10, done.
Counting objects: 100% (10/10), done.
Delta compression using up to 2 threads
Compressing objects: 100% (8/8), done.
Writing objects: 100% (9/9), 592.02 KiB | 6.30 MiB/s, done.
Total 9 (delta 0), reused 0 (delta 0), pack-reused 0
To https://huggingface.co/lysandre/dummy
891b41d..b08aab1 main -> main
70fd9db..c2ec5c9 main -> main
```

{#if fw === 'pt'}
Expand Down
Loading
Loading