Skip to content

Commit

Permalink
Move ML experiments to the model repository
Browse files Browse the repository at this point in the history
  • Loading branch information
davidfischer committed Dec 15, 2022
1 parent 305f0df commit 7ceedd0
Show file tree
Hide file tree
Showing 12 changed files with 1 addition and 1,935 deletions.
3 changes: 0 additions & 3 deletions machine_learning_experiments/.gitignore

This file was deleted.

67 changes: 1 addition & 66 deletions machine_learning_experiments/README.md
Original file line number Diff line number Diff line change
@@ -1,66 +1 @@
# Machine Learning for Ads!

This project uses [spaCy](https://spacy.io) to do text classification around text for ad targeting.

## Quickstart

This will generate our training data and then build and train the model.

# Generate training and test set from the categorized data (Yaml file)
python scripts/generate-training-test-sets.py -o assets/train.json -f assets/test.json assets/categorized-data.yml
python -m spacy project run all . --vars.train=train --vars.dev=test --vars.name=ethicalads_topics --vars.version=`date "+%Y%m%d_%H_%M_%S"`


### Running the analyzer

After installing the analyzer (it's installed in staging already),
you can run it against an arbitrary URL to see how that page was classified.

ADSERVER_ANALYZER_BACKEND=adserver.analyzer.backends.EthicalAdsTopicsBackend ./manage.py runmodel https://example.com


## 📋 project.yml

The [`project.yml`](project.yml) defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
[spaCy projects documentation](https://spacy.io/usage/projects).

For training with a GPU, some modifications to the `project.yml` are needed.
Specifically, set the `gpu_id` (to 0 usually) and the `config` to `gpu-efficiency.cfg`.

### ⏯ Commands

The following commands are defined by the project. They
can be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run).
Commands are only re-run if their inputs have changed.

| Command | Description |
| --- | --- |
| `preprocess` | Convert the data to spaCy's binary format |
| `train` | Train a text classification model |
| `evaluate` | Evaluate the model and export metrics |
| `package` | Build the actual Python package for the model to install |

### ⏭ Workflows

The following workflows are defined by the project. They
can be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run)
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.

| Workflow | Steps |
| --- | --- |
| `all` | `preprocess` → `train` → `evaluate` |

## 📚 Data

Our data is hand-labeled URL's that are located in ``assets/categorized-data.yml``.
This maps a specific URL to a topic,
and then we download the data from those URL's and split them into a training & validation set with ``scripts/generate-training-test-sets.py``.

## Deployment

We are currently just uploading a zipfile of the Python model,
and then installing it in our deployment scripts into a baked build image.

This can be found in our closed source ``ethicalads-ops`` repo that has custom deployment code.
Our ML model for ads has moved to a separate repo [ethicalads-model](https://github.com/readthedocs/ethicalads-model).
6 changes: 0 additions & 6 deletions machine_learning_experiments/assets/.gitattributes

This file was deleted.

Loading

0 comments on commit 7ceedd0

Please sign in to comment.