-
Notifications
You must be signed in to change notification settings - Fork 61
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Move ML experiments to the model repository
These now exist in https://github.com/readthedocs/ethicalads-model
- Loading branch information
1 parent
305f0df
commit 7ceedd0
Showing
12 changed files
with
1 addition
and
1,935 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,66 +1 @@ | ||
# Machine Learning for Ads! | ||
|
||
This project uses [spaCy](https://spacy.io) to do text classification around text for ad targeting. | ||
|
||
## Quickstart | ||
|
||
This will generate our training data and then build and train the model. | ||
|
||
# Generate training and test set from the categorized data (Yaml file) | ||
python scripts/generate-training-test-sets.py -o assets/train.json -f assets/test.json assets/categorized-data.yml | ||
python -m spacy project run all . --vars.train=train --vars.dev=test --vars.name=ethicalads_topics --vars.version=`date "+%Y%m%d_%H_%M_%S"` | ||
|
||
|
||
### Running the analyzer | ||
|
||
After installing the analyzer (it's installed in staging already), | ||
you can run it against an arbitrary URL to see how that page was classified. | ||
|
||
ADSERVER_ANALYZER_BACKEND=adserver.analyzer.backends.EthicalAdsTopicsBackend ./manage.py runmodel https://example.com | ||
|
||
|
||
## 📋 project.yml | ||
|
||
The [`project.yml`](project.yml) defines the data assets required by the | ||
project, as well as the available commands and workflows. For details, see the | ||
[spaCy projects documentation](https://spacy.io/usage/projects). | ||
|
||
For training with a GPU, some modifications to the `project.yml` are needed. | ||
Specifically, set the `gpu_id` (to 0 usually) and the `config` to `gpu-efficiency.cfg`. | ||
|
||
### ⏯ Commands | ||
|
||
The following commands are defined by the project. They | ||
can be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run). | ||
Commands are only re-run if their inputs have changed. | ||
|
||
| Command | Description | | ||
| --- | --- | | ||
| `preprocess` | Convert the data to spaCy's binary format | | ||
| `train` | Train a text classification model | | ||
| `evaluate` | Evaluate the model and export metrics | | ||
| `package` | Build the actual Python package for the model to install | | ||
|
||
### ⏭ Workflows | ||
|
||
The following workflows are defined by the project. They | ||
can be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run) | ||
and will run the specified commands in order. Commands are only re-run if their | ||
inputs have changed. | ||
|
||
| Workflow | Steps | | ||
| --- | --- | | ||
| `all` | `preprocess` → `train` → `evaluate` | | ||
|
||
## 📚 Data | ||
|
||
Our data is hand-labeled URL's that are located in ``assets/categorized-data.yml``. | ||
This maps a specific URL to a topic, | ||
and then we download the data from those URL's and split them into a training & validation set with ``scripts/generate-training-test-sets.py``. | ||
|
||
## Deployment | ||
|
||
We are currently just uploading a zipfile of the Python model, | ||
and then installing it in our deployment scripts into a baked build image. | ||
|
||
This can be found in our closed source ``ethicalads-ops`` repo that has custom deployment code. | ||
Our ML model for ads has moved to a separate repo [ethicalads-model](https://github.com/readthedocs/ethicalads-model). |
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.