diff --git a/README.md b/README.md index 14d90256d..1aa95d2a0 100644 --- a/README.md +++ b/README.md @@ -43,7 +43,6 @@ bundle exec rake - [Schemas](docs/schemas.md): how to work with schemas and the document types - [Popularity information](docs/popularity.md): Search API uses Google Analytics data to improve search results. - [Publishing document finders](docs/publishing-finders.md): Information about publishing finders using rake tasks -- [Learning to rank](docs/learning-to-rank.md): Guidance on how to run the ranking model locally ## Licence diff --git a/docs/how-search-works.md b/docs/how-search-works.md index 942f7e227..a23e1c375 100644 --- a/docs/how-search-works.md +++ b/docs/how-search-works.md @@ -19,17 +19,6 @@ stack don't need to know how to construct Elasticsearch queries. See the [relevancy documentation](relevancy.md) to learn more about how Search API determines how relevant a document is to a query. -### Reranking - -Once Search API has retrieved a selection of relevant documents from -Elasticsearch, the results are re-ranked by a machine learning model. - -This process ensures that we show the most relevant documents at the top -of the search results. - -See the [learning to rank documentation](learning-to-rank.md) to learn -more about the reranking model. - ## Evaluating search quality To ensure Search API returns good quality results, we use a combination of diff --git a/docs/learning-to-rank.md b/docs/learning-to-rank.md deleted file mode 100644 index 1bcaf6419..000000000 --- a/docs/learning-to-rank.md +++ /dev/null @@ -1,164 +0,0 @@ -Learning to Rank -================ - -We use a machine learning approach to improve search result relevance, -using the [TensorFlow Ranking][] module. This doc covers how to use it, -and what additional work is required. - -ADR-010 and ADR-011 cover the architectural decisions. - -[TensorFlow Ranking]: https://github.com/tensorflow/ranking - - -Running it locally ------------------- - -### Set up - -TensorFlow is written in Python 3, so you will need some libraries -installed. The simplest way to do this is using `virtualenv`: - -```sh -pip3 install virtualenv -virtualenv venv -p python3 -source venv/bin/activate -pip install -r ltr/scripts/requirements-freeze.txt -``` - -This adjusts your shell's environment to use a local Python package -database in the `venv` directory. If you close the shell, you can run -`source venv/bin/activate` again to bring everything back. - - -### Using LTR - -**Set the `ENABLE_LTR` environment variable to "true", or all of this is disabled.** - -There are several rake tasks for training and serving a TensorFlow -model in the `learn_to_rank` namespace. - -The `learn_to_rank:generate_relevancy_judgements` task needs the -`GOOGLE_PRIVATE_KEY` and `GOOGLE_CLIENT_EMAIL` environment variables -set. Values for these can be found in [govuk-secrets][]. The task is -run regularly and the generated `judgements.csv` file available in: - -- `govuk-integration-search-relevancy` -- `govuk-staging-search-relevancy` -- `govuk-production-search-relevancy` - -In the future we will store more things in these buckets, like the -trained models. - -Assuming you have a `judgements.csv` file, you can generate a dataset -for training the model: - -```sh -bundle exec rake learn_to_rank:generate_training_dataset[judgements.csv] -``` - -This task needs to be run with access to Elasticsearch. If you're -using govuk-docker the full command will be: - -```sh -govuk-docker run -e ENABLE_LTR=true search-api-lite bundle exec rake 'learn_to_rank:generate_training_dataset[judgements.csv]' -``` - -Once you have the training dataset you can train and serve a model: - -```sh -bundle exec rake learn_to_rank:reranker:train -bundle exec rake learn_to_rank:reranker:serve -``` - -These tasks do not need access to Elasticsearch. - -You now have a docker container running and responding to requests -inside the govuk-docker network at `reranker:8501`. You can start -search-api with the `ENABLE_LTR` environment variable with: - -```sh -govuk-docker run -e ENABLE_LTR=true search-api-app -``` - -If you query search-api then results will be re-ranked when you order by -relevance. If this doesn't happen, check you're running search-api with -`ENABLE_LTR` set. - -You can disable re-ranking with the parameter `ab_tests=relevance:disable`. - -The `learn_to_rank:reranker:evaluate` task can be used to compare -queries without needing to manually search for things. It uses the -same `judgements.csv` file. - -[govuk-secrets]: https://github.com/alphagov/govuk-secrets - - -Running it in production ------------------------- - -In production the model training and deployment are automated through -Jenkins, with the deployed model hosted in [Amazon SageMaker][]. -The Jenkins job executes the script `ltr/jenkins/start.sh` and -runs on the [Deploy Jenkins][]. - -The Jenkins job has four tasks, one for each environment, which: - -1. Spin up a EC2 instance and start an SSH session - -2. Generate datasets to train a new model. It does this by running the Search - API application locally in a container on the EC2 instance and calling the - relevant rake tasks. - -3. Call Amazon SageMaker's training API to create a new model from - that training data, and store the model artefact in S3. This happens from - the EC2 instance. - -4. Call Amazon SageMaker's deployment API to deploy the new model, - removing the old model configuration (but leaving the artefact in - S3). This happens from the EC2 instance. - -The Jenkins job for each environment is triggered automatically at 10pm on -Sundays. - -All artefacts are stored in the relevancy S3 bucket: training data is -under `data//` and model data under `model/-`. Files are removed by a lifecycle policy -after 7 days. - -[Deploy Jenkins]: https://deploy.integration.publishing.service.gov.uk/job/search-api-learn-to-rank/ -[Amazon SageMaker]: https://aws.amazon.com/sagemaker/ - -Reranking ---------- - -Reranking happens when `ENABLE_LTR=true` is set. The model is found -by trying these options in order, going for the first one which -succeeds: - -1. If `TENSORFLOW_SAGEMAKER_ENDPOINT` is set, [Amazon SageMaker][] is - used. It's assumed that search-api is running under a role which - has permissions to invoke the endpoint. - -2. If `TENSORFLOW_SERVING_IP` is set, `http:://::8501` is used. - -3. If `RANK_ENV` is `development`, `http://reranker:8501` is used. - -4. `http://0.0.0.0:8501` is used. - -When reranking is working, search-api results get three additional -fields: - -- `model_score`: the score assigned by TensorFlow -- `combined_score`: the score used for the final ranking -- `original_rank`: how Elasticsearch ranked the result - -We may remove `combined_score` in the future, as it's just the same as -`model_score`. - - -Further work ------ - -- Investigate window sizes for reranking (top-k) -- Reduce the performance impact of reranking -- Update the process for improving search relevance diff --git a/docs/new-indexing-process.md b/docs/new-indexing-process.md index 9d30284ff..31f3a851f 100644 --- a/docs/new-indexing-process.md +++ b/docs/new-indexing-process.md @@ -23,11 +23,6 @@ Example PRs: - [Prepare for moving to rummager](https://github.com/alphagov/calendars/pull/160/files) - [Ensure we pass the description text to publishing API](https://github.com/alphagov/calendars/pull/162/files) -## Add the format to the list in `lib/learn_to_rank/format_enums.rb` - -We take format into account in our machine learning, which means we -need a mapping from formats to unique numbers. - ## Update the presenter to handle the new format You'll need to update the elasticsearch presenter in Search API so that it handles any fields which are not yet used by other formats in the govuk index. diff --git a/docs/relevancy.md b/docs/relevancy.md index 8f575d3bc..4baeda14c 100644 --- a/docs/relevancy.md +++ b/docs/relevancy.md @@ -36,15 +36,6 @@ a `combined_score` on every document. The `combined_score` is used for ranking results and represents how relevant we think a result is to your query. -## What impacts relevancy? - -Once Search API has [retrieved](#what-impacts-document-retrieval) the -top scoring documents from the search indexes, it ranks the results -in order of relevance using a pre-trained model. - -See the [learning to rank](learning-to-rank.md) documentation for -more details. - ## What impacts document retrieval? Out of the box, Elasticsearch comes with a decent scoring algorithm. @@ -102,13 +93,13 @@ field and its number of page views in the `vc_14` field. This is an implementation of [this curve](https://solr.apache.org/guide/7_7/function-queries.html#recip-function), and is applied to documents of the "announcement" type in the [booster.rb][] -file. It serves to increase the score of new documents and decrease +file. It serves to increase the score of new documents and decrease the score of old documents. Only documents of `search_format_types` 'announcement' are affected by recency boosting. -The curve was chosen so that it only applies the boost temporarily (2 +The curve was chosen so that it only applies the boost temporarily (2 months moderate decay then a rapid decay after that). #### Properties diff --git a/docs/search-quality-metrics.md b/docs/search-quality-metrics.md index 1fc40a9d3..dc7412919 100644 --- a/docs/search-quality-metrics.md +++ b/docs/search-quality-metrics.md @@ -15,13 +15,9 @@ click on something that isn't what they were looking for. But this serves our needs in the absence of a more sophisticated way of measuring user success following a search. -We also measure nDCG before and after re-ranking over time, to -tell us how search is performing against relevance judgements. - ## Offline metrics -Our main offline metric is nDCG. We measure this before and after -re-ranking by our [learning to rank model](learning-to-rank.md). +Our main offline metric is nDCG. We use Elasticsearch's [Ranking Evaluation API](ranking_evaluation_api) to assess the quality of results retrieved from Elasticsearch prior