Tools, ideas, and data.
Semantics: EQUELLA objects are items with attachments. Invenio objects are records with files. EQUELLA has taxonomies; Invenio has vocabularies. We use these terms consistently so it's clear what format an object is in (e.g. python migrate/record.py item.json > record.json
converts an item into a record).
poetry install # get dependencies
poetry shell # enter venv
python -m spacy download en_core_web_lg # download spacy model for Named Entity Recognition
pytest -v migrate/tests.py # run tests
Migrate scripts that create records require an INVENIO_TOKEN
or TOKEN
variable in our environment or .env file. To create a token: sign in as an admin and go to Applications > Personal access tokens.
Invenio uses vocabularies to represent a number of fixtures beyond just subject headings, like names, description types, and creator roles. They're stored under the app_data directory and loaded when an instance is initialized. Many of our controlled lists in contribution wizards and EQUELLA taxonomies will be mapped to vocabularies.
The taxos dir contains exported EQUELLA taxonomies and tools for working with them. The vocab dir contains YAML files for Invenio vocabularies.
We create two subject vocabularies: one for Library of Congress subjects with URIs from one of their authorities and one for CCA local subjects not present in any LC authority.
Download our subjects sheet and run python migrate/mk_subjects.py data/subjects.csv
to create the YAML vocabularies in the vocab dir (lc.yaml and cca_local.yaml) as well as migrate/subjects_map.json which is used to convert the text of VAULT subject terms into Invenio identifiers or ID-less keyword subjects.
Copy the YAML vocabularies into the app_data/vocabularies directory of our Invenio instance. The site needs to be rebuilt to load the changes (invenio-cli services destroy
and then invenio-cli services setup
again). Eventually (Invenio v12) there will be a CLI command to alter vocabularies without rebuilding the site.
- migrate/record.py: Converts EQUELLA item JSON into Invenio record JSON
- migrate/api.py: Converts an item and
POST
s it to Invenio to create a record - migrate/import.py: Imports an item directory (created by the export tool) with its attachments to Invenio
To use these scripts, we must create a personal access token for an administrator account in Invenio:
- Sign in as an admin
- Go to Applications > Personal access tokens
- Create one—its name and the
user:email
scope (as of v12) do not matter - Copy it to clipboard and Save
- Paste in .env and/or set it as an env var, e.g.
set -x INVENIO_TOKEN=xyz
in fish
Below, we migrate a VAULT item to an Invenio record and post it to Invenio.
set -x INVENIO_TOKEN=your_token_here
poetry run python migrate/api.py items/item.json # example output below
HTTP 201
https://127.0.0.1:5000/api/records/k7qk8-fqq15/draft
HTTP 202
{"id": "k7qk8-fqq15", "created": "2024-05-31T15:26:17.972009+00:00", ...
https://127.0.0.1:5000/records/k7qk8-fqq15
You can sometimes trip over yourself because Poetry automatically loads the .env
file in the project root, which might contain an outdated personal access token. If API calls fail with 403 errors, check that the TOKEN
and/or INVENIO_TOKEN
environment variables are set correctly.
Rerunning the script with the same input creates multiple records, it doesn't update existing ones.
We could write scripts to directly take an item from EQUELLA using its API, perform a metadata crosswalk, and post it to Invenio. Alternatively, we could work with local copies of items, perhaps created by the equella_scripts collection export tool.
We need to load the necessary fixtures, including user accounts, before adding to Invenio. For instance, the item owner needs to already be in Invenio before we can add them as owner of a record. If we attempt to load a record with a subject id
that doesn't exist yet, we get a 500 error.
We download metadata for all items using equella-cli and a script like this:
#!/usr/bin/env fish
set total (eq search -l 1 | jq '.available')
set length 50 # can only download up to 50 at a time
set pages (math floor $total / $length)
for i in (seq 0 $pages)
set start (math $i \* $length)
echo "Downloading items $start to" (math $start + $length)
# NOTE: no attachment info, use "--info all" for both attachments & metadata
eq search -l $length --info metadata --start $start > json/$i.json
end
We can use the item.metadata
XML of existing VAULT items for testing. Generally, poetry run python migrate/record.py items/item.json | jq
to see the JSON Invenio record. See our crosswalk diagrams.
Schemas:
It's likely our schema is outdated/inaccurate in places.
How to map a field:
- Add a brief description to the mermaid diagram in docs/crosswalk.html
- Write a test in tests.py with your input XML and expected record output
- Write a
Record
method in migrate.py & use it in theRecord::get()
dict - Run tests, optionally run a record migration as described above