This project aims to get old video files from CDS and programatically extract their metadata, convert the metadata to a new format and upload both the video and its metadata to CDS-Videos.
It was implemented as a local webserver with three main pages. The first page lets you convert the
record's metadata from MARCXML
to JSON
and, once you already have the JSON metadata, the second page
lets you both upload the old record to the new plataform or simply download the JSON converted metadata to
your system. The third and last page will show you the progress of your files being uploaded and warn you
if something went wrong with any of the uploads.
Create a Python environment, with pyenv
or conda
for example, and run:
(your_environment)$ git clone https://github.com/Luizerko/cds-video-transfer
(your_environment)$ cd cds-video-transfer
(your_environment)$ pip install -r requirements.txt
Now do the cds-dojson
installation by running:
(your_environment)$ cd ..
(your_environment)$ git clone https://github.com/CERNDocumentServer/cds-dojson
(your_environment)$ cd cds-dojson
(your_environment)$ pip install -e .[tests]
(your_environment)$ cd ../cds-video-transfer
(your_environment)$ pip install -e ../cds-dojson
For the last part, change the default configuration of cds-dojson
to make sure it works with flask. do that by changing line 108 of cds-dojson/cds_dojson/overdo.py
from if HAS_FLASK:
to if not HAS_FLASK:
. Now you should have a working environment.
For managing the dependencies, the pip-tools
package was used. If you want to add new dependencies to the repo, edit requirements.in
file and then run:
(your_environment)$ pip install pip-tools
(your_environment)$ pip-compile requirements.in
The requirements.txt
file will be automatically updated. This dependency manager was used because it takes care of all the version conflicts generated by the sub-dependencies.
If you want to migrate your videos, you need authorization from the CDS team, which means you need an access token to interact with the plataform programatically. Please, get your access token and save it right outside the cds-video-transfer
folder as access_token
.
Also, since there were too many problems with the tags, legacy information, inconsistency and redundancy for example, the project is still in experimental phase. This means that videos still need to me migrated to CDS-Videos as soon as all the decisions about tags have been taken and CDS-Dojson has been properly updated. This also means that you need a local instance of CDS-Videos running to test the migration process - or change the code appropriately to test it on sand-box/production.
Start by activating your Python environment, creating the database if you don't have one and then running the project locally with:
(your_environment)$ python3 init_db.py
(your_environment)$ python3 video_extractor.py
When you have the webserver running locally, open your browser and got to localhost:5555
. You'll find a
plug and play website ready to transfer old video records from CDS to CDS-Videos.
Single Record: Just put your record ID on the 'Record' section and press submit to convert its MARCXML metadata to JSON metadata.
Multiple Records or Queries: Indicate the record IDs, first_number,second_number,third_number
for three records for example, or search for a query like 'physics'. If your query fetches more than 10 videos, they will be migrated in chunks of 10 records.
All Records: Migrate all the records from the Digital Memory Project on chunks of 10 records.
After you're done migrating all the records you want, remember to generate a file to update CDS records properly, marking migrated records as migrated_2023
using tag 980__b
:
(your_environment)$ python3 updating_cds.py
Inside moving_images_data
folder, one can also find some preprocessed files:
update_cds
->MARCXML
generated code to update CDS records.migration_database.db
-> Database that stores migration state for each processed record.moving_images_<number>.xml
-> ProcessedMARCXML
s with records from a query to CDS. It is numbered because of pagination of requests, since the maximum number of fetched records from a query is 200.<recid>.xml
-> Individual processedMARCXML
from a specific record.missing_tags
-> All tags that were found in record`sMARCXML
s for a query, but were not processed.missing_tags_examples
-> All the missing tags for each individual queried record. This file is primarly used to find examples of the missing tags.missing_tags_values
-> All the unique values for each individual missing tag of the queried records.moving_images_fails
-> All the fails and their errors when generating theJSON
file for each individual queried record.moving_images_json
-> All the generatedJSON
s for each individual queried record. -persistent_data
-> Folder with similarmoving_images_<number>.xml
,missing_tags
,missing_tags_examples
,missing_tags_values
,moving_images_fails
andmoving_images_json
but for all the Digital Memory Project and non-migrated files.