Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move ESA to K8s #2062

Closed
artntek opened this issue Feb 5, 2025 · 6 comments
Closed

Move ESA to K8s #2062

artntek opened this issue Feb 5, 2025 · 6 comments
Assignees

Comments

@artntek
Copy link
Contributor

artntek commented Feb 5, 2025

Tracking progress for moving https://data.esa.org/ from mn-ucsb-2.dataone.org to k8s prod cluster.

Add any notes to this issue, and follow checklist in sub-issue #2063

@artntek
Copy link
Contributor Author

artntek commented Feb 5, 2025

Notes on rsync

  • ceph is not mounted on ESA host (mn-ucsb-2.dataone.org), and I'm rsyncing across hosts to datateam using brooke login (see commands below).
  • Therefore, need to log into datateam, and chown -R brooke:brooke on /mnt/ceph/repos/esa/metacat and /mnt/ceph/repos/esa/postgresql destination ceph directories, before running the rsync on mn-ucsb-2
  • After completing rsync, need to log into datateam, and chown back to 59997 and 59996

Commands

$ time sudo rsync -aHAX /var/esa/data/ [email protected]:/mnt/ceph/repos/esa/metacat/data/
real	1m14.854s
user	0m0.159s
sys	0m0.104s

brooke@mn-ucsb-2:~$ time sudo rsync -aHAX /var/esa/documents/ [email protected]:/mnt/ceph/repos/esa/metacat/documents/
real	0m5.177s
user	0m0.144s
sys	0m0.081s

brooke@mn-ucsb-2:~$ time sudo rsync -aHAX /var/esa/logs/ [email protected]:/mnt/ceph/repos/esa/metacat/logs/
real	0m2.261s
user	0m0.114s
sys	0m0.052s

brooke@mn-ucsb-2:~$ time sudo rsync -aHAX /var/lib/postgresql/14/ [email protected]:/mnt/ceph/repos/esa/postgresql/14/
real	1m8.735s
user	0m10.161s
sys	0m24.387s

@artntek
Copy link
Contributor Author

artntek commented Feb 7, 2025

Indexer startup issue: HashStore not yet initialized on fresh install, and indexers come up before metacat - so indexer HashStore lib tries to initialize it, but doesn't have write access:

(should self-resolve after metacat pod is up and running, and has initialized HashStore)

dataone-indexer 20250207-14:44:57: [ERROR]: Dataone-indexer cannot initialize the Storage class since HashStoreFactory - Error creating 'FileHashStore' instance: java.nio.file.FileSystemException: /var/metacat/hashstore: Read-only file system [org.dataone.indexer.storage.Storage:<clinit>:28]
	at org.dataone.cn.indexer.IndexWorker.<init>(IndexWorker.java:225) [dataone-index-worker-3.1.1-shaded.jar:?]
org.dataone.hashstore.exceptions.HashStoreFactoryException: HashStoreFactory - Error creating 'FileHashStore' instance: java.nio.file.FileSystemException: /var/metacat/hashstore: Read-only file system
	at org.dataone.cn.indexer.IndexWorker.<init>(IndexWorker.java:209) [dataone-index-worker-3.1.1-shaded.jar:?]
	at org.dataone.hashstore.HashStoreFactory.getHashStore(HashStoreFactory.java:84) ~[dataone-index-worker-3.1.1-shaded.jar:?]
	at org.dataone.indexer.storage.Storage.<init>(Storage.java:61) ~[dataone-index-worker-3.1.1-shaded.jar:?]
	at org.dataone.cn.indexer.IndexWorker.main(IndexWorker.java:103) [dataone-index-worker-3.1.1-shaded.jar:?]
	at org.dataone.indexer.storage.Storage.<clinit>(Storage.java:26) [dataone-index-worker-3.1.1-shaded.jar:?]
	at org.dataone.cn.indexer.object.ObjectManager.<clinit>(ObjectManager.java:57) [dataone-index-worker-3.1.1-shaded.jar:?]
	at org.dataone.cn.indexer.IndexWorker.<init>(IndexWorker.java:225) [dataone-index-worker-3.1.1-shaded.jar:?]
	at org.dataone.cn.indexer.IndexWorker.<init>(IndexWorker.java:209) [dataone-index-worker-3.1.1-shaded.jar:?]
	at org.dataone.cn.indexer.IndexWorker.main(IndexWorker.java:103) [dataone-index-worker-3.1.1-shaded.jar:?]

@artntek
Copy link
Contributor Author

artntek commented Feb 7, 2025

Database related issues

Metacat startup error. Note that database name is "esa", not "metacat", although postgresql.auth.database is correctly set to esa, and the configmap contains the correct value of:

database.connectionURI=jdbc:postgresql://metacatesa-postgresql-hl/esa

Next step - check debug output for correct props init

Error:

 [edu.ucsb.nceas.metacat.startup.StartupRequirementsChecker:abort:351]
org.postgresql.util.PSQLException: FATAL: database "metacat" does not exist
	at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2733) ~[postgresql-42.7.4.jar:42.7.4]
	at org.postgresql.core.v3.QueryExecutorImpl.readStartupMessages(QueryExecutorImpl.java:2845) ~[postgresql-42.7.4.jar:42.7.4]
	at org.postgresql.core.v3.QueryExecutorImpl.<init>(QueryExecutorImpl.java:176) ~[postgresql-42.7.4.jar:42.7.4]
	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:323) ~[postgresql-42.7.4.jar:42.7.4]
	at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54) ~[postgresql-42.7.4.jar:42.7.4]
	at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:273) ~[postgresql-42.7.4.jar:42.7.4]
	at org.postgresql.Driver.makeConnection(Driver.java:446) ~[postgresql-42.7.4.jar:42.7.4]
	at org.postgresql.Driver.connect(Driver.java:298) ~[postgresql-42.7.4.jar:42.7.4]
	at java.sql.DriverManager.getConnection(Unknown Source) ~[java.sql:?]
	at java.sql.DriverManager.getConnection(Unknown Source) ~[java.sql:?]
	at edu.ucsb.nceas.metacat.database.DBConnection.openConnection(DBConnection.java:372) ~[metacat.jar:?]
	at edu.ucsb.nceas.metacat.database.DBConnection.openConnection(DBConnection.java:345) ~[metacat.jar:?]
	at edu.ucsb.nceas.metacat.database.DBConnection.<init>(DBConnection.java:83) ~[metacat.jar:?]
	at edu.ucsb.nceas.metacat.database.DBConnectionPool.initialDBConnectionPool(DBConnectionPool.java:187) ~[metacat.jar:?]
	at edu.ucsb.nceas.metacat.database.DBConnectionPool.<init>(DBConnectionPool.java:156) ~[metacat.jar:?]
	at edu.ucsb.nceas.metacat.database.DBConnectionPool.getInstance(DBConnectionPool.java:134) ~[metacat.jar:?]
	at edu.ucsb.nceas.metacat.startup.MetacatInitializer.initAfterMetacatConfig(MetacatInitializer.java:156) ~[metacat.jar:?]
	at edu.ucsb.nceas.metacat.startup.MetacatInitializer.contextInitialized(MetacatInitializer.java:103) [metacat.jar:?]

This was because pg_hba.conf didn't have the right permissions (expected db name to be metacat instead of esa, Fixed, and metacat can now connect

Metacat runs 2.19.0 -> 2.19.1 DB script, and the 2.19.1 -> 3.0.0 script successfully, but then failed on the 3.0.0 -> 3.1.0 script:

metacat 20250210-17:21:51: [ERROR]: initializeContainerisedDBConfiguration(): error getting
metacat version (3.1.0) or database version (2.19.0). Error was: DBAdmin.upgradeDatabase -
SQL error when running upgrade scripts: ERROR: relation "db_version_id_seq" does not exist
[edu.ucsb.nceas.metacat.startup.K8sAdminInitializer:initK8sDBConfig:109]

Discovered this is because ESA has the default value for db_version_id set to use the text value of the sequence (db_version_id_seq), instead of treating it as a reference:

esa=> \d db_version;
                                             Table "public.db_version"
    Column     |            Type             | Collation | Nullable |                   Default
---------------+-----------------------------+-----------+----------+----------------------------------------------
 db_version_id | bigint                      |           | not null | nextval('db_version_id_seq'::text::regclass)

(note the text in nextval('db_version_id_seq'::text::regclass). Therefore, when the sequence is renamed, this value remains unchanged. For comparison, compare the above with the same query in GOA:

evos=> \d db_version;
                                          Table "public.db_version"
    Column     |            Type             | Collation | Nullable |                Default
---------------+-----------------------------+-----------+----------+----------------------------------------
 db_version_id | bigint                      |           | not null | nextval('db_version_id_seq'::regclass)

(note nextval('db_version_id_seq'::regclass))

Fixed this by doing:

ALTER TABLE db_version 
ALTER COLUMN db_version_id 
SET DEFAULT nextval('db_version_id_seq'::regclass);

and then the conversions ran as expected

@artntek
Copy link
Contributor Author

artntek commented Feb 7, 2025

MetacatUI

startup error - can't mount PVC due to permissions:

  Warning  FailedMount  27s    kubelet            MountVolume.MountDevice failed for volume "cephfs-metacatesa-metacatui-theme" : rpc error: code = Internal desc = an error (exit status 32) occurred while running mount args: [-t ceph 10.0.3.131:6789,10.0.3.132:6789,10.0.3.133:6789:/volumes/k8ssubvolgroup/k8ssubvol/58cda964-ce10-4ff9-8242-983da0fd0da3/repos/esa/metacatui /var/lib/kubelet/plugins/kubernetes.io/csi/pv/cephfs-metacatesa-metacatui-theme/globalmount -o name=pdg-subvol-user,secretfile=/tmp/csi/keys/keyfile-620537753,mds_namespace=cephfs,_netdev] stderr: mount error 13 = Permission denied
  Warning  FailedMount  19s    kubelet            Unable to attach or mount volumes: unmounted volumes=[metacatesa-mcui-custom-theme-files], unattached volumes=[metacatesa-mcui-source-files kube-api-access-w6fpj metacatesa-mcui-custom-theme-files metacatesa-mcui-config-js metacatesa-mcui-config-all]: timed out waiting for the condition

Solved: Incorrect rootPath: in pv definition, and then incorrect subPath

Note - installed new ESA custom theme by doing a git clone of the NCEAS/metacatui-themes repo in the /mnt/ceph/repos/esa/metacatui repo, sharing the metacatui directory as the root of a PV, and setting the subPath of the PVC definition to point at metacatui-themes/src/esa/js/themes/esa.

@artntek artntek added this to the 3.1.0-deployment milestone Feb 10, 2025
@artntek artntek self-assigned this Feb 10, 2025
@artntek
Copy link
Contributor Author

artntek commented Feb 11, 2025

hashstore conversion errors

201 failures, with this error:

metacat 20250212-16:59:13: [ERROR]: Cannot move the object esa.34.1 to hashstore since null [edu.ucsb.nceas.metacat.admin.upgrade.HashStoreUpgrader:convert:541]
org.dataone.exceptions.MarshallingException: null
	at org.dataone.service.util.TypeMarshaller.marshalTypeToOutputStream(TypeMarshaller.java:232) ~[d1_common_java-2.3.0.jar:?]
	at org.dataone.service.util.TypeMarshaller.marshalTypeToOutputStream(TypeMarshaller.java:202) ~[d1_common_java-2.3.0.jar:?]
	at edu.ucsb.nceas.metacat.admin.upgrade.HashStoreUpgrader.convertSystemMetadata(HashStoreUpgrader.java:491) ~[metacat.jar:?]
	at edu.ucsb.nceas.metacat.admin.upgrade.HashStoreUpgrader.convert(HashStoreUpgrader.java:519) ~[metacat.jar:?]
	at edu.ucsb.nceas.metacat.admin.upgrade.HashStoreUpgrader.lambda$upgrade$0(HashStoreUpgrader.java:258) ~[metacat.jar:?]
[...]
Caused by: javax.xml.bind.MarshalException
	at com.sun.xml.bind.v2.runtime.MarshallerImpl.write(MarshallerImpl.java:301) ~[jaxb-runtime-2.3.2.jar:2.3.2]
	at com.sun.xml.bind.v2.runtime.MarshallerImpl.marshal(MarshallerImpl.java:226) ~[jaxb-runtime-2.3.2.jar:2.3.2]
	at javax.xml.bind.helpers.AbstractMarshallerImpl.marshal(AbstractMarshallerImpl.java:80) ~[jakarta.xml.bind-api-2.3.2.jar:2.3.2]
	at org.dataone.service.util.TypeMarshaller.marshalTypeToOutputStream(TypeMarshaller.java:229) ~[d1_common_java-2.3.0.jar:?]
	... 9 more
Caused by: org.xml.sax.SAXParseException: cvc-pattern-valid: Value '' is not facet-valid with respect to pattern '[\s]*[\S][\s\S]*' for type 'NonEmptyString'.
	at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source) ~[xercesImpl-2.12.2.jar:?]
[...]
	at javax.xml.bind.helpers.AbstractMarshallerImpl.marshal(AbstractMarshallerImpl.java:80) ~[jakarta.xml.bind-api-2.3.2.jar:2.3.2]
	at org.dataone.service.util.TypeMarshaller.marshalTypeToOutputStream(TypeMarshaller.java:229) ~[d1_common_java-2.3.0.jar:?]
	... 9 more

With lots of help from Jing, we checked:

esa=> select * from systemmetadata where guid='esa.34.1';
-- 1 record; looked fine - nothing missing

esa=> select * from xml_access where guid='esa.34.1';
-- 3 records; looked fine - nothing missing

Tried getting the system metadata from the URL, on the original VM host:
https://data.esa.org/esa/d1/mn/v2/meta/esa.34.1

...which showed an error:

Image

...so then we checked the smreplicationpolicy table:

esa=> \x
Expanded display is on.
esa=> select * from smreplicationpolicy where guid='esa.34.1';
-[ RECORD 1 ]-------------
guid        | esa.34.1
member_node | urn:node:KNB
policy      | preferred
policy_id   | 449
-[ RECORD 2 ]-------------
guid        | esa.34.1
member_node |
policy      | blocked
policy_id   | 695

this is the problem: for RECORD 2, the member_node is blank.

esa=> select count(*) from smreplicationpolicy where member_node='';
count | 201
-- 201 conversion errors, and 201 blank fields!

esa=> select * from smreplicationpolicy where policy='blocked' and not member_node='';
(0 rows)
-- there are no blocked entries with a node id instead of being blank

esa=> select distinct member_node from smreplicationpolicy where policy='preferred';
-[ RECORD 1 ]-------------
member_node | urn:node:KNB

...and there were no restrictions set in metacat.properties

# The default replication policy
dataone.replicationpolicy.default.numreplicas=0
dataone.replicationpolicy.default.preferredNodeList=
dataone.replicationpolicy.default.blockedNodeList=

...so we deleted the troublesome entries:

esa=> delete from smreplicationpolicy where member_node='' and policy='blocked';
DELETE 201
esa=> COMMIT;

Finally, set the status back to 'pending':

esa=> update version_history set storage_upgrade_status='pending' where status='1';
UPDATE 1
esa=> COMMIT;

...and restarted the pod. It converted those 201 with no problems.

System metadata from the URL works fine on the new k8s host: https://esa-prod.test.dataone.org/esa/d1/mn/v2/meta/esa.34.1

@artntek
Copy link
Contributor Author

artntek commented Feb 12, 2025

Final step: deployed and all set up to point at prod CN. Nick sent an email to ESA to ask thenm to change the DNS to point to k8s. When that happens, it should switch over seamlessly, and we can take down the old version

@artntek artntek moved this to In Progress in Metacat Releases Feb 13, 2025
@artntek artntek moved this from In Progress to Done in Metacat Releases Feb 13, 2025
@artntek artntek closed this as completed Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant