Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up kg-ontoml to call NEAT to train classifiers and get metrics (AUC, precision, recall) #16

Closed
justaddcoffee opened this issue Mar 25, 2022 · 4 comments · Fixed by #19

Comments

@justaddcoffee
Copy link

Per our conversation today with @GuoJing @caufieldjh in the OntoML meeting, we'd like to set up kg-ontoml to call NEAT and train classifiers (logistic regression, random forest, and MLP).

For each graph we want to do the learning task on, we will write a NEAT.yaml and also upload an embedding file, and the existing kg-hug-scheduler will train three classifiers (logistic regression, random forest and MLP with some layers that will not change across experiments, and emit some metrics like validatoin AUC, precision, recall, etc. The NEAT.yaml should basically not have a graph block, should have an embedding block with a pointer to a file that already exists, and a classifier block that is basically like this:

classifier:
  edge_method: Average # one of EdgeTransformer.methods: Hadamard, Sum, Average, L1, AbsoluteL1, L2, or alternatively a lambda
  classifiers:  # a list of classifiers to be trained
    - type: neural network
      model:
        outfile: "model_mlp_test_yaml.h5"
        classifier_history_file_name: "mlp_classifier_history.json"
        type: tensorflow.keras.models.Sequential
        layers:
          - type: tensorflow.keras.layers.Input
            parameters:
              shape: 868   # must match embedding_size up above
          - type: tensorflow.keras.layers.Dense
            parameters:
              units: 128
              activation: relu
          - type: tensorflow.keras.layers.Dense
            parameters:
              units: 32
              activation: relu
              # TODO: fix this:
              # activity_regularizer: tensorflow.keras.regularizers.l1_l2(l1=1e-5, l2=1e-4)
          - type: tensorflow.keras.layers.Dropout
            parameters:
              rate: 0.5
          - type: tensorflow.keras.layers.Dense
            parameters:
              units: 16
              activation: relu
          - type: tensorflow.keras.layers.Dense
            parameters:
              units: 1
              activation: sigmoid
      model_compile:
        loss: binary_crossentropy
        optimizer: nadam
        metrics:  # these can be tensorflow objects or a string that tensorflow understands, e.g. 'accuracy'
          - type: tensorflow.keras.metrics.AUC
            parameters:
              curve: PR
              name: auprc
          - type: tensorflow.keras.metrics.AUC
            parameters:
              curve: ROC
              name: auroc
          - type: tensorflow.keras.metrics.Recall
            parameters:
              name: Recall
          - type: tensorflow.keras.metrics.Precision
            parameters:
              name: Precision
          - type: accuracy
      model_fit:
        parameters:
          batch_size: 4096
          epochs: 5  # typically much higher
          callbacks:
            - type: tensorflow.keras.callbacks.EarlyStopping
              parameters:
                monitor: val_loss
                patience: 5
                min_delta: 0.001  # min improvement to be considered progress
            - type: tensorflow.keras.callbacks.ReduceLROnPlateau
    - type: Decision Tree
      model:
        outfile: "model_decision_tree_test_yaml.h5"
        type: sklearn.tree.DecisionTreeClassifier
        parameters:
          max_depth: 30
          random_state: 42
    - type: Random Forest
      model:
        outfile: "model_random_forest_test_yaml.h5"
        type: sklearn.ensemble.RandomForestClassifier
        parameters:
          n_estimators: 500
          max_depth: 30
          n_jobs: 8  # cpu count
          random_state: 42
    - type: Logistic Regression
      model:
        outfile: "model_lr_test_yaml.h5"
        type: sklearn.linear_model.LogisticRegression
        parameters:
          random_state: 42
          max_iter: 1000

Guojing also has a GNN set up that will likely do well on this learning task. For this, he will also produce embeddings, which we can run through the above NEAT pipeline to assess how it does with the HP-MP task. (Guojing will also investigate using the GNN directly on this HP-MP task, without making embeddings, but this won't be a part of the feature described in this ticket.)

@caufieldjh
Copy link
Contributor

See also #5

@caufieldjh
Copy link
Contributor

For purposes of uploading an embedding, this may mean enabling embedding_file_name to be a URL much like with the graph, so NEAT can move it from S3 -> gcloud

@caufieldjh
Copy link
Contributor

caufieldjh commented Mar 30, 2022

Will probably need to take care of this too:
Knowledge-Graph-Hub/neat-ml#43
(for purposes of parsing whether a URL refers to a compressed file in general)

@caufieldjh caufieldjh linked a pull request Mar 30, 2022 that will close this issue
@caufieldjh
Copy link
Contributor

Blocked until Knowledge-Graph-Hub/neat-ml#64 is resolved - can provide graph and dummy pos/neg graphs but this appears to lead to other errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants