Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Reflect build-errors back out again #777

Open
epa095 opened this issue Dec 10, 2019 · 2 comments
Open

RFC: Reflect build-errors back out again #777

epa095 opened this issue Dec 10, 2019 · 2 comments
Labels
need elaboration Issues that need further elaboration

Comments

@epa095
Copy link
Contributor

epa095 commented Dec 10, 2019

Problem
We build some models, we fail others. But to figure out why a model failed you must find the argo workflow containing that model and look at it. Often the exit-code is enough, other times you must look at the log of the pod. How can we expose this information out of the cluster? Thoughts?

Thoughts

  1. I think this should be gathered and exposed this in some reasonable way inside the k8s cluster first, and then we can simply expose this over http using a dumb server. This instead of building a smart http server which know much of the internal k8s setup. Agree?

If we agree on the above point, where in k8s should this information be?

  1. We can create a modelobject for failed models as well containing the status (Failed), and the exit code of the container. This means that kubectl get models dont give working models, but rather desired models, and it can be filtered on the status. gordo-controller can still write some summary-statistic into the gordo (e.g. nr of failed models per exit-code for example), but "the truth" is in the models.
  2. We can create a failed-model object. But this seems quite weird compared to how other k8s objects are handled.
  3. We can store the information about failed models back (and only in) the gordo. So either we write the status/exit code directly back in the config dictionary, or maybe better: add another map (for example in the status field) from model-name to exitcode / status?
    Then the gordo functions as a kind of log. Problems with this: The gordo is already pressed for size, and this will increase it a bit.

I guess a core question is: Does kubectl get models give desired models or successful models?

@ryanjdillon
Copy link
Contributor

If were to use another service for aggregating logs, I found these while poking around: fluentd and ELK on Kubernetes. I like the idea of having a Kibana dashboard with all the essential deets on broadcast.

With the core question, I like the idea of getting desired models and then grepping , etc.

@flikka flikka added the need elaboration Issues that need further elaboration label Jan 30, 2020
@flikka
Copy link
Contributor

flikka commented Jan 30, 2020

Maybe @milesgranger have thoughts on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need elaboration Issues that need further elaboration
Projects
None yet
Development

No branches or pull requests

3 participants