RFC: Reflect build-errors back out again #777

epa095 · 2019-12-10T09:35:15Z

Problem
We build some models, we fail others. But to figure out why a model failed you must find the argo workflow containing that model and look at it. Often the exit-code is enough, other times you must look at the log of the pod. How can we expose this information out of the cluster? Thoughts?

Thoughts

I think this should be gathered and exposed this in some reasonable way inside the k8s cluster first, and then we can simply expose this over http using a dumb server. This instead of building a smart http server which know much of the internal k8s setup. Agree?

If we agree on the above point, where in k8s should this information be?

We can create a modelobject for failed models as well containing the status (Failed), and the exit code of the container. This means that kubectl get models dont give working models, but rather desired models, and it can be filtered on the status. gordo-controller can still write some summary-statistic into the gordo (e.g. nr of failed models per exit-code for example), but "the truth" is in the models.
We can create a failed-model object. But this seems quite weird compared to how other k8s objects are handled.
We can store the information about failed models back (and only in) the gordo. So either we write the status/exit code directly back in the config dictionary, or maybe better: add another map (for example in the status field) from model-name to exitcode / status?
Then the gordo functions as a kind of log. Problems with this: The gordo is already pressed for size, and this will increase it a bit.

I guess a core question is: Does kubectl get models give desired models or successful models?

The text was updated successfully, but these errors were encountered:

ryanjdillon · 2019-12-13T08:43:24Z

If were to use another service for aggregating logs, I found these while poking around: fluentd and ELK on Kubernetes. I like the idea of having a Kibana dashboard with all the essential deets on broadcast.

With the core question, I like the idea of getting desired models and then grepping , etc.

flikka · 2020-01-30T09:32:53Z

Maybe @milesgranger have thoughts on this

flikka added the need elaboration Issues that need further elaboration label Jan 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Reflect build-errors back out again #777

RFC: Reflect build-errors back out again #777

epa095 commented Dec 10, 2019

ryanjdillon commented Dec 13, 2019

flikka commented Jan 30, 2020

RFC: Reflect build-errors back out again #777

RFC: Reflect build-errors back out again #777

Comments

epa095 commented Dec 10, 2019

ryanjdillon commented Dec 13, 2019

flikka commented Jan 30, 2020