You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem
We build some models, we fail others. But to figure out why a model failed you must find the argo workflow containing that model and look at it. Often the exit-code is enough, other times you must look at the log of the pod. How can we expose this information out of the cluster? Thoughts?
Thoughts
I think this should be gathered and exposed this in some reasonable way inside the k8s cluster first, and then we can simply expose this over http using a dumb server. This instead of building a smart http server which know much of the internal k8s setup. Agree?
If we agree on the above point, where in k8s should this information be?
We can create a modelobject for failed models as well containing the status (Failed), and the exit code of the container. This means that kubectl get models dont give working models, but rather desired models, and it can be filtered on the status. gordo-controller can still write some summary-statistic into the gordo (e.g. nr of failed models per exit-code for example), but "the truth" is in the models.
We can create a failed-model object. But this seems quite weird compared to how other k8s objects are handled.
We can store the information about failed models back (and only in) the gordo. So either we write the status/exit code directly back in the config dictionary, or maybe better: add another map (for example in the status field) from model-name to exitcode / status?
Then the gordo functions as a kind of log. Problems with this: The gordo is already pressed for size, and this will increase it a bit.
I guess a core question is: Does kubectl get models give desired models or successful models?
The text was updated successfully, but these errors were encountered:
If were to use another service for aggregating logs, I found these while poking around: fluentd and ELK on Kubernetes. I like the idea of having a Kibana dashboard with all the essential deets on broadcast.
With the core question, I like the idea of getting desired models and then grepping , etc.
Problem
We build some models, we fail others. But to figure out why a model failed you must find the argo workflow containing that model and look at it. Often the exit-code is enough, other times you must look at the log of the pod. How can we expose this information out of the cluster? Thoughts?
Thoughts
If we agree on the above point, where in k8s should this information be?
model
object for failed models as well containing the status (Failed
), and the exit code of the container. This means thatkubectl get models
dont give working models, but rather desired models, and it can be filtered on the status. gordo-controller can still write some summary-statistic into the gordo (e.g. nr of failed models per exit-code for example), but "the truth" is in themodels
.failed-model
object. But this seems quite weird compared to how other k8s objects are handled.config
dictionary, or maybe better: add another map (for example in thestatus
field) from model-name to exitcode / status?Then the gordo functions as a kind of log. Problems with this: The gordo is already pressed for size, and this will increase it a bit.
I guess a core question is: Does
kubectl get models
givedesired
models orsuccessful
models?The text was updated successfully, but these errors were encountered: