-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run started to error out #150
Comments
@yarikoptic The error does not appear to have anything to do with specific PRs; rather, it's tied to specific (broken?) workflows which have empty |
@yarikoptic For filling out the GitHub ticket, I need to know when this started. |
hm, fun. Well - the dates are in the logs above pretty much since we run those cron jobs quite regularly. In both cases it started on Oct 01. First email failing for dandi-cli:
for datalad:
|
Ticket filed with GitHub. I associated it with the "con" organization, so you might(?) be able to see it: https://support.github.com/ticket/personal/0/1818408 |
Unfortunately can't see it, but that's ok I guess. Is there anything we need to fix in those workflows triggering this issue and/or any way tinious could workaround it ATM? |
@yarikoptic We could make tinuous skip workflows with an empty path, but I'm not entirely sure whether that can legitimately happen. |
ah gotcha - so you suspect that github returned information about workflows is buggy... ok, let's wait a day for a possible response to that github support ticket you filed. |
FTR: it seems has started to error out also for datalad-extensions. Let's hope github folks would figure it out and solve soonish. edit: and also for heudiconv for which I tried to add con/tinuous |
if the issue is due to github duplicating workflow records in its listing, where the 2nd copy doesn't have path, couldn't we detect duplicate incomplete entries and just skip them? (after some |
|
|
It not OK to assume that. The "name" is just whatever's filled in for the "name" key at the top of the workflow file, and it's perfectly possible for two or more workflows to have the same name.
Using the below script, I get a 500 when trying to list the runs for the first datalad/datalad-extensions workflow (which has a unique name and a #!/usr/bin/env python3
import click
from ghrepo import GHRepo
import requests
@click.command()
@click.argument("repo", type=GHRepo.parse)
def main(repo):
with requests.Session() as s:
r = s.get(f"{repo.api_url}/actions/workflows?per_page=100")
r.raise_for_status()
for workflow in r.json()["workflows"]:
print("Name:", repr(workflow["name"]))
print("Path:", repr(workflow["path"]))
r = s.get(workflow["url"] + "/runs")
if not r.ok:
print("ERROR:", r.status_code)
else:
print("Runs:", r.json()["total_count"])
print()
if __name__ == "__main__":
main() |
interesting, so -- it is 500s on an extension which no longer exists although existed long time ago:
well beyond the date we care about. So, for each one of those 500s we just need to establish (and ideally cache but we can may be tolerate querying each time for now) the date when it was removed in the repo. is there a github API to provide commits for a path? (I couldn't find quickly) |
@yarikoptic I don't see anything obvious. Note that my script is able to retrieve runs for |
wouldn't that mean that we potentially could loose some logs? e.g. if
also -- I checked mail archives - it seems that "Extensions" workflow was returned before even after its removal , e.g.
or
so it seems that they should be returned. |
hey -- we do have all needed information, kinda:
so we have
|
@yarikoptic Detecting whether a 500 error occurred isn't straightforward. We've configured PyGithub to retry all 500 errors (as well as some other 5xx errors) up to 12 times, and only if all attempts fail (after about 12 and a half minutes) is an exception raised. Unfortunately, the |
We are talking about https://github.com/con/tinuous/blob/master/src/tinuous/github.py#L66 which uses |
Just check for 500 or for something else? If we only check for 500's, I don't see how that'd be different from not retrying on 500. If we also check details of the workflow at that point, how would |
only for 500. It will be different that we will perform more checks for that 500 - that the date is still "of interest etc".
We will have the request URL, wouldn't it carry org/repo/workflow path as part of it within that end-point which is 500ing now? |
Assuming that
We also need the |
I envisioned that increment to be overloaded as: def increment(self, *args, **kwargs):
ret = super().increment(*args, **kwargs)
if do_checks_for_the_url_say_it_is_500_to_skip:
raise SkipThisWorkflow()
return ret so it would just perform as usual Retry otherwise, and raise our ad-hoc exception if runs into our use case
we know what workflow we are talking about so we could perform (a possibly duplicate) request to get any information we want to get that |
The |
so it is in our direct power to access it, although possibly breaking the beauty of the design, right? the ugly workaround to achieve would be to populate some global var or env var accessible to the entire process/all async jobs. |
Let me describe rationale: I am just trying to get all the logs flowing again. I do hope that github team would fix underlying issue, but as we have no estimate on when/if that would happen, I do not want us to hold our breath. |
@yarikoptic I think I have a way to mostly-robustly identify if a workflow run request failed due to a 500, but it still involves waiting to retry all 12 times for 12.5 minutes. Would that be acceptable? |
I think so, unless it would bring us into some other rate limiting or alike problem. |
Warn on & skip workflow runs for certain "broken" GitHub workflows
@yarikoptic GitHub support has replied to my ticket, saying they've fixed the underlying issue. The script I posted above shows no longer shows any problematic workflows for dandi/dandi-cli, datalad/datalad, or datalad/datalad-extensions. |
cool -- great to see things fixed up! Shouldn't we then revert that added skipping added in https://github.com/con/tinuous/pull/151/files#diff-ae556ebe88a885b67c7348cdd53576ce30426c59a5d439d58919251d8865fbb0R116 ? |
@yarikoptic I don't see a need to remove it. |
Wouldn't then if github gets broken again, we -- as the only force in the universe capable of detecting such cases -- would miss that and not report to github to get it fixed again? |
|
It is ok to doubt, but there is no empirical support for that and we did have to report an issue (so some indirect evidence that issue might have been unknown)
con/tinuous is typically ran by a cron job with
true but we would trigger GitHub to fix it, hopefully faster that time. Without this churn we might just keep accumulating missing logs until someone mentions the warning or does some archaeological expedition. In an ideal case -- this error does not re-emerge and we just have assurance that we are not missing any log. So I would prefer to revert the |
happens consistently on datalad, seems always on the same pr
and started to fail on dandi-cli
since so consistent on the same PR -- may be indeed some github issue??? should be investigated and possibly filed with github. I do not think I saw similar problem from any other cron job -- only from datalad and dandi-cli.
The text was updated successfully, but these errors were encountered: