Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise 404 responses #362

Closed
tleb opened this issue Dec 19, 2024 · 2 comments
Closed

Optimise 404 responses #362

tleb opened this issue Dec 19, 2024 · 2 comments

Comments

@tleb
Copy link
Member

tleb commented Dec 19, 2024

For some reason, almost all our response times above two seconds is for answering 404s. I have investigated quite a bit and haven't yet found an explanation.

Some stats from two weeks worth of data:

requests:                     29_012_453
>2s requests:                  2_317_542
>2s 404 /ident/ requests:      1_593_969, 69%
>2s non-404 /ident/ requests:      3_750
>2s 200 requests:                     54
>2s non-404 requests:              4_993
>2s 404 requests:              2_312_549, 99.8%
>2s 404 /source/ requests:       718_466, 31%

Clearly something is wrong. Pretty graphs agree (log scale, note the 2s spike for 404s which isn't present for 200s):

timings-200

404-timings

I say "optimise 404s" but I think the issue is for requests that aren't 200. The difference is subtle because almost all requests that are not 200 are 404s.

@tleb
Copy link
Member Author

tleb commented Dec 19, 2024

I tried finding codepaths that could explain, I couldn't. Most are 404 on identifiers, those do not go through the get_project_error_page() function. However the /source/ 404 requests go through this function.

The scenario that looks the most likely currently: querying Git or the databases on non-existing values gives a worst-scenario timing, maybe with some sort of timeout.


Also tried reproducing locally by downloading Linux databases from prod and running Elixir in a container, and I couldn't reproduce.


Also, we only log wallclock duration. It could be that those threads hang here doing nothing. In that case, then avoiding those response times would bring close to no difference on the server load (which we want to reduce).


Also, it could be that user agents are the ones that are slow. That would explain the weird distribution. TODO: how to debunk that hypothesis? Can Apache log other things that response wallclock time?

@tleb
Copy link
Member Author

tleb commented Feb 25, 2025

This was all a wrong lead. Oops. I don't remember why I was mistaken though.

@tleb tleb closed this as completed Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant