Optimise 404 responses #362

tleb · 2024-12-19T15:49:51Z

For some reason, almost all our response times above two seconds is for answering 404s. I have investigated quite a bit and haven't yet found an explanation.

Some stats from two weeks worth of data:

requests:                     29_012_453
>2s requests:                  2_317_542
>2s 404 /ident/ requests:      1_593_969, 69%
>2s non-404 /ident/ requests:      3_750
>2s 200 requests:                     54
>2s non-404 requests:              4_993
>2s 404 requests:              2_312_549, 99.8%
>2s 404 /source/ requests:       718_466, 31%

Clearly something is wrong. Pretty graphs agree (log scale, note the 2s spike for 404s which isn't present for 200s):

I say "optimise 404s" but I think the issue is for requests that aren't 200. The difference is subtle because almost all requests that are not 200 are 404s.

The text was updated successfully, but these errors were encountered:

tleb · 2024-12-19T16:01:29Z

I tried finding codepaths that could explain, I couldn't. Most are 404 on identifiers, those do not go through the get_project_error_page() function. However the /source/ 404 requests go through this function.

The scenario that looks the most likely currently: querying Git or the databases on non-existing values gives a worst-scenario timing, maybe with some sort of timeout.

Also tried reproducing locally by downloading Linux databases from prod and running Elixir in a container, and I couldn't reproduce.

Also, we only log wallclock duration. It could be that those threads hang here doing nothing. In that case, then avoiding those response times would bring close to no difference on the server load (which we want to reduce).

Also, it could be that user agents are the ones that are slow. That would explain the weird distribution. TODO: how to debunk that hypothesis? Can Apache log other things that response wallclock time?

tleb · 2025-02-25T08:57:48Z

This was all a wrong lead. Oops. I don't remember why I was mistaken though.

tleb mentioned this issue Dec 20, 2024

"Error 503 Backend fetch failed" — backend CPU load is too high #365

Closed

tleb closed this as completed Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise 404 responses #362

Optimise 404 responses #362

tleb commented Dec 19, 2024

tleb commented Dec 19, 2024 •

edited

Loading

tleb commented Feb 25, 2025

Optimise 404 responses #362

Optimise 404 responses #362

Comments

tleb commented Dec 19, 2024

tleb commented Dec 19, 2024 • edited Loading

tleb commented Feb 25, 2025

tleb commented Dec 19, 2024 •

edited

Loading