Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements #41

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

cameron1729
Copy link
Contributor

@cameron1729 cameron1729 commented Apr 1, 2022

This branch:

  • Changes the way duplicate file hashes are cached: We now progressively add them to the cache instead of getting them all upfront
  • Caches the result of the rewritten img tags so we don't hit the DB every time

I've done some profiling locally on a test course with a bunch of large images and compared the before/after in a few different scenarios. My test was simply loading the course index page. The results are for the "normal" case once caches have finished building. When duplicate detection was turned on, minduplicates was set to 2.

No plugin (base to compare against)

  • Load time: 336 requests | 17.84MB | 7.39s finish (2.46 DOM, 6.67 load)
  • DB queries: 46

master branch

Duplicate detection off:
Load time: 324 requests | 4.61MB | 6.61s finish (2.28 DOM, 6.01 load)
DB queries: 121
Duplicate detection on:
Load time: 317 requests | 3.68MB | 6.48s finish | (2.26 DOM, 5.80 load)
DB queries: 114

performance-improvements branch

Duplicate detection off:
Load time: 334 requests | 5.91MB | 6.37s finish (2.18 DOM, 5.89 load)
DB queries: 46
Duplicate detection on:
Load time: 318 requests | 4.6MB | 7.62s finish (2.35 DOM, 6.74 load)
DB queries: 46

Resolves #40

}

return in_array($contenthash, $publicfilescache->get($key));
$publicfilescache->set($key, $publicfiles);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. shouldn't this only be set if it has changed?

  2. isn't this going to create a single giant record with every file hash in it? If so this won't scale at all

We need to put a lot of thought into the granularity of the cache keys, and I would suggest that the grain size is either at the level of a course or the level of a url. Maybe using the context could also work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the patch to progressively cache the hashes as individual keys. Note that it looks a bit weird due to a bug I found where you can't retrieve boolean false from cachestore_file.

The cache keys are now publicfiles_[contenthash]_[minduplicates] and their values are true/false. We could still do it with an array, but I'm not sure if that has much value since we hit the cache once per image request anyway.

@cameron1729 cameron1729 force-pushed the performance-improvements branch 2 times, most recently from 6eccdb7 to 513b13e Compare April 7, 2022 07:13
On large sites, the files table can be enourmous and it is
impractical to query the entire table for duplicates at once.

Instead, this patch modifies the public files check to query
for instances of a file individually (as required). Each time,
the content hash of that file is added to the cache and marked
as either public or private.
@cameron1729 cameron1729 force-pushed the performance-improvements branch from 513b13e to 5a265ef Compare April 11, 2022 08:29
global $DB;

$minduplicates = get_config('filter_imageopt', 'minduplicates');
$publicfilescache = \cache::make('filter_imageopt', 'public_files');
$key = 'public_files_' . $minduplicates;
$key = 'publicfiles' . '_' . $contenthash . '_' . $minduplicates;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now there's a cache entry for every image that passes through the filter. We also have a DB count for every file that hasn't already been added to the cache. If there are lot of images present on a page then we will have a lot of hits to the DB until the cache is populated. I think I can live with this since it's only happening once. I'm wondering if we can have a scheduled task or something to populate the cache so that it's fast before people access the page - the scheduled task could just make 1 DB query to get all the records like it used to but since it's not running per image it should be performant. @brendanheywood please let me know your thoughts on this. @cameron1729 sorry for the late review on this - I have been on holiday.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Thomas, no worries. You're right about there being potentially many DB hits before the cache is all built.

Other parts of this PR reduce the number of DB hits per image. I suspect that due caching the processed tags, the net number of DB queries is now less, even with the one query added per image for duplicate detection. In the numbers above, with duplicate detection disabled, the number of DB queries went from 46->121 when the vanilla plugin was enabled. With the processed tags cache (and duplicate detection still off) that goes back down to 46. But I'd need to run tests again to be sure the net number of queries is still the same or less overall.

I considered the scheduled task idea, but the files table on the site where we're testing this has over one hundred million rows. The original SQL query takes about 6 minutes to finish (even when minduplicates is set to something like 500) and the resulting array can be enormous if minduplicates is set to something low.

We could possibly do something smart in the task like cache an array of hashes per context id, and process more recently accessed courses first, but the current patch sort of does that already since it adds items to the cache based on what is actually being accessed (i.e., a course that is accessed frequently will have its images processed and put in the cache, whereas a course that is never accessed will not).

I guess there's pros and cons to both approaches. I'm not sure which is the best. Happy to be steered.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to put some in depth thought into this and real world shaped load testing. But my knee jerk reaction is that the granularity of the cache keys should be roughly one per page view. So no idea if this is viable but could we rejig things that that we build up a structure that contains a map of multiple old urls to new urls and it is stored at a fairly chunky level. I'm thinking if you have a course page with 10 mod_label's on it then they all end up as one key. We'd need to do some level of url processing so that it included key parameters in the key, but excluded ones which were not important and that logic might be funky. Also open to other suggestions like the page's context id could be the key.

Perhaps an more crazy idea: We can also set limits to how much slower the page is allowed to be while generating this stuff, ie if we have a huge page with 1000 images then the filter could be smart enough to stop processing after 50 lookups and saving its results in the cache. And then over 20 page views the page would be fully warmed up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this branch currently works is kind of already storing a mapping between old URLs and new URLs. Once a tag has been processed, the result of that is cached and then used when we encounter that URL again. The public files cache won't be used again for that URL after that point. Currently the key is generated from the input img tag and the value is the URL of the processed image, but I'm pretty sure that can be changed so that the key is a contextid (or something similar) and the value is a mapping between raw URLs and processed URLs.

Before processing an img tag, we need to know if the source image is public or private. I think we could remove the public files cache completely and the end result would be exactly the same. But having the public files cache means we use less DB queries as we're filling up the processed tags cache. If it doesn't seem worth it, then we could remove it.

What you mention in the second paragraph is kind of already happening (that may only be because of the loadonvisible setting though). In my testing it took something like 4 or 5 refreshes for the cache to be fully populated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Plugin needs some optimisation to be used in large sites
3 participants