Cache various statistics to improve performance #204
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
An implementation that resolves #201.
Describe your changes
This is a rather large-looking PR (only because I am basing this off of #198 for ease-of-merge!), and is a sort of a first attempt (which is why I am marking it as a draft) at caching the various statistics properties as described in #201.
TL;DR: since at each iteration we only move one point in the dataset (basically map$(x_i, y_i) \mapsto (x'_i, y'_i)$ ), why not compute the new statistics ($\langle X' \rangle$ , $\text{Var}(X') $ , and $r(X', Y') $ ) in terms of the old ones ($\langle X \rangle$ , $\text{Var}(X) $ , and $r(X, Y) $ ) + the old ($(x_i, y_i) $ ) and new ($(x'_i, y'_i) $ ) points? This is essentially what this PR does.
Details
Mathematical derivation
The mean is shifted as:
where$\delta_x = x'_i - x_i$ (same for $Y$ ), the variance (or the standard deviation if you want) is shifted as:
while the correlation coefficient is shifted as:
where the only new quantity to keep track of is$\langle X Y \rangle$ (the rest can be obtained from the above 2).
Workflow
With the math out of the way, this is how the new way of computing the statistics works:
(x, y)
dataset, and make an instance ofStatistics
(subject to change for its somewhat ambiguous naming) from itStatistics
has a method,perturb
, which takes in 3 arguments: the row (index) we wish to perturb, and the 2 values of the perturbations (in thex
andy
directions), and returns an instance of the newSummaryStatistics
(i.e. the one from the perturbed data)_is_close_enough
returnsTrue
, we call theperturb
method again (it'supdate=True
, which actually updates the data in theStatistics
instanceNote that there's a bunch of new methods (
shifted_mean
,shifted_var
,shifted_stdev
,shifted_corrcoef
), which we may or may not want to make private (or refactor so they accept different arguments), as I didn't put too much thought into the overall API design (again, hence the "draft" status).Performance
Now, onto the performance results! Note that the benchmarks were done with #201 already in there, so the absolute numbers differ from the ones on
main
, but the reduction in the number of function calls is evident. I tested everything usingpython -m cProfile -m data_morph --seed 42 --start-shape panda --target-shape [shape]
.so we use 26.1M less function calls for all shapes in question, which results in about 35% faster performance (computed as (t_old - t_new) / t_old) * 100).
Discussion items
Some possible points of discussion:
shifted_*
functions private? Couldperturb
do without anupdate
argument? etc.)shifted_*
functions are a bit repetitive, and could take a while to run (I was initially very skeptical of the numerical stability, but it appears to be fine, though it could just be that I haven't tried enough scenarios to hit an instability)Feedback is welcome :)
Checklist