-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPPM graph refactor #391
base: main
Are you sure you want to change the base?
IPPM graph refactor #391
Conversation
f91409f
to
b949c97
Compare
cd9e56b
to
26d14ec
Compare
kymata/ippm/denoising_strategies.py
Outdated
should_merge_hemis: bool = False, | ||
should_exclude_insignificant: bool = True, | ||
should_shuffle: bool = True, | ||
should_normalise: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worthwhile to add some default values for these parameters. It comes down to whether you think which parameters have high variance across runs and which ones will remain constant. The constant ones can be moved to optional/default valuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think should_normalise, should_cluster_only_latency, should_max_pool will be False for most runs unless someone wants to experiment.
should_shuffle will be True. As for the threshold for significance, not sure, although estimating it from the data seems better than absolute threshold (exclude_points_above_n_sigma > exclude_logp_vals_above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd left the existing default values in the subclasses, but it sounds like it makes logical sense for them to be set in the base class too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I've left the defaults for the should_exclude_above_n_sigma
params as 5, which is what it was originally, but we can revisit that.
# Conflicts: # .gitignore # kymata/plot/color.py # kymata/plot/plot.py
…re comparing to a logp-threshold
@neukym Yeah I saw this, it is related to the non-passing tests. Something trivial I think but I've not tracked down what yet. |
Here's a record of where I got to before signing off this evening. @anirudh1666 perhaps this will help as you look at the tests; and I'll pick up again here next week. I've been stepping through In particular, here's how the full "Kymata2023Q3" dataset looks like when denoised on main: My plan was to simplify this comparison by just filtering down to one transform at the top of the notebook using expression_set: HexelExpressionSet = KymataMirror2023Q3Dataset().to_expressionset()
expression_set = expression_set[[
"TVL loudness (short-term)",
]] Now here's something interesting. On main, this looks as expected: So when filtering down to one function, the branch works just like main. Now moving up to two functions. expression_set: HexelExpressionSet = KymataMirror2023Q3Dataset().to_expressionset()
expression_set = expression_set[[
"TVL loudness (instantaneous)",
"TVL loudness (short-term)",
]] So somehow the denoising is changing when adding multiple functions, on branch. Next comment is one difference I've found, but it needs some discussion/thinking. |
TL;DR
How are points filtered for significance?On main, points are filtered for significance using a threshold set in the strategy init: class DenoisingStrategy(ABC):
"""
Superclass for unsupervised clustering algorithms.
Strategies should conform to this interface.
"""
def __init__(self, ..., should_exclude_insignificant: bool = True, ...):
...
self._threshold_for_significance = (
DenoisingStrategy._estimate_threshold_for_significance(
normal_dist_threshold # defaults to 5; i.e. 5-sigma thresholding
)
) Here's the entirety of @staticmethod
def _estimate_threshold_for_significance(x: float) -> float:
def __calculate_bonferroni_corrected_alpha(one_minus_probability):
return 1 - pow(
(1 - one_minus_probability), (1 / (2 * TIMEPOINTS * NUMBER_OF_HEXELS))
)
probability_X_gt_x = 1 - NormalDist(mu=0, sigma=1).cdf(x)
threshold_for_significance = __calculate_bonferroni_corrected_alpha(
probability_X_gt_x
)
return threshold_for_significance A quick note here that this is called "bonferroni" because the code was copied from an old version of the codebase before @ollieparish rightly pointed out this is in fact Šidák correction not Bonferroni correction. Questions about this correction for @neukymIs this correct? Is this correcting probability, or one-minus-probability as suggested in the variable name? It looks to me like As another point of comparison, on main currently there's also a snippet in sidak_corrected_alpha = 1 - (
(1 - alpha)
** np.longdouble(1 / (2 * len(expression_set.latencies) * n_channels * len(show_only)))
) (this code recently touched by @young-x-skyee but just to change the float to a longdouble). So here it looks like the correction is applied to the alpha level, not 1-p. On branch, this is refactored into a single function in one place, assuming that it's working on p. How much correction to applyOf particular note however is that the number of timepoints and hexels are stored in constants in TIMEPOINTS = 201
NUMBER_OF_HEXELS = 200000 (These were previously stored in vars in a few places, so I just put them in one place.) My understanding is that the denominator in the exponent, everything after the (Bonus question for @neukym: what's that 2 doing there? If that's 2 hemispheres I don't think we need it if we just count all the hexels. Thoughts?) It would be nice if these values were derived from the data rather than constants, and I've tried to do this on branch. However if this is true than I think the constants are wrong... len(expression_set.latencies) # 201 - that's correct
len(expression_set.hexels_left) + len(expression_set.hexels_right) # 20484 So 20,484 is out from Another point is also raised comparing to the Šidák correction in the expression plotting code: this includes the number of transforms in the correction factor (that's def get_threshold(es: ExpressionSet) -> float:
"""Get Šidák n-sigma threshold"""
n_comparisons = len(es.transforms) * len(es.latencies) + get_n_channels(es)
return p_to_logp(sidak_correct(p_threshold_for_sigmas(exclude_points_above_n_sigma),
n_comparisons=n_comparisons)) def sidak_correct(alpha: float, n_comparisons: int) -> float:
"""Applies Šidák correction to an alpha threshold."""
return 1 - (
(1 - alpha)
** longdouble(1 / (2 * n_comparisons))
) (just converting to logp values here because that's how probability is stored in expression sets). But then my question is: am I right to do this? On main, the IPPM code doesn't adjust the threshold depending on how many transforms are in play - but should it? Or should we treat each transform entirely independently? On main, about 86% of the points are removed using this threshold before denoising even begins, so it makes a big difference to what's going on. ConclusionThe branch code is obviously doing something wrong, but given that by some argument a threshold should change depending on the number of transforms included in the IPPM, and that interacts with the behaviour observed on branch, I think we should resolve this before continuing. |
(And of course I hasten to add that if there is an error on main, it's entirely something which @neukym and I should have picked up in review — that's on us!) |
Yes I agree.
Yes, you are right, the 2 is there for the number of hemispheres.
Good catch. The correct one is on the plot.py:
...although as you point out the
where
Yes I think it should - we take account of this in main's
Yes, you are right that the following logic is correct:
Yes, I think it should. To standardise things, perhaps we should add this function to
Where
This then standardises everything into a single function, and it forces everyone to explicitly put in the number of time points, channels and transforms into it. (I'm worried if we don't make this function explicit there's a risk that one of us will forget to put in the number of transforms, as has been happening so far.) The problem I think I am right in saying the denoiser uses this threshold to do an initial filter, before it starts clustering. If we were to use the above logic this would mean that the denoised expression dataset is specific to the scope of some specific set of transforms in show_only, which doesn't seem quite right. We perhaps need to chat about this - I hadn't thought of this situation - that the denoising procedure would be reliant on the value in One safe hack would be not to include n_transforms in the denoiser, so for that, in
We then reapply:
in the IPPM plotting step? This would ensure that we only see the clusters that are significant at this higher threshold. Bit hacky though. The other, cleaner, solution is that denoising simply isn't callable unless you are plotting an IPPM (in which case |
@caiw @neukym Would the Sidak correction change the recommendations of the paper? Response to other questions I agree with your corrections and Andy’s responses. Putting this in a central location to make it consistent across the Kymata pipeline also gets a +1 for me. The problem highlighted by Andy I just want to confirm my understanding of this first. So, you are saying that when we generate the expression plot for all of the transforms, we have a threshold. But when we do denoising, the threshold for significance changes based on the subset size. This intuitively feels wrong because selecting a subset of the results should not change the results? If my understanding is correct, I think it is better to use the threshold for all of the transforms even for the subset. I don’t like the hacky solution because the plotting doesn’t accurately represent what is input into the denoiser. We can quickly discuss this if necessary. Update on DenoisingStrategies integration/unit tests and denoising bug I’ve been tinkering with the new denoising strategies and ExpressionSets this past week. Unfortunately, I think it might take some time to fix these bugs and write the tests, but I will continue working on it whenever I have some time. So far, I’ve been able to create a simplistic integration test that passes successfully and identified a few areas of concern. There may be some problems in ExpressionSet._clear_points. For example, it does not clear the insignificant points, which is why the denoised time series has the small black spikes. The fix for this is to tag points to retain and remove the rest rather than tag the points to remove and remove them (points to remove never include the insignificant points, so they aren’t removed). There are other problems too, for example, shuffling the points changes the coordinates in the sparse array but we don’t reflect this in _clear_points as it assumes the points are ordered the same. I think there are some other problems as well from doing experiments but I haven’t found the root cause for them all yet. At the moment, however, the code I have written is mainly for my own understanding/prototyping rather than actually to commit, so I don’t have anything to commit atm. I will continue to look into this, with the following deliverables:
I will make my changes on ippm-graph-refactor-anirudh to avoid clashing with Cai’s changes. PS (Unasked for programming principles but it may help with refactoring/implementing features in the future) The most useful (also funniest) guide I have found for software engineering is the following: https://grugbrain.dev/. Personally, I think we could have done this refactoring a bit better by doing it in smaller increments and testing after each step. Debugging becomes really difficult with large changes and in the longer run, takes more time than doing it in smaller increments. The saying for this is “Don’t outrun your headlights”. I would have also preferred starting with a simpler version of ExpressionSet that is correct and works, then iteratively optimise and generalise it, making sure the tests/demo_ippm are working after each iteration. As the saying goes, “Premature optimisation is the root of all evil”. Another advantage of starting simple then optimising is that the simpler version is more flexible and amenable to change based on feedback. The most optimised code makes the strongest assumptions about the context it will be operating in, which can make it brittle. Of course, you may have valid reasons for your decisions, which is fine as well, in which case, you can always disregard these tips. In the link provided, the grug developer explains this in the “Refactoring” section. |
Hey, I've not been able to get to this properly today for half-term childcare reasons, but I have some thoughts. I'll get back to it properly next week. Alpha-level correction@neukym said:
Great, I'll use this, and I've added some tests for specific values, using an online Sidak calculator to verify the correct results. @caiw said:
@neukym said:
We could, but I think we should push this to another PR. In particular we describe Sidak correction in the plot axis name for expression plots, so I'm slightly unsure about also setting that choice elsewhere. In particular, if some reviewer some time demands bonferroni correction, or we decide that one correction or another makes sense at some point, I wonder if we should keep it a bit more procedural. I don't feel strongly on this point. @anirudh1666 said:
I know what you mean, but it's all about controlling the chance that we find false signal in the noise. The more things we test, the more likely things are likely to appear significant by chance. The alpha level is precisely the level of "false discoveries" we are willing to accept. So the more transforms we let in into a given "omnibus test" (testing multiple things all at the same time), the more likely one of them is likely to pass a threshold by chance, so the more we have to restrict ("correct") the threshold for significance to control the overall false-discovery rate) — I think each transform we add (and every latency and hexel) is like an extra jellybean colour, which is why we have to control the alpha level so aggressively. @neukym is the expert on the stats comparisons made in the kymata gridsearch, and under certain circumstances we can control the FDR without correcting a threshold in this brute-force way, but doing the correction is the safest/most conservative way. Unless @neukym would argue the model comparison we do is somehow automatically controlling the familywise error rate (I've not thought about this in full), I think we should do full correction for all comparisons. @neukym said:
So I've changed my mind on this. We're using an alpha threshold to apply an initial filter of points passed to the denoiser, but we're not actually using it for a formal omnibus NHST, so in that sense we don't need to control the FWER in a strict sense like we do when presenting an expression plot (where the coloured points are RESULTS so we need to control the FWER in those results). So long as the significance of the nodes in the final IPPM survive a fully corrected NHST, I think we are fine applying an independent threshold to each transform going into the denoiser. To be clear, my recommendation is the same as @neukym's:
If this happens within the IPPM-building step, it can be documented in code, and the thresholds should* be quick to calculate and apply. (*As @anirudh1666 points out there is currently a bug in this, but that's easily fixable) @neuky said:
I don't favour this exactly — there are other reasons we may want to denoise multiple transforms in future which don't directly result in IPPMs (e.g. see what Tianyi is doing with thousands of related transforms from ANN nodes). As previously discussed, I think we should formalise denoising as a process on @anirudh1666 said:
It's the other way: on @anirudh1666 said:
Yeah, that makes sense to me — good point. Refactoring
|
Anyway, long comment finally posted now I have internet access again. Thanks @neukym and @anirudh1666 for the thoughtful comments! |
Thanks both - great discussion.
Yes, completely agree. |
Main changes
ExpressionSet
s #404: Try to avoid creating new datastructures which look similar to existing ones, where existing ones can be used.SensorExpressionSet
s #402 for freeplot.expression_plot
andippm.stem_plot
#400 for freeExpressionSets
, and making it so that anyExpressionSet
can produce an IPPM graph, we kind of get this one for free too. General compartmentalisation of functionality.networkx.DiGraph
.self.var = self.func(...)
, andself.func
itself mutatesself.var
before returning a value which replaces the mutation, it is not always clear which assignments toself.var
will end up "sticking" without following out to the calling code.Note
I'M SORRY that this is such a big diff. I'm aware it makes it hard to review. If I had had more time I'd have written a shorter letter, etc.
Also I'm sorry that I have renamed some classes/files in a way which prevents Github from showing diffs properly. In particular
builder.py
andIPPMBuilder
are more or less replaced bygrapy.py
andIPPMGraph
.data_tools.py
was somewhat a collection of miscellaneous functionality which has been moved into other classes.Little fixes
A few issues I created and fixed in the course of the refactor.
None
supplied as the first or second element #420 as part of a merge-conflict resolutionUnblocks
Issues removed from scope, but which can now be tackled, hopefully more easily.