Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring MSD, smoothness and dynamic range calculations and unecessary script cleanup #219

Open
wants to merge 77 commits into
base: main
Choose a base branch
from

Conversation

edyoshikun
Copy link
Contributor

@edyoshikun edyoshikun commented Jan 8, 2025

This PR refactors and adds the calculations for:

  • Mean square displacement
  • Slope calculation for smoothness
  • Normalization strategies for the embeddings prior analysis Decided to use raw
  • Cleaning up unecessary/obsolete scripts

@edyoshikun edyoshikun changed the title Refactoring MSD, smoothness and dynamic range calculations Refactoring MSD, smoothness and dynamic range calculations and unecessary script cleanup Jan 10, 2025
edyoshikun and others added 15 commits February 8, 2025 20:28
* caching dataloader

* caching data module

* black

* ruff

* Bump torch to 2.4.1 (#174)

* update torch >2.4.1

* black

* ruff

* adding timeout to ram_dataloader

* bandaid to cached dataloader

* fixing the dataloader using torch collate_fn

* replacing dictionary with single array

* loading prior to epoch 0

* Revert "replacing dictionary with single array"

This reverts commit 8c13f49.

* using multiprocessing manager

* add sharded distributed sampler

* add example script for ddp caching

* format and lint

* addding the custom distrb sampler to hcs_ram.py

* adding sampler to val train dataloader

* fix divisibility of the last shard

* hcs_ram format and lint

* data module that only crops and does not collate

* wip: execute transforms on the GPU

* path for if not ddp

* fix randomness in inversion transform

* add option to pop the normalization metadata

* move gpu transform definition back to data module

* add tiled crop transform for validation

* add stack channel transform for gpu augmentation

* fix typing

* collate before sending to gpu

* inherit gpu transforms for livecell dataset

* update fcmae engine to apply per-dataset augmentations

* format and lint hcs_ram

* fix abc type hint

* update docstring style

* disable grad for validation transforms

* improve sample image logging in fcmae

* fix dataset length when batch size is larger than the dataset

* fix docstring

* add option to disable normalization metadata

* inherit gpu transform for ctmc

* remove duplicate method overrride

* update docstring for ctmc

* allow skipping caching for large datasets

* make the fcmae module compatible with image translation

* remove prototype implementation

* fix import path

* Arbitrary prediction time transforms (#209)

* fix spelling in docstring and comment

* add batched zoom transform for tta

* add standalone lightning module for arbitrary TTA

* fix composition of different zoom factors

* add docstrings

* fix typo in docstring

---------

Co-authored-by: Eduardo Hirata-Miyasaki <[email protected]>
* fix terminology

* fix task description

VisCy is not setup to do supervised image classification
…module (#225)

* fixing Triplet Dataset always return the negative sample #224

* Update viscy/data/triplet.py

Co-authored-by: Ziwen Liu <[email protected]>

* adding comment to tripletdataset

---------

Co-authored-by: Ziwen Liu <[email protected]>
Copy link
Contributor Author

@edyoshikun edyoshikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Soorya19Pradeep this is a good start. We should focus the tests to ensure that the functions handle a known input and behave as expected. By this I mean, if we throw in a known array with known features like you did using random.uniform() or similar, when we run individual functions in the class like CellFeatures().compute_intensity_features() do we get the expected mean,std,min,max,kurtosis,etc? We can use pytest.approx(to the expected value). Most of these test, if you look at the tests/data/test_data.py are checking that we return the right shapes and numbers because that is crucial for feeding them into the model. In this case, we are more interested in the functionality of the function and its behavior.

I left individual comments, and happy to chat about each individual one.

"""
Create an image with a small range of values
"""
return np.random.uniform(min_val, max_val, size).astype(np.float32)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are using random. I suggest we add a seed. Any number is good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong but since this i using random vs randint the type is already a float.

"""
Create an image with a large range of values
"""
return np.random.uniform(min_val, max_val, size).astype(np.int32)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably want to use randint instead of random

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or random_integers in this case since you do have a min,max value


import numpy as np
from viscy.representation.evaluation.feature import CellFeatures

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.fixture
def random_seed(self):
np.random.seed(42)

assert np.isnan(cell_features[feature].values[0]), f"{feature} should be nan for constant image"

for feature in positive_features:
assert abs(cell_features[feature].values[0]) > 0, f"{feature} should be positive for constant image"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one needs to be more precise. When you do abs() it will always return a positive number.

float_features = list(set(positive_features) - set(integer_features))

for feature in positive_features:
assert abs(cell_features[feature].values[0]) >= 0, f"{feature} should be positive"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one as well. change since abs() returns positive values.

binary_image = self.create_binary_image()

features = CellFeatures(constant_image, binary_image)
cell_features = features.compute_all_features()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have a nice strucure in the code where you can do CellFeatures().compute_intensity_features() I suggest we split and do one test function for each case. That way the tests are more focused, easier to debug and we can isolate the issues. In principle if we call the individual compute_***() we should be able to call compute_all() since this calls all of these in sequence.

CF = CellFeatures(image,mask)
features = CF.compute_intensity_features()

assert features["mean_intensity"] == value
assert features["std_dev"] == 0
assert features["min_intensity"] == value
assert features["max_intensity"] == value

I haven't written many tests myself with pytests, but we can also use either hypothesis or pytest.mark.parametrize. The latter I've used to make arrays with known input variables so we can re-use the functions like this

        @pytest.mark.parametrize("value", [0, 1, 255, -100, 1e5])
        def test_constant_image_intensity_features(self, value, binary_mask):
            """Test intensity features with constant value images"""
            image = np.full((100, 100), value, dtype=np.float32)
            cf = CellFeatures(image, binary_mask)
            
            features = cf._compute_intensity_features()
            
            assert features["mean_intensity"] == value
            assert features["std_dev"] == 0
            assert features["min_intensity"] == value
            assert features["max_intensity"] == value
            assert features["kurtosis"] == 0  # Constant distribution has no kurtosis
            assert features["skewness"] == 0  # Constant distribution has no skewness

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants