Add get_jump_image_iter, fix tqdm #44

HugoHakem · 2024-08-19T23:19:01Z

1) Motivation:

get_jump_image does not support iterable as an input which prevent from loading every images associated to a list of desired metadata.

2) Solution:

Instead of simply looping on get_jump_image, get_jump_image_iter is introduced and build on batch_processing and parallel to load images in a threaded fashion.

A) Description of the Solution

Which Input ?

metadata:(pl.DataFrame):

Metadata information is often stored in a pl.DataFrame (for instance, the output of get_item_location_info). Then naturally, get_jump_image_iter takes as an input a pl.DataFrame with exclusively those information and in this following order (coherently with get_jump_image):

(source, batch, plate, well)

channel: List[str] (the desired channel)
site: List[str] (the desired site)
correction:str='Orig' (the desired correction)
print_progress: bool=True (whether to print progression of the work with tqdm)

Which Output ?

features: (pl.DataFrame):
A polars dataframe storing every metadata (including channel site and correction) + the array containing the images information
work_fail: (List(tuple)):
get_jump_image proved to fail and raise the following error:

"More than one site found"

This seems to be an issue of the data itself. Then when the work fail, the tuple of input leading to this failure is stored into work_fail.

B) Subsequent modification to enable Solution

To pass the error:

"More than one site found"

The function try_function has been created:

If it pass for a function f: return a tuple storing the input and the output: (*input, f(input))
Else, return only the input.

A suggestion of modification has been made for parallel and batch_processing to support tqdm
tqdm position used to be enforced using this parameter:

position=job_idx

It was not behaving has desired in Jupyter notebook. Then, I suggest using

position=0, desc=f"worker #{job_idx}, leave=True, disable=not print_progess"

It prints on the same line the every tqdm bar, but whenever one worker is done, the remaining bar are updated on the next bar.
This is not ideal but this solution still enable to to see both where the worker are in their process + which worker is done.
Other solution exists on the web, but they only enable to see which worker is done. To my opinion it is not as useful as worker goes relatively at the same pace. Then if the work of each worker is tremendous, it will takes a lot of time before having any update anyway.

Then parallel and batch_processing are modified to support the print_progess variable.

3) Test

The function has been tested using the following code:

metadata_pre = get_item_location_info("MYT1")
features_pre, work_fail = get_jump_image_iter(metadata_pre.select(pl.col(
["Metadata_Source", "Metadata_Batch", "Metadata_Plate", "Metadata_Well"])),
                                                        channel=['DNA'],#, 'ER', 'AGP', 'Mito', 'RNA'],
                                                        site=["1"],
                                                        correction='Orig',
                                                        print_progress=True) #None, 'Illum'

4) Question

Do we want get_jump_image_iter to be more flexible on the input DataFrame?
There would be no need for try_function the "More than one site found" in the data was addressed. Is there a way to tackle this issue?
Is the tqdm solution satisfying?

afermg · 2024-08-27T17:46:47Z

libs/jump_portrait/src/jump_portrait/fetch.py

-from jump_portrait.utils import batch_processing, parallel
-
+from jump_portrait.utils import batch_processing, parallel, try_function
+from typing import List


This can be replaced by just 'list' since python3.9 IIRC, so no need for the extra import.

afermg · 2024-08-27T17:49:38Z

libs/jump_portrait/src/jump_portrait/fetch.py

+       Load jump image associated to metadata in a threaded fashion.
+        ----------


Indentation issue, see https://github.com/scikit-image/scikit-image/blob/v0.24.0/skimage/measure/_moments.py#L376-L405 for an example.

afermg · 2024-08-27T17:53:06Z

libs/jump_portrait/src/jump_portrait/utils.py

@@ -72,7 +75,7 @@ def parallel(
        jobs = len(iterable)
    slices = slice_iterable(iterable, jobs)
    result = Parallel(n_jobs=jobs, timeout=timeout)(
-        delayed(func)(chunk, idx, *args, **kwargs)
+        delayed(func)(chunk, idx, print_progress, *args, **kwargs)


print_progress is a bit too verbose. Let us rename it to "verbose" to follow conventions.

afermg · 2024-08-27T17:54:08Z

libs/jump_portrait/src/jump_portrait/utils.py

-        for item in item_list:
-            # pbar.set_description(f"Processing {item}")
+        for item in tqdm(item_list, position=0, leave=True,
+                         disable=not print_progress,


I think 'leave' may cause troubles. We should test it on the command line, by running it in a script, and alongside notebooks (which it was not supporting originally).

I will try both and let you know. From what I remember it allows to work both on notebook and script.

afermg · 2024-08-27T18:01:12Z

libs/jump_portrait/src/jump_portrait/fetch.py

+        channel(List[str]): list of channel desired
+            Must be in ['DNA', 'ER', 'AGP', 'Mito', 'RNA']
+        site(List[str]): list of site desired
+            For compound, must be in ['1' - '6']
+            For ORF, CRISPR, must be in ['1' - '9']
+        correction(str): Must be 'Illum' or 'Orig'


The docstrings should be easy to read for humans, thus syntax like List[str] is not ideal, replace them with 'list of strings'

afermg · 2024-08-27T19:49:47Z

libs/jump_portrait/src/jump_portrait/fetch.py

+                        print_progress=print_progress)
+
+    img_list = sorted(img_list, key=lambda x: len(x))
+    fail_success = {k: list(g) for k, g in groupby(img_list, key=lambda x: len(x))}


Nice use of groupby

afermg · 2024-08-27T19:54:25Z

libs/jump_portrait/src/jump_portrait/fetch.py

@@ -119,6 +120,52 @@ def get_jump_image(
    return result


+def get_jump_image_iter(metadata: pl.DataFrame, channel: List[str],


rename to get_jump_image_iter to get_jump_image_batch

afermg · 2024-08-27T19:55:03Z

libs/jump_portrait/src/jump_portrait/fetch.py

@@ -84,7 +85,7 @@ def get_jump_image(
        Site identifier (also called foci), default is 1.
    correction : str
        Whether or not to use corrected data. It does not by default.
-    apply_illum : bool
+    apply_correction : bool
        When correction=="Illum" apply Illum correction on original image.


(I forgot to update the argument description). Please make it "When apply_correction==...."

afermg · 2024-08-27T19:55:31Z

libs/jump_portrait/src/jump_portrait/fetch.py

+
+    img_list = sorted(img_list, key=lambda x: len(x))
+    fail_success = {k: list(g) for k, g in groupby(img_list, key=lambda x: len(x))}
+    if len(fail_success) == 1:


This is the same as 'if len(fail_success):'

afermg

There are a few things necessary:

format the files with ruff
make docstrings human-legible
replace print_progress with verbose
run tests so things run (we should automate this at some point)
Fix the issues indicated in the per-line comments
the batched image function will need some refactoring, as I specified in the comment under that function. The general idea is that we do not test by default that the input yielded images or not. This messes up the order and silent errors are hard to debug. I suggest to give users the option to ignore errors and use that to define whether or not to wrap the get_jump_image function in a try-except block or not.

My main concern with the try-except wrapper around the batcher is that the interface is different from the normal get_image... stuff. On the other hand, it makes sense if we are batching a ton of images.

My solution is to pass "ignore_errors" as an arguments (false by default) and then wrap the function in a try-except (or not) based on that argument. This changes the shape of the output, so the user must be conscious of making that decision.

afermg · 2024-08-27T19:57:29Z

libs/jump_portrait/src/jump_portrait/fetch.py

+    iterable = [(*metadata.row(i), ch, s, correction)
+               for i in range(metadata.shape[0]) for s in site for ch in channel]


Replace triple-nested loop with itertools.product

afermg · 2024-08-27T19:58:44Z

libs/jump_portrait/src/jump_portrait/utils.py

+        If it success, return a tuple of function parameters + its results
+        If it fails, return the function parameters
+    '''
+    # This assume parameters are packed in a tuple


"This assumes"

afermg · 2024-08-27T20:08:45Z

libs/jump_portrait/src/jump_portrait/fetch.py

+    features = pl.DataFrame(img_success,
+                               schema=["Metadata_Source", "Metadata_Batch", "Metadata_Plate", "Metadata_Well",
+                                        "channel", "site", "correction",
+                                        "img"])


Do not put numpy arrays inside DataFrames. If you want to return a set of data+metadata, return them as tuples. Normally I would suggest to do so by stacking all images but I know they don't all have the same size so let us use a tuple of image,meta pairs. A Dataframe is an overkill for this. Just specify what metadata is included in the output within a comment by the end of the function.

My main motivation for using a Dataframe is that it allows me to do some grouping if I need to, or order things easily. But understood, I will remove that

afermg · 2024-08-27T20:09:45Z

libs/jump_portrait/src/jump_portrait/fetch.py

-from jump_portrait.utils import batch_processing, parallel
-
+from jump_portrait.utils import batch_processing, parallel, try_function
+from typing import List
+from itertools import groupby


Sort imports. Ruff should fix it automatically

afermg · 2024-09-13T13:24:47Z

libs/jump_portrait/src/jump_portrait/fetch.py

@@ -373,3 +362,11 @@ def get_gene_images(
    )

    return images
+
+metadata_pre = get_item_location_info("MYT1")


I think these were not supposed to be here. These are the lines for testing from the readme, right?

HugoHakem · 2024-11-19T21:52:22Z

Ready for review @afermg. But no rush, please take your time 🙏

afermg

Looks good in general. I'm still not terribly convinced about the try-except blocks but it seems like they are necessary for now to get this through. We should think about how to fix the fringe cases where we get the exceptions (see #45).

When running tests I get this warning, though it runs fine:

test/unit_test.py::test_get_jump_image_batch[Illum-channel0-site0]
  /home/amunoz/projects/monorepo/libs/jump_portrait/.venv/lib/python3.11/site-packages/joblib/externals/loky/backend/fork_exec.py:38: RuntimeWarning: Using fork() can cause Polars to deadlock in the child process.
  In addition, using fork() with Python in general is a recipe for mysterious
  deadlocks and crashes.
  
  The most likely reason you are seeing this error is because you are using the
  multiprocessing module on Linux, which uses fork() by default. This will be
  fixed in Python 3.14. Until then, you want to use the "spawn" context instead.
  
  See https://docs.pola.rs/user-guide/misc/multiprocessing/ for details.

All tests run fine, I will approve but we may have to rework some minor details soon :)

HugoHakem added enhancement New feature or request portrait labels Aug 19, 2024

afermg reviewed Aug 27, 2024

View reviewed changes

afermg mentioned this pull request Sep 4, 2024

Function documentation out of date #43

Closed

5 tasks

afermg reviewed Sep 13, 2024

View reviewed changes

afermg self-assigned this Nov 26, 2024

afermg self-requested a review November 26, 2024 16:45

HugoHakem and others added 2 commits November 26, 2024 11:56

Add get_jump_image_iter, fix tqdm

613020c

merge main

52ce8a3

afermg force-pushed the main branch from 6345738 to 52ce8a3 Compare November 26, 2024 17:02

afermg and others added 8 commits November 26, 2024 12:07

nix(portrait): upgrade flake to 24.05

1aa9dfe

Minor correction of fetch.py and utils.py

ccb3001

Correcting fetch and utils to remove polars output

b1f60cc

Update Test

02fee6f

refactor(portrait): update README, test

a26128e

style(portrait): minor change README

213581a

fix(rr): remove diff strings

5e9b439

fix(portrait): remove unused HEAD line

2038de7

afermg approved these changes Nov 26, 2024

View reviewed changes

afermg merged commit 0e40df3 into broadinstitute:main Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add get_jump_image_iter, fix tqdm #44

Add get_jump_image_iter, fix tqdm #44

HugoHakem commented Aug 19, 2024

afermg Aug 27, 2024

afermg Aug 27, 2024

afermg Aug 27, 2024

afermg Aug 27, 2024

HugoHakem Sep 12, 2024

afermg Aug 27, 2024

afermg Aug 27, 2024

afermg Aug 27, 2024

afermg Aug 27, 2024

afermg Aug 27, 2024

afermg left a comment

afermg Aug 27, 2024

afermg Aug 27, 2024

afermg Aug 27, 2024

HugoHakem Sep 12, 2024

afermg Aug 27, 2024

afermg Sep 13, 2024

HugoHakem commented Nov 19, 2024

afermg left a comment

		Load jump image associated to metadata in a threaded fashion.
		----------

		@@ -119,6 +120,52 @@ def get_jump_image(
		return result


		def get_jump_image_iter(metadata: pl.DataFrame, channel: List[str],

		iterable = [(*metadata.row(i), ch, s, correction)
		for i in range(metadata.shape[0]) for s in site for ch in channel]

Add get_jump_image_iter, fix tqdm #44

Add get_jump_image_iter, fix tqdm #44

Conversation

HugoHakem commented Aug 19, 2024

1) Motivation:

2) Solution:

A) Description of the Solution

B) Subsequent modification to enable Solution

3) Test

4) Question

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afermg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HugoHakem commented Nov 19, 2024

afermg left a comment

Choose a reason for hiding this comment