-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add get_jump_image_iter, fix tqdm #44
Conversation
from jump_portrait.utils import batch_processing, parallel | ||
|
||
from jump_portrait.utils import batch_processing, parallel, try_function | ||
from typing import List |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be replaced by just 'list' since python3.9 IIRC, so no need for the extra import.
Load jump image associated to metadata in a threaded fashion. | ||
---------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation issue, see https://github.com/scikit-image/scikit-image/blob/v0.24.0/skimage/measure/_moments.py#L376-L405 for an example.
@@ -72,7 +75,7 @@ def parallel( | |||
jobs = len(iterable) | |||
slices = slice_iterable(iterable, jobs) | |||
result = Parallel(n_jobs=jobs, timeout=timeout)( | |||
delayed(func)(chunk, idx, *args, **kwargs) | |||
delayed(func)(chunk, idx, print_progress, *args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print_progress is a bit too verbose. Let us rename it to "verbose" to follow conventions.
for item in item_list: | ||
# pbar.set_description(f"Processing {item}") | ||
for item in tqdm(item_list, position=0, leave=True, | ||
disable=not print_progress, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 'leave' may cause troubles. We should test it on the command line, by running it in a script, and alongside notebooks (which it was not supporting originally).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try both and let you know. From what I remember it allows to work both on notebook and script.
channel(List[str]): list of channel desired | ||
Must be in ['DNA', 'ER', 'AGP', 'Mito', 'RNA'] | ||
site(List[str]): list of site desired | ||
For compound, must be in ['1' - '6'] | ||
For ORF, CRISPR, must be in ['1' - '9'] | ||
correction(str): Must be 'Illum' or 'Orig' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstrings should be easy to read for humans, thus syntax like List[str] is not ideal, replace them with 'list of strings'
print_progress=print_progress) | ||
|
||
img_list = sorted(img_list, key=lambda x: len(x)) | ||
fail_success = {k: list(g) for k, g in groupby(img_list, key=lambda x: len(x))} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice use of groupby
@@ -119,6 +120,52 @@ def get_jump_image( | |||
return result | |||
|
|||
|
|||
def get_jump_image_iter(metadata: pl.DataFrame, channel: List[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to get_jump_image_iter to get_jump_image_batch
@@ -84,7 +85,7 @@ def get_jump_image( | |||
Site identifier (also called foci), default is 1. | |||
correction : str | |||
Whether or not to use corrected data. It does not by default. | |||
apply_illum : bool | |||
apply_correction : bool | |||
When correction=="Illum" apply Illum correction on original image. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I forgot to update the argument description). Please make it "When apply_correction==...."
|
||
img_list = sorted(img_list, key=lambda x: len(x)) | ||
fail_success = {k: list(g) for k, g in groupby(img_list, key=lambda x: len(x))} | ||
if len(fail_success) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same as 'if len(fail_success):'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few things necessary:
- format the files with ruff
- make docstrings human-legible
- replace print_progress with verbose
- run tests so things run (we should automate this at some point)
- Fix the issues indicated in the per-line comments
- the batched image function will need some refactoring, as I specified in the comment under that function. The general idea is that we do not test by default that the input yielded images or not. This messes up the order and silent errors are hard to debug. I suggest to give users the option to ignore errors and use that to define whether or not to wrap the get_jump_image function in a try-except block or not.
My main concern with the try-except wrapper around the batcher is that the interface is different from the normal get_image... stuff. On the other hand, it makes sense if we are batching a ton of images.
My solution is to pass "ignore_errors" as an arguments (false by default) and then wrap the function in a try-except (or not) based on that argument. This changes the shape of the output, so the user must be conscious of making that decision.
iterable = [(*metadata.row(i), ch, s, correction) | ||
for i in range(metadata.shape[0]) for s in site for ch in channel] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace triple-nested loop with itertools.product
If it success, return a tuple of function parameters + its results | ||
If it fails, return the function parameters | ||
''' | ||
# This assume parameters are packed in a tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"This assumes"
features = pl.DataFrame(img_success, | ||
schema=["Metadata_Source", "Metadata_Batch", "Metadata_Plate", "Metadata_Well", | ||
"channel", "site", "correction", | ||
"img"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not put numpy arrays inside DataFrames. If you want to return a set of data+metadata, return them as tuples. Normally I would suggest to do so by stacking all images but I know they don't all have the same size so let us use a tuple of image,meta pairs. A Dataframe is an overkill for this. Just specify what metadata is included in the output within a comment by the end of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main motivation for using a Dataframe is that it allows me to do some grouping if I need to, or order things easily. But understood, I will remove that
from jump_portrait.utils import batch_processing, parallel | ||
|
||
from jump_portrait.utils import batch_processing, parallel, try_function | ||
from typing import List | ||
from itertools import groupby |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort imports. Ruff should fix it automatically
@@ -373,3 +362,11 @@ def get_gene_images( | |||
) | |||
|
|||
return images | |||
|
|||
metadata_pre = get_item_location_info("MYT1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these were not supposed to be here. These are the lines for testing from the readme, right?
Ready for review @afermg. But no rush, please take your time 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general. I'm still not terribly convinced about the try-except blocks but it seems like they are necessary for now to get this through. We should think about how to fix the fringe cases where we get the exceptions (see #45).
When running tests I get this warning, though it runs fine:
test/unit_test.py::test_get_jump_image_batch[Illum-channel0-site0]
/home/amunoz/projects/monorepo/libs/jump_portrait/.venv/lib/python3.11/site-packages/joblib/externals/loky/backend/fork_exec.py:38: RuntimeWarning: Using fork() can cause Polars to deadlock in the child process.
In addition, using fork() with Python in general is a recipe for mysterious
deadlocks and crashes.
The most likely reason you are seeing this error is because you are using the
multiprocessing module on Linux, which uses fork() by default. This will be
fixed in Python 3.14. Until then, you want to use the "spawn" context instead.
See https://docs.pola.rs/user-guide/misc/multiprocessing/ for details.
All tests run fine, I will approve but we may have to rework some minor details soon :)
1) Motivation:
2) Solution:
A) Description of the Solution
Which Input ?
Metadata information is often stored in a pl.DataFrame (for instance, the output of get_item_location_info). Then naturally, get_jump_image_iter takes as an input a pl.DataFrame with exclusively those information and in this following order (coherently with get_jump_image):
Which Output ?
A polars dataframe storing every metadata (including channel site and correction) + the array containing the images information
get_jump_image proved to fail and raise the following error:
This seems to be an issue of the data itself. Then when the work fail, the tuple of input leading to this failure is stored into work_fail.
B) Subsequent modification to enable Solution
The function try_function has been created:
tqdm position used to be enforced using this parameter:
It was not behaving has desired in Jupyter notebook. Then, I suggest using
It prints on the same line the every tqdm bar, but whenever one worker is done, the remaining bar are updated on the next bar.
This is not ideal but this solution still enable to to see both where the worker are in their process + which worker is done.
Other solution exists on the web, but they only enable to see which worker is done. To my opinion it is not as useful as worker goes relatively at the same pace. Then if the work of each worker is tremendous, it will takes a lot of time before having any update anyway.
Then parallel and batch_processing are modified to support the print_progess variable.
3) Test
The function has been tested using the following code:
4) Question