Sbachmei/mic 5753/embarrassingly parallel basic step #153

stevebachmeier · 2025-02-24T15:58:07Z

Embarrassingly parallel basic steps (milestone)

Description

Category: feature
JIRA issue: https://jira.ihme.washington.edu/browse/MIC-5753
Research reference:

Changes and notes

This implements the milestone as written, i.e. step 3 (already a LoopStep)
is not to loop an EmbarrassinglyParallelStep. It's important to note that this is
not a fully fleshed out feature:

It doesn't handle multiple input slots
It doesn't handle multiple output slots
It doesn't handle multiple checkpoints.

These will be implemented in followup PRs.

Verification and Testing

stevebachmeier · 2025-02-24T22:38:59Z

src/easylink/cli.py

@@ -87,6 +87,16 @@
        default=False,
        help="Do not save the results in a timestamped sub-directory of ``--output-dir``.",
    ),
+    click.option(


Adds -vvv and --pdb options to easylink generate-dag

stevebachmeier · 2025-02-24T22:40:52Z

src/easylink/pipeline_graph.py

@@ -45,6 +45,8 @@ class PipelineGraph(ImplementationGraph):
    ----------
    config
        The :class:`~easylink.configuration.Config` object.
+    freeze


Added this for testing purposes.

stevebachmeier · 2025-02-24T22:57:27Z

src/easylink/utilities/aggregator_utils.py

+from loguru import logger
+
+
+def concatenate_datasets(input_files: list[str], output_filepath: str) -> None:


This is likely to change, we just needed something to drop in for now.

stevebachmeier · 2025-02-24T22:57:41Z

src/easylink/utilities/splitter_utils.py

+from loguru import logger
+
+
+def split_data_by_size(


This is likely to change, we just needed something to drop in for now.

stevebachmeier · 2025-02-24T22:58:22Z

tests/unit/test_pipeline_graph.py

@@ -172,48 +174,91 @@ def test_update_slot_filepaths(default_config: Config, test_dir: str) -> None:
        assert edge_attrs["filepaths"] == expected_filepaths[(source, sink)]


-def test_get_input_slots(default_config: Config, test_dir: str) -> None:
+def test_get_io_slots(default_config: Config, test_dir: str) -> None:


get_input_slots became get_io_slots b/c we now need output slot data just like input slot data.

zmbc

As usual I only skimmed the tests. Left some detailed comments but overall this looks great and has turned out much nicer than I feared!

I left a few comments to this effect but I think we should aggressively fail when we hit situations we haven't implemented yet, to ensure things don't silently work brokenly.

src/easylink/graph_components.py

src/easylink/pipeline.py

src/easylink/pipeline_graph.py

zmbc · 2025-02-24T23:45:26Z

src/easylink/rule.py

+    run:
+        splitter_utils.{self.input_slots[input_slot_to_split]["splitter"].__name__}(
+            input_files=list(input.files),
+            output_dir=list(output)[0],


Suggested change

output_dir=list(output)[0],

output_dir=output.output_dir,

Can you access these by name?

zmbc · 2025-02-24T23:48:17Z

src/easylink/rule.py

+        rule = self._define_aggregator_rule()
+        return aggregator + rule
+
+    def _define_aggregator_function(self):


I find this name confusing. I think it would be better to call this an input function and link to those Snakemake docs.

zmbc · 2025-02-24T23:49:38Z

src/easylink/rule.py

+        checkpoint_output = glob.glob(f"{{{checkpoint_name}.get(**wildcards).output.output_dir}}/*/")
+    else:
+        output, _ = {checkpoint_name}.rule.expand_output(wildcards)
+        raise IncompleteCheckpointException({checkpoint_name}.rule, checkpoint_target(output[0]))


I have no idea where to put the comment (inside the string or outside) but we should have a comment pointing to the place in the Snakemake codebase that this was inspired by

Added it to the Notes section of the AggregationRule docstring

That's a good addition, but I meant how we knew what to use for the second argument; I think the relevant link is https://github.com/snakemake/snakemake/blob/04f89d330dd94baa51f41bc796392f85bccbd231/snakemake/checkpoints.py#L42

zmbc · 2025-02-24T23:50:53Z

src/easylink/rule.py

+    if os.path.exists(checkpoint_file):
+        checkpoint_output = glob.glob(f"{{{checkpoint_name}.get(**wildcards).output.output_dir}}/*/")
+    else:
+        output, _ = {checkpoint_name}.rule.expand_output(wildcards)
+        raise IncompleteCheckpointException({checkpoint_name}.rule, checkpoint_target(output[0]))


You could simplify the control flow a bit like this:

Suggested change

if os.path.exists(checkpoint_file):

checkpoint_output = glob.glob(f"{{{checkpoint_name}.get(**wildcards).output.output_dir}}/*/")

else:

output, _ = {checkpoint_name}.rule.expand_output(wildcards)

raise IncompleteCheckpointException({checkpoint_name}.rule, checkpoint_target(output[0]))

if not os.path.exists(checkpoint_file):

output, _ = {checkpoint_name}.rule.expand_output(wildcards)

raise IncompleteCheckpointException({checkpoint_name}.rule, checkpoint_target(output[0]))

checkpoint_output = glob.glob(f"{{{checkpoint_name}.get(**wildcards).output.output_dir}}/*/")

src/easylink/rule.py

albrja

This implementation is pretty in the weeds but I tried to give you a review on at least some of it so that's why it is pretty nit picky about docstrings

albrja · 2025-02-25T00:02:24Z

src/easylink/implementation.py

@@ -45,6 +45,10 @@ def __init__(
        implementation_config: LayeredConfigTree,
        input_slots: Iterable["InputSlot"] = (),
        output_slots: Iterable["OutputSlot"] = (),
+        is_embarrassingly_parallel: bool = False,
+        # wildcards: Sequence[


Is this supposed to be commented out?

albrja · 2025-02-25T00:38:24Z

src/easylink/pipeline_graph.py

+        )
+        return input_slot_attrs, output_slot_attrs
+
+    def get_whether_embarrassingly_parallel(self, node: str) -> bool:


Nit: This method doesn't really seem necessary but probably makes the formatting in the other method cleaner

@albrja

get_whether_embarrassingly_parallel relies on a node_name arg and is used in pipeline.Pipeline._write_implementation_rules() to determine whether a given node is embarrassingly parallel or not.

any_embarrassingly_parallel is really more of a property (I'll make it one) that relies on get_whether_embarrassingly_parallel but is a property of the PipelineGraph, i.e. if any node is embarrassingly parallel.

I do agree that making this its own property is probably somewhat unnecessary indirection since it's only used once; I just kinda followed suit w/ the spark_is_required method (I'll change that to a property as well) right above these two for consistency.

albrja · 2025-02-25T00:39:30Z

src/easylink/rule.py

@@ -56,7 +66,7 @@ class TargetRule(Rule):
    requires_spark: bool
    """Whether or not this rule requires a Spark environment to run."""

-    def _build_rule(self) -> str:
+    def build_rule(self) -> str:


Why did you unprivate this?

My thought was that this, being a Rule's abstract method, is by definition intended to be called by other classes (the concrete instances of Rule). In other languages that enforce privacy, you literally couldn't call an abstract method private (b/c if private how could you then define it in the sub-class?).

Really doesn't matter much though. Just trying to get a handle on _s.

albrja · 2025-02-25T15:09:09Z

src/easylink/rule.py

+
+    When running an :class:`~easylink.implementation.Implementation` in an embarrassingly
+    parallel way, we do not know until runtime how many parallel jobs there will
+    be (e.g. we don't know a priori how many chunks a large incoming dataset will


a priori, i.e. "beforehand". I'll change it though, no reason to get fancy here

albrja · 2025-02-25T15:14:00Z

src/easylink/rule.py

+
+@dataclass
+class AggregationRule(Rule):
+    """A :class:`Rule` that aggregates the processed chunks of output data.


It might be helpful if there was some term specifically for the output of an embarrassingly parallel step. I feel like this sentence and the next one are slightly redundant that you are just specifying that it is the output of an embarrassingly parallel step.

I suppose they're a little bit redundant, but I think the context that it's only really needed for EmbarrassinglyParallelSteps is pretty important (but didn't fit on the one-liner part of the docstring).

I do agree it might be better to call it something like ChunkAggregationRule or something like that, but that just seems a bit long in the tooth. I won't even consider EmbarrassinglyParallelProcessedChunkAggregationRule 😆

zmbc

Made a few minor comments but the updates look good overall 👍

zmbc · 2025-02-25T23:49:22Z

src/easylink/rule.py

-        # output_paths = ",".join(self.output)
-        # wildcards_subdir = "/".join([f"{{wildcards.{wc}}}" for wc in self.wildcards])
-        # and then in shell cmd: export DUMMY_CONTAINER_OUTPUT_PATHS={output_paths}/{wildcards_subdir}
+        # TODO [MIC-5877]: handle multiple wildcards, e.g.


Is that really the right ticket? https://jira.ihme.washington.edu/browse/MIC-5877

If so, could use more explanation of why

zmbc · 2025-02-25T23:50:20Z

src/easylink/rule.py

+        if len(self.output) > 1:
+            raise NotImplementedError(
+                "FIXME [MIC-5883] Multiple output files not yet supported"
+            )


Doesn't this need to be inside the if self.is_embarrassingly_parallel? If so, can we add a test that would fail due to this?

zmbc · 2025-02-25T23:51:22Z

src/easylink/rule.py

        """
-        aggregator = self._define_aggregator_function()
+        aggregator = self._define_input_function()


nit: could change this variable name, e.g. to input_function_definition

zmbc · 2025-02-25T23:52:29Z

src/easylink/rule.py

-        # FIXME: handle multiple filepaths
+    def _define_input_function(self):
+        """Builds the `input function <https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#input-functions>`_."""
+        if len(self.output_slot["filepaths"]) > 1:


This feels like a slightly different thing; multiple files vs multiple slots?

we only pass a single slot at a time into the AggregationRule.

zmbc · 2025-02-25T23:53:54Z

src/easylink/rule.py

+        checkpoint_output = glob.glob(f"{{{checkpoint_name}.get(**wildcards).output.output_dir}}/*/")
+    else:
+        output, _ = {checkpoint_name}.rule.expand_output(wildcards)
+        raise IncompleteCheckpointException({checkpoint_name}.rule, checkpoint_target(output[0]))


That's a good addition, but I meant how we knew what to use for the second argument; I think the relevant link is https://github.com/snakemake/snakemake/blob/04f89d330dd94baa51f41bc796392f85bccbd231/snakemake/checkpoints.py#L42

stevebachmeier added 5 commits February 24, 2025 07:53

minor refactor to test_rule.py

cd58db2

initial cut

dd2e77f

fix up

4ad6b14

add tests

136cbb9

fixes for doc builds

6e8ea8d

stevebachmeier commented Feb 24, 2025

View reviewed changes

stevebachmeier marked this pull request as ready for review February 24, 2025 23:00

stevebachmeier requested review from aflaxman, albrja, hussain-jafari, patricktnast, rmudambi and zmbc as code owners February 24, 2025 23:00

zmbc approved these changes Feb 24, 2025

View reviewed changes

albrja reviewed Feb 25, 2025

View reviewed changes

stevebachmeier marked this pull request as draft February 25, 2025 15:41

stevebachmeier added 2 commits February 25, 2025 10:54

PR review updates

9560a79

raise if trying to combine embarrassingly parallel step

5489879

stevebachmeier requested review from zmbc and albrja February 25, 2025 23:37

albrja approved these changes Feb 25, 2025

View reviewed changes

stevebachmeier marked this pull request as ready for review February 25, 2025 23:44

zmbc approved these changes Feb 25, 2025

View reviewed changes

minor final tweaks

44fd2c7

stevebachmeier merged commit 664a4a5 into main Feb 26, 2025
12 checks passed

stevebachmeier deleted the sbachmei/mic-5753/embarrassingly-parallel-basic-step branch February 26, 2025 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sbachmei/mic 5753/embarrassingly parallel basic step #153

Sbachmei/mic 5753/embarrassingly parallel basic step #153

stevebachmeier commented Feb 24, 2025

stevebachmeier Feb 24, 2025

stevebachmeier Feb 24, 2025

stevebachmeier Feb 24, 2025

stevebachmeier Feb 24, 2025

stevebachmeier Feb 24, 2025

zmbc left a comment

zmbc Feb 24, 2025

zmbc Feb 24, 2025

zmbc Feb 24, 2025

stevebachmeier Feb 25, 2025

zmbc Feb 25, 2025

zmbc Feb 24, 2025

albrja left a comment

albrja Feb 25, 2025

albrja Feb 25, 2025

stevebachmeier Feb 25, 2025

albrja Feb 25, 2025

stevebachmeier Feb 25, 2025

albrja Feb 25, 2025

stevebachmeier Feb 25, 2025

albrja Feb 25, 2025

stevebachmeier Feb 25, 2025

zmbc left a comment

zmbc Feb 25, 2025

zmbc Feb 25, 2025

zmbc Feb 25, 2025

zmbc Feb 25, 2025

stevebachmeier Feb 26, 2025 •

edited

Loading

zmbc Feb 25, 2025

		from loguru import logger


		def concatenate_datasets(input_files: list[str], output_filepath: str) -> None:

Sbachmei/mic 5753/embarrassingly parallel basic step #153

Sbachmei/mic 5753/embarrassingly parallel basic step #153

Conversation

stevebachmeier commented Feb 24, 2025

Embarrassingly parallel basic steps (milestone)

Description

Changes and notes

Verification and Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zmbc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albrja left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zmbc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevebachmeier Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevebachmeier Feb 26, 2025 •

edited

Loading