SNOW-1883796: Support appending of same size Column to Dataframe #2869

rhaffar · 2025-01-15T14:41:31Z

What is the current behavior?

Apologies if I'm just missing something in the docs, but currently, there seems to be no way to append a Column to a Dataframe.

What is the desired behavior?

To have some method of appending Column objects to Dataframes. The main motive for me here is to have something approaching parallelism when generating multiple columns with the with_columns operations. If we can append columns to dataframes, then that lets us generate the new columns from a dataframe separately in parallel, leaving only the appending of the new columns to be done sequentially. This can be done with joins as well, but I imagine directly appending a column should be much less expensive. Only consideration here is order would have to be maintained.

How would this improve `snowflake-snowpark-python`?

Enable general use cases for users wanting to append columns to a dataframe.
Will allow for a way to significantly improve performance of generation of multiple new columns from an existing dataframe without having to directly implement any parallelization.

The text was updated successfully, but these errors were encountered:

sfc-gh-aalam · 2025-01-15T19:07:48Z

@rhaffar would you mind sharing a code example of what you would like to get supported. I don't completely understand why with_column is not sufficient for your use case.

rhaffar · 2025-01-15T19:50:46Z

@sfc-gh-aalam
No problem!

So here is an arbitrary example of the solution I've currently implemented, given a target dataframe, and a need to transform some set of columns for this dataframe.

for col in col_set:
    new_col_name = col + "_transformed"
    self.target_dataframe = transform_col(
        self.target_dataframe,
        col
        new_col_name
    )

def transform_col(target_dataframe, col, new_col_name):
  return target_dataframe.with_column(new_col_name, target_dataframe[col] / 2)

Given something like this, say I have 200 columns to transform, I cannot compute the new columns concurrently (even though I already have all the source columns in the original target_dataframe) - I must compute them entirely sequentially. Trying to compute them with a multi-threaded solution would lead to race conditions. My idea was to instead do something like following:

def transform_dataframe(self, col)
    new_col_name = col + "_transformed"
    new_col = transform_col(
        self.target_dataframe,
        col
        new_col_name
    )
    with self.lock:
        self.target_dataframe = self.target_dataframe.append(new_col_name, new_col)

def transform_col(target_dataframe, col, new_col_name):
  return target_dataframe.select(target_dataframe[col] / 2)

Where the transform_dataframe method would be called with one thread per column in the col_set in some threadpool. This way, the calculation of the target columns could be done in parallel, and only the updating of the target_dataframe via appending would be sequential. There may just be some misunderstanding on my end regarding how calculations are performed on Snowpark, but my understanding is that as of now, repeated with_column calls like in my original solution will not result in concurrent calculation of the target columns, so I'm requesting this to get a more performant alternative of the solution. Alternatively, allowing with_column calls to be made lazily with some way for Snowpark to evaluate these calls in parallel would be nice, but I'm assuming quite difficult.

rhaffar added the feature New feature or request label Jan 15, 2025

github-actions bot changed the title ~~Support appending of same size Column to Dataframe~~ SNOW-1883796: Support appending of same size Column to Dataframe Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-1883796: Support appending of same size Column to Dataframe #2869

SNOW-1883796: Support appending of same size Column to Dataframe #2869

rhaffar commented Jan 15, 2025

sfc-gh-aalam commented Jan 15, 2025

rhaffar commented Jan 15, 2025 •

edited

Loading

SNOW-1883796: Support appending of same size Column to Dataframe #2869

SNOW-1883796: Support appending of same size Column to Dataframe #2869

Comments

rhaffar commented Jan 15, 2025

What is the current behavior?

What is the desired behavior?

How would this improve snowflake-snowpark-python?

sfc-gh-aalam commented Jan 15, 2025

rhaffar commented Jan 15, 2025 • edited Loading

How would this improve `snowflake-snowpark-python`?

rhaffar commented Jan 15, 2025 •

edited

Loading