Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1883796: Support appending of same size Column to Dataframe #2869

Open
rhaffar opened this issue Jan 15, 2025 · 2 comments
Open

SNOW-1883796: Support appending of same size Column to Dataframe #2869

rhaffar opened this issue Jan 15, 2025 · 2 comments
Labels
feature New feature or request

Comments

@rhaffar
Copy link

rhaffar commented Jan 15, 2025

What is the current behavior?

Apologies if I'm just missing something in the docs, but currently, there seems to be no way to append a Column to a Dataframe.

What is the desired behavior?

To have some method of appending Column objects to Dataframes. The main motive for me here is to have something approaching parallelism when generating multiple columns with the with_columns operations. If we can append columns to dataframes, then that lets us generate the new columns from a dataframe separately in parallel, leaving only the appending of the new columns to be done sequentially. This can be done with joins as well, but I imagine directly appending a column should be much less expensive. Only consideration here is order would have to be maintained.

How would this improve snowflake-snowpark-python?

  • Enable general use cases for users wanting to append columns to a dataframe.
  • Will allow for a way to significantly improve performance of generation of multiple new columns from an existing dataframe without having to directly implement any parallelization.
@rhaffar rhaffar added the feature New feature or request label Jan 15, 2025
@github-actions github-actions bot changed the title Support appending of same size Column to Dataframe SNOW-1883796: Support appending of same size Column to Dataframe Jan 15, 2025
@sfc-gh-aalam
Copy link
Contributor

@rhaffar would you mind sharing a code example of what you would like to get supported. I don't completely understand why with_column is not sufficient for your use case.

@rhaffar
Copy link
Author

rhaffar commented Jan 15, 2025

@sfc-gh-aalam
No problem!

So here is an arbitrary example of the solution I've currently implemented, given a target dataframe, and a need to transform some set of columns for this dataframe.

for col in col_set:
    new_col_name = col + "_transformed"
    self.target_dataframe = transform_col(
        self.target_dataframe,
        col
        new_col_name
    )

def transform_col(target_dataframe, col, new_col_name):
  return target_dataframe.with_column(new_col_name, target_dataframe[col] / 2)

Given something like this, say I have 200 columns to transform, I cannot compute the new columns concurrently (even though I already have all the source columns in the original target_dataframe) - I must compute them entirely sequentially. Trying to compute them with a multi-threaded solution would lead to race conditions. My idea was to instead do something like following:

def transform_dataframe(self, col)
    new_col_name = col + "_transformed"
    new_col = transform_col(
        self.target_dataframe,
        col
        new_col_name
    )
    with self.lock:
        self.target_dataframe = self.target_dataframe.append(new_col_name, new_col)

def transform_col(target_dataframe, col, new_col_name):
  return target_dataframe.select(target_dataframe[col] / 2)

Where the transform_dataframe method would be called with one thread per column in the col_set in some threadpool. This way, the calculation of the target columns could be done in parallel, and only the updating of the target_dataframe via appending would be sequential. There may just be some misunderstanding on my end regarding how calculations are performed on Snowpark, but my understanding is that as of now, repeated with_column calls like in my original solution will not result in concurrent calculation of the target columns, so I'm requesting this to get a more performant alternative of the solution. Alternatively, allowing with_column calls to be made lazily with some way for Snowpark to evaluate these calls in parallel would be nice, but I'm assuming quite difficult.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants