Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51316][PYTHON] Allow Arrow batches in bytes instead of number of rows #50080

Closed
wants to merge 3 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Feb 25, 2025

What changes were proposed in this pull request?

This PR allows Arrow batches in bytes instead of number of rows

Why are the changes needed?

We enabled spark.sql.execution.pythonUDF.arrow.enabled by default, and we should make sure users won't hit OOM.

Does this PR introduce any user-facing change?

Yes. Now we will make the Arrow batches in bytes 256MB by default, and users can configure this

How was this patch tested?

Tested with changing default value to 1KB, and added a unittest. Also manually tested as below:

from pyspark.sql.functions import pandas_udf
import pandas as pd

# spark.conf.set("spark.sql.execution.arrow.maxBytesPerBatch", "1K")
# spark.conf.set("spark.sql.execution.arrow.maxBytesPerBatch", "2K")
# spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "1")
# spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "10")


@pandas_udf("long")
def func(s: pd.Series) -> pd.Series:
    return s


a = spark.range(100000).select(func("id")).collect()

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon HyukjinKwon marked this pull request as draft February 25, 2025 12:39
@HyukjinKwon HyukjinKwon changed the title [DO-NOT-MERGE] Allow Arrow batches in bytes instead of number of rows [SPARK-51316][PYTHON] Allow Arrow batches in bytes instead of number of rows Feb 26, 2025
@HyukjinKwon HyukjinKwon marked this pull request as ready for review February 26, 2025 05:53
@HyukjinKwon HyukjinKwon requested a review from ueshin February 26, 2025 05:53
@@ -112,6 +112,16 @@ class ArrowWriter(val root: VectorSchemaRoot, fields: Array[ArrowFieldWriter]) {
count += 1
}

def sizeInBytes(): Int = {
var i = 0
var bytes = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This represents the size of a single row and should work for primitive types. But what if we have a string or binary type, which can vary in size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually represents the size of the whole Arrow batch. ArrowFieldWriter will be in charge of writing single column, and here we get the size of all columns. Since we're getting the size of the buffer being used in individual ArrowFieldWriter, it should work regardless of specific types.

@dbtsai
Copy link
Member

dbtsai commented Feb 26, 2025

cc @viirya

// DO NOT use iter.grouped(). See BatchIterator.
val batchIter =
if (batchSize > 0) new BatchIterator(inputIter, batchSize) else Iterator(inputIter)
val batchIter = Iterator(inputIter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't see the new iterator is "batched" by ArrowRRunner. So it is just removed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like ArrowPythonWithNamedArgumentRunner which is now batching rows internally with BatchedPythonArrowInput, but I don't see such thing on ArrowRRunner too. Is it intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I actually intentionally did not touch R patch because SparkR is deprecated. So I made this trait and used only for Scalar Python UDF cases.

Comment on lines +163 to +168
def getSizeInBytes(): Int = {
valueVector.setValueCount(count)
// Before calling getBufferSizeFor, we need to call
// `setValueCount`, see https://github.com/apache/arrow/pull/9187#issuecomment-763362710
valueVector.getBufferSizeFor(count)
}
Copy link
Member

@viirya viirya Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, based on the API doc https://arrow.apache.org/docs/java/vector.html

After this step, the vector enters an immutable state. In other words, we
should no longer mutate it. (Unless we reuse the vector by allocating it
again. This will be discussed shortly.)

A Java Arrow field vector after called this method, should not be modified. But I think this patch will call getSizeInBytes during inserting values into a vector.

It might cause unexpected error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah .. this has been discussed a lot, e.g., apache/arrow#9187 but it has been many years in production without an issue so I assume this is fine ....

var numRowsInBatch: Int = 0

def underBatchSizeLimit: Boolean =
(maxBytesPerBatch == Int.MaxValue) || (arrowWriter.sizeInBytes() < maxBytesPerBatch)
Copy link
Member

@viirya viirya Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Let me followup for this separately if you don't mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

@classmethod
def setUpClass(cls):
MapInArrowTests.setUpClass()
# Set it to a small odd value to exercise batching logic for all test cases
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't get this comment. So you mean you set maxRecordsPerBatch to 3 to make it meet earlier than maxBytesPerBatch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I meant for both. Actually I think I should reduce the byte size some more so both number of records and bytes can be tested.

@HyukjinKwon
Copy link
Member Author

Merged to master and branch-4.0.

HyukjinKwon added a commit that referenced this pull request Feb 27, 2025
…of rows

### What changes were proposed in this pull request?

This PR allows Arrow batches in bytes instead of number of rows

### Why are the changes needed?

We enabled `spark.sql.execution.pythonUDF.arrow.enabled` by default, and we should make sure users won't hit OOM.

### Does this PR introduce _any_ user-facing change?

Yes. Now we will make the Arrow batches in bytes 256MB by default, and users can configure this

### How was this patch tested?

Tested with changing default value to 1KB, and added a unittest. Also manually tested as below:

```python
from pyspark.sql.functions import pandas_udf
import pandas as pd

# spark.conf.set("spark.sql.execution.arrow.maxBytesPerBatch", "1K")
# spark.conf.set("spark.sql.execution.arrow.maxBytesPerBatch", "2K")
# spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "1")
# spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "10")

pandas_udf("long")
def func(s: pd.Series) -> pd.Series:
    return s

a = spark.range(100000).select(func("id")).collect()
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50080 from HyukjinKwon/bytes-arrow.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 53fc763)
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants