[SPARK-51316][PYTHON] Allow Arrow batches in bytes instead of number of rows #50080

HyukjinKwon · 2025-02-25T12:39:51Z

What changes were proposed in this pull request?

This PR allows Arrow batches in bytes instead of number of rows

Why are the changes needed?

We enabled spark.sql.execution.pythonUDF.arrow.enabled by default, and we should make sure users won't hit OOM.

Does this PR introduce any user-facing change?

Yes. Now we will make the Arrow batches in bytes 256MB by default, and users can configure this

How was this patch tested?

Tested with changing default value to 1KB, and added a unittest. Also manually tested as below:

from pyspark.sql.functions import pandas_udf
import pandas as pd

# spark.conf.set("spark.sql.execution.arrow.maxBytesPerBatch", "1K")
# spark.conf.set("spark.sql.execution.arrow.maxBytesPerBatch", "2K")
# spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "1")
# spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "10")


@pandas_udf("long")
def func(s: pd.Series) -> pd.Series:
    return s


a = spark.range(100000).select(func("id")).collect()

Was this patch authored or co-authored using generative AI tooling?

No.

dbtsai · 2025-02-26T07:03:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala

@@ -112,6 +112,16 @@ class ArrowWriter(val root: VectorSchemaRoot, fields: Array[ArrowFieldWriter]) {
    count += 1
  }

+  def sizeInBytes(): Int = {
+    var i = 0
+    var bytes = 0


This represents the size of a single row and should work for primitive types. But what if we have a string or binary type, which can vary in size?

This actually represents the size of the whole Arrow batch. ArrowFieldWriter will be in charge of writing single column, and here we get the size of all columns. Since we're getting the size of the buffer being used in individual ArrowFieldWriter, it should work regardless of specific types.

dbtsai · 2025-02-26T07:06:05Z

cc @viirya

viirya · 2025-02-26T07:12:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala

-      // DO NOT use iter.grouped(). See BatchIterator.
-      val batchIter =
-        if (batchSize > 0) new BatchIterator(inputIter, batchSize) else Iterator(inputIter)
+      val batchIter = Iterator(inputIter)


Hmm, I don't see the new iterator is "batched" by ArrowRRunner. So it is just removed?

Like ArrowPythonWithNamedArgumentRunner which is now batching rows internally with BatchedPythonArrowInput, but I don't see such thing on ArrowRRunner too. Is it intentional?

Yeah, I actually intentionally did not touch R patch because SparkR is deprecated. So I made this trait and used only for Scalar Python UDF cases.

viirya · 2025-02-26T07:22:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala

+  def getSizeInBytes(): Int = {
+    valueVector.setValueCount(count)
+    // Before calling getBufferSizeFor, we need to call
+    // `setValueCount`, see https://github.com/apache/arrow/pull/9187#issuecomment-763362710
+    valueVector.getBufferSizeFor(count)
+  }


Hmm, based on the API doc https://arrow.apache.org/docs/java/vector.html

After this step, the vector enters an immutable state. In other words, we should no longer mutate it. (Unless we reuse the vector by allocating it again. This will be discussed shortly.)

A Java Arrow field vector after called this method, should not be modified. But I think this patch will call getSizeInBytes during inserting values into a vector.

It might cause unexpected error.

Yeah .. this has been discussed a lot, e.g., apache/arrow#9187 but it has been many years in production without an issue so I assume this is fine ....

viirya · 2025-02-26T07:29:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowInput.scala

+      var numRowsInBatch: Int = 0
+
+      def underBatchSizeLimit: Boolean =
+        (maxBytesPerBatch == Int.MaxValue) || (arrowWriter.sizeInBytes() < maxBytesPerBatch)


Due to the API issue https://github.com/apache/spark/pull/50080/files#r1971063958, maybe we can call ArrowWriter.bytesWritten (https://arrow.apache.org/docs/dev/java/reference/org/apache/arrow/vector/ipc/ArrowWriter.html#bytesWritten())?

Sounds good. Let me followup for this separately if you don't mind.

viirya · 2025-02-26T08:00:31Z

python/pyspark/sql/tests/arrow/test_arrow_map.py

+    @classmethod
+    def setUpClass(cls):
+        MapInArrowTests.setUpClass()
+        # Set it to a small odd value to exercise batching logic for all test cases


Don't get this comment. So you mean you set maxRecordsPerBatch to 3 to make it meet earlier than maxBytesPerBatch?

Ah, I meant for both. Actually I think I should reduce the byte size some more so both number of records and bytes can be tested.

HyukjinKwon · 2025-02-27T05:11:39Z

Merged to master and branch-4.0.

…of rows ### What changes were proposed in this pull request? This PR allows Arrow batches in bytes instead of number of rows ### Why are the changes needed? We enabled `spark.sql.execution.pythonUDF.arrow.enabled` by default, and we should make sure users won't hit OOM. ### Does this PR introduce _any_ user-facing change? Yes. Now we will make the Arrow batches in bytes 256MB by default, and users can configure this ### How was this patch tested? Tested with changing default value to 1KB, and added a unittest. Also manually tested as below: ```python from pyspark.sql.functions import pandas_udf import pandas as pd # spark.conf.set("spark.sql.execution.arrow.maxBytesPerBatch", "1K") # spark.conf.set("spark.sql.execution.arrow.maxBytesPerBatch", "2K") # spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "1") # spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "10") pandas_udf("long") def func(s: pd.Series) -> pd.Series: return s a = spark.range(100000).select(func("id")).collect() ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50080 from HyukjinKwon/bytes-arrow. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 53fc763) Signed-off-by: Hyukjin Kwon <[email protected]>

HyukjinKwon marked this pull request as draft February 25, 2025 12:39

github-actions bot added SQL STRUCTURED STREAMING PYTHON labels Feb 25, 2025

HyukjinKwon force-pushed the bytes-arrow branch from 0115d6e to 0375159 Compare February 26, 2025 01:42

HyukjinKwon changed the title ~~[DO-NOT-MERGE] Allow Arrow batches in bytes instead of number of rows~~ [SPARK-51316][PYTHON] Allow Arrow batches in bytes instead of number of rows Feb 26, 2025

HyukjinKwon added 2 commits February 26, 2025 14:26

Allow Arrow batches in bytes instead of number of rows

4255d1a

See if tests pass

15d3b07

HyukjinKwon force-pushed the bytes-arrow branch from e8c8238 to 15d3b07 Compare February 26, 2025 05:48

HyukjinKwon marked this pull request as ready for review February 26, 2025 05:53

HyukjinKwon requested a review from ueshin February 26, 2025 05:53

dbtsai reviewed Feb 26, 2025

View reviewed changes

viirya reviewed Feb 26, 2025

View reviewed changes

lower the byte size in the test

9a5e157

viirya approved these changes Feb 27, 2025

View reviewed changes

HyukjinKwon closed this in 53fc763 Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51316][PYTHON] Allow Arrow batches in bytes instead of number of rows #50080

[SPARK-51316][PYTHON] Allow Arrow batches in bytes instead of number of rows #50080

HyukjinKwon commented Feb 25, 2025 •

edited

Loading

dbtsai Feb 26, 2025

HyukjinKwon Feb 26, 2025

dbtsai commented Feb 26, 2025

viirya Feb 26, 2025

viirya Feb 26, 2025

HyukjinKwon Feb 26, 2025

viirya Feb 26, 2025 •

edited

Loading

HyukjinKwon Feb 26, 2025

viirya Feb 26, 2025 •

edited

Loading

HyukjinKwon Feb 26, 2025

viirya Feb 26, 2025

viirya Feb 26, 2025

HyukjinKwon Feb 26, 2025

HyukjinKwon commented Feb 27, 2025

[SPARK-51316][PYTHON] Allow Arrow batches in bytes instead of number of rows #50080

[SPARK-51316][PYTHON] Allow Arrow batches in bytes instead of number of rows #50080

Conversation

HyukjinKwon commented Feb 25, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbtsai commented Feb 26, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Feb 27, 2025

HyukjinKwon commented Feb 25, 2025 •

edited

Loading

viirya Feb 26, 2025 •

edited

Loading

viirya Feb 26, 2025 •

edited

Loading