[SPARK-50994][CORE] Perform RDD conversion under tracked execution #49678

BOOTMGR · 2025-01-26T09:45:31Z

What changes were proposed in this pull request?

A new lazy variable materializedRdd is introduced which actualyl holds RDD after it is created (by executing plan).
Dataset#rdd is wrapped within withNewRDDExecutionId, which takes care of important setup tasks, like updating Spark properties in SparkContext's thread-locals, before executing the SparkPlan to fetch data
Dataset#rdd acts like any other RDD operations like reduce or foreachPartition and operates on materializedRdd with new execution id (and initialising it if not done yet)

Why are the changes needed?

When Dataset is converted into RDD, It executes SpakPlan without any execution context. This leads to:

No tracking is available on Spark UI for stages which are necessary to build the RDD.
Spark properties which are local to thread may not be set in the RDD execution context. This leads to these properties not being sent with TaskContext but some operations like reading parquet files depend on these properties (eg, case-sesitivity).

Test scenario:

test("SPARK-50994: RDD conversion is performed with execution context") {
    withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") {
      withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "false") {
        withTempDir(dir => {
          val dummyDF = Seq((1, 1.0), (2, 2.0), (3, 3.0), (1, 1.0)).toDF("a", "A")
          dummyDF.write.format("parquet").mode("overwrite").save(dir.getCanonicalPath)

          val df = spark.read.parquet(dir.getCanonicalPath)
          val encoder = ExpressionEncoder(df.schema)
          val deduplicated = df.dropDuplicates(Array("a"))
          val df2 = deduplicated.flatMap(row => Seq(row))(encoder).rdd

          val output = spark.createDataFrame(df2, df.schema)
          checkAnswer(output, Seq(Row(1, 1.0), Row(2, 2.0), Row(3, 3.0)))
        })
      }
    }
  }

In the above scenario,

Call to .rdd triggers execution which performs shuffle after reading parquet
However, while reading parquet file spark.sql.caseSensitive is not set (even though it is passed during session creation) which is referred into SQLConf by parquet-mr reader
This leads to unexpected and wrong result of dropDuplicates as it would drop duplicates by either a or 'A'. Expectation is to drop duplicates by column a
This behaviour is not applicable to vectorized parquet reader because it reads case-sensitivity flag from hadoopContext hence is disabled.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing testcases & new test case added for specific scenario

Was this patch authored or co-authored using generative AI tooling?

No

Correct because `checkAnswer` in testcase calls `rdd.count()` which is now a tracked operation and Spark event listener is invoked for the same

BOOTMGR · 2025-01-26T14:45:41Z

Marking WIP, this would require some more work around event listeners and observable due to exposure of RDD stages.

`materializedRdd` is the actual holder which is initialized on-demand by operations like `.rdd`, `foreachPartition` etc.

BOOTMGR · 2025-01-27T09:46:51Z

Ready for view

BOOTMGR · 2025-02-05T10:42:37Z

@dongjoon-hyun / @HyukjinKwon seeking your attention.

cloud-fan · 2025-02-18T03:20:30Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

@@ -2721,6 +2721,25 @@ class DataFrameSuite extends QueryTest
      parameters = Map("name" -> ".whatever")
    )
  }
+
+  test("SPARK-50994: RDD conversion is performed with execution context") {
+    withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") {


This is an internal conf and ideally shouldn't be set by users. Do we have other examples like JSON/CSV scan with some configs?

I haven't seen any additional failures linked to this behaviour. The only instance we encountered was in production, where legacy requirements forced us to work with a case-sensitive schema.

This fix is also included in #48325, shall we take some tests from it?

Sure thing. Testcases on #48325 are more meaningful.

@bersprockets do you mind If I paste your testcase here?
@cloud-fan I will still keep added testcase here. Let me know if you think otherwise.

@BOOTMGR paste away!

@cloud-fan I took a close look at #48325 and I see that It takes stab at a bigger problem: SQLConf are not propagated when actual execution of RDD happens (when iterator is called) because that is triggered on-demand by user. This PR only ensures that when RDD is computed, It gets correct SQLConf but not during iterator traversal.

I followed conversation there and I agree with you that all SQLConf accesses should have been done during RDD computation (by storing configs locally) but not when iterator is called. I also agree with @bersprockets 's view that fixing it everywhere would be troublesome and there is not guarantee for future additions. I believe that change needs some bigger considerations like how we see interoperability between Dataset and RDD. I am ready to volunteer there.

However, I feel this change should ship independently because

We need to have correct configs set when RDD computation happens. This is needed regardless of [SPARK-47193][SQL] Ensure SQL conf is propagated to executors when actions are called on RDD returned by Dataset#rdd #48325 . We can wait for it later.

We need to have tracking on Spark UI for stages submitted during RDD computation. For example, Snowflake's official spark connector internally converts DF to RDD for serialising it into CSV format. Due to this, none of the dependent stages are show on Spark UI.

Let me know what you think.

I'm +1 to ship this fix first as it's straightforward. I was only asking to take some tests from #48325 so that we don't need to set an internal non-user-facing conf to reproduce the bug.

Understood. I could reproduce the same issue for spark.sql.legacy.timeParserPolicy but that is also an internal conf. Rest other scenarios mentioned there are caused by iterator issue discussed above.
Please let me know If I can look into any particular thing.

I see, let's merge it as it is.

cloud-fan · 2025-02-27T10:14:13Z

thanks, merging to master/4.0!

### What changes were proposed in this pull request? - A new lazy variable `materializedRdd` is introduced which actualyl holds RDD after it is created (by executing plan). - `Dataset#rdd` is wrapped within `withNewRDDExecutionId`, which takes care of important setup tasks, like updating Spark properties in `SparkContext`'s thread-locals, before executing the `SparkPlan` to fetch data - `Dataset#rdd` acts like any other RDD operations like `reduce` or `foreachPartition` and operates on `materializedRdd` with new execution id (and initialising it if not done yet) ### Why are the changes needed? When `Dataset` is converted into `RDD`, It executes `SpakPlan` without any execution context. This leads to: 1. No tracking is available on Spark UI for stages which are necessary to build the `RDD`. 2. Spark properties which are local to thread may not be set in the `RDD` execution context. This leads to these properties not being sent with `TaskContext` but some operations like reading parquet files depend on these properties (eg, case-sesitivity). Test scenario: ```java test("SPARK-50994: RDD conversion is performed with execution context") { withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") { withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "false") { withTempDir(dir => { val dummyDF = Seq((1, 1.0), (2, 2.0), (3, 3.0), (1, 1.0)).toDF("a", "A") dummyDF.write.format("parquet").mode("overwrite").save(dir.getCanonicalPath) val df = spark.read.parquet(dir.getCanonicalPath) val encoder = ExpressionEncoder(df.schema) val deduplicated = df.dropDuplicates(Array("a")) val df2 = deduplicated.flatMap(row => Seq(row))(encoder).rdd val output = spark.createDataFrame(df2, df.schema) checkAnswer(output, Seq(Row(1, 1.0), Row(2, 2.0), Row(3, 3.0))) }) } } } ``` In the above scenario, - Call to `.rdd` triggers execution which performs shuffle after reading parquet - However, while reading parquet file `spark.sql.caseSensitive` is not set (even though it is passed during session creation) which is referred into `SQLConf` by `parquet-mr` reader - This leads to unexpected and wrong result of `dropDuplicates` as it would drop duplicates by either `a` or 'A'. Expectation is to drop duplicates by column `a` - This behaviour is not applicable to vectorized parquet reader because it reads case-sensitivity flag from `hadoopContext` hence is disabled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing testcases & new test case added for specific scenario ### Was this patch authored or co-authored using generative AI tooling? No Closes #49678 from BOOTMGR/SPARK-50994. Authored-by: BOOTMGR <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 07e6a06) Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? - A new lazy variable `materializedRdd` is introduced which actualyl holds RDD after it is created (by executing plan). - `Dataset#rdd` is wrapped within `withNewRDDExecutionId`, which takes care of important setup tasks, like updating Spark properties in `SparkContext`'s thread-locals, before executing the `SparkPlan` to fetch data - `Dataset#rdd` acts like any other RDD operations like `reduce` or `foreachPartition` and operates on `materializedRdd` with new execution id (and initialising it if not done yet) ### Why are the changes needed? When `Dataset` is converted into `RDD`, It executes `SpakPlan` without any execution context. This leads to: 1. No tracking is available on Spark UI for stages which are necessary to build the `RDD`. 2. Spark properties which are local to thread may not be set in the `RDD` execution context. This leads to these properties not being sent with `TaskContext` but some operations like reading parquet files depend on these properties (eg, case-sesitivity). Test scenario: ```java test("SPARK-50994: RDD conversion is performed with execution context") { withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") { withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "false") { withTempDir(dir => { val dummyDF = Seq((1, 1.0), (2, 2.0), (3, 3.0), (1, 1.0)).toDF("a", "A") dummyDF.write.format("parquet").mode("overwrite").save(dir.getCanonicalPath) val df = spark.read.parquet(dir.getCanonicalPath) val encoder = ExpressionEncoder(df.schema) val deduplicated = df.dropDuplicates(Array("a")) val df2 = deduplicated.flatMap(row => Seq(row))(encoder).rdd val output = spark.createDataFrame(df2, df.schema) checkAnswer(output, Seq(Row(1, 1.0), Row(2, 2.0), Row(3, 3.0))) }) } } } ``` In the above scenario, - Call to `.rdd` triggers execution which performs shuffle after reading parquet - However, while reading parquet file `spark.sql.caseSensitive` is not set (even though it is passed during session creation) which is referred into `SQLConf` by `parquet-mr` reader - This leads to unexpected and wrong result of `dropDuplicates` as it would drop duplicates by either `a` or 'A'. Expectation is to drop duplicates by column `a` - This behaviour is not applicable to vectorized parquet reader because it reads case-sensitivity flag from `hadoopContext` hence is disabled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing testcases & new test case added for specific scenario ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#49678 from BOOTMGR/SPARK-50994. Authored-by: BOOTMGR <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

LuciferYang · 2025-03-17T03:40:10Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

@@ -2704,7 +2704,7 @@ class SQLQuerySuite extends SQLQuerySuiteBase with DisableAdaptiveExecutionSuite
                    checkAnswer(sql(s"SELECT id FROM $targetTable"),
                      Row(1) :: Row(2) :: Row(3) :: Nil)
                    spark.sparkContext.listenerBus.waitUntilEmpty()
-                    assert(commands.size == 3)
+                    assert(commands.size == 4)


@BOOTMGR After this change, this test has shown a tendency to become flaky. I noticed its failure in the Maven daily test, but it seemed stable before (or maybe I just didn't encounter the issue before). Could you investigate this problem?

also cc @cloud-fan

https://github.com/apache/spark/actions/runs/13883657836/job/38845459557

- SPARK-25271: Hive ctas commands should use data source if it is convertible *** FAILED *** List(org.apache.spark.sql.execution.SparkPlanInfo@cab1821f, org.apache.spark.sql.execution.SparkPlanInfo@4cf80e6, org.apache.spark.sql.execution.SparkPlanInfo@39acc973, org.apache.spark.sql.execution.SparkPlanInfo@fcface5, org.apache.spark.sql.execution.SparkPlanInfo@8316aebc) had size 5 instead of expected size 4 (SQLQuerySuite.scala:2707)

were you able to reproduce it locally?

I ran this test multiple times locally but it never failed. I also triggered test case execution with some debug logs on CI twice but it did not fail there either.

This change adds one extra execution stage (which was not tracker earlier) due to RDD mapping needed by ColumnarToRow transition. I will check If that codebase has any dynamic behaviour but most likely that should not be the case since all parameters and data is always the same.

It could be some other change impacting execution too so I'll do some more runs today to find which extra node is getting added.

I haven't found a way to reproduce it locally yet. If it's difficult to reproduce, we can set it aside for now and investigate it later when there's an easier way to reproduce it.

### What changes were proposed in this pull request? - A new lazy variable `materializedRdd` is introduced which actualyl holds RDD after it is created (by executing plan). - `Dataset#rdd` is wrapped within `withNewRDDExecutionId`, which takes care of important setup tasks, like updating Spark properties in `SparkContext`'s thread-locals, before executing the `SparkPlan` to fetch data - `Dataset#rdd` acts like any other RDD operations like `reduce` or `foreachPartition` and operates on `materializedRdd` with new execution id (and initialising it if not done yet) ### Why are the changes needed? When `Dataset` is converted into `RDD`, It executes `SpakPlan` without any execution context. This leads to: 1. No tracking is available on Spark UI for stages which are necessary to build the `RDD`. 2. Spark properties which are local to thread may not be set in the `RDD` execution context. This leads to these properties not being sent with `TaskContext` but some operations like reading parquet files depend on these properties (eg, case-sesitivity). Test scenario: ```java test("SPARK-50994: RDD conversion is performed with execution context") { withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") { withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "false") { withTempDir(dir => { val dummyDF = Seq((1, 1.0), (2, 2.0), (3, 3.0), (1, 1.0)).toDF("a", "A") dummyDF.write.format("parquet").mode("overwrite").save(dir.getCanonicalPath) val df = spark.read.parquet(dir.getCanonicalPath) val encoder = ExpressionEncoder(df.schema) val deduplicated = df.dropDuplicates(Array("a")) val df2 = deduplicated.flatMap(row => Seq(row))(encoder).rdd val output = spark.createDataFrame(df2, df.schema) checkAnswer(output, Seq(Row(1, 1.0), Row(2, 2.0), Row(3, 3.0))) }) } } } ``` In the above scenario, - Call to `.rdd` triggers execution which performs shuffle after reading parquet - However, while reading parquet file `spark.sql.caseSensitive` is not set (even though it is passed during session creation) which is referred into `SQLConf` by `parquet-mr` reader - This leads to unexpected and wrong result of `dropDuplicates` as it would drop duplicates by either `a` or 'A'. Expectation is to drop duplicates by column `a` - This behaviour is not applicable to vectorized parquet reader because it reads case-sensitivity flag from `hadoopContext` hence is disabled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing testcases & new test case added for specific scenario ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#49678 from BOOTMGR/SPARK-50994. Authored-by: BOOTMGR <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Perform RDD conversion under tracked execution

88b8b1b

github-actions bot added the SQL label Jan 26, 2025

BOOTMGR changed the title ~~Perform RDD conversion under tracked execution~~ SPARK-50994: Perform RDD conversion under tracked execution Jan 26, 2025

BOOTMGR changed the title ~~SPARK-50994: Perform RDD conversion under tracked execution~~ [SPARK-50994][SQL] Perform RDD conversion under tracked execution Jan 26, 2025

Fix failing test case

0c93694

Correct because `checkAnswer` in testcase calls `rdd.count()` which is now a tracked operation and Spark event listener is invoked for the same

BOOTMGR changed the title ~~[SPARK-50994][SQL] Perform RDD conversion under tracked execution~~ [SPARK-50994][SQL][WIP] Perform RDD conversion under tracked execution Jan 26, 2025

BOOTMGR added 2 commits January 27, 2025 09:17

Behave Dataset#rdd like any other operations on RDD

6388c61

`materializedRdd` is the actual holder which is initialized on-demand by operations like `.rdd`, `foreachPartition` etc.

style fix

97184ff

BOOTMGR changed the title ~~[SPARK-50994][SQL][WIP] Perform RDD conversion under tracked execution~~ [SPARK-50994][SQL] Perform RDD conversion under tracked execution Jan 27, 2025

BOOTMGR changed the title ~~[SPARK-50994][SQL] Perform RDD conversion under tracked execution~~ [SPARK-50994][CORE] Perform RDD conversion under tracked execution Feb 5, 2025

cloud-fan reviewed Feb 18, 2025

View reviewed changes

cloud-fan approved these changes Feb 18, 2025 •

edited

Loading

View reviewed changes

cloud-fan closed this in 07e6a06 Feb 27, 2025

LuciferYang reviewed Mar 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50994][CORE] Perform RDD conversion under tracked execution #49678

[SPARK-50994][CORE] Perform RDD conversion under tracked execution #49678

BOOTMGR commented Jan 26, 2025 •

edited

Loading

BOOTMGR commented Jan 26, 2025 •

edited

Loading

BOOTMGR commented Jan 27, 2025

BOOTMGR commented Feb 5, 2025

cloud-fan Feb 18, 2025

BOOTMGR Feb 18, 2025

cloud-fan Feb 18, 2025

BOOTMGR Feb 18, 2025

bersprockets Feb 18, 2025

BOOTMGR Feb 22, 2025

cloud-fan Feb 24, 2025

BOOTMGR Feb 26, 2025

cloud-fan Feb 27, 2025

cloud-fan commented Feb 27, 2025

LuciferYang Mar 17, 2025 •

edited

Loading

cloud-fan Mar 17, 2025

BOOTMGR Mar 18, 2025

LuciferYang Mar 18, 2025 •

edited

Loading

[SPARK-50994][CORE] Perform RDD conversion under tracked execution #49678

[SPARK-50994][CORE] Perform RDD conversion under tracked execution #49678

Conversation

BOOTMGR commented Jan 26, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

BOOTMGR commented Jan 26, 2025 • edited Loading

BOOTMGR commented Jan 27, 2025

BOOTMGR commented Feb 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Feb 27, 2025

LuciferYang Mar 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LuciferYang Mar 18, 2025 • edited Loading

Choose a reason for hiding this comment

BOOTMGR commented Jan 26, 2025 •

edited

Loading

BOOTMGR commented Jan 26, 2025 •

edited

Loading

LuciferYang Mar 17, 2025 •

edited

Loading

LuciferYang Mar 18, 2025 •

edited

Loading