Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] spark 3.5.0 shim spark-shell is broken in spark-rapids 23.10 and 23.12 #9498

Closed
abellina opened this issue Oct 20, 2023 · 8 comments · Fixed by #9500
Closed

[BUG] spark 3.5.0 shim spark-shell is broken in spark-rapids 23.10 and 23.12 #9498

abellina opened this issue Oct 20, 2023 · 8 comments · Fixed by #9500
Assignees
Labels
bug Something isn't working

Comments

@abellina
Copy link
Collaborator

abellina commented Oct 20, 2023

A user reported seeing an issue trying to launch spark-shell with Spark 3.5.0 and 23.12. I have reproed the issue and confirmed I see it for 23.10. User report: NVIDIA/spark-rapids-ml#453 (comment)

Here are tests I ran:

  • pyspark shell: works
  • spark-submit with java wordcount example: works
  • spark-shell: doesn't work

My repro locally. I used JDK 17 like the user, but running with JDK 8 also reproes it:

23/10/20 14:07:20 INFO ShimLoader: Loading shim for Spark version: 3.5.0
23/10/20 14:07:20 INFO ShimLoader: Complete Spark build info: 3.5.0, https://github.com/apache/spark, HEAD, ce5ddad990373636e94071e7cef2f31021add07b, 2023-09-09T01:53:20Z
23/10/20 14:07:20 INFO ShimLoader: findURLClassLoader hit the Boostrap classloader org.apache.spark.executor.ExecutorClassLoader@7ffce33c, failed to find a mutable classloader!
23/10/20 14:07:20 WARN ShimLoader: Found an unexpected context classloader org.apache.spark.executor.ExecutorClassLoader@7ffce33c. We will try to recover from this, but it may cause class loading problems.
23/10/20 14:07:25 INFO RapidsPluginUtils: RAPIDS Accelerator build: {version=23.10.0-SNAPSHOT, user=abellina, url=https://github.com/NVIDIA/spark-rapids.git, date=2023-10-20T13:58:56Z, revision=1baa350ec2289eac84287676f2b76b7e4e82013d, cudf_version=23.10.0, branch=branch-23.10}
23/10/20 14:07:25 INFO RapidsPluginUtils: RAPIDS Accelerator JNI build: {version=23.10.0, user=, url=https://github.com/NVIDIA/spark-rapids-jni.git, date=2023-10-12T02:48:23Z, revision=e5fb14eb4bd4087be9b5a7e960edb27fc76ffc2d, branch=HEAD}
23/10/20 14:07:25 INFO RapidsPluginUtils: cudf build: {version=23.10.0, user=, url=https://github.com/rapidsai/cudf.git, date=2023-10-12T02:48:23Z, revision=9f0c2f452f1cf318c3f7fe2c6f7e07fc513fc335, branch=HEAD}
23/10/20 14:07:25 WARN RapidsPluginUtils: RAPIDS Accelerator 23.10.0-SNAPSHOT using cudf 23.10.0.
23/10/20 14:07:25 ERROR RapidsExecutorPlugin: Exception in the executor plugin, shutting down!
java.lang.ExceptionInInitializerError
        at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:363)
        at org.apache.spark.internal.plugin.ExecutorPluginContainer.$anonfun$executorPlugins$1(PluginContainer.scala:125)
        at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
        at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
        at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
        at org.apache.spark.internal.plugin.ExecutorPluginContainer.<init>(PluginContainer.scala:113)
        at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:211)
        at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:199)
        at org.apache.spark.executor.Executor.$anonfun$plugins$1(Executor.scala:337)
        at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:178)
        at org.apache.spark.executor.Executor.<init>(Executor.scala:337)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:174)
        at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
        at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
        at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: com.nvidia.spark.rapids.OptimizerPlugin
        at org.apache.spark.executor.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:124)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at org.apache.spark.util.SparkClassUtils.classForName(SparkClassUtils.scala:41)
        at org.apache.spark.util.SparkClassUtils.classForName$(SparkClassUtils.scala:36)
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 20, 2023
@abellina
Copy link
Collaborator Author

This issue seems to be isolated to spark-shell, so if we had something like this in place: #9497 we could have seen this before the user.

@abellina
Copy link
Collaborator Author

There is a workaround for this (disabling parallel worlds) but I see log messages in the executor log that would lead me to be concerned as a user. So I filed this: #9499

@gerashegalov
Copy link
Collaborator

Can you add an exact command/config for the repro @abellina please?

The following works for me:

JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64 \
  ~/dist/spark-3.5.0-bin-hadoop3/bin/spark-shell \
  --jars rapids-4-spark_2.12-23.10.0-cuda11.jar \
  --conf spark.plugins=com.nvidia.spark.SQLPlugin \
  --conf spark.rapids.sql.explain=ALL

@abellina
Copy link
Collaborator Author

Sure, sorry should have mentioned this was for standalone. In your case that is local, which I assume has its own host of issues/differences:

export RAPIDS_PLUGIN_JAR="$HOME/rapids-4-spark_2.12-23.12.0-SNAPSHOT-cuda11.jar"

export SPARK_HOST=127.0.0.1
export SPARK_PORT=7077
export SPARK_MASTER=spark://$SPARK_HOST:$SPARK_PORT
export SPARK_CONF_DIR=${SPARK_HOME}/conf

$SPARK_HOME/bin/spark-shell \
--master $SPARK_MASTER \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.explain=ALL \
--conf spark.cores.max=12 \
--conf spark.executor.cores=12 \
--jars $RAPIDS_PLUGIN_JAR

@tgravescs
Copy link
Collaborator

tgravescs commented Oct 20, 2023

this was cuased by apache/spark@1486835 in 3.5.0. It changed the package of the ExecutorClassLoader and our ShimLoader is explicitly looking for the old package name:

  case replCl if replCl.getClass.getName == "org.apache.spark.repl.ExecutorClassLoader" =>

We need to also check for the new classname of org.apache.spark.executor.ExecutorClassLoader

@tgravescs
Copy link
Collaborator

tgravescs commented Oct 20, 2023

making the change described above it works and find the mutable classloader from the ExecutorClassLoader


23/10/20 11:27:34 INFO ShimLoader: findURLClassLoader found org.apache.spark.executor.ExecutorClassLoader@346a6b16, trying parentLoader=org.apache.spark.util.ParentClassLoader@545f384f
23/10/20 11:27:34 INFO ShimLoader: findURLClassLoader found an immutable org.apache.spark.util.ParentClassLoader@545f384f, trying parent=org.apache.spark.util.MutableURLClassLoader@4fa34591
23/10/20 11:27:34 INFO ShimLoader: findURLClassLoader found a URLClassLoader org.apache.spark.util.MutableURLClassLoader@4fa34591

@tgravescs tgravescs removed the ? - Needs Triage Need team to review and classify label Oct 20, 2023
@gerashegalov
Copy link
Collaborator

The minimum repro that can be used for the test is

~/dist/spark-3.5.0-bin-hadoop3/bin/spark-shell \
  --master local-cluster[1,1,1024] \
  --jars rapids-4-spark_2.12-23.10.0-cuda11.jar \
  --conf spark.plugins=com.nvidia.spark.SQLPlugin --conf spark.rapids.sql.explain=ALL

@gerashegalov
Copy link
Collaborator

fixed by #9500

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants