Skip to content

Commit 52dfa3d

Browse files
wenghallisonwang-db
authored andcommitted
[SPARK-51566][PYTHON] Python UDF traceback improvement
### Motivation Currently, when a Python UDF raises an error, the traceback only includes the file name and the line number, but doesn't include the content of that specific line. This behavior is different from local code tracebacks that show the line content. See following example. Error inside UDF. ![image](https://github.com/user-attachments/assets/03d6dc6a-c821-4601-9074-95377d03f80c) Local error. Notice that IPython additionally includes more more lines around the line where error happens, with links to the notebook cell, making it even easier to understand. ![image](https://github.com/user-attachments/assets/54886061-b90d-4449-a500-a81ee30c19db) ### What changes were proposed in this pull request? This PR changes `convert_exception` to detect Python tracebacks in the JVM error message, parse them back to a traceback object, and include them in the converted exception. This way, these frames will be included in the exception traceback as if they were part of the call stack. If we fail to parse the traceback, we will silently ignore the failure, keeping the original behavior. In either case, the original error message, with the traceback in string form, will always be included in the exception so that we don't lose any information. This PR also introduces [`tblib`](https://github.com/ionelmc/python-tblib), a lightweight library that allows parsing traceback from string. The library is included as a source file, with modifications to make it preserve the original line content when the file is not available anymore. Example of improved traceback as displayed in IPython. Notice that the Python worker frames are concatenated to the exception so that IPython ultratb shows the code around the error line. ```py --------------------------------------------------------------------------- PythonException Traceback (most recent call last) Cell In[3], line 8 4 udf(returnType=StringType()) 5 def foo(value): 6 1 / 0 ----> 8 spark.range(1).toDF("value").select(foo("value")).show() File ~/personal/test-spark/.venv/lib/python3.9/site-packages/pyspark/sql/classic/dataframe.py:285, in DataFrame.show(self, n, truncate, vertical) 284 def show(self, n: int = 20, truncate: Union[bool, int] = True, vertical: bool = False) -> None: --> 285 print(self._show_string(n, truncate, vertical)) ... File ~/personal/test-spark/.venv/lib/python3.9/site-packages/pyspark/errors/exceptions/captured.py:271, in capture_sql_exception.<locals>.deco(*a, **kw) 267 converted = convert_exception(e.java_exception) 268 if not isinstance(converted, UnknownException): 269 # Hide where the exception came from that shows a non-Pythonic 270 # JVM exception message. --> 271 raise converted from None 272 else: 273 raise Cell In[3], line 6, in foo() 4 udf(returnType=StringType()) 5 def foo(value): ----> 6 1 / 0 PythonException: An exception was thrown from the Python worker. Please see the stack trace below. Traceback (most recent call last): File "/var/folders/jm/2m7c_fjj75v7gqlr5px0j8780000gp/T/ipykernel_6171/1200677507.py", line 6, in foo ZeroDivisionError: division by zero ``` ### Why are the changes needed? To improve debuggability of Python UDFs (and UDTFs, Data Sources, etc.) by recovering the traceback when calling the UDF using PySpark. ### Does this PR introduce _any_ user-facing change? Yes. Exceptions converted from JVM will include additional traceback frames if the JVM error message includes a Python traceback. ### How was this patch tested? Unit tests and end to end tests in `python/pyspark/errors/tests/test_traceback.py` ### Was this patch authored or co-authored using generative AI tooling? No ### What are the risks? 1. This change will break anything that assume that the frames of converted exception don't include UDF frames. This sounds unlikely though. 2. If there's a Python source file that is present on both the worker and the client, but with different content, then the traceback may show wrong line content for frames in that file. 3. If the traceback string format changes in a future Python version, then the parsing will break. A recent example is [PEP 657 in Python 3.11](https://docs.python.org/3/whatsnew/3.11.html#pep-657-fine-grained-error-locations-in-tracebacks) which introduced fine-grained error locations in traceback. If similar changes happen in the future, the unit tests should catch the change. In the worst case the parsing will fail and we will fall back to the original traceback string. 4. If the Python source file name contains a line break then parsing will return an incorrect traceback. ### Why do the worker tracebacks not include the line content? When Python turns a traceback into string, it looks up the line content from the file and the line number using the `linecache` module. This module contains a in memory cache from file name to lines, and when there's a miss, it reads the file from module globals or from the file system. IPython adds the cell content to the cache when the cell is executed. When we run a UDF the function is pickled and sent to the driver JVM. The pickle doesn't include the source code nor the line cache so `linecache` will generally not be able to find the source code when it generates the traceback on the Python worker, unless the client code is in a `.py` file on the same host as the worker. ### What alternative solutions did we consider? This solution may be brittle since it depends on the specific traceback string format. And it also introduces a new 3rd party dependency. Here's some alternatives and why we didn't choose them. #### Pass linecache to Python worker If we can pass the linecache content to the worker, then the traceback on the worker will include the line content. To do this, we need to find which files are part of the pickle, and pass the content of these files to the worker. I think this is not feasible because it can potentially cause a lot of overhead when calling UDFs. And the user would still not be able to benefit from rich tracebacks like the one provided by IPython. #### Send serialized traceback from Python worker to JVM To avoid the brittle string parsing, we could send the traceback object from the worker to the JVM, then deserialize in Python client. Unfortunately Python tracebacks are not pickleable, so we would still need to use a 3rd party library (`tblib` again) to serialize the traceback. Also, this would require a lot more changes across the codebase. Closes #50313 from wengh/python-udf-traceback. Authored-by: Haoyu Weng <[email protected]> Signed-off-by: Allison Wang <[email protected]>
1 parent c8f5020 commit 52dfa3d

File tree

11 files changed

+731
-1
lines changed

11 files changed

+731
-1
lines changed

LICENSE-binary

+2
Original file line numberDiff line numberDiff line change
@@ -445,6 +445,8 @@ jline:jline
445445
org.jodd:jodd-core
446446
pl.edu.icm:JLargeArrays
447447

448+
python/pyspark/errors/exceptions/tblib.py
449+
448450

449451
BSD 3-Clause
450452
------------

dev/.rat-excludes

+1
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ jquery.mustache.js
4848
pyspark-coverage-site/*
4949
cloudpickle/*
5050
join.py
51+
tblib.py
5152
SparkILoop.scala
5253
sbt
5354
sbt-launch-lib.bash

dev/sparktestsupport/modules.py

+2
Original file line numberDiff line numberDiff line change
@@ -1466,6 +1466,8 @@ def __hash__(self):
14661466
python_test_goals=[
14671467
# unittests
14681468
"pyspark.errors.tests.test_errors",
1469+
"pyspark.errors.tests.test_traceback",
1470+
"pyspark.errors.tests.connect.test_parity_traceback",
14691471
],
14701472
)
14711473

python/packaging/client/setup.py

+1
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@
6868
test_packages = []
6969
if "SPARK_TESTING" in os.environ:
7070
test_packages = [
71+
"pyspark.errors.tests.connect",
7172
"pyspark.tests", # for Memory profiler parity tests
7273
"pyspark.resource.tests",
7374
"pyspark.sql.tests",

python/pyspark/errors/exceptions/base.py

+32-1
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,9 @@
1717
import warnings
1818
from abc import ABC, abstractmethod
1919
from enum import Enum
20-
from typing import Dict, Optional, cast, Iterable, TYPE_CHECKING, List
20+
from typing import Dict, Optional, TypeVar, cast, Iterable, TYPE_CHECKING, List
2121

22+
from pyspark.errors.exceptions.tblib import Traceback
2223
from pyspark.errors.utils import ErrorClassesReader
2324
from pyspark.logger import PySparkLogger
2425
from pickle import PicklingError
@@ -27,6 +28,9 @@
2728
from pyspark.sql.types import Row
2829

2930

31+
T = TypeVar("T", bound="PySparkException")
32+
33+
3034
class PySparkException(Exception):
3135
"""
3236
Base Exception for handling errors generated from PySpark.
@@ -449,3 +453,30 @@ def summary(self) -> str:
449453
Summary of the exception cause.
450454
"""
451455
...
456+
457+
458+
def recover_python_exception(e: T) -> T:
459+
"""
460+
Recover Python exception stack trace.
461+
462+
Many JVM exceptions types may wrap Python exceptions. For example:
463+
- UDFs can cause PythonException
464+
- UDTFs and Data Sources can cause AnalysisException
465+
"""
466+
python_exception_header = "Traceback (most recent call last):"
467+
try:
468+
message = str(e)
469+
start = message.find(python_exception_header)
470+
if start == -1:
471+
# No Python exception found
472+
return e
473+
474+
# The message contains a Python exception. Parse it to use it as the exception's traceback.
475+
# This allows richer error messages, for example showing line content in Python UDF.
476+
python_exception_string = message[start:]
477+
tb = Traceback.from_string(python_exception_string)
478+
tb.populate_linecache()
479+
return e.with_traceback(tb.as_traceback())
480+
except BaseException:
481+
# Parsing the stacktrace is best effort.
482+
return e

python/pyspark/errors/exceptions/captured.py

+6
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@
3737
UnknownException as BaseUnknownException,
3838
QueryContext as BaseQueryContext,
3939
QueryContextType,
40+
recover_python_exception,
4041
)
4142

4243
if TYPE_CHECKING:
@@ -185,6 +186,11 @@ def getQueryContext(self) -> List[BaseQueryContext]:
185186

186187

187188
def convert_exception(e: "Py4JJavaError") -> CapturedException:
189+
converted = _convert_exception(e)
190+
return recover_python_exception(converted)
191+
192+
193+
def _convert_exception(e: "Py4JJavaError") -> CapturedException:
188194
from pyspark import SparkContext
189195
from py4j.java_gateway import is_instance_of
190196

python/pyspark/errors/exceptions/connect.py

+11
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@
3939
StreamingPythonRunnerInitializationException as BaseStreamingPythonRunnerInitException,
4040
PickleException as BasePickleException,
4141
UnknownException as BaseUnknownException,
42+
recover_python_exception,
4243
)
4344

4445
if TYPE_CHECKING:
@@ -56,6 +57,16 @@ def convert_exception(
5657
truncated_message: str,
5758
resp: Optional[pb2.FetchErrorDetailsResponse],
5859
display_server_stacktrace: bool = False,
60+
) -> SparkConnectException:
61+
converted = _convert_exception(info, truncated_message, resp, display_server_stacktrace)
62+
return recover_python_exception(converted)
63+
64+
65+
def _convert_exception(
66+
info: "ErrorInfo",
67+
truncated_message: str,
68+
resp: Optional[pb2.FetchErrorDetailsResponse],
69+
display_server_stacktrace: bool = False,
5970
) -> SparkConnectException:
6071
raw_classes = info.metadata.get("classes")
6172
classes: List[str] = json.loads(raw_classes) if raw_classes else []

0 commit comments

Comments
 (0)