PySpark and Ozone integration #6299

dino-chiio · 2024-02-29T16:57:27Z

dino-chiio
Feb 29, 2024

I am trying to use Apache Ozone with PySpark. In the documentation, there is a statement: "Frameworks like Apache Spark, YARN and Hive work against Ozone without needing any change" which tells about OFS and OFS3.

I want to deploy Ozone in a separated pod inside K8S, then PySpark applications will connect to this pod to write and read data.

Could you guys help me provide some information related to set up and code samples to read and write data from Ozone using PySpark, please?

jojochuang · 2024-02-29T18:51:00Z

jojochuang
Feb 29, 2024
Collaborator

Assuming you're not in a Kerberized environment, and you have a csv file uploaded to ofs://ozone1708496417/vol1/bucket1/abc.csv that looks like this:

Name, Age, City
John, 30, New York
Alice, 25, Los Angeles
Bob, 35, Chicago
Emily, 28, San Francisco

Here's a script abc.py to read that csv from PySpark:


#!/bin/python

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("Read from Ozone") \
    .getOrCreate()

# Define the HDFS file path
hdfs_path = "ofs://ozone1708496417/vol1/bucket1/abc.csv"

# Read data from HDFS into a DataFrame
df = spark.read.csv(hdfs_path, header=True, inferSchema=True)

# Show the first few rows of the DataFrame
df.show()

# Stop the SparkSession
spark.stop()

And you can execute it with:

spark-submit abc.py

2 replies

jojochuang Feb 29, 2024
Collaborator

If it's in a Kerberized environment, you need to specify spark.yarn.access.hadoopFileSystems property so that Spark knows where to request delegation token from, for example:

spark-submit --conf "spark.yarn.access.hadoopFileSystems=ofs://ozone1709186636" abc.py

dino-chiio Mar 1, 2024
Author

I am wondering that the value: ofs://ozone1709186636 is the same as inside the code or it is another service?

kerneltime · 2024-02-29T18:53:26Z

kerneltime
Feb 29, 2024
Collaborator

To use Ozone in any application that is using HDFS you need to bring in the shaded FS jar into the class path and provide the ozone-site.xml config updated for the Ozone deployment, then change the url to ofs://<service name>/<path>. Essentially you need to provide the client code for Ozone to the application, provide the config and change the URL.

2 replies

dino-chiio Mar 1, 2024
Author

The class path you mean is inside the Ozone pod or in the application pod?
I am following the installation in This discussion.

kerneltime Mar 4, 2024
Collaborator

The application (Spark) needs the shaded fat jar (one single jar with all the client code) in the class path.

dino-chiio · 2024-03-01T14:24:46Z

dino-chiio
Mar 1, 2024
Author

Dear @kerneltime and @jojochuang, Could you please help me consider this error?

I have set up the Ozone environment following this discussion.

My Ozone cluster is running on minikube with the instruction at https://ozone.apache.org/docs/1.3.0/start/minikube.html
My customized PySpark image based on this discussion
I use SparkOperator to deploy above PySpark image as a pod.
The .yaml file as below.
When I created the pod, inside pod, there is an error as below related to org.apache.hadoop.fs.ozone.RootedOzoneFileSystem

spark.txt

py4j.protocol.Py4JJavaError: An error occurred while calling o68.text. : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.ozone.RootedOzoneFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2595) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3269) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361) at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:470) at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:572) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293) at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:923) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.ozone.RootedOzoneFileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2499) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2593) ... 24 more

1 reply

kerneltime Mar 4, 2024
Collaborator

Which Ozone jar did you place in the class path?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PySpark and Ozone integration #6299

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

PySpark and Ozone integration #6299

dino-chiio Feb 29, 2024

Replies: 3 comments · 5 replies

jojochuang Feb 29, 2024 Collaborator

jojochuang Feb 29, 2024 Collaborator

dino-chiio Mar 1, 2024 Author

kerneltime Feb 29, 2024 Collaborator

dino-chiio Mar 1, 2024 Author

kerneltime Mar 4, 2024 Collaborator

dino-chiio Mar 1, 2024 Author

kerneltime Mar 4, 2024 Collaborator

dino-chiio
Feb 29, 2024

Replies: 3 comments 5 replies

jojochuang
Feb 29, 2024
Collaborator

jojochuang Feb 29, 2024
Collaborator

dino-chiio Mar 1, 2024
Author

kerneltime
Feb 29, 2024
Collaborator

dino-chiio Mar 1, 2024
Author

kerneltime Mar 4, 2024
Collaborator

dino-chiio
Mar 1, 2024
Author

kerneltime Mar 4, 2024
Collaborator