Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]Support pyspark.ml.evaluation.{BinaryClassificationEvaluator, MulticlassClassificationEvaluator} #64

Open
viadea opened this issue Feb 18, 2022 · 0 comments

Comments

@viadea
Copy link

viadea commented Feb 18, 2022

I wish we can support pyspark.ml.evaluation.{BinaryClassificationEvaluator, MulticlassClassificationEvaluator}.

Take the example from https://stackoverflow.com/questions/60772315/how-to-evaluate-a-classifier-with-pyspark-2-4-5 :

from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

# Create both evaluators
evaluatorMulti = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction")
evaluator = BinaryClassificationEvaluator(labelCol="target", rawPredictionCol="prediction", metricName='areaUnderROC')

# Make predicitons
predictionAndTarget = model.transform(df).select("target", "prediction")

# Get metrics
acc = evaluatorMulti.evaluate(predictionAndTarget, {evaluatorMulti.metricName: "accuracy"})
f1 = evaluatorMulti.evaluate(predictionAndTarget, {evaluatorMulti.metricName: "f1"})
weightedPrecision = evaluatorMulti.evaluate(predictionAndTarget, {evaluatorMulti.metricName: "weightedPrecision"})
weightedRecall = evaluatorMulti.evaluate(predictionAndTarget, {evaluatorMulti.metricName: "weightedRecall"})
auc = evaluator.evaluate(predictionAndTarget)

Seems those are RDD APIs and will generate lots of un-supported messages.
Such as:

! <DeserializeToObjectExec> cannot run on GPU because not all expressions can be replaced; GPU does not currently support the operator class org.apache.spark.sql.execution.DeserializeToObjectExec
  ! <CreateExternalRow> createexternalrow(prediction#327, label#322, 1.0#400, newInstance(class org.apache.spark.ml.linalg.VectorUDT).deserialize, StructField(prediction,DoubleType,true), StructField(label,DoubleType,true), StructField(1.0,DoubleType,false), StructField(probability,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
    @Expression <AttributeReference> prediction#327 could run on GPU
    @Expression <AttributeReference> label#322 could run on GPU
    @Expression <AttributeReference> 1.0#400 could run on GPU
    ! <Invoke> newInstance(class org.apache.spark.ml.linalg.VectorUDT).deserialize cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
      ! <NewInstance> newInstance(class org.apache.spark.ml.linalg.VectorUDT) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.NewInstance
      !Expression <AttributeReference> probability#326 cannot run on GPU because expression AttributeReference probability#326 produces an unsupported type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7
  !Expression <AttributeReference> obj#406 cannot run on GPU because expression AttributeReference obj#406 produces an unsupported type ObjectType(interface org.apache.spark.sql.Row)
  !Exec <ProjectExec> cannot run on GPU because unsupported data types in input: org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 [probability#326]; not all expressions can be replaced; unsupported data types in output: org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 [probability#326]
    @Expression <AttributeReference> prediction#327 could run on GPU
    @Expression <AttributeReference> label#322 could run on GPU
    @Expression <Alias> 1.0 AS 1.0#400 could run on GPU
      @Expression <Literal> 1.0 could run on GPU
    !Expression <AttributeReference> probability#326 cannot run on GPU because expression AttributeReference probability#326 produces an unsupported type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7
    !Exec <FileSourceScanExec> cannot run on GPU because unsupported data types org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 [probability] in read for Parquet; unsupported data types in output: org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 [probability#326]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant