[SPARK-48834][SQL] Disable variant input/output to python scalar UDFs, UDTFs, UDAFs during query compilation #47253

richardc-db · 2024-07-08T07:16:10Z

What changes were proposed in this pull request?

Throws an exception if a variant is the input/output type to/from python UDF, UDAF, UDTF

Why are the changes needed?

currently, variant input/output types to scalar UDFs will fail during execution or return a net.razorvine.pickle.objects.ClassDictConstructor to the user python code. For a better UX, we should fail during query compilation for failures, and block returning ClassDictConstructor to user code as we one day want to actually return VariantVals to the user code.

Does this PR introduce any user-facing change?

yes - attempting to use variants in python UDFs will now throw an exception rather than returning a ClassDictConstructor as before. However, we want to make this change now as we one day want to be able to return VariantVals to the user code and do not want users relying on this current behavior

How was this patch tested?

added UTs

Was this patch authored or co-authored using generative AI tooling?

no

allisonwang-db

How about Python UDF/UDTFs?

common/utils/src/main/resources/error/error-conditions.json

python/pyspark/sql/tests/pandas/test_pandas_udf_scalar.py

allisonwang-db · 2024-07-10T19:07:55Z

python/pyspark/sql/types.py

-        from pyspark.sql import SparkSession
-        from pyspark.sql.functions import udf
-
-        # Intentionally uses SparkSession so one implementation can be shared with/without
-        # Spark Connect.
-        schema = (
-            SparkSession.active().range(0).select(udf(lambda x: x, returnType=ddl)("id")).schema
-        )
-        assert len(schema) == 1
-        return schema[0].dataType
+        return _parse_datatype_string(ddl)


Why is this change needed?

If we want to do something like fromDDL("v variant"), fromDDL actually calls a udf just to get the output schema. However, this PR disables variant output UDFs during planning, so we have to get the data type another way

I've discussed a bit with @HyukjinKwon offline last night, who suggested that we instead block variant ser/de, however I took a second look today and, unless I'm missing something, I believe this route is preferable:

This will fail earlier during planning rather than execution

Blocking during ser/de may require more code changes, because pandas udfs don't seem to have the same codepath as non-pandas udfs.

the _parse_datatype_string also was made to work with spark connect (which was @HyukjinKwon's initial concern) in this pr

Can we make sure the behaivor of _parse_datatype_string is the same as the original fromDDL? My concern is that this might introduce unintentional behavior change for a public API.
What's the error message if we do fromDDL(a variant) without this change?

yeah makes sense. The function comment mentions that this was added for Spark 4.0.0 (which won't be GA until 2025 IIUC?) so I thought the risk of a breaking change for a public API would be a bit lower.

The error message without changing fromDDL for something like "a int v variant" would be

org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.UNSUPPORTED_UDF_OUTPUT_TYPE] Cannot resolve "<lambda>(id)" due to data type mismatch: UDFs do not support 'STRUCT<a: INT, v: VARIANT>' as an output data type. SQLSTATE: 42K09

which is pretty confusing for fromDDL in my opinion (i.e. why does fromDDL need to call a UDF in the first place?)

It seems that the UDF returnType implementation directly calls _parse_datatype_string here.

I added an additional test to test that the old udf behavior matches the new behavior

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala

allisonwang-db · 2024-07-10T19:15:04Z

cc @HyukjinKwon

allisonwang-db

Thanks for adding the tests!

HyukjinKwon · 2024-07-15T02:02:40Z

Merged to master.

…, UDTFs, UDAFs during query compilation ### What changes were proposed in this pull request? Throws an exception if a variant is the input/output type to/from python UDF, UDAF, UDTF ### Why are the changes needed? currently, variant input/output types to scalar UDFs will fail during execution or return a `net.razorvine.pickle.objects.ClassDictConstructor` to the user python code. For a better UX, we should fail during query compilation for failures, and block returning `ClassDictConstructor` to user code as we one day want to actually return `VariantVal`s to the user code. ### Does this PR introduce _any_ user-facing change? yes - attempting to use variants in python UDFs will now throw an exception rather than returning a `ClassDictConstructor` as before. However, we want to make this change now as we one day want to be able to return `VariantVal`s to the user code and do not want users relying on this current behavior ### How was this patch tested? added UTs ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47253 from richardc-db/variant_scalar_udfs. Authored-by: Richard Chen <r.chen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added SQL PYTHON CONNECT labels Jul 8, 2024

allisonwang-db reviewed Jul 8, 2024

View reviewed changes

richardc-db changed the title ~~[SPARK-48834][SQL] Disable variant input/output to scalar UDFs during query compilation~~ Jul 10, 2024

allisonwang-db reviewed Jul 10, 2024

View reviewed changes

richardc-db requested a review from allisonwang-db July 10, 2024 21:39

richardc-db added 6 commits July 10, 2024 17:37

init

0a73d14

add udtf udaf

2fd18e9

fix tests

87b1735

fix from DDL

848ca86

comments

72692cd

fix pandas test

e43927c

richardc-db force-pushed the variant_scalar_udfs branch from 620c6ce to e43927c Compare July 11, 2024 00:39

HyukjinKwon approved these changes Jul 11, 2024

View reviewed changes

test for behavior change

98d1e23

allisonwang-db approved these changes Jul 12, 2024

View reviewed changes

style

6ebbd66

HyukjinKwon approved these changes Jul 15, 2024

View reviewed changes

HyukjinKwon closed this in effd4d8 Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48834][SQL] Disable variant input/output to python scalar UDFs, UDTFs, UDAFs during query compilation #47253

[SPARK-48834][SQL] Disable variant input/output to python scalar UDFs, UDTFs, UDAFs during query compilation #47253

richardc-db commented Jul 8, 2024 •

edited

Loading

allisonwang-db left a comment

allisonwang-db Jul 10, 2024

richardc-db Jul 10, 2024 •

edited

Loading

allisonwang-db Jul 12, 2024

richardc-db Jul 12, 2024

allisonwang-db commented Jul 10, 2024

allisonwang-db left a comment

HyukjinKwon commented Jul 15, 2024

[SPARK-48834][SQL] Disable variant input/output to python scalar UDFs, UDTFs, UDAFs during query compilation #47253

[SPARK-48834][SQL] Disable variant input/output to python scalar UDFs, UDTFs, UDAFs during query compilation #47253

Conversation

richardc-db commented Jul 8, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

allisonwang-db left a comment

Choose a reason for hiding this comment

allisonwang-db Jul 10, 2024

Choose a reason for hiding this comment

richardc-db Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

allisonwang-db Jul 12, 2024

Choose a reason for hiding this comment

richardc-db Jul 12, 2024

Choose a reason for hiding this comment

allisonwang-db commented Jul 10, 2024

allisonwang-db left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 15, 2024

richardc-db commented Jul 8, 2024 •

edited

Loading

richardc-db Jul 10, 2024 •

edited

Loading