Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customizable Nullability for UDAF #11274

Open
jayzhan211 opened this issue Jul 5, 2024 · 4 comments
Open

Customizable Nullability for UDAF #11274

jayzhan211 opened this issue Jul 5, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 5, 2024

Is your feature request related to a problem or challenge?

Follow on #11093 and also there is another issue #11256 expect this

Describe the solution you'd like

There are some functions always returns non-null result, like count, array_agg, min/max and more.
We can optimize the query based on the nullability of the function, so it would be helpful if we could define nullability for each UDAF

Describe alternatives you've considered

No response

Additional context

No response

@jayzhan211 jayzhan211 added the enhancement New feature or request label Jul 5, 2024
@jayzhan211 jayzhan211 changed the title Customizable Nullability for UDFs and UDAFs Jul 5, 2024
@findepi
Copy link
Member

findepi commented Jul 5, 2024

To address this we could extend the pattern here

pub fn return_type(
&self,
input_expr_types: &[DataType],
input_expr_nullable: &[bool],
) -> Result<DataType> {

ie now we have return_type(input_types, nullability) -> return_type
we could have return_type(input_types, nullability) -> (return_type, nullability)

this is an API for builtin functions, we would need similar API for UDAFs

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType>;

Obviously, this would be a breaking change: both method signature and return type would change.
We can avoid the breaking change by making nullability a separate method, but it would be good to first determine the ideal end-state, and only then think how to get there.

@findepi
Copy link
Member

findepi commented Jul 5, 2024

Just thinking more about this. SQL spec

If no row qualifies, then the result of COUNT is 0 (zero), and the result of any other aggregate function is the null value.

and this is also about array_agg.

So it might be that the count is the only aggregate function that really returns non-null values, and maybe we don't need to complicate UDAF API for this.

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jul 6, 2024

I agree that we don't need non-null for array_agg, min/max, but we may need it for count variant like approx_count_distinct or other user-defined function that expect non-null, so able to define nullability still seems like a good idea.

btw, our array_agg always return empty array not null, we should consider change the behaviour to follow other sql database to return null or keep it as it is 🤔 . DuckDB and Postgres also returns null for array_agg.

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jul 6, 2024

The nullability of a function can differ from the final result, which is crucial for optimization purposes. We can inform the optimizer that a function will always return a non-null value, allowing for optimizations based on this assumption. However, the final result can still be null if no rows qualify, resulting in an empty plan and thus a null value

#11093 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
2 participants