-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47690][SQL] Enable hash aggregation support for all collations (StringType) #46640
base: master
Are you sure you want to change the base?
Conversation
// This rewrite rule is used to enabled hash aggregation on collated string columns. However, | ||
// hash aggregation is currently only supported for grouping aggregations - this means that no | ||
// string type can be found in the aggregate expressions, so we avoid rewrite in this case. | ||
!aggregate.aggregateExpressions.exists(e => e.dataType.isInstanceOf[StringType]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that you should check if hash aggregation is supported at all, regardless of StringType
.
If we are going to end up doing merge agg there is no need to insert collation_key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I just try to use supportsHashAggregate
here, I might find that the aggregate does not support hash aggregation before the rewrite, but will support it after the rewrite (as a result of this, the rewrite rule will never actually execute)
However, we perform this check before doing the plan rewrite, so the point of this check is to verify that the current Aggregate is only a grouping aggregate with respect to StringType (i.e. StringType is not found in aggregateExpressions).
Any ideas on how to make this better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can call supportsHashAggregate
by just passing agg keys and empty seq for group by?
What changes were proposed in this pull request?
Enable collation support for hash aggregation on StringType, for aggregates where aggregate expressions don't include a non-binary collation expression. Note: support for complex types will be added separately.
CollationKey
CollationKey
is a unary expression that transformsStringType
toBinaryType
Why are the changes needed?
Improve GROUP BY performance for collated strings.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
RewriteGroupByCollation
inCollationSuite
Was this patch authored or co-authored using generative AI tooling?
No.