Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a combine batch_matmul pass #5791

Merged
merged 2 commits into from
Jun 17, 2020
Merged

Conversation

t-vi
Copy link
Contributor

@t-vi t-vi commented Jun 12, 2020

Contrary what you might expect, this doesn't share as much code with the combine dense pass as it does with the combine 2d conv pass. This is because it concatenates the "output feature" dimensions just like the 2d conv pass concatenates output channels, whereas combine dense stacks the various matmul arguments.

I'm not sure if there is a deeper reason not to concatenate for dense, too, but maybe there is.

Contrary what you might expect, this doesn't share as much code with
the combine dense pass as it does with the combine 2d conv pass.
This is because it concatenates the "output feature" dimensions.
@t-vi t-vi closed this Jun 12, 2020
@t-vi t-vi reopened this Jun 14, 2020
@comaniac
Copy link
Contributor

This might be off the topic. I think it's possible to use the pattern language to replace all "combine parallel XX op" passes. Maybe we could create an issue to check it.

cc @mbrookhart @tqchen

@mbrookhart
Copy link
Contributor

@t-vi Interesting PR! Thank you for submitting it! I presume the use case for this is in Transformer-like models? Do you see a perf benefit from the rewrite?

This might be off the topic. I think it's possible to use the pattern language to replace all "combine parallel XX op" passes. Maybe we could create an issue to check it.

cc @mbrookhart @tqchen

I would imagine yes, it would be pretty easy to implement this as a pattern and a rewrite. A pattern solution might have some complication though, I don't think we currently have a way to match something with 2 or 3 or 4 branches in the same pattern, it would require a number of patterns. I'll think about that.

@t-vi
Copy link
Contributor Author

t-vi commented Jun 15, 2020

@mbrookhart Yes, my use-case is transformers. The PyTorch frontend translates the matmul used in HuggingFace transformer's BERT into batch_matmul. The speedup is 1.5x-2x-ish on ROCm (gfx906) and also some on a GTX1080Ti even though it currently hits a reshape right after batch_matmul. I don't quite reach the speed of ONNXRuntime yet.
I'm currently preparing a detailed writeup (and that's the pattern of my recent PRs - tuneable BMM, this, support for integers and other non-float32 PyTorch frontend).

I imagine that it would be cool to move the pass to a pattern-matching. I would expect that it would replace the code shared by the combine passes of batch_matmul and conv2d (and to some extend the dense combiner rather than the part that's separate. I have been wondering about the efficiency of dense btw. - it mentions BERT as a use-case in the code comments but it is unclear to me whether the dense -> batch_matmul with "duplicated" (possibly stride 0) input is better than dense -> dense with non-contiguous results (though the columns would still be and only the rows would be interleaved between the ops). But then I haven't looked a lot at how TVM deals with strides (which is relatively significant because the self-attention typically has some reshapes that would be nice to fuse).

Copy link
Contributor

@mbrookhart mbrookhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tqchen
Copy link
Member

tqchen commented Jun 17, 2020

cc @vinx13 please also help to take a look

python/tvm/relay/transform/transform.py Outdated Show resolved Hide resolved
python/tvm/relay/transform/transform.py Outdated Show resolved Hide resolved
@vinx13 vinx13 merged commit 052ea4d into apache:master Jun 17, 2020
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Jun 30, 2020
* Add a combine batch_matmul pass

Contrary what you might expect, this doesn't share as much code with
the combine dense pass as it does with the combine 2d conv pass.
This is because it concatenates the "output feature" dimensions.

* fix docstring
zhiics pushed a commit to neo-ai/tvm that referenced this pull request Jul 2, 2020
* Add a combine batch_matmul pass

Contrary what you might expect, this doesn't share as much code with
the combine dense pass as it does with the combine 2d conv pass.
This is because it concatenates the "output feature" dimensions.

* fix docstring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
5 participants