-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU] Add conversion for unsupported BF16 ops via target-specific stage #27
Conversation
# TTCIR -> Optimized TTCIR | ||
pm = ir.pass_manager(mod.context) | ||
pm.enable_debug() | ||
if self.cpu_arch == "x86_64" and "avx512bf16" not in self.cpu_features: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming CPUs with avx512bf16
you can generate vector instructions which handles bf16 natively. My question is about leveraging instructions with mixed precisions i.e. vdpbf16ps
which takes two bf16 operands and accumulates into a fp32. IIUC that is not currently supported through MLIR::vector::contract
and you may have to jump through similar dtype conversion hoops partly for lowering purposes. Is my understanding accurate? Since we are looking into GEM[MV] this is quite relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in all cases where existing lowering doesn't give us an expected result (or doesn't work at all), we can handle it through such passes. If we can do lowering to lower-level vector operations or target vector operations, then I think we should go this way. Otherwise, we would have to lower directly to LLVM intrinsics.
Unfortunately, I don't see vdpbf16ps
exposed through any of vector dialects (there is only tdpbf16ps
), so we should either add such an operation (e.g. to a new triton::cpu::x86
dialect) or lower directly to LLVM intrinsic in the appropriate pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a general comment: for now, I think this hard coded check (explicitly checking avx512bf16
) is okay. Maybe once we want to support ARM or other extended ISAs, we can have a nicer check method? I still don't think we need to have a separate third-party; just a single cpu
backend is still better. I think we can abstract away some target information into like TargetInfo.cpp
as in GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are likely to have more and more diverged compilation flows for different targets. So I think we should introduce target-specific pipelines in compile.py. Something similar to the existing TargetInfo also might be usefull, but on a lower level.
825ca0f
to
b2675ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for the work!
I have some questions and suggestions (naming, again). And I wish you could add a testing case.
# TTCIR -> Optimized TTCIR | ||
pm = ir.pass_manager(mod.context) | ||
pm.enable_debug() | ||
if self.cpu_arch == "x86_64" and "avx512bf16" not in self.cpu_features: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a general comment: for now, I think this hard coded check (explicitly checking avx512bf16
) is okay. Maybe once we want to support ARM or other extended ISAs, we can have a nicer check method? I still don't think we need to have a separate third-party; just a single cpu
backend is still better. I think we can abstract away some target information into like TargetInfo.cpp
as in GPU.
third_party/cpu/backend/compiler.py
Outdated
@@ -86,6 +89,19 @@ def make_ttcir(mod, metadata, opt): | |||
metadata["cluster_dims"] = (opt.cluster_dims[0], opt.cluster_dims[1], opt.cluster_dims[2]) | |||
return mod | |||
|
|||
def make_ottcir(self, mod, metadata, opt): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, do you think this "optimized" TTCIR will have more "optimizations"? For me, and at least for now, this PR for BF16 doesn't look like optimizations. Isn't it (supporting unsupported BF16) more like a transformation?
So, until we have more stronger reasons to have a separate pass, what about just having this code in make_ttcir
? Actually, make_ttir
can be also seen as a make_ottir
. So I do understand your intention.
Both amd
anc nvidia
do similar transformations in make_ttgir
. Again, it doesn't mean their approaches are correct, but just for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These transformations are done as an optimization for targets with no native BF16 support. LLVM is able to compile BF16 operations even with no BF16 target support, but the resulting code would be suboptimal. E.g. BF16->FP32 transformations are done using scalar intrinsic calls which should be much slower than vectorized transformation.
CPU backend differs a lot from the existing backends. It uses more gradual lowering and supports multiple target architectures. I don't see why we should try to mimic GPU backends.
My vision of compilation stages is the following:
- Transform TTIR to vectors using high-level target-independent vector operations.
- Perform target-specific transformations and optimizations. Higher-level operations are lowered to lower-level ones, including target-specific dialects.
- Lower to LLVM dialect.
We could merge the middle part into its neighbors, but I don't see what benefits it would give.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for giving more background!
I think "target-specific" TTCIR sounds much better than optimizations. Having the TTTCIR is a good compromise of two extremes: (1) totally having a new third-party like intel-cpu or arm-cpu; (2) a single pipeline but with a thin layer like TargetInfo.
I was thinking, as of now, SIMD within a register for ARM/x86 is still pretty much the same. This BF16 issue is more like a capability/feature of CPU architectures, not necessarily for Intel/AVX512? So, this could be easily abstracted away, and can be handled in a target-independent way.
Yeah, I agree that having a target-specific stage (which allows us to have target-specific dialects) is valid.
@@ -1,2 +1,3 @@ | |||
add_subdirectory(TritonCPUOpt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2cent for the naming. What about using TritonCPUTransforms
for better consistency? amd
has TritonAMDGPUTransforms
that does transformations and optimizations. For me, I'd like to be consistent as much as possible with the GPU backends. It doesn't mean that the GPU is always correct, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
} | ||
}; | ||
|
||
struct VectorizeFpToFp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand ConvertUnsupportedOps
, but I didn't fully understand VectorizeFpToFp
. Maybe could you have a next PR that has some unit test cases for this PR?
I'm not sure whether I got it correctly. This is my understanding:
ConvertUnsupportedOps
introducesTruncFOp
at the result.- This
VectorizeFpToFp
/Fp32ToBf16Conversion
handles the aboveTruncFOp
s?
If this is true, can we do this work in ConvertUnsupportedOps
? Yes, it can still be a separate pass, but the Vectorize
naming wasn't easily understandable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our CI already covers this code, these passes fix multiple BF16 tests on a platform with no AVX512BF and GCC version <13.
I see the name VectorizeFpToFp
is confusing. The name comes from the fact that if this transformation is not applied, then LLVM would make transformation element by element using scalar intrinsic. So this pass is to produce vectorized code for FP conversions instead. What about DecomposeFpToFp
? This pass is going to be used for FP8 conversion lowering, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification! Glad that I read your intentions correctly! Yes, I was thinking just like you: without this conversion, LLVM would not be able to vectorize. So, let's call it "vectorize" fp-to-fp.
Yes, DecomposeFpToFp
makes a more sense.
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
b2675ef
to
98b4a21
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for providing more context. Having the Target-specific TTCIR (TTTCIR) looks good! Oh, and thanks for removing unused code, which I should've done it.
(I updated the title as this is a non-trivial design decision.)
…ge (triton-lang#27) * Remove unused code. Signed-off-by: Ilya Enkovich <[email protected]> * Fma is always allowed on CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Add unsupported op conversions for BF16 type. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>
…ge (#27) * Remove unused code. Signed-off-by: Ilya Enkovich <[email protected]> * Fma is always allowed on CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Add unsupported op conversions for BF16 type. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>
…ge (#27) * Remove unused code. Signed-off-by: Ilya Enkovich <[email protected]> * Fma is always allowed on CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Add unsupported op conversions for BF16 type. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>
…ge (#27) * Remove unused code. Signed-off-by: Ilya Enkovich <[email protected]> * Fma is always allowed on CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Add unsupported op conversions for BF16 type. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>
…ge (#27) * Remove unused code. Signed-off-by: Ilya Enkovich <[email protected]> * Fma is always allowed on CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Add unsupported op conversions for BF16 type. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>
…ge (#27) * Remove unused code. Signed-off-by: Ilya Enkovich <[email protected]> * Fma is always allowed on CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Add unsupported op conversions for BF16 type. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>
…ge (#27) * Remove unused code. Signed-off-by: Ilya Enkovich <[email protected]> * Fma is always allowed on CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Add unsupported op conversions for BF16 type. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>
On machines with no native BF16 support we encounter a link problem due to the missing __truncsfbf2 symbol that is used by LLVM to convert FP32 to BF16. Tests work on BF16 enabled machines and those that have libgcc 13 where this symbol was added.
We can add a runtime with a required symbol to work on machines with older libgcc, but I chose to perform conversions in MLIR for more efficiency. LVVM wouldn't use vector instruction for the conversion. Also, it introduces conversions in tests where it's not really required (e.g. in test_where[bloat16]).
This patch introduces a new pipeline TritonCPUOpt meant to decompose higher-level vector operations (or unsupported ones) to lower-level operations. I added passes to transform some basic operations on BF16 and to perform vectorized conversion FP32->BF16. Passes are executed conditionally, only on x86 targets with no AVX512BF16 extension.
With this patch, all enabled tests pass on machines with and without AVX512BF16 support. Our CI workflow should finally pass with these changes.