[CPU] Add conversion for unsupported BF16 ops via target-specific stage #27

ienkovich · 2024-06-18T03:01:27Z

On machines with no native BF16 support we encounter a link problem due to the missing __truncsfbf2 symbol that is used by LLVM to convert FP32 to BF16. Tests work on BF16 enabled machines and those that have libgcc 13 where this symbol was added.

We can add a runtime with a required symbol to work on machines with older libgcc, but I chose to perform conversions in MLIR for more efficiency. LVVM wouldn't use vector instruction for the conversion. Also, it introduces conversions in tests where it's not really required (e.g. in test_where[bloat16]).

This patch introduces a new pipeline TritonCPUOpt meant to decompose higher-level vector operations (or unsupported ones) to lower-level operations. I added passes to transform some basic operations on BF16 and to perform vectorized conversion FP32->BF16. Passes are executed conditionally, only on x86 targets with no AVX512BF16 extension.

With this patch, all enabled tests pass on machines with and without AVX512BF16 support. Our CI workflow should finally pass with these changes.

digantdesai · 2024-06-18T22:42:28Z

third_party/cpu/backend/compiler.py

+        # TTCIR -> Optimized TTCIR
+        pm = ir.pass_manager(mod.context)
+        pm.enable_debug()
+        if self.cpu_arch == "x86_64" and "avx512bf16" not in self.cpu_features:


I am assuming CPUs with avx512bf16 you can generate vector instructions which handles bf16 natively. My question is about leveraging instructions with mixed precisions i.e. vdpbf16ps which takes two bf16 operands and accumulates into a fp32. IIUC that is not currently supported through MLIR::vector::contract and you may have to jump through similar dtype conversion hoops partly for lowering purposes. Is my understanding accurate? Since we are looking into GEM[MV] this is quite relevant.

Yes, in all cases where existing lowering doesn't give us an expected result (or doesn't work at all), we can handle it through such passes. If we can do lowering to lower-level vector operations or target vector operations, then I think we should go this way. Otherwise, we would have to lower directly to LLVM intrinsics.

Unfortunately, I don't see vdpbf16ps exposed through any of vector dialects (there is only tdpbf16ps), so we should either add such an operation (e.g. to a new triton::cpu::x86 dialect) or lower directly to LLVM intrinsic in the appropriate pipeline.

Just a general comment: for now, I think this hard coded check (explicitly checking avx512bf16) is okay. Maybe once we want to support ARM or other extended ISAs, we can have a nicer check method? I still don't think we need to have a separate third-party; just a single cpu backend is still better. I think we can abstract away some target information into like TargetInfo.cpp as in GPU.

We are likely to have more and more diverged compilation flows for different targets. So I think we should introduce target-specific pipelines in compile.py. Something similar to the existing TargetInfo also might be usefull, but on a lower level.

minjang

Thanks so much for the work!

I have some questions and suggestions (naming, again). And I wish you could add a testing case.

minjang · 2024-06-24T04:33:09Z

third_party/cpu/backend/compiler.py

+        # TTCIR -> Optimized TTCIR
+        pm = ir.pass_manager(mod.context)
+        pm.enable_debug()
+        if self.cpu_arch == "x86_64" and "avx512bf16" not in self.cpu_features:


Just a general comment: for now, I think this hard coded check (explicitly checking avx512bf16) is okay. Maybe once we want to support ARM or other extended ISAs, we can have a nicer check method? I still don't think we need to have a separate third-party; just a single cpu backend is still better. I think we can abstract away some target information into like TargetInfo.cpp as in GPU.

minjang · 2024-06-24T06:14:46Z

third_party/cpu/backend/compiler.py

@@ -86,6 +89,19 @@ def make_ttcir(mod, metadata, opt):
        metadata["cluster_dims"] = (opt.cluster_dims[0], opt.cluster_dims[1], opt.cluster_dims[2])
        return mod

+    def make_ottcir(self, mod, metadata, opt):


In the future, do you think this "optimized" TTCIR will have more "optimizations"? For me, and at least for now, this PR for BF16 doesn't look like optimizations. Isn't it (supporting unsupported BF16) more like a transformation?

So, until we have more stronger reasons to have a separate pass, what about just having this code in make_ttcir? Actually, make_ttir can be also seen as a make_ottir. So I do understand your intention.

Both amd anc nvidia do similar transformations in make_ttgir. Again, it doesn't mean their approaches are correct, but just for consistency?

These transformations are done as an optimization for targets with no native BF16 support. LLVM is able to compile BF16 operations even with no BF16 target support, but the resulting code would be suboptimal. E.g. BF16->FP32 transformations are done using scalar intrinsic calls which should be much slower than vectorized transformation.

CPU backend differs a lot from the existing backends. It uses more gradual lowering and supports multiple target architectures. I don't see why we should try to mimic GPU backends.

My vision of compilation stages is the following:

Transform TTIR to vectors using high-level target-independent vector operations.

Perform target-specific transformations and optimizations. Higher-level operations are lowered to lower-level ones, including target-specific dialects.

Lower to LLVM dialect.

We could merge the middle part into its neighbors, but I don't see what benefits it would give.

Thank you for giving more background!

I think "target-specific" TTCIR sounds much better than optimizations. Having the TTTCIR is a good compromise of two extremes: (1) totally having a new third-party like intel-cpu or arm-cpu; (2) a single pipeline but with a thin layer like TargetInfo.

I was thinking, as of now, SIMD within a register for ARM/x86 is still pretty much the same. This BF16 issue is more like a capability/feature of CPU architectures, not necessarily for Intel/AVX512? So, this could be easily abstracted away, and can be handled in a target-independent way.

Yeah, I agree that having a target-specific stage (which allows us to have target-specific dialects) is valid.

minjang · 2024-06-24T06:25:43Z

third_party/cpu/include/CMakeLists.txt

@@ -1,2 +1,3 @@
+add_subdirectory(TritonCPUOpt)


My 2cent for the naming. What about using TritonCPUTransforms for better consistency? amd has TritonAMDGPUTransforms that does transformations and optimizations. For me, I'd like to be consistent as much as possible with the GPU backends. It doesn't mean that the GPU is always correct, though.

Sounds good

minjang · 2024-06-24T07:19:13Z

third_party/cpu/lib/TritonCPUOpt/VectorizeFpToFp.cpp

+  }
+};
+
+struct VectorizeFpToFp


I understand ConvertUnsupportedOps, but I didn't fully understand VectorizeFpToFp. Maybe could you have a next PR that has some unit test cases for this PR?

I'm not sure whether I got it correctly. This is my understanding:

ConvertUnsupportedOps introduces TruncFOp at the result.

This VectorizeFpToFp/Fp32ToBf16Conversion handles the above TruncFOps?

If this is true, can we do this work in ConvertUnsupportedOps? Yes, it can still be a separate pass, but the Vectorize naming wasn't easily understandable?

Our CI already covers this code, these passes fix multiple BF16 tests on a platform with no AVX512BF and GCC version <13.

I see the name VectorizeFpToFp is confusing. The name comes from the fact that if this transformation is not applied, then LLVM would make transformation element by element using scalar intrinsic. So this pass is to produce vectorized code for FP conversions instead. What about DecomposeFpToFp? This pass is going to be used for FP8 conversion lowering, etc.

Thanks for the clarification! Glad that I read your intentions correctly! Yes, I was thinking just like you: without this conversion, LLVM would not be able to vectorize. So, let's call it "vectorize" fp-to-fp.

Yes, DecomposeFpToFp makes a more sense.

Signed-off-by: Ilya Enkovich <[email protected]>

minjang

Thanks for providing more context. Having the Target-specific TTCIR (TTTCIR) looks good! Oh, and thanks for removing unused code, which I should've done it.

(I updated the title as this is a non-trivial design decision.)

…ge (triton-lang#27) * Remove unused code. Signed-off-by: Ilya Enkovich <[email protected]> * Fma is always allowed on CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Add unsupported op conversions for BF16 type. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>

…ge (#27) * Remove unused code. Signed-off-by: Ilya Enkovich <[email protected]> * Fma is always allowed on CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Add unsupported op conversions for BF16 type. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>

ienkovich requested a review from minjang June 18, 2024 03:01

ienkovich requested a review from ptillet as a code owner June 18, 2024 03:01

minjang requested review from digantdesai and embg June 18, 2024 18:49

digantdesai reviewed Jun 18, 2024

View reviewed changes

ienkovich force-pushed the ienkovich/bf16-casts branch from 825ca0f to b2675ef Compare June 21, 2024 20:52

minjang reviewed Jun 24, 2024

View reviewed changes

minjang force-pushed the main branch from e915162 to 3cfb1e5 Compare June 24, 2024 08:07

ienkovich added 3 commits June 24, 2024 09:34

Remove unused code.

871b8df

Signed-off-by: Ilya Enkovich <[email protected]>

Fma is always allowed on CPU.

4bfd784

Signed-off-by: Ilya Enkovich <[email protected]>

Add unsupported op conversions for BF16 type.

98b4a21

Signed-off-by: Ilya Enkovich <[email protected]>

ienkovich force-pushed the ienkovich/bf16-casts branch from b2675ef to 98b4a21 Compare June 24, 2024 16:42

minjang changed the title ~~[CPU] Add conversion for unsupported BF16 ops~~ [CPU] Add conversion for unsupported BF16 ops via target-specific stage Jun 25, 2024

minjang approved these changes Jun 25, 2024

View reviewed changes

minjang merged commit f5b94d3 into main Jun 25, 2024
2 of 5 checks passed

vivekvpandya mentioned this pull request Aug 9, 2024

OptimizeMask pass fails on following input #105

Closed

minjang deleted the ienkovich/bf16-casts branch September 25, 2024 05:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] Add conversion for unsupported BF16 ops via target-specific stage #27

[CPU] Add conversion for unsupported BF16 ops via target-specific stage #27

ienkovich commented Jun 18, 2024

digantdesai Jun 18, 2024

ienkovich Jun 18, 2024

minjang Jun 24, 2024

ienkovich Jun 24, 2024

minjang left a comment

minjang Jun 24, 2024

minjang Jun 24, 2024

ienkovich Jun 24, 2024

minjang Jun 25, 2024

minjang Jun 24, 2024

ienkovich Jun 24, 2024

minjang Jun 24, 2024

ienkovich Jun 24, 2024

minjang Jun 25, 2024

minjang left a comment

[CPU] Add conversion for unsupported BF16 ops via target-specific stage #27

[CPU] Add conversion for unsupported BF16 ops via target-specific stage #27

Conversation

ienkovich commented Jun 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

minjang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

minjang left a comment

Choose a reason for hiding this comment