Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU] Add conversion for unsupported BF16 ops via target-specific stage #27

Merged
merged 3 commits into from
Jun 25, 2024

Conversation

ienkovich
Copy link
Collaborator

On machines with no native BF16 support we encounter a link problem due to the missing __truncsfbf2 symbol that is used by LLVM to convert FP32 to BF16. Tests work on BF16 enabled machines and those that have libgcc 13 where this symbol was added.

We can add a runtime with a required symbol to work on machines with older libgcc, but I chose to perform conversions in MLIR for more efficiency. LVVM wouldn't use vector instruction for the conversion. Also, it introduces conversions in tests where it's not really required (e.g. in test_where[bloat16]).

This patch introduces a new pipeline TritonCPUOpt meant to decompose higher-level vector operations (or unsupported ones) to lower-level operations. I added passes to transform some basic operations on BF16 and to perform vectorized conversion FP32->BF16. Passes are executed conditionally, only on x86 targets with no AVX512BF16 extension.

With this patch, all enabled tests pass on machines with and without AVX512BF16 support. Our CI workflow should finally pass with these changes.

@ienkovich ienkovich requested a review from minjang June 18, 2024 03:01
@ienkovich ienkovich requested a review from ptillet as a code owner June 18, 2024 03:01
@minjang minjang requested review from digantdesai and embg June 18, 2024 18:49
# TTCIR -> Optimized TTCIR
pm = ir.pass_manager(mod.context)
pm.enable_debug()
if self.cpu_arch == "x86_64" and "avx512bf16" not in self.cpu_features:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming CPUs with avx512bf16 you can generate vector instructions which handles bf16 natively. My question is about leveraging instructions with mixed precisions i.e. vdpbf16ps which takes two bf16 operands and accumulates into a fp32. IIUC that is not currently supported through MLIR::vector::contract and you may have to jump through similar dtype conversion hoops partly for lowering purposes. Is my understanding accurate? Since we are looking into GEM[MV] this is quite relevant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in all cases where existing lowering doesn't give us an expected result (or doesn't work at all), we can handle it through such passes. If we can do lowering to lower-level vector operations or target vector operations, then I think we should go this way. Otherwise, we would have to lower directly to LLVM intrinsics.

Unfortunately, I don't see vdpbf16ps exposed through any of vector dialects (there is only tdpbf16ps), so we should either add such an operation (e.g. to a new triton::cpu::x86 dialect) or lower directly to LLVM intrinsic in the appropriate pipeline.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general comment: for now, I think this hard coded check (explicitly checking avx512bf16) is okay. Maybe once we want to support ARM or other extended ISAs, we can have a nicer check method? I still don't think we need to have a separate third-party; just a single cpu backend is still better. I think we can abstract away some target information into like TargetInfo.cpp as in GPU.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are likely to have more and more diverged compilation flows for different targets. So I think we should introduce target-specific pipelines in compile.py. Something similar to the existing TargetInfo also might be usefull, but on a lower level.

@ienkovich ienkovich force-pushed the ienkovich/bf16-casts branch from 825ca0f to b2675ef Compare June 21, 2024 20:52
Copy link
Collaborator

@minjang minjang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the work!

I have some questions and suggestions (naming, again). And I wish you could add a testing case.

# TTCIR -> Optimized TTCIR
pm = ir.pass_manager(mod.context)
pm.enable_debug()
if self.cpu_arch == "x86_64" and "avx512bf16" not in self.cpu_features:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general comment: for now, I think this hard coded check (explicitly checking avx512bf16) is okay. Maybe once we want to support ARM or other extended ISAs, we can have a nicer check method? I still don't think we need to have a separate third-party; just a single cpu backend is still better. I think we can abstract away some target information into like TargetInfo.cpp as in GPU.

@@ -86,6 +89,19 @@ def make_ttcir(mod, metadata, opt):
metadata["cluster_dims"] = (opt.cluster_dims[0], opt.cluster_dims[1], opt.cluster_dims[2])
return mod

def make_ottcir(self, mod, metadata, opt):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, do you think this "optimized" TTCIR will have more "optimizations"? For me, and at least for now, this PR for BF16 doesn't look like optimizations. Isn't it (supporting unsupported BF16) more like a transformation?

So, until we have more stronger reasons to have a separate pass, what about just having this code in make_ttcir? Actually, make_ttir can be also seen as a make_ottir. So I do understand your intention.

Both amd anc nvidia do similar transformations in make_ttgir. Again, it doesn't mean their approaches are correct, but just for consistency?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These transformations are done as an optimization for targets with no native BF16 support. LLVM is able to compile BF16 operations even with no BF16 target support, but the resulting code would be suboptimal. E.g. BF16->FP32 transformations are done using scalar intrinsic calls which should be much slower than vectorized transformation.

CPU backend differs a lot from the existing backends. It uses more gradual lowering and supports multiple target architectures. I don't see why we should try to mimic GPU backends.

My vision of compilation stages is the following:

  1. Transform TTIR to vectors using high-level target-independent vector operations.
  2. Perform target-specific transformations and optimizations. Higher-level operations are lowered to lower-level ones, including target-specific dialects.
  3. Lower to LLVM dialect.

We could merge the middle part into its neighbors, but I don't see what benefits it would give.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for giving more background!

I think "target-specific" TTCIR sounds much better than optimizations. Having the TTTCIR is a good compromise of two extremes: (1) totally having a new third-party like intel-cpu or arm-cpu; (2) a single pipeline but with a thin layer like TargetInfo.


I was thinking, as of now, SIMD within a register for ARM/x86 is still pretty much the same. This BF16 issue is more like a capability/feature of CPU architectures, not necessarily for Intel/AVX512? So, this could be easily abstracted away, and can be handled in a target-independent way.

Yeah, I agree that having a target-specific stage (which allows us to have target-specific dialects) is valid.

@@ -1,2 +1,3 @@
add_subdirectory(TritonCPUOpt)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2cent for the naming. What about using TritonCPUTransforms for better consistency? amd has TritonAMDGPUTransforms that does transformations and optimizations. For me, I'd like to be consistent as much as possible with the GPU backends. It doesn't mean that the GPU is always correct, though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

}
};

struct VectorizeFpToFp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand ConvertUnsupportedOps, but I didn't fully understand VectorizeFpToFp. Maybe could you have a next PR that has some unit test cases for this PR?

I'm not sure whether I got it correctly. This is my understanding:

  • ConvertUnsupportedOps introduces TruncFOp at the result.
  • This VectorizeFpToFp/Fp32ToBf16Conversion handles the above TruncFOps?

If this is true, can we do this work in ConvertUnsupportedOps? Yes, it can still be a separate pass, but the Vectorize naming wasn't easily understandable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our CI already covers this code, these passes fix multiple BF16 tests on a platform with no AVX512BF and GCC version <13.

I see the name VectorizeFpToFp is confusing. The name comes from the fact that if this transformation is not applied, then LLVM would make transformation element by element using scalar intrinsic. So this pass is to produce vectorized code for FP conversions instead. What about DecomposeFpToFp? This pass is going to be used for FP8 conversion lowering, etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification! Glad that I read your intentions correctly! Yes, I was thinking just like you: without this conversion, LLVM would not be able to vectorize. So, let's call it "vectorize" fp-to-fp.

Yes, DecomposeFpToFp makes a more sense.

@ienkovich ienkovich force-pushed the ienkovich/bf16-casts branch from b2675ef to 98b4a21 Compare June 24, 2024 16:42
@minjang minjang changed the title [CPU] Add conversion for unsupported BF16 ops [CPU] Add conversion for unsupported BF16 ops via target-specific stage Jun 25, 2024
Copy link
Collaborator

@minjang minjang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing more context. Having the Target-specific TTCIR (TTTCIR) looks good! Oh, and thanks for removing unused code, which I should've done it.

(I updated the title as this is a non-trivial design decision.)

@minjang minjang merged commit f5b94d3 into main Jun 25, 2024
2 of 5 checks passed
Devjiu pushed a commit to Devjiu/triton-cpu that referenced this pull request Aug 13, 2024
…ge (triton-lang#27)

* Remove unused code.

Signed-off-by: Ilya Enkovich <[email protected]>

* Fma is always allowed on CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add unsupported op conversions for BF16 type.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
int3 pushed a commit that referenced this pull request Aug 29, 2024
…ge (#27)

* Remove unused code.

Signed-off-by: Ilya Enkovich <[email protected]>

* Fma is always allowed on CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add unsupported op conversions for BF16 type.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
minjang pushed a commit that referenced this pull request Sep 22, 2024
…ge (#27)

* Remove unused code.

Signed-off-by: Ilya Enkovich <[email protected]>

* Fma is always allowed on CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add unsupported op conversions for BF16 type.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
@minjang minjang deleted the ienkovich/bf16-casts branch September 25, 2024 05:08
minjang pushed a commit that referenced this pull request Oct 22, 2024
…ge (#27)

* Remove unused code.

Signed-off-by: Ilya Enkovich <[email protected]>

* Fma is always allowed on CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add unsupported op conversions for BF16 type.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
minjang pushed a commit that referenced this pull request Oct 24, 2024
…ge (#27)

* Remove unused code.

Signed-off-by: Ilya Enkovich <[email protected]>

* Fma is always allowed on CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add unsupported op conversions for BF16 type.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
int3 pushed a commit that referenced this pull request Dec 6, 2024
…ge (#27)

* Remove unused code.

Signed-off-by: Ilya Enkovich <[email protected]>

* Fma is always allowed on CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add unsupported op conversions for BF16 type.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
ienkovich added a commit that referenced this pull request Dec 6, 2024
…ge (#27)

* Remove unused code.

Signed-off-by: Ilya Enkovich <[email protected]>

* Fma is always allowed on CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add unsupported op conversions for BF16 type.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants