Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp the generation of runtime division checks on ARM64 #111543

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

snickolls-arm
Copy link
Contributor

Fixes #64795

This patch introduces a new compilation phase that passes over the GenTrees looking for GT_DIV/GT_UDIV nodes on integral types, and morphs the code to introduce the necessary conformance checks (overflow/divide-by-zero) early on in the compilation pipeline. Currently these are added during the Emit phase, meaning optimizations don't run on any code introduced.

The aim is to allow the compiler to make decisions on code position and instruction selection for these checks. For example on ARM64 this enables certain scenarios to choose the cbz instruction over cmp/beq, can lead to more compact code. It also allows some of the comparisons in the checks to be hoisted out of loops.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 17, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jan 17, 2025
@snickolls-arm
Copy link
Contributor Author

@kunalspathak @a74nh

This is WIP. I've taken a different approach to adding new nodes, instead adding a pass that modifies the HIR.

The pass will run through all of the code in the function looking for GT_DIV/GT_UDIV nodes. On ARM64 we need to run this after morph because so we catch any GT_DIV nodes that might've been introduced by conversions such as the MOD to SUB-MUL-DIV. If the pass encounters a GT_DIV node, it will use fgSplitBlockBeforeTree to ensure any side effects of the tree will run before the runtime check. Then it will add the runtime checks to the graph just after these side effects, but before the actual division occurs.

The added HIR looks like this for the signed overflow check, for example. This is checking for (dividend < 0 && divisor == -1), which should throw an overflow exception.

------------ BB06 [0005] [???..???) -> BB07(0.01),BB05(0.99) (cond), preds={BB02} succs={BB05,BB07}

***** BB06 [0005]
STMT00007 ( ??? ... ??? )
               [000032] -----------                         *  JTRUE     void
               [000030] J----------                         \--*  EQ        int
               [000028] -----------                            +--*  AND       int
               [000025] -----------                            |  +--*  EQ        int
               [000022] -----------                            |  |  +--*  LCL_VAR   int    V01 arg1
               [000024] -----------                            |  |  \--*  CNS_INT   int    -1
               [000027] -----------                            |  \--*  LT        int
               [000023] -----------                            |     +--*  LCL_VAR   int    V03 loc0
               [000026] -----------                            |     \--*  CNS_INT   int    0
               [000029] -----------                            \--*  CNS_INT   int    1

------------ BB07 [0006] [???..???) (throw), preds={BB06} succs={}

***** BB07 [0006]
STMT00006 ( ??? ... ??? )
               [000031] --CXG------                         *  CALL help void   CORINFO_HELP_OVERFLOW

Here's the example @kunalspathak mentioned in #64795:

// See https://aka.ms/new-console-template for more information
using System;

namespace MyApp
{
    internal class Program
    {
        public static int issue2(int x, int y, int z)
	{
	    int result = x;
	    for (int i = 0; i < z; i++)
	    {
		//result = x % y; <-- this hoist things properly because both dividend and divisor are invariant.
		result = result % y;
	    }
	    return result;
	}

        static void Main(string[] args)
        {
	    var rand = new Random(1234);
	    Console.WriteLine(issue2(rand.Next(), rand.Next(), rand.Next()));
        }
    }
}

Before the change:

; Total bytes of code 80, prolog size 8, PerfScore 81.00, instruction count 24, allocated bytes for code 80 (MethodHash=3a9665a0) for method MyApp.Program:issue2(int,int,int):int (FullOpts)
; ============================================================

*************** After end code gen, before unwindEmit()
G_M39519_IG01:        ; func=00, offs=0x000000, size=0x0008, bbWeight=1, PerfScore 1.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG

IN0015: 000000      stp     fp, lr, [sp, #-0x10]!
IN0016: 000004      mov     fp, sp

G_M39519_IG02:        ; offs=0x000008, size=0x0008, bbWeight=1, PerfScore 1.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB01 [0000], byref, isz

IN0001: 000008      cmp     w2, #0
IN0002: 00000C      ble     G_M39519_IG06

G_M39519_IG03:        ; offs=0x000010, size=0x0000, bbWeight=0.25, PerfScore 0.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB02 [0005], byref, isz

IN0003: 000010      align   [0 bytes for IG04]
IN0004: 000010      align   [0 bytes]
IN0005: 000010      align   [0 bytes]
IN0006: 000010      align   [0 bytes]

G_M39519_IG04:        ; offs=0x000010, size=0x0018, bbWeight=4, PerfScore 18.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB03 [0001], byref, isz

IN0007: 000010      cmp     w1, #0
IN0008: 000014      beq     G_M39519_IG07
IN0009: 000018      cmn     w1, #1
IN000a: 00001C      bne     G_M39519_IG05
IN000b: 000020      cmp     w0, #1
IN000c: 000024      bvs     G_M39519_IG08

G_M39519_IG05:        ; offs=0x000028, size=0x0010, bbWeight=4, PerfScore 58.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, loop=IG04, BB03 [0001], byref, isz

IN000d: 000028      sdiv    w3, w0, w1
IN000e: 00002C      msub    w0, w3, w1, w0
IN000f: 000030      sub     w2, w2, #1
IN0010: 000034      cbnz    w2, G_M39519_IG04

G_M39519_IG06:        ; offs=0x000038, size=0x0008, bbWeight=1, PerfScore 2.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, epilog, nogc

IN0017: 000038      ldp     fp, lr, [sp], #0x10
IN0018: 00003C      ret     lr

G_M39519_IG07:        ; offs=0x000040, size=0x0008, bbWeight=0, PerfScore 0.00, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB06 [0007], gcvars, byref

IN0011: 000040      bl      CORINFO_HELP_THROWDIVZERO
IN0012: 000044      brk     #0

G_M39519_IG08:        ; offs=0x000048, size=0x0008, bbWeight=0, PerfScore 0.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB07 [0008], byref

IN0013: 000048      bl      CORINFO_HELP_OVERFLOW
IN0014: 00004C      brk     #0

After the change:

; Total bytes of code 84, prolog size 8, PerfScore 79.25, instruction count 25, allocated bytes for code 84 (MethodHash=3a9665a0) for method MyApp.Program:issue2(int,int,int):int (FullOpts)
; ============================================================

*************** After end code gen, before unwindEmit()
G_M39519_IG01:        ; func=00, offs=0x000000, size=0x0008, bbWeight=1, PerfScore 1.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG

IN0016: 000000      stp     fp, lr, [sp, #-0x10]!
IN0017: 000004      mov     fp, sp

G_M39519_IG02:        ; offs=0x000008, size=0x0008, bbWeight=1, PerfScore 1.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB01 [0000], byref, isz

IN0001: 000008      cmp     w2, #0
IN0002: 00000C      ble     G_M39519_IG05

G_M39519_IG03:        ; offs=0x000010, size=0x0008, bbWeight=0.25, PerfScore 0.25, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB02 [0011], byref, isz

IN0003: 000010      cmn     w1, #1
IN0004: 000014      cset    x3, eq
IN0005: 000018      align   [0 bytes for IG04]
IN0006: 000018      align   [0 bytes]
IN0007: 000018      align   [0 bytes]
IN0008: 000018      align   [0 bytes]

G_M39519_IG04:        ; offs=0x000018, size=0x0024, bbWeight=4, PerfScore 74.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, loop=IG04, BB03 [0001], BB04 [0004], BB05 [0007], byref, isz

IN0009: 000018      lsr     w4, w0, #31
IN000a: 00001C      and     w4, w3, w4
IN000b: 000020      cmp     w4, #1
IN000c: 000024      beq     G_M39519_IG07
IN000d: 000028      cbz     w1, G_M39519_IG06
IN000e: 00002C      sdiv    w4, w0, w1
IN000f: 000030      msub    w0, w4, w1, w0
IN0010: 000034      sub     w2, w2, #1
IN0011: 000038      cbnz    w2, G_M39519_IG04

G_M39519_IG05:        ; offs=0x00003C, size=0x0008, bbWeight=1, PerfScore 2.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, epilog, nogc

IN0018: 00003C      ldp     fp, lr, [sp], #0x10
IN0019: 000040      ret     lr

G_M39519_IG06:        ; offs=0x000044, size=0x0008, bbWeight=0, PerfScore 0.00, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB07 [0009], gcvars, byref

IN0012: 000044      bl      CORINFO_HELP_THROWDIVZERO
IN0013: 000048      brk     #0

G_M39519_IG07:        ; offs=0x00004C, size=0x0008, bbWeight=0, PerfScore 0.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB06 [0006], byref

IN0014: 00004C      bl      CORINFO_HELP_OVERFLOW
IN0015: 000050      brk     #0

The main difference is at label IG04, rather than a fixed sequence of compare and branch instructions chosen at the emit stage, the compiler has decided to build a logical expression for the overflow check and emit a cbz for the divide-by-zero check. The loop hoisting optimization has decided that the test for (divisor == -1) can be performed outside of the loop to save an instruction inside the loop, this is computed in IG03. Building a logical expression instead of a branch sequence has also allowed the compiler to perform these checks with 2 compare and branches instead of 3.

The approach is working well when:
• The trees containing GT_DIV don't have many side-effects, as these will have to be split out and this can result in spilling, especially in MinOpts.
GT_DIV occurs in a loop, as some of the expression tree for the check can now be hoisted outside the loop.
• There are a lot of GT_DIV nodes in a function, as now the compiler seems to choose cbz more often than cmp/beq.

It seems to have an adverse effect on MinOpts though, because splitting the tree will often spill and there aren't any optimization passes running to clear up these spills.

At the moment I haven't focused on the efficiency of the pass itself but I believe it could be improved. I could borrow the recursive traversals in the earlier morph phase to build a work-list for where checks need to be added. Then the pass can be linear over a pre-built list of nodes rather than a search in a loop. I would just have to be careful to update all of the locations of the nodes after any trees are split, but I think this should be possible.

I've also had to make a temporary fix on a problem with the tree splitting code where it wasn't correctly updating the node flags after splitting out side effects. After splitting the tree I traverse it post-order to update all of the flags. There might be a more efficient way of doing this.

@snickolls-arm
Copy link
Contributor Author

I think the build is failing on Release mode due to use of GenTree::gtTreeID so I'll need to look into having access to this, or some similar identifier, for all modes as it is part of the algorithm.

@kunalspathak
Copy link
Member

can you also eliminate the regressions?

image

@jakobbotsch
Copy link
Member

I think the build is failing on Release mode due to use of GenTree::gtTreeID so I'll need to look into having access to this, or some similar identifier, for all modes as it is part of the algorithm.

What do you need this for? Increasing the size of GenTree is hard to justify. I do not think this transformation qualifies. Most likely you have other options.

@snickolls-arm
Copy link
Contributor Author

What do you need this for? Increasing the size of GenTree is hard to justify. I do not think this transformation qualifies. Most likely you have other options.

I need a way of uniquely identifying a GenTree node, so I can record that I've already visited the node and added the runtime checks. Is it possible to use the address value of the node? There would need to be a guarantee that the address would be unique within the function compilation.

can you also eliminate the regressions?

I think many of the regressions are caused by spilling in gtSplitTree, but there are also some individual regressions that I'd need to look into case-by-case.

My options for continuing this are:

  1. Look into making this transform produce the same code as before in MinOpts.
  2. Make this pass run only for FullOpts and revert to existing code for MinOpts.

I think it's sensible to allow the compiler to have a view of all of this code being added early on in the pipeline, but this might not make sense in the tiering model. So option 2 could be a good compromise. I'll need to look into individual regression cases for both options regardless of choice. I would be grateful for any opinions on this and the approach in general.

impCloneExpr(divisor, &divisorCopy, CHECK_SPILL_NONE, nullptr DEBUGARG("cloned for runtime check"));
impCloneExpr(dividend, &dividendCopy, CHECK_SPILL_NONE, nullptr DEBUGARG("cloned for runtime check"));

// (dividend < 0 && divisor == -1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be dividend == MinValue and divisor == -1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I misinterpreted the exception case and this likely explains some of the issues I'm seeing. Thanks.

code.block = divBlock;
code.stmt = divBlock->firstStmt();
code.tree = tree;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like something here should be setting GTF_DIV_MOD_NO_BY_ZERO and GTF_DIV_MOD_NO_OVERFLOW on the DIV node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm relying on the settings of these flags in morph stage, e.g. in fgMorphSmpOp (morph.cpp:8584). When I call GenTree::OperExceptions I'm implicitly checking these flags too. As I'm not doing any further processing of the types of the operands I don't think I can make that decision in this pass, unless I'm missing something here?

@jakobbotsch
Copy link
Member

I need a way of uniquely identifying a GenTree node, so I can record that I've already visited the node and added the runtime checks. Is it possible to use the address value of the node? There would need to be a guarantee that the address would be unique within the function compilation.

Yes, the address of nodes can be used, see e.g. NodeInternalRegisters for a hash table keyed on GenTree addresses (the JIT does not have a set type, so you could use a bool-valued hash table).

However, most likely there is no need for any form of "visited" check at all; instead you can shape your pass so that it visits all IR exactly once. See e.g. the various helper expansions in helperexpansion.cpp; those are shaped so that they visit all IR once while allowing for expansion of internal nodes into control flow.

I think many of the regressions are caused by spilling in gtSplitTree, but there are also some individual regressions that I'd need to look into case-by-case.

My options for continuing this are:

  1. Look into making this transform produce the same code as before in MinOpts.
  2. Make this pass run only for FullOpts and revert to existing code for MinOpts.

I think it's sensible to allow the compiler to have a view of all of this code being added early on in the pipeline, but this might not make sense in the tiering model. So option 2 could be a good compromise. I'll need to look into individual regression cases for both options regardless of choice. I would be grateful for any opinions on this and the approach in general.

I agree that (2) would be most reasonable.

You may want to experiment with some alternative simpler and cheaper ways of accomplishing what this pass is doing. One thing that comes to mind is expanding the checks as QMARKs during import. That is, instead of creating a GT_DIV/GT_MOD` node, you would create a shape like

QMARK(dividend == MinValue & divisor == -1, CALL CORINFO_HELP_OVERFLOW, QMARK(divisor == 0, GCALL CORINFO_HELP_THROWDIVBYZERO, DIV dividend, divisor))

(marking the division node with GTF_DIV_MOD_NO_OVERFLOW | GTF_DIV_MOD_NO_BY_ZERO).

Fixes dotnet#64795

This patch wraps GT_DIV/GT_UDIV nodes on integral types with GT_QMARK
trees that contain the necessary conformance checks (overflow/divide-by-zero)
when compiling the code with FullOpts enabled. Currently these checks are added
during the Emit phase, this is still the case for MinOpts.

The aim is to allow the compiler to make decisions on code position
and instruction selection for these checks. For example on ARM64 this
enables certain scenarios to choose the cbz instruction over cmp/beq,
can lead to more compact code. It also allows some of the comparisons
in the checks to be hoisted out of loops.
@snickolls-arm
Copy link
Contributor Author

snickolls-arm commented Jan 24, 2025

Hi @jakobbotsch,

Thanks for the help. I've updated the pull request now with the implementation introducing QMARK nodes at import. I think this is a much cleaner solution.

It's still not fully clear from the diffs exactly how much impact is caused by the cbz instruction, and how much is related to loop hoisting. I've just opened #111797, once this is rebased on that it should reduce the amount of diffs and show some examples of hoisting. If there aren't any examples of hoisting then this might not show any positive impact until functions that perform DIV are inlined into user code.

It's much clearer now where the spills are occurring now, quite a few are related to trying to clone a RETURN_EXPR tree in the divisor/dividend, as assertions fired for this when I replaced impCloneExpr with gtCloneExpr. So to reduce regressions, I would have to try and allow cloning of inline trees.

I'll also need to think about a way of making sure the MOD nodes are also being affected, because currently this is only affecting DIV nodes and the MOD nodes don't get morphed to SUB/MUL/DIV until later on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Arm64: In mod operation happening inside the loop, if divisor is an invariant, hoist the divisor checks
3 participants