-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Ifpack2: Compute residual update #13610
[WIP] Ifpack2: Compute residual update #13610
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me. I'll do a run myself to make sure that hardcoding the block size and vector length doesn't hurt cuda performance.
@seanofthemillers I see you have WIP in the PR title, are there other changes you plan to make? You will at least have to amend your commit with the git signoff message ( edit: instead of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seanofthemillers After some evaluation on H100, it seems that these changes do cause a measurable regression for blocksize=7, and don't improve blocksize=13. I didn't test as many cases as you (just 100^3 grid and those two block sizes) but I think what this needs is to guard the change with launch bounds/vector length to Kokkos::HIP exec space.
Time in seconds for overall BlockTriDiagonalSolver, with 100 repeats.
Before (develop) | After (rebased PR) | |
---|---|---|
block size 7 | 0.0443379 | 0.0776897 |
block size 13 | 0.0694024 | 0.0688918 |
(Originally written by Sean Miller for trilinos#13610) Use shared memory and TeamPolicy parameter tweaks to improve BTDS residual performance. Signed-off-by: Brian Kelley <[email protected]>
(Originally written by Sean Miller for trilinos#13610) Use shared memory and TeamPolicy parameter tweaks to improve BTDS residual performance. Signed-off-by: Brian Kelley <[email protected]>
I'll open a new PR soon that improves upon these changes some more. I spotted some other issues since my review - team barrier inside a TeamThreadRange, and modifications of captured variables inside parallel_fors. I also think the performance can be improved quite a bit without adding complexity. The other thing was this only applies tweaks to the |
@trilinos/ifpack2
Motivation
This is a work in progress to test changes to the compute residual kernel used by the TriDiag Solver.
@brian-kelley @vbrunini @tcfisher This is a draft of the changes we discussed. It should give a small performance uplift, though I think there is room for further improvement.
Using the Ifpack2 block tridiag benchmark:
Related Issues
put-issue-number-here
Stakeholder Feedback
Testing