Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Ifpack2: Compute residual update #13610

Conversation

seanofthemillers
Copy link
Contributor

@trilinos/ifpack2

Motivation

This is a work in progress to test changes to the compute residual kernel used by the TriDiag Solver.

@brian-kelley @vbrunini @tcfisher This is a draft of the changes we discussed. It should give a small performance uplift, though I think there is room for further improvement.

Using the Ifpack2 block tridiag benchmark:
image

Related Issues

  • Closes put-issue-number-here

Stakeholder Feedback

Testing

@seanofthemillers seanofthemillers requested a review from a team as a code owner November 16, 2024 00:19
@brian-kelley brian-kelley self-requested a review December 12, 2024 16:44
Copy link
Contributor

@brian-kelley brian-kelley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me. I'll do a run myself to make sure that hardcoding the block size and vector length doesn't hurt cuda performance.

@brian-kelley
Copy link
Contributor

brian-kelley commented Dec 12, 2024

@seanofthemillers I see you have WIP in the PR title, are there other changes you plan to make? You will at least have to amend your commit with the git signoff message (git commit -s ...) to pass that check that we require now.

edit: instead of -s you can also just add the text to commit message, "Signed-off-by: $user <$email>" where user and email match what you use for github.

Copy link
Contributor

@brian-kelley brian-kelley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seanofthemillers After some evaluation on H100, it seems that these changes do cause a measurable regression for blocksize=7, and don't improve blocksize=13. I didn't test as many cases as you (just 100^3 grid and those two block sizes) but I think what this needs is to guard the change with launch bounds/vector length to Kokkos::HIP exec space.

Time in seconds for overall BlockTriDiagonalSolver, with 100 repeats.

Before (develop) After (rebased PR)
block size 7 0.0443379 0.0776897
block size 13 0.0694024 0.0688918

brian-kelley added a commit to brian-kelley/Trilinos that referenced this pull request Jan 22, 2025
(Originally written by Sean Miller for trilinos#13610)
Use shared memory and TeamPolicy parameter tweaks to improve
BTDS residual performance.

Signed-off-by: Brian Kelley <[email protected]>
brian-kelley added a commit to brian-kelley/Trilinos that referenced this pull request Jan 22, 2025
(Originally written by Sean Miller for trilinos#13610)
Use shared memory and TeamPolicy parameter tweaks to improve
BTDS residual performance.

Signed-off-by: Brian Kelley <[email protected]>
@brian-kelley
Copy link
Contributor

brian-kelley commented Jan 23, 2025

I'll open a new PR soon that improves upon these changes some more. I spotted some other issues since my review - team barrier inside a TeamThreadRange, and modifications of captured variables inside parallel_fors. I also think the performance can be improved quite a bit without adding complexity.

The other thing was this only applies tweaks to the OverlapTag version of the kernel, not the AsyncTag. In solves with multiple MPI ranks, the AsyncTag is used by default. The kernels were very similar however, so I am merging them into one general version. That way optimizations will only need to be implemented in one place, and will benefit all cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants