Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove disabled SSE4.1 dot product #729

Merged
merged 1 commit into from
Jan 7, 2024

Conversation

argilo
Copy link
Member

@argilo argilo commented Dec 16, 2023

volk_32fc_x2_dot_prod_32fc_u_sse4_1 and volk_32fc_x2_dot_prod_32fc_a_sse4_1 were commented out in #411 because they fail on Windows, and were slower than the SSE3 versions. I had a look into this, and found the reason for the failure. Namely, 0x000000000000000080000000 is too large to fit into a long (which is 32 bits on Windows):

// static const __m128i neg = { 0x000000000000000080000000 };

This value is used as a hacky way to negate real1:

// real1 = _mm_xor_ps(real1, bit128_p(&neg)->float_vec);
// im0 = _mm_add_ps(im0, im1);
// real0 = _mm_add_ps(real0, real1);

Removing the negation and changing the last _mm_add_ps to _mm_sub_ps fixes the problem.

However, the SSE4.1 protokernels are still much slower than the SSE3 versions. I had a look into why, and found that the SSE4.1 protokernels are performing dot products inside the loop. The _mm_dp_ps instruction (new in SSE4.1) is quite slow (1.5 cycles per instruction). The SSE3 protokernels instead perform multiplication (_mm_mul_ps, 0.5 cycles per instruction) inside the loop, and leave the final dot product calculation until after the loop. Using _mm_dp_ps doesn't cut down on the number of additions that need to be performed, so I don't think there's a good reason to use it.

It seems that the poor performance of the SSE4.1 protokernels is inherent to their design, so there's no reason to keep them. So I propose to delete this commented-out code.

@jdemel
Copy link
Contributor

jdemel commented Dec 17, 2023

If we can fix the kernels, it might be interesting to keep them anyways. e.g. as a reference on how to potentially do things but then don't in this case. A case study.

However, we need to face the fact that most users won't run volk_profile. In this case they would potentially end up with the slower kernel.

I'm torn between these options.

@argilo
Copy link
Member Author

argilo commented Dec 17, 2023

I could go with the fix instead, but I think deletion is the better option because these seem like a bad design. Even if _mm_dp_ps was as fast as _mm_mul_ps, I don't see what would be gained by switching to it.

@jdemel jdemel merged commit e527309 into gnuradio:main Jan 7, 2024
32 checks passed
Alesha72003 pushed a commit to Alesha72003/volk that referenced this pull request May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants