Remove disabled SSE4.1 dot product #729
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
volk_32fc_x2_dot_prod_32fc_u_sse4_1
andvolk_32fc_x2_dot_prod_32fc_a_sse4_1
were commented out in #411 because they fail on Windows, and were slower than the SSE3 versions. I had a look into this, and found the reason for the failure. Namely,0x000000000000000080000000
is too large to fit into along
(which is 32 bits on Windows):volk/kernels/volk/volk_32fc_x2_dot_prod_32fc.h
Line 319 in af3399f
This value is used as a hacky way to negate
real1
:volk/kernels/volk/volk_32fc_x2_dot_prod_32fc.h
Lines 365 to 368 in af3399f
Removing the negation and changing the last
_mm_add_ps
to_mm_sub_ps
fixes the problem.However, the SSE4.1 protokernels are still much slower than the SSE3 versions. I had a look into why, and found that the SSE4.1 protokernels are performing dot products inside the loop. The
_mm_dp_ps
instruction (new in SSE4.1) is quite slow (1.5 cycles per instruction). The SSE3 protokernels instead perform multiplication (_mm_mul_ps
, 0.5 cycles per instruction) inside the loop, and leave the final dot product calculation until after the loop. Using_mm_dp_ps
doesn't cut down on the number of additions that need to be performed, so I don't think there's a good reason to use it.It seems that the poor performance of the SSE4.1 protokernels is inherent to their design, so there's no reason to keep them. So I propose to delete this commented-out code.