Remove disabled SSE4.1 dot product #729

argilo · 2023-12-16T17:27:01Z

volk_32fc_x2_dot_prod_32fc_u_sse4_1 and volk_32fc_x2_dot_prod_32fc_a_sse4_1 were commented out in #411 because they fail on Windows, and were slower than the SSE3 versions. I had a look into this, and found the reason for the failure. Namely, 0x000000000000000080000000 is too large to fit into a long (which is 32 bits on Windows):

volk/kernels/volk/volk_32fc_x2_dot_prod_32fc.h

Line 319 in af3399f

// static const __m128i neg = { 0x000000000000000080000000 };

This value is used as a hacky way to negate real1:

volk/kernels/volk/volk_32fc_x2_dot_prod_32fc.h

Lines 365 to 368 in af3399f

    
           //     real1 = _mm_xor_ps(real1, bit128_p(&neg)->float_vec); 
        
           //     im0 = _mm_add_ps(im0, im1); 
        
           //     real0 = _mm_add_ps(real0, real1);

Removing the negation and changing the last _mm_add_ps to _mm_sub_ps fixes the problem.

However, the SSE4.1 protokernels are still much slower than the SSE3 versions. I had a look into why, and found that the SSE4.1 protokernels are performing dot products inside the loop. The _mm_dp_ps instruction (new in SSE4.1) is quite slow (1.5 cycles per instruction). The SSE3 protokernels instead perform multiplication (_mm_mul_ps, 0.5 cycles per instruction) inside the loop, and leave the final dot product calculation until after the loop. Using _mm_dp_ps doesn't cut down on the number of additions that need to be performed, so I don't think there's a good reason to use it.

It seems that the poor performance of the SSE4.1 protokernels is inherent to their design, so there's no reason to keep them. So I propose to delete this commented-out code.

Signed-off-by: Clayton Smith <[email protected]>

jdemel · 2023-12-17T10:11:31Z

If we can fix the kernels, it might be interesting to keep them anyways. e.g. as a reference on how to potentially do things but then don't in this case. A case study.

However, we need to face the fact that most users won't run volk_profile. In this case they would potentially end up with the slower kernel.

I'm torn between these options.

argilo · 2023-12-17T13:27:05Z

I could go with the fix instead, but I think deletion is the better option because these seem like a bad design. Even if _mm_dp_ps was as fast as _mm_mul_ps, I don't see what would be gained by switching to it.

Remove disabled SSE4.1 dot product

Remove disabled SSE4.1 dot product

40adedb

Signed-off-by: Clayton Smith <[email protected]>

jdemel merged commit e527309 into gnuradio:main Jan 7, 2024
32 checks passed

Alesha72003 pushed a commit to Alesha72003/volk that referenced this pull request May 15, 2024

Merge pull request gnuradio#729 from argilo/remove-dot-prod-sse41

3fcdded

Remove disabled SSE4.1 dot product

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove disabled SSE4.1 dot product #729

Remove disabled SSE4.1 dot product #729

argilo commented Dec 16, 2023 •

edited

Loading

jdemel commented Dec 17, 2023

argilo commented Dec 17, 2023 •

edited

Loading

	// real1 = _mm_xor_ps(real1, bit128_p(&neg)->float_vec);

	// im0 = _mm_add_ps(im0, im1);
	// real0 = _mm_add_ps(real0, real1);

Remove disabled SSE4.1 dot product #729

Remove disabled SSE4.1 dot product #729

Conversation

argilo commented Dec 16, 2023 • edited Loading

jdemel commented Dec 17, 2023

argilo commented Dec 17, 2023 • edited Loading

argilo commented Dec 16, 2023 •

edited

Loading

argilo commented Dec 17, 2023 •

edited

Loading