Test removal of LoopVectorization.jl #83

Moelf · 2024-10-25T09:20:52Z

Since LV.jl is going away in 1.12, maybe we should attempt to remove it.

I'm looking at https://github.com/graeme-a-stewart/JetReconstructionBenchmarks.jl/ and want to find a small set of benchmarks I can run to monitor the performance when making changes, @graeme-a-stewart can you help to point out a minimal set of things to run, using that repo? It's not completely clear what to run (e.g. example snippets)

Moelf · 2024-10-25T09:40:30Z

on my machine, without @turbo is faster...

# with @turbo
akako@frame16 ~/D/g/JetReconstructionBenchmarks.jl (main)> julia --project src/benchmark.jl --backend Julia --repeats 30  data/events-pp-13TeV-20GeV.hepmc3
1×5 DataFrame
 Row │ File_path                          File                          mean_particles  n_samples  time_per_event
     │ Any                                String                        Int64           Int64      Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ data/events-pp-13TeV-20GeV.hepmc3  events-pp-13TeV-20GeV.hepmc3              -1         16         38.7031

## without `@turbo`
akako@frame16 ~/D/g/JetReconstructionBenchmarks.jl (main)> julia --project src/benchmark.jl --backend Julia --repeats 30  data/events-pp-13TeV-20GeV.hepmc3
Precompiling JetReconstruction...
  1 dependency successfully precompiled in 2 seconds. 117 already precompiled.
1×5 DataFrame
 Row │ File_path                          File                          mean_particles  n_samples  time_per_event
     │ Any                                String                        Int64           Int64      Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ data/events-pp-13TeV-20GeV.hepmc3  events-pp-13TeV-20GeV.hepmc3              -1         16         29.7108

Moelf · 2024-10-25T09:52:45Z

I do see a difference in micro benchmark:

julia> using LoopVectorization, Chairmarks

julia> function fast_findmin(dij, n)
           # findmin(@inbounds @view dij[1:n])
           best = 1
           @inbounds dij_min = dij[1]
           @turbo for here in 2:n
               newmin = dij[here] < dij_min
               best = newmin ? here : best
               dij_min = newmin ? dij[here] : dij_min
           end
           dij_min, best
       end

julia> function setup_bench()
           N = rand(200:500)
           dij = abs.(rand(N))
           first_n = rand(1:N)
           dij, first_n
       end^C

julia> function basic_findmin(dij, n)
           # findmin(@inbounds @view dij[1:n])
           best = 1
           @inbounds dij_min = dij[1]
           for here in 2:n
               newmin = dij[here] < dij_min
               best = newmin ? here : best
               dij_min = newmin ? dij[here] : dij_min
           end
           dij_min, best
       end

julia> @be setup_bench fast_findmin(_...) evals=1
Benchmark: 24004 samples with 1 evaluation
 min    50.000 ns
 median 90.000 ns
 mean   92.675 ns
 max    681.000 ns

julia> @be setup_bench fast_findmin(_...) evals=1
Benchmark: 20405 samples with 1 evaluation
 min    50.000 ns
 median 90.000 ns
 mean   92.634 ns
 max    1.442 μs

julia> @be setup_bench basic_findmin(_...) evals=1
Benchmark: 7178 samples with 1 evaluation
 min    40.000 ns
 median 240.000 ns
 mean   255.817 ns
 max    782.000 ns

julia> @be setup_bench basic_findmin(_...) evals=1
Benchmark: 25737 samples with 1 evaluation
 min    40.000 ns
 median 241.000 ns
 mean   260.301 ns
 max    14.146 μs

graeme-a-stewart · 2024-10-25T19:36:22Z

Hi @Moelf - yes, this has been on my TODO as well, as we might loose LoopVectorization. My micro-benchmark/testcase is here.

Just running that now on my M2 Pro I get the following:

Julia `findmin`

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  171.583 μs …   6.069 ms  ┊ GC (min … max): 0.00% … 96.63%
 Time  (median):     176.500 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   181.996 μs ± 102.615 μs  ┊ GC (mean ± σ):  1.88% ±  3.33%

   ▄█▂  ▂▆▄▆                                                     
  ▇███▄▄█████▆▆▆▇▆▅▅▅▄▄▃▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  172 μs           Histogram: frequency by time          203 μs <

 Memory estimate: 70.34 KiB, allocs estimate: 2001.

Fast `findmin` with `@turbo`

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  47.333 μs …  5.323 ms  ┊ GC (min … max): 0.00% … 98.67%
 Time  (median):     48.250 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   50.801 μs ± 64.514 μs  ┊ GC (mean ± σ):  2.06% ±  1.69%

  ▁▆██▄▃▁▃▅▅▃▂▁▁▂▄▅▅▃▂▂▂▃▂▁                                   ▂
  ███████████████████████████▇█▇███▇█████▇▇█▇█▇▇▆▆▆▆▅▇▆▆▅▅▅▅▅ █
  47.3 μs      Histogram: log(frequency) by time      60.9 μs <

 Memory estimate: 31.28 KiB, allocs estimate: 1001.

Note that this test shrinks the array by 1 step each time, which correctly mocks up the jet reconstruction case - however, I see much the same result for a test against a fixed array size.

Moelf · 2024-10-25T19:41:29Z

yeah I can reproduce that, I also opened: https://discourse.julialang.org/t/faster-findmin-without-loopvectorization-jl/121742/3

graeme-a-stewart · 2024-10-25T20:12:53Z

I saw - thanks. I will pitch in there later.

Moelf · 2024-10-25T22:14:23Z

       function naive_findmin(dij, n)
           x = @fastmath foldl(min, dij)
           i = findfirst(==(x), dij)::Int
           x, i
       end

looks quite promising

graeme-a-stewart · 2024-10-28T14:55:58Z

Posted to Discourse, but also for here...

Thanks @jling for bringing this up. I have been very interested in the discussion.

Just to give a bit of context here, in this problem the metric dij is calculated for a set of clusters (usually 300-500 at the beginning) and each iteration we need to find this minimum value. Then two clusters are merged to one, and some of the dij values are updated. The next minimum is found, a merger happens, and so on, until the process finishes.

This is why the realistic test is to take a random array of length, say, 400. Find the minimum, then reduce to 399, 398, 397, ..., 1. Adding some update to the array values each iteration makes it even more realistic (prevents unrealistic caching).

This is my wrapper to test any findmin implementation f on vector v:

function run_descent(v::DenseVector{Float64}, f::T; perturb = 0) where T
    # Ensure we do something with the calculation to prevent the
    # compiler from optimizing everything away!
    sum = 0.0
    for n in length(v):-1:2
        val, _ = f(v, n)
        sum += val
        # If one wants to perturb the array do it like this, which
        # is a proxy for changing values as the algorithm progresses.
        for _ in 1:min(perturb,n)
            v[rand(1:n)] = abs(rand())
        end
    end
    sum
end

I introduced the trick with LoopVectorisation when I realised that the out-of-the-box Julia findmin was really slow. However, now I am realising that even just a basic naive loop is a lot faster than findmin, so that a lot of the speed up may have nothing to do with vectorisation.

I wrapped up all of the suggestions so far into a script and this is what I get on two machines:

OS X 14.7, M2 Pro

Running the benchmark for an array of length 400
fast_findmin min: 25.333000000000002 μs; mean: 26.46452926525524 μs
basic_findmin min: 83.25 μs; mean: 85.91027592592592 μs
julia_findmin min: 360.791 μs; mean: 366.5392078431371 μs
naive_findmin min: 23.167 μs; mean: 36.22719819819824 μs
naive_findmin_reduce min: 24.875 μs; mean: 36.793294815987366 μs
fast_findmin_simd min: 42.291000000000004 μs; mean: 44.39575992348151 μs

Alma9, Ryzen 7 5700G

Running the benchmark for an array of length 400
fast_findmin min: 16.721 μs; mean: 16.934860248447162 μs
basic_findmin min: 22.782 μs; mean: 23.006807827395896 μs
julia_findmin min: 82.84500000000001 μs; mean: 88.04729608404968 μs
naive_findmin min: 17.352000000000004 μs; mean: 33.093611011154984 μs
naive_findmin_reduce min: 18.884999999999998 μs; mean: 34.14370736589271 μs
fast_findmin_simd min: 16.54 μs; mean: 18.151618091451382 μs

The platform differences are revealing, particularly that the basic_findmin is quite bad on the M2, but performs well on the x86_64. Julia's findmin is particularly bad on M2.

Nothing is quite beating the LoopVectorization version fast_findmin, but naive_findmin is quite respectable on both platforms, though slower than basic_findmin on x86. fast_findmin_simd is good on x86, not great on M2 (which only has neon, as opposed to avx2).

fast_findmin_simd is also pretty complicated, so I would not like to have that in my package (though FindFirstFunctions.jl is terrifying!).

Microbenchmarks can be misleading so my next test is in the real code paths of JetReconstruction.jl.

graeme-a-stewart · 2024-11-29T10:10:14Z

@aoanla was kind enough to test these updated fast_findmin on an Ampere Altera A80-30

So, on an Ampere Altera A80-30:
 
main branch:
 
(examples) pkg>
[sskipsey@wn-a23-010 examples]$ julia --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32
Processed 100 events 32 times
Average time per event 327.8093575000001 ± 13.52692474880304 μs
Lowest time per event 310.92226999999997 μs
 
(I did this a few more times and the statistics are about the same)
 
] git checkout exorcise_lv
 
simd branch
 
[sskipsey@wn-a23-010 examples]$ julia --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32
Processed 100 events 32 times
Average time per event 328.04179531249997 ± 12.247306169464444 μs
Lowest time per event 314.09708 μs
 
(I also did this a few more times and the statistics were consistent)
 
I didn’t do any package re-precompilation between the checkout of the 
branch and the benchmark, but I don’t think I should need to in this case?
 
Provisionally, this looks like there’s no real regression _on Ampere Altera_ 
- this may be another Apple Silicon thing?
 
I see that Jerry is using a SIMD width of 8 in their implementation – I wonder 
if this is a bad width for Apple Silicon in particular. (I also note that, IIRC, 
LoopVectorisation does both loop-unrolling *and* SIMD vectorisation,
so it may be the trade-off is somewhere there?)

Sam

Moelf linked a pull request Oct 27, 2024 that will close this issue

use SIMD.jl directly instead of LV.jl for fast_findmin() #84

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test removal of LoopVectorization.jl #83

Test removal of LoopVectorization.jl #83

Moelf commented Oct 25, 2024 •

edited

Loading

Moelf commented Oct 25, 2024

Moelf commented Oct 25, 2024 •

edited

Loading

graeme-a-stewart commented Oct 25, 2024

Moelf commented Oct 25, 2024

graeme-a-stewart commented Oct 25, 2024

Moelf commented Oct 25, 2024

graeme-a-stewart commented Oct 28, 2024

graeme-a-stewart commented Nov 29, 2024

Test removal of LoopVectorization.jl #83

Test removal of LoopVectorization.jl #83

Comments

Moelf commented Oct 25, 2024 • edited Loading

Moelf commented Oct 25, 2024

Moelf commented Oct 25, 2024 • edited Loading

graeme-a-stewart commented Oct 25, 2024

Julia findmin

Fast findmin with @turbo

Moelf commented Oct 25, 2024

graeme-a-stewart commented Oct 25, 2024

Moelf commented Oct 25, 2024

graeme-a-stewart commented Oct 28, 2024

OS X 14.7, M2 Pro

Alma9, Ryzen 7 5700G

graeme-a-stewart commented Nov 29, 2024

Moelf commented Oct 25, 2024 •

edited

Loading

Moelf commented Oct 25, 2024 •

edited

Loading

Julia `findmin`

Fast `findmin` with `@turbo`