Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test removal of LoopVectorization.jl #83

Open
Moelf opened this issue Oct 25, 2024 · 8 comments · May be fixed by #84
Open

Test removal of LoopVectorization.jl #83

Moelf opened this issue Oct 25, 2024 · 8 comments · May be fixed by #84

Comments

@Moelf
Copy link
Member

Moelf commented Oct 25, 2024

Since LV.jl is going away in 1.12, maybe we should attempt to remove it.

I'm looking at https://github.com/graeme-a-stewart/JetReconstructionBenchmarks.jl/ and want to find a small set of benchmarks I can run to monitor the performance when making changes, @graeme-a-stewart can you help to point out a minimal set of things to run, using that repo? It's not completely clear what to run (e.g. example snippets)

@Moelf
Copy link
Member Author

Moelf commented Oct 25, 2024

on my machine, without @turbo is faster...

# with @turbo
akako@frame16 ~/D/g/JetReconstructionBenchmarks.jl (main)> julia --project src/benchmark.jl --backend Julia --repeats 30  data/events-pp-13TeV-20GeV.hepmc3
1×5 DataFrame
 Row │ File_path                          File                          mean_particles  n_samples  time_per_event
     │ Any                                String                        Int64           Int64      Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ data/events-pp-13TeV-20GeV.hepmc3  events-pp-13TeV-20GeV.hepmc3              -1         16         38.7031

## without `@turbo`
akako@frame16 ~/D/g/JetReconstructionBenchmarks.jl (main)> julia --project src/benchmark.jl --backend Julia --repeats 30  data/events-pp-13TeV-20GeV.hepmc3
Precompiling JetReconstruction...
  1 dependency successfully precompiled in 2 seconds. 117 already precompiled.
1×5 DataFrame
 Row │ File_path                          File                          mean_particles  n_samples  time_per_event
     │ Any                                String                        Int64           Int64      Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ data/events-pp-13TeV-20GeV.hepmc3  events-pp-13TeV-20GeV.hepmc3              -1         16         29.7108

@Moelf
Copy link
Member Author

Moelf commented Oct 25, 2024

I do see a difference in micro benchmark:

julia> using LoopVectorization, Chairmarks

julia> function fast_findmin(dij, n)
           # findmin(@inbounds @view dij[1:n])
           best = 1
           @inbounds dij_min = dij[1]
           @turbo for here in 2:n
               newmin = dij[here] < dij_min
               best = newmin ? here : best
               dij_min = newmin ? dij[here] : dij_min
           end
           dij_min, best
       end

julia> function setup_bench()
           N = rand(200:500)
           dij = abs.(rand(N))
           first_n = rand(1:N)
           dij, first_n
       end^C

julia> function basic_findmin(dij, n)
           # findmin(@inbounds @view dij[1:n])
           best = 1
           @inbounds dij_min = dij[1]
           for here in 2:n
               newmin = dij[here] < dij_min
               best = newmin ? here : best
               dij_min = newmin ? dij[here] : dij_min
           end
           dij_min, best
       end

julia> @be setup_bench fast_findmin(_...) evals=1
Benchmark: 24004 samples with 1 evaluation
 min    50.000 ns
 median 90.000 ns
 mean   92.675 ns
 max    681.000 ns

julia> @be setup_bench fast_findmin(_...) evals=1
Benchmark: 20405 samples with 1 evaluation
 min    50.000 ns
 median 90.000 ns
 mean   92.634 ns
 max    1.442 μs

julia> @be setup_bench basic_findmin(_...) evals=1
Benchmark: 7178 samples with 1 evaluation
 min    40.000 ns
 median 240.000 ns
 mean   255.817 ns
 max    782.000 ns

julia> @be setup_bench basic_findmin(_...) evals=1
Benchmark: 25737 samples with 1 evaluation
 min    40.000 ns
 median 241.000 ns
 mean   260.301 ns
 max    14.146 μs

@graeme-a-stewart
Copy link
Member

Hi @Moelf - yes, this has been on my TODO as well, as we might loose LoopVectorization. My micro-benchmark/testcase is here.

Just running that now on my M2 Pro I get the following:

Julia findmin

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  171.583 μs …   6.069 ms  ┊ GC (min … max): 0.00% … 96.63%
 Time  (median):     176.500 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   181.996 μs ± 102.615 μs  ┊ GC (mean ± σ):  1.88% ±  3.33%

   ▄█▂  ▂▆▄▆                                                     
  ▇███▄▄█████▆▆▆▇▆▅▅▅▄▄▃▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  172 μs           Histogram: frequency by time          203 μs <

 Memory estimate: 70.34 KiB, allocs estimate: 2001.

Fast findmin with @turbo

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  47.333 μs …  5.323 ms  ┊ GC (min … max): 0.00% … 98.67%
 Time  (median):     48.250 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   50.801 μs ± 64.514 μs  ┊ GC (mean ± σ):  2.06% ±  1.69%

  ▁▆██▄▃▁▃▅▅▃▂▁▁▂▄▅▅▃▂▂▂▃▂▁                                   ▂
  ███████████████████████████▇█▇███▇█████▇▇█▇█▇▇▆▆▆▆▅▇▆▆▅▅▅▅▅ █
  47.3 μs      Histogram: log(frequency) by time      60.9 μs <

 Memory estimate: 31.28 KiB, allocs estimate: 1001.

Note that this test shrinks the array by 1 step each time, which correctly mocks up the jet reconstruction case - however, I see much the same result for a test against a fixed array size.

@Moelf
Copy link
Member Author

Moelf commented Oct 25, 2024

yeah I can reproduce that, I also opened: https://discourse.julialang.org/t/faster-findmin-without-loopvectorization-jl/121742/3

@graeme-a-stewart
Copy link
Member

I saw - thanks. I will pitch in there later.

@Moelf
Copy link
Member Author

Moelf commented Oct 25, 2024

       function naive_findmin(dij, n)
           x = @fastmath foldl(min, dij)
           i = findfirst(==(x), dij)::Int
           x, i
       end

looks quite promising

@Moelf Moelf linked a pull request Oct 27, 2024 that will close this issue
@graeme-a-stewart
Copy link
Member

Posted to Discourse, but also for here...

Thanks @jling for bringing this up. I have been very interested in the discussion.

Just to give a bit of context here, in this problem the metric dij is calculated for a set of clusters (usually 300-500 at the beginning) and each iteration we need to find this minimum value. Then two clusters are merged to one, and some of the dij values are updated. The next minimum is found, a merger happens, and so on, until the process finishes.

This is why the realistic test is to take a random array of length, say, 400. Find the minimum, then reduce to 399, 398, 397, ..., 1. Adding some update to the array values each iteration makes it even more realistic (prevents unrealistic caching).

This is my wrapper to test any findmin implementation f on vector v:

function run_descent(v::DenseVector{Float64}, f::T; perturb = 0) where T
    # Ensure we do something with the calculation to prevent the
    # compiler from optimizing everything away!
    sum = 0.0
    for n in length(v):-1:2
        val, _ = f(v, n)
        sum += val
        # If one wants to perturb the array do it like this, which
        # is a proxy for changing values as the algorithm progresses.
        for _ in 1:min(perturb,n)
            v[rand(1:n)] = abs(rand())
        end
    end
    sum
end

I introduced the trick with LoopVectorisation when I realised that the out-of-the-box Julia findmin was really slow. However, now I am realising that even just a basic naive loop is a lot faster than findmin, so that a lot of the speed up may have nothing to do with vectorisation.

I wrapped up all of the suggestions so far into a script and this is what I get on two machines:

OS X 14.7, M2 Pro

Running the benchmark for an array of length 400
fast_findmin min: 25.333000000000002 μs; mean: 26.46452926525524 μs
basic_findmin min: 83.25 μs; mean: 85.91027592592592 μs
julia_findmin min: 360.791 μs; mean: 366.5392078431371 μs
naive_findmin min: 23.167 μs; mean: 36.22719819819824 μs
naive_findmin_reduce min: 24.875 μs; mean: 36.793294815987366 μs
fast_findmin_simd min: 42.291000000000004 μs; mean: 44.39575992348151 μs

Alma9, Ryzen 7 5700G

Running the benchmark for an array of length 400
fast_findmin min: 16.721 μs; mean: 16.934860248447162 μs
basic_findmin min: 22.782 μs; mean: 23.006807827395896 μs
julia_findmin min: 82.84500000000001 μs; mean: 88.04729608404968 μs
naive_findmin min: 17.352000000000004 μs; mean: 33.093611011154984 μs
naive_findmin_reduce min: 18.884999999999998 μs; mean: 34.14370736589271 μs
fast_findmin_simd min: 16.54 μs; mean: 18.151618091451382 μs

The platform differences are revealing, particularly that the basic_findmin is quite bad on the M2, but performs well on the x86_64. Julia's findmin is particularly bad on M2.

Nothing is quite beating the LoopVectorization version fast_findmin, but naive_findmin is quite respectable on both platforms, though slower than basic_findmin on x86. fast_findmin_simd is good on x86, not great on M2 (which only has neon, as opposed to avx2).

fast_findmin_simd is also pretty complicated, so I would not like to have that in my package (though FindFirstFunctions.jl is terrifying!).

Microbenchmarks can be misleading so my next test is in the real code paths of JetReconstruction.jl.

@graeme-a-stewart
Copy link
Member

@aoanla was kind enough to test these updated fast_findmin on an Ampere Altera A80-30

So, on an Ampere Altera A80-30:
 
main branch:
 
(examples) pkg>
[sskipsey@wn-a23-010 examples]$ julia --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32
Processed 100 events 32 times
Average time per event 327.8093575000001 ± 13.52692474880304 μs
Lowest time per event 310.92226999999997 μs
 
(I did this a few more times and the statistics are about the same)
 
] git checkout exorcise_lv
 
simd branch
 
[sskipsey@wn-a23-010 examples]$ julia --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32
Processed 100 events 32 times
Average time per event 328.04179531249997 ± 12.247306169464444 μs
Lowest time per event 314.09708 μs
 
(I also did this a few more times and the statistics were consistent)
 
I didn’t do any package re-precompilation between the checkout of the 
branch and the benchmark, but I don’t think I should need to in this case?
 
Provisionally, this looks like there’s no real regression _on Ampere Altera_ 
- this may be another Apple Silicon thing?
 
I see that Jerry is using a SIMD width of 8 in their implementation – I wonder 
if this is a bad width for Apple Silicon in particular. (I also note that, IIRC, 
LoopVectorisation does both loop-unrolling *and* SIMD vectorisation,
so it may be the trade-off is somewhere there?)

Sam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants