Replies: 1 comment 1 reply
-
I'm going to go back to the current executor and see if I can try starting a system once when it's spawned. This won't allow as many systems as this approach to run, but might get a lot of the gains. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I was doing some experiments using a shared struct for controlling component archetype access in the executor. It ended up being a little slower than the current method. I'm making this discussion thread to record my results in case something in it becomes useful in the future.
The majority of the changes are in this file.
Hypothesis
My expectation going in was that this was going to have overhead from cloning the shared access struct and extra logic needed for coordination, but it we might be able to make up for it by being able to start systems while running the
prepare_systems
function.We can see in this trace that systems are not allowed to start until after all the systems have been prepared for running. If we could start systems during this time then we could improve parallelism and speed as long as the extra overhead of coordination was not more than the time saved by running during prepare.
Approach
The basic idea was to use channels and a shared conflict access struct to coordinate between different tasks when a system is allowed to run rather than using a central exectutor.
The parallel executor in bevy is responsible for not running a system if the currently running systems conflict in read/write access to that system. This is done through a access conflict fixed bit set. This approach needs to be able to share the bit set between threads. It does this by putting the bit set behind a
Arc<Mutex<>>
. This is necessary over something like an atomic, because rebuilding the access when a system is removed from the pool of running systems requires looping over all the running systems. Systems cannot be allowed to read the access while it is in this intermediate invalid state.Besides the shared access systems need to coordinate when they're allowed to start by their
before
andafter
dependencies. In main this is done by counting how many dependencies have run. In this approach we use channels to wait for all the dependencies to send that they have run. This gets a little tricky when we're starting systems during prepare. If the receiving channel is not cloned before the finish is sent, it will not see the finish message. So we delay sending the finish, until after the sending channels sees it's expected number of dependants. This required another channel as to avoid a tight loop. It only needs to check this number when a new system is spawned.Results
The changes do work at least. We can see that systems do run during the
prepare_systems
function.When running tracy on the many cubes example, I was seeing very similar frame times between main and my branch. But would see performance loss from 54 fps to 52 fps without running tracy. So in the same ballpark, but not quite the same.
I suspect that a lot of the lower performance is due to the cloning of the new channels and shared access struct. This is seen in that
prepare_systems
in the first stage takes 900 us vs 600 us on my machine. We make up for some of this with starting systems earlier, but it seems likely that we don't make up all of it.There is likely some overhead from the coordination too, but hard to say how much. The extra overhead here would come into effect the most in long chains of systems. These can limit how multithreaded things can be.
Overall a fun experiment, but probably not one I'm going to take further. There might be some more possible performance improvements, but I ran out of ones that would be relatively easy to do and felt like they could be clear wins. Also if you see the benchmarks below, the contrived benches have some significant performance regressions that suggests that the coordination through a mutex causes significant overhead when there are a lot of conflicting systems.
Ideas for Improvement
Some random ideas I had that I didn't pursue.
We could create a new type of channel that would clone the active messages on a channel when it was cloned. This would allow one of the channels to be removed as the dependants would now see that the system had finished.Did this and saw a small improvement.Notes
Appendix: Benchmarks
Tracy Frame Times
In the screencaps here it shows that the new branch is faster than main, but on my machine they would flop back and forth a bit depending on how fast my computer wanted to run. I would say they were very similar when tracy was enabled.
Empty Systems
As expected the empty systems have hilariously bad regressions percentage wise. These might not have mattered if we gained enough time with real systems.
Busy Systems
We see some small improvements and regressions in these tests. The changes
here are small enough that they could just be noise.
Contrived
Seeing some significant regressions here. The conflicts are probably causing
extra overhead as the futures need to be repolled.
Beta Was this translation helpful? Give feedback.
All reactions