-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[3.3.0] Performance regression #2634
Comments
I wonder if this is related to typelevel/fs2#2734. Would you mind trying with these versions? |
@armanbilge I was thinking about this change as well. |
@armanbilge not much of a difference 🤔 3.2-111-594f8fe
3.2-124-832c161
3.2.3
|
If you have bandwidth to bisect further, you can use the |
@armanbilge I will experiment with different versions during the day |
Thanks! |
Nothing really jumps out when looking at the diff between v3.2.2 and 594f8fe. Most likely culprits: |
Indeed, my project depends on fs2 Thank you for your help and sorry for wasting your time. |
Not a waste of time at all, thanks for reporting! It would be good to understand what's causing the slow down in cats-effect. My guess is that this is the overhead of tracing but that's a total guess. |
Cats Effect 3.0.0 did not have tracing. It came with 3.2.0. The latest |
Can you try the benchmark with the system property |
I'm running benchmarks on a different machine so numbers differ from yesterday. fs2 3.2.2cats-effect 3.2.9 without tracing
cats-effect 3.2.9 with tracing
cats-effect 3.3.0 without tracing
cats-effect 3.3.0 with tracing
fs2 3.2.3cats-effect 3.2.9 without tracing
cats-effect 3.2.9 with tracing
cats-effect 3.3.0 without tracing
cats-effect 3.3.0 with tracing
|
Thanks for the update!
Sorry, actually no. I think you want to compare fs2 3.2.2 with CE 3.2.9 and CE 3.3.0, both with tracing disabled. |
@armanbilge I've updated the comment above |
@iRevive Thanks! For the fs2 3.2.2 / CE 3.2.9, I assume that was with tracing? Can you run those with tracing disabled too? (same option) |
@armanbilge yeah, the difference is caused by tracing. I've updated my comment with extra benchmarks. Seems I covered all possible combinations. |
Thanks for all your efforts, and sorry for the confusion. I think these are the 4 combinations we need:
You've done one of the first two, but I'm not sure which (I assume with tracing enabled, because that's the default). Sorry to be a nuisance. Running CE 3.2.9 with tracing disabled is helpful to check if the performance change was caused due to changes in tracing or changes in the runtime itself. |
@armanbilge done, see the updated comment https://github.com/typelevel/fs2/issues/2746#issuecomment-988800286 |
Awesome, thank you! 😁 let's have a look.. |
Looking at the time Delta, this seems about in line with what I would expect for tracing code which is mostly compute bound: about a 25% difference. Fully compute bound would be about 30%, more than likely. The really interesting thing here is that GC time though. The tracing benchmark hit fewer GC iterations but still took almost twice as long. I wonder if that means that we could optimize tracing a bit further in practice by streamlining GC costs? Edit: actually that shift only happened in 3.3.0, so it's almost certainly being caused by the weak bag shenanigans. We're forcing the GC to work harder in order to avoid bogging down the critical path. |
|
Thanks to the tabular formation, it's easier to see that I'm probably wrong on the compute bound but. Notice how tracing enabled/disabled makes very little difference on 3.2.9, but it matters a lot on 3.3. That suggests the fiber dump mechanism is primarily at fault. Absolutely fascinating. Can you open an issue on CE for this? I think this is a meaningful performance regression. |
We can transfer this one actually. |
We seem to be hitting this issue particularly hard in Scala.js (dev environment), with a performance degradation by a whole order of magnitude. For example, an iteration over ~100 IO calls which took between 1 and 2 seconds in So far, I've been unable to avoid the issue by setting Any pointers for further exploring solutions would be welcome. |
@armanbilge Thank you for the links.
Chrome Version 96.0.4664.93 (Official Build) (x86_64)
Thank you! |
Good idea. I tried your suggestion and I'm actually not getting enhanced stacktraces. This is what I get when setting |
Aha, that is helpful. So now that we've confirmed tracing is off, you are definitely still seeing a significant performance regression? That is concerning! |
The I would be really surprised if that's a 15x regression though, that doesn't quite seem right. |
Ah, @vasilmkd is right, my bad 😅 we need one more conditional there to check if tracing is disabled. Thanks, this is an easy fix! |
So I think that we're probably putting just a bit more strain than we should be on the GC. This doesn't show as problematic in our benchmarks since we don't stress the heap too much, but in scenarios like the one in OP it becomes more problematic. A 20% hit for a feature which is intended to be lossless in terms of performance is a bit too much, IMO. At the very least it would merit allowing some more granular configurability, since clearly the scenario here is not sensitive to the performance hit of tracing itself, but is sensitive to the weak tracking. I think a good action item here would be to try to replicate something like this in our own performance tests. If we can create a benchmark which reproduces the problem, we can try to iterate on the implementation more effectively. My guess is that we may have to back away from the weak thread-local striping, but maybe there's a way we can be creative to keep that optimization intact. |
We can always replace the
|
@vasilmkd This would need some measurement, but if I correctly understand the way the GC works, I mostly brought up the thread locality because, if we don't leverage the GC to perform the cleanup, we have to find some other way to solve the problem of an |
I don't see how |
@iRevive If you could try the following snapshot, that would be amazing. Thank you. "io.vasilev" %% "cats-effect" % "3.3-87-1d1f0ff" |
That's what I got today:
|
What about the throughput results? The GC time is lower with the fix. |
The |
How does that compare to |
I did several measurements yesterday. I've increased the measurement time to decrease the margin of error for the score.
The GC time is lower with the fix. And the From what I see, enabled tracing affects throughput of the library by ~15% on If we compare
|
The newly-released 3.3.2 includes the fixes tested here. One thing I would really like to do is figure out why tracing disabled has more overhead in this release than in previous ones. We'll leave this issue open to track that problem. Regarding the increase in overhead with tracing enabled, I think we can definitely continue to iterate on that case, but I expect that the scenario described in the OP is always going to be sub-optimal. From what I can tell, @iRevive is describing an application which Cats Effect is fundamentally optimized around the following assumption: the number of Fwiw, the way to avoid this and recover the constant-factor k is to use Anyway, I'm making a few assumptions about the OP case, but I think I'm at least close to the mark. :-) The point I'm making is that we probably can't make OP's scenario fast, just because the JVM is… the JVM. What we can do though is fix the tracing-disabled performance regression (which is still open), and ensure that tracing-enabled performance overhead is kept to a minimum (which is what was done in 3.3.2). |
@djspiewak thank you for the details. According to the StreamDecoder L56, On the other hand, StreamDecoder reevaluates the decoding part if current bytes are not enough to parse the packet. I can assume it happens quite often, especially in the scenario I did a benchmark with. I will experiment with the given suggestion:
Perhaps a custom accumulating stage (before decoding) will do the trick. |
I think that's a good idea! This should also result in less backtracking in scodec. It's tricky though because you have to try to predict the "right" chunk size. Hard to give good advice here without seeing your problem space, but accumulating into larger chunks is definitely what I would try, particularly if it can be done on the sending side rather than on the receiving side. |
@iRevive Have you test the most recent release? |
@He-Pin not really, to be honest. I guess I can run some benchmarks this weekend to see the difference. |
Quite a lot of things have changed: internal implementation and optimization, testing platform (apple intel -> apple m1), JVM (11 -> 17). But the pattern is still there. Testing platformVM version: JDK 17.0.3, OpenJDK 64-Bit Server VM, 17.0.3+7-LTS BenchmarksCE 3.2.9, FS2 3.2.2, tracing
CE 3.2.9, FS2 3.2.2, no tracing
CE 3.3.0, FS2 3.2.2, tracing
CE 3.3.0, FS2 3.2.2, no tracing
CE 3.5.1, FS2 3.9.1, tracing
CE 3.5.1, FS2 3.9.1, no tracing
Conclusions
Bonus point: comparison with Java driverThere is also an official driver, written in Java, that utilizes Netty under the hood. The numbers are the following:
Scala driver is 13-18% slower, but the allocation rate is completely different. There are several reasons for that: Scala driver utilizes immutable structures, refined types, extra levels of abstraction, and the FS2 NIO socket layer; while Netty is doing Netty things :) Also, a while ago, I was experimenting with the Netty socket layer instead of the FS2 NIO (mostly the code adjusted from https://github.com/typelevel/fs2-netty), and the allocation rate dropped by 2-2.5x. The allocation rate was never an issue for the project (at least for my team), so I didn't progress far with that. |
@iRevive thanks for that update!
Interesting, I'm not sure what's significantly changed between those releases besides the integrated timers. Along those lines ...
If you have any extra bandwidth, it might be interesting to try out the JDK polling system (#3692 (comment)) and also the io_uring polling system (#3692 (comment))
The io_uring polling system gave Ember a significant speedup, would be cool if it makes a difference here. |
This feels like it has essentially quiesced, so I'm going to close it. Please feel free to poke if you feel it should be reopened. |
This issue was originally created in fs2, so version numbers below refer to fs2.
I've observed a performance degradation in some scenarios on 3.2.3 version.
The throughput on small byte streams remained the same.
The throughput on bigger byte streams is decreased roughly by ~20%.
The memory allocation decreased by 10-15% on bigger byte streams.
See the benchmarks below for more details.
Stream usage
The project utilizes a TCP socket from the fs2-io module. I cannot share many details due to NDA, but the generalized usage of the socket is the following one:
Consumed bytes per single invocation:
3.2.2
Operation average time:
Memory allocation:
3.2.3
Operation average time:
Memory allocation:
The text was updated successfully, but these errors were encountered: