Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linkerd viz stat-outbound reports incorrect latencies #13483

Open
kflynn opened this issue Dec 13, 2024 · 2 comments
Open

linkerd viz stat-outbound reports incorrect latencies #13483

kflynn opened this issue Dec 13, 2024 · 2 comments
Labels

Comments

@kflynn
Copy link
Member

kflynn commented Dec 13, 2024

What is the issue?

linkerd viz stat-outbound seems to report latencies that are many times larger than actual latencies.

How can it be reproduced?

Get a new cluster, then

linkerd install --crds | kubectl apply -f -
linkerd install | kubectl apply -f -
linkerd viz install | kubectl apply -f -
linkerd check

kubectl create ns faces
kubectl annotate ns/faces linkerd.io/inject=enabled

helm install faces -n faces \
     oci://ghcr.io/buoyantio/faces-chart --version 2.0.0-rc.2 \
     --set gui.serviceType=LoadBalancer \
     --set face.errorFraction=0 \
     --set backend.errorFraction=0 \
     --set backend.delayBuckets="1000"

kubectl rollout status -n faces deploy

At this point, we need access to the faces-gui Service in the faces namespace. Though I can access it directly via its external IP in my setup, I'll write this repro using kubectl port-forward.

kubectl port-forward -n faces svc/faces-gui 8080:80

Using --set backend.delayBuckets=1000 when installing Faces means that the backend workloads (smiley and color) will always delay every call 1000ms. We can verify this with curl:

curl 'http://localhost:8080/face/center/?row=2&col=2' | jq

This will take about a second, and its output will include a latency element that will be around 1000ms. We can check this a few times:

for i in 1 2 3 4 5 6 7 8 9 10; do
    curl -s 'http://localhost:8080/face/center/?row=2&col=2' | jq .latency
done

This will take about 10 seconds, and should show a stack of numbers around 1000.

Go ahead and open a web browser to http://localhost:8080 and you'll see the Faces GUI, which we'll use as a traffic generator. Then run

watch linkerd viz stat-outbound -n faces deploy/face

and you'll see something like this (after possibly giving it a chance to warm up):

NAME  SERVICE    ROUTE      TYPE  BACKEND    SUCCESS    RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99  TIMEOUTS  RETRIES
face  color:80   [default]                   100.00%   8.00       5500ms       9550ms       9910ms     0.00%    0.00%
                 └─────────────►  color:80   100.00%   8.00       5500ms       9550ms       9910ms     0.00%
face  smiley:80  [default]                   100.00%   8.00       5500ms       9550ms       9910ms     0.00%    0.00%
                 └─────────────►  smiley:80  100.00%   8.00       5500ms       9550ms       9910ms     0.00%

Ignore the 100% success rate for color (it's a gRPC service and we have no GRPCRoutes at present) and look at the latencies. All of these numbers should be right at 1000ms, but they're not?

Even more interesting, if we switch the backend latencies to 100ms:

kubectl set env -n faces deploy/smiley DELAY_BUCKETS=100
kubectl set env -n faces deploy/color DELAY_BUCKETS=100
kubectl rollout status -n faces deploy

then the for loop above will run it about one second, and show numbers right around 100, but linkerd viz stat-outbound -n faces deploy/face will (after it settles down) show things like this:

NAME  SERVICE    ROUTE      TYPE  BACKEND    SUCCESS    RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99  TIMEOUTS  RETRIES
face  smiley:80  [default]                   100.00%   8.00        275ms        478ms        496ms     0.00%    0.00%
                 └─────────────►  smiley:80  100.00%   8.00        175ms        242ms        248ms     0.00%
face  color:80   [default]                   100.00%   8.00        275ms        478ms        496ms     0.00%    0.00%
                 └─────────────►  color:80   100.00%   8.00        175ms        242ms        248ms     0.00%

which is even weirder -- why the distinction between the different rows?

Logs, error output, etc

See above.

output of linkerd check -o short

:; linkerd check -o short
Status check results are √

Environment

MacOS 15.1.1
k3d version 5.7.4
k3s version 1.30.4-k3s1
Linkerd edge-24.11.8
Faces 2.0.0-rc.2 😇

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

maybe

@kflynn kflynn added the bug label Dec 13, 2024
@adleong
Copy link
Member

adleong commented Dec 13, 2024

This is an artifact of the latency histogram buckets that Linkerd uses for these metrics.

https://github.com/linkerd/linkerd2-proxy/blob/main/linkerd/app/outbound/src/http/logical/policy/route/metrics.rs#L104

In this case, the relevant latency histogram bucket bounds is [1s, 10s]. When the Linkerd proxy observes a latency that is slightly larger than 1s, it records it into this bucket where it becomes indistinguishable from any other value between 1s and 10s. Then, when performing latency quantile calculations, we would get a value like 1.5s for p50 (which, as you point out, is much larger than the actual latency value).

These types of histogram artifacts can always occur when recorded latency values are near a bucket boundary and the magnitude of the error depends on the size of the buckets. If Linkerd used finer grained buckets here, the magnitude of the error would be smaller but this would come at a tradeoff with memory and timeseries storage.

Additionally, this test exhibited "worst case" behavior because a latency of 1s was artificially chosen which happens to be right on a bucket boundary. Real world latencies are less likely to be clustered directly on bucket boundaries. Repeating this experiment with backend.delayBuckets="913" gives more accurate (i.e. less inaccurate) percentiles:

linkerd viz stat-outbound -n faces deploy/face 
NAME  SERVICE    ROUTE      TYPE  BACKEND    SUCCESS   RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99  TIMEOUTS  RETRIES  
face  color:80   [default]                   100.00%  2.37        750ms        975ms        995ms     0.00%    0.00%  
                 └─────────────►  color:80   100.00%  2.37        750ms        975ms        995ms     0.00%           
face  smiley:80  [default]                   100.00%  2.37        750ms        975ms        995ms     0.00%    0.00%  
                 └─────────────►  smiley:80  100.00%  2.37        750ms        975ms        995ms     0.00% 

@adleong
Copy link
Member

adleong commented Dec 13, 2024

For context on our bucket size choices, see linkerd/linkerd2-proxy#3086

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants