Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support remote write v2 by converting request #6330

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

SungJin1212
Copy link
Contributor

@SungJin1212 SungJin1212 commented Nov 11, 2024

This PR supports Prometheus remote write 2.0 by converting the v2 request to v1 at the API.

Which issue(s) this PR fixes:
Fixes #6324

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@SungJin1212 SungJin1212 force-pushed the Add-remote-write-v2-api branch 6 times, most recently from 07bbbee to 83d0ba6 Compare November 11, 2024 11:25
@yeya24
Copy link
Contributor

yeya24 commented Nov 14, 2024

Looks promising. Thanks!

FYI we have prometheus/client_golang#1658 which exports remote write handler. Not a blocker for this PR but we should keep it on our radar to switch to use the library

@SungJin1212
Copy link
Contributor Author

@yeya24
Thanks for letting me know.
Should we make the issue to track it?

@alanprot
Copy link
Member

Maybe we can open a issue for someone give a try to use the client_golang handler even before it get merged so we can give feedback on the open PR. Changing that handler after is merged probably will be more difficult as it could potentially break all projects that are already using it.

@SungJin1212
Copy link
Contributor Author

@alanprot
I added a comment here: #6324

@yeya24
Copy link
Contributor

yeya24 commented Nov 17, 2024

I took a breif look at prometheus/client_golang#1658. Left some comments there and we have some changes Cortex specific that might not make sense for Prometheus. I think we are ok to proceed with this PR first.

@SungJin1212
Copy link
Contributor Author

@yeya24
Thanks. I read it, and it would be good if we could reuse its functions!

}
case config.RemoteWriteProtoMsgV2:
var req writev2.Request
err := util.ParseProtoReader(ctx, r.Body, int(r.ContentLength), maxRecvMsgSize, &req, util.RawSnappy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanprot @danielblando
I wonder if we want to introduce a feature flag to control the behavior for RW v2 request. We can either ignore the request or convert to v1 and in the future maybe just accept as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a feature flag for the purpose of rollout. If RW 2.0 conversion is enabled right away, then Ingesters need to be rolled out first because of the protocol change to return stats. If we want to rollout Ingester and Distributor the same time then things can go wrong without a feature flag.

return &cortexpb.WriteResponse{}, nil
writeResponse := &cortexpb.WriteResponse{
Samples: int64(succeededSamplesCount),
Histograms: int64(succeededSamplesCount), //TODO(Sungjin1212): Should we count histogram?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this implemented in Prometheus. Why we don't count histogram here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left TODO since we are counting histograms just as we count the sample.
But, the Prometheus is counting native histogram https://github.com/prometheus/prometheus/blob/main/storage/remote/write_handler.go#L424.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about starting to count histogram when we introduce PushV2?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the concern here. There is nothing prevent us doing it. We should count native histograms

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I agree with counting native histograms by changing to Histograms: int64(nativeHistogramCount).
My concern is we are counting samples instead of native histograms
https://github.com/cortexproject/cortex/blob/master/pkg/ingester/ingester.go#L1269

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine to split but why it needs a separate PR? We can just add a new int64 variable to count succeeded histograms

Copy link
Contributor Author

@SungJin1212 SungJin1212 Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some changes are needed like we are tracking ingestionRate by calculating succeededSamplesCount + ingestedMetadata.
We should change the calculation to sustain existing behavior to succeededSamplesCount + succeededHistogramCount + ingestedMetadata
Also, we can introduce new metrics like cortex_ingester_ingested_native_histograms_total and cortex_ingester_ingested_histograms_failures_total.
WDYT?

Copy link
Contributor

@yeya24 yeya24 Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok to add the new metrics. But they are not blocking this PR so can be done either now or after this change.

If we don't add new metrics just track succeeded histogram samples, it is a simple change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will make PR soon!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yeya24
I make the PR addressing it! (#6370)

@@ -816,7 +824,7 @@ func (d *Distributor) doBatch(ctx context.Context, req *cortexpb.WriteRequest, s
}
}

return d.send(localCtx, ingester, timeseries, metadata, req.Source)
return d.send(localCtx, ingester, timeseries, metadata, req.Source, stats)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to sum the samples pulled from stats? Now I see we just overwrite stats for every request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, isn't there a good chance the returned header value (X-Prometheus-Remote-Write-Samples-Written) would be multiple of samples in a write request?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. With replication factor it is expected to have more samples. I think this is fine.

}
case config.RemoteWriteProtoMsgV2:
var req writev2.Request
err := util.ParseProtoReader(ctx, r.Body, int(r.ContentLength), maxRecvMsgSize, &req, util.RawSnappy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a feature flag for the purpose of rollout. If RW 2.0 conversion is enabled right away, then Ingesters need to be rolled out first because of the protocol change to return stats. If we want to rollout Ingester and Distributor the same time then things can go wrong without a feature flag.

@SungJin1212 SungJin1212 force-pushed the Add-remote-write-v2-api branch from 83d0ba6 to 6ce5027 Compare November 27, 2024 01:57
@SungJin1212
Copy link
Contributor Author

SungJin1212 commented Nov 27, 2024

@yeya24
I added -distributor.remote-write2-enabled flags to configure whether the Distributor can accept PRW2.0.
I added the TestIngesterRollingUpdate e2e test, where the Distributor can accept PRW2.0 and the Ingester uses the v1.18.1 image.
The result is PRW2.0 push is a success, but the response header (X-Prometheus-Remote-Write-xxx) values are all "0".
Is it expecting behavior?

@SungJin1212 SungJin1212 force-pushed the Add-remote-write-v2-api branch from 6ce5027 to 5344155 Compare November 28, 2024 09:28
@yeya24
Copy link
Contributor

yeya24 commented Dec 3, 2024

The result is PRW2.0 push is a success, but the response header (X-Prometheus-Remote-Write-xxx) values are all "0".
Is it expecting behavior?

This doesn't sound like the right behavior. Is that what you got with this PR?

@SungJin1212
Copy link
Contributor Author

SungJin1212 commented Dec 3, 2024

@yeya24
Yes, the test condition is that Ingester uses a v1.18.1 image and Distributor uses a PRW2.0-implemented one. The PRW2.0 push then gets that result.
If the Ingester and Distributor use the same images (PRW 2.0 implemented), we can get the expected response header.

@yeya24
Copy link
Contributor

yeya24 commented Dec 3, 2024

Yes, the test condition is that Ingester uses a v1.18.1 image and Distributor uses a PRW2.0-implemented one. The PRW2.0 push then gets that result.

I see what you meant. Then it is expected to get that result if you use Ingester of old version and Distributor of new version.
That's why we introduce the PRW 2.0 feature flag in distributor to only enable PRW 2.0 request if backend Ingester is running the newer version.

@SungJin1212
Copy link
Contributor Author

SungJin1212 commented Dec 3, 2024

@yeya24
Should I add comments to -distributor.remote-write2-enabled so that the user can do a rolling update of the Ingesters first and then update the Distributor afterward?

@yeya24
Copy link
Contributor

yeya24 commented Dec 3, 2024

We can mention it in the flag description but I doubt users really look at it.
I prefer to create a dedicated doc/guide for users to migrate to Prometheus 3.0

@SungJin1212
Copy link
Contributor Author

@yeya24
Yes, the guide docs would be more good.

@SungJin1212 SungJin1212 force-pushed the Add-remote-write-v2-api branch from 5344155 to a9231e4 Compare December 18, 2024 11:02
@SungJin1212 SungJin1212 force-pushed the Add-remote-write-v2-api branch from a9231e4 to 8ec7204 Compare January 14, 2025 06:34
@CharlieTLe
Copy link
Member

Hello @SungJin1212, thank you for opening this PR.

There is a release in progress. As such, please rebase your CHANGELOG entry on top of the master branch and move the CHANGELOG entry to the top under ## master / unreleased.

Thanks,
Charlie

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prometheus Remote Write v2 Implementation
4 participants