-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support remote write v2 by converting request #6330
base: master
Are you sure you want to change the base?
Support remote write v2 by converting request #6330
Conversation
07bbbee
to
83d0ba6
Compare
Looks promising. Thanks! FYI we have prometheus/client_golang#1658 which exports remote write handler. Not a blocker for this PR but we should keep it on our radar to switch to use the library |
@yeya24 |
Maybe we can open a issue for someone give a try to use the client_golang handler even before it get merged so we can give feedback on the open PR. Changing that handler after is merged probably will be more difficult as it could potentially break all projects that are already using it. |
I took a breif look at prometheus/client_golang#1658. Left some comments there and we have some changes Cortex specific that might not make sense for Prometheus. I think we are ok to proceed with this PR first. |
@yeya24 |
pkg/util/push/push.go
Outdated
} | ||
case config.RemoteWriteProtoMsgV2: | ||
var req writev2.Request | ||
err := util.ParseProtoReader(ctx, r.Body, int(r.ContentLength), maxRecvMsgSize, &req, util.RawSnappy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alanprot @danielblando
I wonder if we want to introduce a feature flag to control the behavior for RW v2 request. We can either ignore the request or convert to v1 and in the future maybe just accept as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a feature flag for the purpose of rollout. If RW 2.0 conversion is enabled right away, then Ingesters need to be rolled out first because of the protocol change to return stats. If we want to rollout Ingester and Distributor the same time then things can go wrong without a feature flag.
pkg/ingester/ingester.go
Outdated
return &cortexpb.WriteResponse{}, nil | ||
writeResponse := &cortexpb.WriteResponse{ | ||
Samples: int64(succeededSamplesCount), | ||
Histograms: int64(succeededSamplesCount), //TODO(Sungjin1212): Should we count histogram? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this implemented in Prometheus. Why we don't count histogram here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left TODO
since we are counting histograms just as we count the sample.
But, the Prometheus is counting native histogram https://github.com/prometheus/prometheus/blob/main/storage/remote/write_handler.go#L424.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about starting to count histogram
when we introduce PushV2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the concern here. There is nothing prevent us doing it. We should count native histograms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I agree with counting native histograms by changing to Histograms: int64(nativeHistogramCount)
.
My concern is we are counting samples instead of native histograms
https://github.com/cortexproject/cortex/blob/master/pkg/ingester/ingester.go#L1269
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine to split but why it needs a separate PR? We can just add a new int64 variable to count succeeded histograms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe some changes are needed like we are tracking ingestionRate by calculating succeededSamplesCount + ingestedMetadata
.
We should change the calculation to sustain existing behavior to succeededSamplesCount + succeededHistogramCount + ingestedMetadata
Also, we can introduce new metrics like cortex_ingester_ingested_native_histograms_total
and cortex_ingester_ingested_histograms_failures_total
.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok to add the new metrics. But they are not blocking this PR so can be done either now or after this change.
If we don't add new metrics just track succeeded histogram samples, it is a simple change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will make PR soon!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/distributor/distributor.go
Outdated
@@ -816,7 +824,7 @@ func (d *Distributor) doBatch(ctx context.Context, req *cortexpb.WriteRequest, s | |||
} | |||
} | |||
|
|||
return d.send(localCtx, ingester, timeseries, metadata, req.Source) | |||
return d.send(localCtx, ingester, timeseries, metadata, req.Source, stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to sum the samples pulled from stats
? Now I see we just overwrite stats for every request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, isn't there a good chance the returned header value (X-Prometheus-Remote-Write-Samples-Written
) would be multiple of samples in a write request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. With replication factor it is expected to have more samples. I think this is fine.
pkg/util/push/push.go
Outdated
} | ||
case config.RemoteWriteProtoMsgV2: | ||
var req writev2.Request | ||
err := util.ParseProtoReader(ctx, r.Body, int(r.ContentLength), maxRecvMsgSize, &req, util.RawSnappy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a feature flag for the purpose of rollout. If RW 2.0 conversion is enabled right away, then Ingesters need to be rolled out first because of the protocol change to return stats. If we want to rollout Ingester and Distributor the same time then things can go wrong without a feature flag.
83d0ba6
to
6ce5027
Compare
@yeya24 |
6ce5027
to
5344155
Compare
This doesn't sound like the right behavior. Is that what you got with this PR? |
@yeya24 |
I see what you meant. Then it is expected to get that result if you use Ingester of old version and Distributor of new version. |
@yeya24 |
We can mention it in the flag description but I doubt users really look at it. |
@yeya24 |
5344155
to
a9231e4
Compare
Signed-off-by: SungJin1212 <[email protected]>
a9231e4
to
8ec7204
Compare
Hello @SungJin1212, thank you for opening this PR. There is a release in progress. As such, please rebase your CHANGELOG entry on top of the master branch and move the CHANGELOG entry to the top under Thanks, |
This PR supports Prometheus remote write 2.0 by converting the v2 request to v1 at the API.
Which issue(s) this PR fixes:
Fixes #6324
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]