MGMT-19573: Track release stats in installercache #7156

paul-maidment · 2025-01-02T16:50:46Z

To support Ephemeral storage improvement efforts in MGMT-13917 It is desirable to have some statistics from the installer cache

ReleaseId                 The release being downloaded
Cached                    Was the release found in the cache (true) or did it need to be downloaded/extracted? (false)
StartTime                 The start time of the install cache request
EndTime                   The time at which the caller finished using the request
ExtractDuration           The time taken to extract the file in seconds (zero if no extraction took place)

This wil be sent in a single metrics event, the event time of the event represents the time that the release was no longer needed by callers. All of the above should be sufficient to give visibility on the lifespan and concurrency of intstaller cache entries.

In addition, it was requested that we are able to enable/disable the recording of metrics for this feature so I have implemented this. Implemented through the metricsManager as a list of metrics that we would like to block.

List all the issues related to this PR

What environments does this code impact?

Automation (CI, tools, etc)
Cloud
Operator Managed Deployments
None

How was this code tested?

assisted-test-infra environment
dev-scripts environment
Reviewer's test appreciated
Waiting for CI to do a full test run
Manual (Elaborate on how it was tested)
No tests needed

Checklist

Title and description added to both, commit and PR.
Relevant issues have been associated (see CONTRIBUTING guide)
This change does not require a documentation update (docstring, docs, README, etc)
Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

Are the title and description (in both PR and commit) meaningful and clear?
Is there a bug required (and linked) for this change?
Should this PR be backported?

openshift-ci-robot · 2025-01-02T16:50:50Z

@paul-maidment: This pull request references MGMT-19573 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set.

In response to this:

To support Ephemeral storage improvement efforts in MGMT-13917 It is desirable to have some statistics from the installer cache

ReleaseId The release being downloaded
Cached Was the release found in the cache (true) or did it need to be downloaded/extracted? (false)
StartTime The start time of the install cache request
EndTime The time at which the caller finished using the request
ExtractDuration The time taken to extract the file in seconds (zero if no extraction took place)

This wil be sent in a single metrics event, the event time of the event represents the time that the release was no longer needed by callers. All of the above should be sufficient to give visibility on the lifespan and concurrency of intstaller cache entries.

In addition, it was requested that we are able to enable/disable the recording of metrics for this feature so I have implemented this. Implemented through the metricsManager as a list of metrics that we would like to block.

List all the issues related to this PR

New Feature

Enhancement

Bug fix

Tests

Documentation

CI/CD

What environments does this code impact?

Automation (CI, tools, etc)

Cloud

Operator Managed Deployments

None

How was this code tested?

assisted-test-infra environment

dev-scripts environment

Reviewer's test appreciated

Waiting for CI to do a full test run

Manual (Elaborate on how it was tested)

No tests needed

Checklist

Title and description added to both, commit and PR.

Relevant issues have been associated (see CONTRIBUTING guide)

This change does not require a documentation update (docstring, docs, README, etc)

Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

Are the title and description (in both PR and commit) meaningful and clear?

Is there a bug required (and linked) for this change?

Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-01-02T16:51:08Z

@paul-maidment: This pull request references MGMT-19573 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set.

In response to this:

To support Ephemeral storage improvement efforts in MGMT-13917 It is desirable to have some statistics from the installer cache

ReleaseId The release being downloaded
Cached Was the release found in the cache (true) or did it need to be downloaded/extracted? (false)
StartTime The start time of the install cache request
EndTime The time at which the caller finished using the request
ExtractDuration The time taken to extract the file in seconds (zero if no extraction took place)

This wil be sent in a single metrics event, the event time of the event represents the time that the release was no longer needed by callers. All of the above should be sufficient to give visibility on the lifespan and concurrency of intstaller cache entries.

In addition, it was requested that we are able to enable/disable the recording of metrics for this feature so I have implemented this. Implemented through the metricsManager as a list of metrics that we would like to block.

List all the issues related to this PR

New Feature

Enhancement

Bug fix

Tests

Documentation

CI/CD

What environments does this code impact?

Automation (CI, tools, etc)

Cloud

Operator Managed Deployments

None

How was this code tested?

assisted-test-infra environment

dev-scripts environment

Reviewer's test appreciated

Waiting for CI to do a full test run

Manual (Elaborate on how it was tested)

No tests needed

Checklist

Title and description added to both, commit and PR.

Relevant issues have been associated (see CONTRIBUTING guide)

This change does not require a documentation update (docstring, docs, README, etc)

Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

Are the title and description (in both PR and commit) meaningful and clear?

Is there a bug required (and linked) for this change?

Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

paul-maidment · 2025-01-02T16:51:14Z

/cc @carbonin

openshift-ci · 2025-01-02T16:53:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: paul-maidment

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [paul-maidment]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov · 2025-01-02T17:44:24Z

Codecov Report

Attention: Patch coverage is 81.25000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 67.73%. Comparing base (0e3133e) to head (5d29c9b).
Report is 11 commits behind head on master.

Files with missing lines	Patch %	Lines
internal/ignition/installmanifests.go	40.00%	3 Missing ⚠️
internal/installercache/installercache.go	88.88%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7156      +/-   ##
==========================================
+ Coverage   67.58%   67.73%   +0.14%     
==========================================
  Files         296      296              
  Lines       40235    40290      +55     
==========================================
+ Hits        27193    27290      +97     
+ Misses      10589    10544      -45     
- Partials     2453     2456       +3

Files with missing lines	Coverage Δ
internal/ignition/installmanifests.go	`56.08% <40.00%> (ø)`
internal/installercache/installercache.go	`77.31% <88.88%> (+9.78%)`	⬆️

... and 11 files with indirect coverage changes

paul-maidment · 2025-01-03T09:37:56Z

/retest

paul-maidment · 2025-01-03T14:01:12Z

/retest

paul-maidment · 2025-01-03T14:24:14Z

/test okd-scos-e2e-aws-ovn

carbonin · 2025-01-03T19:26:08Z

Are you intending to track this through events or a metric?

I could see something like installer cache misses and release usage duration as good metrics, but what you're doing here with timestamps feels more like an event.

If you're intending to target only events you can probably implement it as an event only rather than adding something to the metrics manager.

@rccrdpccl what do you think? Should these be events or actual prometheus metrics?

carbonin · 2025-01-03T19:27:58Z

My personal preference would be to use actual metrics as they would be more usable through more tools (like grafana), but maybe we also do something clever there with events that I don't know about 🤷

paul-maidment · 2025-01-03T20:28:18Z

My personal preference would be to use actual metrics as they would be more usable through more tools (like grafana), but maybe we also do something clever there with events that I don't know about 🤷

I can understand this perspective, using multiple events instead of a single event with embedded metrics.

The current approach does have some benefits though...

No need for multiple events, all facts are known at a single point in time.
No need to track "Request ID" or similar in each event. The single event encapsulates an entire release 'lifespan'.
- Less complex to deal with when making queries
One single write to the events table, no chance of inconsistency due to pod restart or outage.
Easy to switch on and off via the metrics manager (as has already been done in this PR)
We can gather other metrics at the same time (extract duration, cached, release id), as such, the target is not only events but also some metrics.

In terms of complexity for use in a query to calculate "concurrent releases"

Need to calculate an intermediate table for the "pure events"
vs the need to extract a JSON field for the "metric events" approach.

I feel that for our needs, the "metric events" approach is the simplest and easiest to work with.

I understand the point about usability via more tools, again I argue that for the "concurrent releases" report that we would probably have to resort to SQL anyway so the choice of "pure events" vs "metric event" becomes less relevant. I am only interested in "overlapping time intervals" in these cases anyway.

An additional metric to indicate "lifespan" of the request could be added to the metric event for more convenience.

paul-maidment · 2025-01-03T20:31:19Z

/test okd-scos-e2e-aws-ovn

paul-maidment · 2025-01-03T20:31:40Z

/test edge-e2e-ai-operator-ztp

paul-maidment · 2025-01-04T00:09:50Z

/test okd-scos-e2e-aws-ovn

paul-maidment · 2025-01-04T07:36:49Z

/test okd-scos-e2e-aws-ovn

rccrdpccl · 2025-01-07T12:21:51Z

@rccrdpccl what do you think? Should these be events or actual prometheus metrics?

It really depends on what question are we trying to answer.
As I understand, we are trying to answer the following:

How many "releases" happen concurrently?
How long does a "release" normally take? What about if it's "cached"?

Answering the first question with Prometheus can get tricky: if it's a gauge we might miss scraping the data due to scraping interval, if it's a counter it'd get complicated to handle and represent. The second question would also be really awkward to answer through Prometheus. However by querying events both implementation and actual answer should be straightforward.

Another factor that we should consider when choosing between metrics and events are alerts: we can alert on metrics, we cannot on events. In this case I think it's not relevant.

If you're intending to target only events you can probably implement it as an event only rather than adding something to the metrics manager.

100% agree. Metrics manager should serve as a facade for Prometheus metrics, and although it's coupled with events handler, IMO it really shouldn't and we should try to avoid that approach.

In addition, it was requested that we are able to enable/disable the recording of metrics for this feature so I have implemented this. Implemented through the metricsManager as a list of metrics that we would like to block.

@paul-maidment why do we need to enable/disable this metric? I do not see the benefit in doing so

paul-maidment · 2025-01-07T13:38:46Z

@paul-maidment why do we need to enable/disable this metric? I do not see the benefit in doing so
In a 1:1 discussion around this PR, @gamli75 requested the ability to disable this metric.

paul-maidment · 2025-01-07T13:39:52Z

100% agree. Metrics manager should serve as a facade for Prometheus metrics, and although it's coupled with events handler, IMO it really shouldn't and we should try to avoid that approach.

I can understand this perspective. I will update this to use the events api directly.

internal/installercache/installercache.go

paul-maidment · 2025-01-08T14:50:34Z

/retest

internal/installercache/installercache.go

paul-maidment · 2025-01-08T16:04:12Z

To clarify, 1000 events per week limit was derived from the following query against production Grafana.

SELECT count(*) FROM events WHERE name='cluster_prepare_installation_started' AND event_time >= NOW() - INTERVAL '7 days';

For the last 7 days, that count is 804

This represents every time that cluster preparation was started.
I would expect this count to be a little higher than the cluster installation count as we need to account for preparation failures etc.

The main point is that we have no more than 1000 events in a given week, this is low volume so we are not at risk of an events explosion or anything of that nature.

rccrdpccl · 2025-01-09T15:25:48Z

/retest

rccrdpccl · 2025-01-09T16:06:30Z

/hold
/lgtm

paul-maidment · 2025-01-10T11:03:35Z

/unhold

rccrdpccl · 2025-01-10T11:32:34Z

/lgtm

openshift-ci-robot · 2025-01-10T11:41:37Z

/retest-required

Remaining retests: 0 against base HEAD 5a89825 and 2 for PR HEAD a5489fd in total

To support Ephemeral storage improvement efforts in MGMT-13917 It is desirable to have some statistics from the installer cache ReleaseId The release being downloaded Cached Was the release found in the cache (true) or did it need to be downloaded/extracted? (false) StartTime The start time of the install cache request EndTime The time at which the caller finished using the request ExtractDuration The time taken to extract the file in seconds (zero if no extraction took place) This wil be sent in a single metrics event, the event time of the event represents the time that the release was no longer needed by callers. All of the above should be sufficient to give visibility on the lifespan and concurrency of intstaller cache entries. In addition, it was requested that we are able to enable/disable the recording of metrics for this feature so I have implemented this. Implemented through the metricsManager as a list of metrics that we would like to block.

rccrdpccl · 2025-01-10T12:22:32Z

/lgtm

openshift-ci · 2025-01-10T15:52:57Z

@paul-maidment: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-bot · 2025-01-10T19:11:13Z

[ART PR BUILD NOTIFIER]

Distgit: ose-agent-installer-api-server
This PR has been included in build ose-agent-installer-api-server-container-v4.19.0-202501101641.p0.g8699842.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 2, 2025

openshift-ci bot requested a review from carbonin January 2, 2025 16:51

openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 2, 2025

openshift-ci bot requested a review from tsorya January 2, 2025 16:53

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 2, 2025

paul-maidment force-pushed the MGMT-19573 branch from 971481b to ab81637 Compare January 2, 2025 17:12

paul-maidment force-pushed the MGMT-19573 branch 3 times, most recently from 163f461 to a242809 Compare January 3, 2025 09:35

paul-maidment force-pushed the MGMT-19573 branch 2 times, most recently from 7d19ddc to 2b09f04 Compare January 3, 2025 11:59

paul-maidment force-pushed the MGMT-19573 branch from 2b09f04 to 6701090 Compare January 7, 2025 16:19

carbonin reviewed Jan 8, 2025

View reviewed changes

internal/installercache/installercache.go Outdated Show resolved Hide resolved

paul-maidment force-pushed the MGMT-19573 branch from b01e34a to a5f40c7 Compare January 8, 2025 14:44

paul-maidment force-pushed the MGMT-19573 branch from a5f40c7 to 56ad1b6 Compare January 8, 2025 15:04

paul-maidment requested review from rccrdpccl and carbonin January 8, 2025 15:05

rccrdpccl reviewed Jan 8, 2025

View reviewed changes

internal/installercache/installercache.go Show resolved Hide resolved

paul-maidment requested a review from rccrdpccl January 8, 2025 15:19

paul-maidment force-pushed the MGMT-19573 branch from 56ad1b6 to d7747a7 Compare January 8, 2025 17:09

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2025

openshift-ci bot assigned rccrdpccl Jan 9, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 9, 2025

paul-maidment force-pushed the MGMT-19573 branch from d7747a7 to a5489fd Compare January 10, 2025 11:02

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 10, 2025

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 10, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 10, 2025

paul-maidment force-pushed the MGMT-19573 branch from a5489fd to 5d29c9b Compare January 10, 2025 12:20

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 10, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 10, 2025

openshift-merge-bot bot merged commit 8699842 into openshift:master Jan 10, 2025
14 checks passed

MGMT-19573: Track release stats in installercache #7156

MGMT-19573: Track release stats in installercache #7156

Conversation

paul-maidment commented Jan 2, 2025 • edited by openshift-ci bot Loading

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

openshift-ci-robot commented Jan 2, 2025 • edited by openshift-ci bot Loading

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

openshift-ci-robot commented Jan 2, 2025 • edited by openshift-ci bot Loading

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

paul-maidment commented Jan 2, 2025

openshift-ci bot commented Jan 2, 2025

codecov bot commented Jan 2, 2025 • edited Loading

Codecov Report

paul-maidment commented Jan 3, 2025

paul-maidment commented Jan 3, 2025

paul-maidment commented Jan 3, 2025

carbonin commented Jan 3, 2025

carbonin commented Jan 3, 2025

paul-maidment commented Jan 3, 2025 • edited Loading

paul-maidment commented Jan 3, 2025

paul-maidment commented Jan 3, 2025

paul-maidment commented Jan 4, 2025

paul-maidment commented Jan 4, 2025

rccrdpccl commented Jan 7, 2025

paul-maidment commented Jan 7, 2025

paul-maidment commented Jan 7, 2025

paul-maidment commented Jan 8, 2025

paul-maidment commented Jan 8, 2025 • edited Loading

rccrdpccl commented Jan 9, 2025

rccrdpccl commented Jan 9, 2025

paul-maidment commented Jan 10, 2025

rccrdpccl commented Jan 10, 2025

openshift-ci-robot commented Jan 10, 2025

rccrdpccl commented Jan 10, 2025

openshift-ci bot commented Jan 10, 2025

openshift-bot commented Jan 10, 2025

paul-maidment commented Jan 2, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Jan 2, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Jan 2, 2025 •

edited by openshift-ci bot

Loading

codecov bot commented Jan 2, 2025 •

edited

Loading

paul-maidment commented Jan 3, 2025 •

edited

Loading

paul-maidment commented Jan 8, 2025 •

edited

Loading