Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MGMT-19573: Track release stats in installercache #7156

Merged
merged 1 commit into from
Jan 10, 2025

Conversation

paul-maidment
Copy link
Contributor

@paul-maidment paul-maidment commented Jan 2, 2025

To support Ephemeral storage improvement efforts in MGMT-13917 It is desirable to have some statistics from the installer cache

ReleaseId                 The release being downloaded
Cached                    Was the release found in the cache (true) or did it need to be downloaded/extracted? (false)
StartTime                 The start time of the install cache request
EndTime                   The time at which the caller finished using the request
ExtractDuration           The time taken to extract the file in seconds (zero if no extraction took place)

This wil be sent in a single metrics event, the event time of the event represents the time that the release was no longer needed by callers. All of the above should be sufficient to give visibility on the lifespan and concurrency of intstaller cache entries.

In addition, it was requested that we are able to enable/disable the recording of metrics for this feature so I have implemented this. Implemented through the metricsManager as a list of metrics that we would like to block.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 2, 2025

@paul-maidment: This pull request references MGMT-19573 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set.

In response to this:

To support Ephemeral storage improvement efforts in MGMT-13917 It is desirable to have some statistics from the installer cache

ReleaseId The release being downloaded
Cached Was the release found in the cache (true) or did it need to be downloaded/extracted? (false)
StartTime The start time of the install cache request
EndTime The time at which the caller finished using the request
ExtractDuration The time taken to extract the file in seconds (zero if no extraction took place)

This wil be sent in a single metrics event, the event time of the event represents the time that the release was no longer needed by callers. All of the above should be sufficient to give visibility on the lifespan and concurrency of intstaller cache entries.

In addition, it was requested that we are able to enable/disable the recording of metrics for this feature so I have implemented this. Implemented through the metricsManager as a list of metrics that we would like to block.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 2, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 2, 2025

@paul-maidment: This pull request references MGMT-19573 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set.

In response to this:

To support Ephemeral storage improvement efforts in MGMT-13917 It is desirable to have some statistics from the installer cache

ReleaseId The release being downloaded
Cached Was the release found in the cache (true) or did it need to be downloaded/extracted? (false)
StartTime The start time of the install cache request
EndTime The time at which the caller finished using the request
ExtractDuration The time taken to extract the file in seconds (zero if no extraction took place)

This wil be sent in a single metrics event, the event time of the event represents the time that the release was no longer needed by callers. All of the above should be sufficient to give visibility on the lifespan and concurrency of intstaller cache entries.

In addition, it was requested that we are able to enable/disable the recording of metrics for this feature so I have implemented this. Implemented through the metricsManager as a list of metrics that we would like to block.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@paul-maidment
Copy link
Contributor Author

/cc @carbonin

@openshift-ci openshift-ci bot requested a review from carbonin January 2, 2025 16:51
@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 2, 2025
@openshift-ci openshift-ci bot requested a review from tsorya January 2, 2025 16:53
Copy link

openshift-ci bot commented Jan 2, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: paul-maidment

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 2, 2025
Copy link

codecov bot commented Jan 2, 2025

Codecov Report

Attention: Patch coverage is 81.25000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 67.73%. Comparing base (0e3133e) to head (5d29c9b).
Report is 11 commits behind head on master.

Files with missing lines Patch % Lines
internal/ignition/installmanifests.go 40.00% 3 Missing ⚠️
internal/installercache/installercache.go 88.88% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7156      +/-   ##
==========================================
+ Coverage   67.58%   67.73%   +0.14%     
==========================================
  Files         296      296              
  Lines       40235    40290      +55     
==========================================
+ Hits        27193    27290      +97     
+ Misses      10589    10544      -45     
- Partials     2453     2456       +3     
Files with missing lines Coverage Δ
internal/ignition/installmanifests.go 56.08% <40.00%> (ø)
internal/installercache/installercache.go 77.31% <88.88%> (+9.78%) ⬆️

... and 11 files with indirect coverage changes

@paul-maidment paul-maidment force-pushed the MGMT-19573 branch 3 times, most recently from 163f461 to a242809 Compare January 3, 2025 09:35
@paul-maidment
Copy link
Contributor Author

/retest

@paul-maidment paul-maidment force-pushed the MGMT-19573 branch 2 times, most recently from 7d19ddc to 2b09f04 Compare January 3, 2025 11:59
@paul-maidment
Copy link
Contributor Author

/retest

@paul-maidment
Copy link
Contributor Author

/test okd-scos-e2e-aws-ovn

@carbonin
Copy link
Member

carbonin commented Jan 3, 2025

Are you intending to track this through events or a metric?

I could see something like installer cache misses and release usage duration as good metrics, but what you're doing here with timestamps feels more like an event.

If you're intending to target only events you can probably implement it as an event only rather than adding something to the metrics manager.

@rccrdpccl what do you think? Should these be events or actual prometheus metrics?

@carbonin
Copy link
Member

carbonin commented Jan 3, 2025

My personal preference would be to use actual metrics as they would be more usable through more tools (like grafana), but maybe we also do something clever there with events that I don't know about 🤷

@paul-maidment
Copy link
Contributor Author

paul-maidment commented Jan 3, 2025

My personal preference would be to use actual metrics as they would be more usable through more tools (like grafana), but maybe we also do something clever there with events that I don't know about 🤷

I can understand this perspective, using multiple events instead of a single event with embedded metrics.

The current approach does have some benefits though...

  • No need for multiple events, all facts are known at a single point in time.
  • No need to track "Request ID" or similar in each event. The single event encapsulates an entire release 'lifespan'.
    • Less complex to deal with when making queries
  • One single write to the events table, no chance of inconsistency due to pod restart or outage.
  • Easy to switch on and off via the metrics manager (as has already been done in this PR)
  • We can gather other metrics at the same time (extract duration, cached, release id), as such, the target is not only events but also some metrics.

In terms of complexity for use in a query to calculate "concurrent releases"

  • Need to calculate an intermediate table for the "pure events"
  • vs the need to extract a JSON field for the "metric events" approach.

I feel that for our needs, the "metric events" approach is the simplest and easiest to work with.

I understand the point about usability via more tools, again I argue that for the "concurrent releases" report that we would probably have to resort to SQL anyway so the choice of "pure events" vs "metric event" becomes less relevant. I am only interested in "overlapping time intervals" in these cases anyway.

An additional metric to indicate "lifespan" of the request could be added to the metric event for more convenience.

@paul-maidment
Copy link
Contributor Author

/test okd-scos-e2e-aws-ovn

@paul-maidment
Copy link
Contributor Author

/test edge-e2e-ai-operator-ztp

@paul-maidment
Copy link
Contributor Author

/test okd-scos-e2e-aws-ovn

1 similar comment
@paul-maidment
Copy link
Contributor Author

/test okd-scos-e2e-aws-ovn

@rccrdpccl
Copy link
Contributor

@rccrdpccl what do you think? Should these be events or actual prometheus metrics?

It really depends on what question are we trying to answer.
As I understand, we are trying to answer the following:

  • How many "releases" happen concurrently?
  • How long does a "release" normally take? What about if it's "cached"?

Answering the first question with Prometheus can get tricky: if it's a gauge we might miss scraping the data due to scraping interval, if it's a counter it'd get complicated to handle and represent. The second question would also be really awkward to answer through Prometheus. However by querying events both implementation and actual answer should be straightforward.

Another factor that we should consider when choosing between metrics and events are alerts: we can alert on metrics, we cannot on events. In this case I think it's not relevant.

If you're intending to target only events you can probably implement it as an event only rather than adding something to the metrics manager.

100% agree. Metrics manager should serve as a facade for Prometheus metrics, and although it's coupled with events handler, IMO it really shouldn't and we should try to avoid that approach.

In addition, it was requested that we are able to enable/disable the recording of metrics for this feature so I have implemented this. Implemented through the metricsManager as a list of metrics that we would like to block.

@paul-maidment why do we need to enable/disable this metric? I do not see the benefit in doing so

@paul-maidment
Copy link
Contributor Author

@paul-maidment why do we need to enable/disable this metric? I do not see the benefit in doing so
In a 1:1 discussion around this PR, @gamli75 requested the ability to disable this metric.

@paul-maidment
Copy link
Contributor Author

100% agree. Metrics manager should serve as a facade for Prometheus metrics, and although it's coupled with events handler, IMO it really shouldn't and we should try to avoid that approach.

I can understand this perspective. I will update this to use the events api directly.

@paul-maidment
Copy link
Contributor Author

/retest

@paul-maidment
Copy link
Contributor Author

paul-maidment commented Jan 8, 2025

To clarify, 1000 events per week limit was derived from the following query against production Grafana.

SELECT count(*) FROM events WHERE name='cluster_prepare_installation_started' AND event_time >= NOW() - INTERVAL '7 days';

For the last 7 days, that count is 804

This represents every time that cluster preparation was started.
I would expect this count to be a little higher than the cluster installation count as we need to account for preparation failures etc.

The main point is that we have no more than 1000 events in a given week, this is low volume so we are not at risk of an events explosion or anything of that nature.

@rccrdpccl
Copy link
Contributor

/retest

@rccrdpccl
Copy link
Contributor

/hold
/lgtm

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2025
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 9, 2025
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 10, 2025
@paul-maidment
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 10, 2025
@rccrdpccl
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 10, 2025
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 5a89825 and 2 for PR HEAD a5489fd in total

To support Ephemeral storage improvement efforts in MGMT-13917
It is desirable to have some statistics from the installer cache

	ReleaseId                 The release being downloaded
	Cached                    Was the release found in the cache (true) or did it need to be downloaded/extracted? (false)
	StartTime                 The start time of the install cache request
	EndTime                   The time at which the caller finished using the request
	ExtractDuration           The time taken to extract the file in seconds (zero if no extraction took place)

This wil be sent in a single metrics event, the event time of the event represents the time that the release was no longer needed by callers.
All of the above should be sufficient to give visibility on the lifespan and concurrency of intstaller cache entries.

In addition, it was requested that we are able to enable/disable the recording of metrics for this feature so I have implemented this.
Implemented through the metricsManager as a list of metrics that we would like to block.
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 10, 2025
@rccrdpccl
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 10, 2025
Copy link

openshift-ci bot commented Jan 10, 2025

@paul-maidment: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 8699842 into openshift:master Jan 10, 2025
14 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-agent-installer-api-server
This PR has been included in build ose-agent-installer-api-server-container-v4.19.0-202501101641.p0.g8699842.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants