Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
USHIFT-5279: MicroShift telemetry enhancement #1742
base: master
Are you sure you want to change the base?
USHIFT-5279: MicroShift telemetry enhancement #1742
Changes from all commits
818344b
6eceedb
8c73ed6
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it useful to mention RHEL Insights as a counterpoint in the MicroShift docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something about this in the R4E docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do users see this telemetry data? Are there any increased system requirements, storage, or networking to use telemetry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understanding user as an application admin using MicroShift, users cant see any of this as the reports are intended to provide information for Red Hat. Users who want to have metrics on their systems should look into an observability stack instead (like Prometheus).
As for system requirements, the resource usage of this feature is negligible so no increase in any aspect. There is an estimation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about which moment in the day (or rather, after how many hours/minutes/seconds since MicroShift started) you want MicroShift to send metrics to Red Hat. It will default to 24h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would a user potentially apply this information to their use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understanding user as an application admin using MicroShift, the only relevant metrics for them might be those of resource usage, besides their own applications running on top. All metrics here are system and MicroShift core components related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per https://rhobs-handbook.netlify.app/products/openshiftmonitoring/telemetry.md/#requirements, is this required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is basically mentioning the monitoring stack (Prometheus, alert-manager, etc). The document in the link is targeting OCP systems and operators and we have none of them. In this enhancement we are using the same API on the backend, but the client is not the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to assess the potential impact on the Telemetry write endpoint (
https://infogw.api.openshift.com/metrics/v1/receive
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean in bandwidth used per day? Number of connections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moadz would have more insights but my concerns would mainly be about network traffic and authentication requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included an estimation of data sent per metrics payload/report.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Telemetry server has an allow-list of metrics which comes from OCP and incoming samples which don't match the list are dropped. What's the strategy to keep MicroShift metrics in sync with OCP metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the characteristics of MicroShift we think the list included here is enough for our purposes, as there are many parts of OCP that are missing and we do not plan to adopt any of them.
Also, MicroShift is an rpm based product, so we also have other means of reporting what's installed and what isn't. These metrics help build a closer to reality picture of what a system is running.
Keeping MicroShift metrics in sync may be done by means of having CI clusters report their metrics and check our stats using Grafana as part of our CI duties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put that in the enhancement?
I expect the process to be automated otherwise I suspect that it will drift fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added info here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't how Prometheus/Thanos works: if you don't send metrics at least every 5m, they will be marked as stale and disappear from the query result. Working with intermittent metrics is very hard in practice and not following the same principle as OCP Telemetry metrics would complicate usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the retention period for metrics in the thanos server? It might be challenging for microshift deployments to send metrics in such a high cadence because of customer limitations (provided they are connected, which is also not common use case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by retention period. The first-level Telemetry backend keeps metrics for 15 days, they are then "copied" to another tier and kept virtually forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was asking because of the querying. MicroShift's use cases are usually disconnected, but for those that are connected they might not have the best network, which is why we thought of reducing the amount of metrics we send. The current queries you mention are based on OpenShift, I understand? We should have separate dashboards/aggregation layers because the product is different than OCP. Are the ones you are talking about in tableau or are they in grafana?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm talking about the Prometheus query API & Grafana (I'm not knowledgeable about Tableau).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that be an issue if we had different dashboards? Our primary interest is to know how many clusters there are out there, usage is a secondary need in this case. Having sparse metrics should not make a huge difference for our intentions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sparse metrics lead to gaps when querying. A workaround would be to use functions like
last_over_time(foo[24d])
(e.g. "return the last received datapoint in the last 24 hours"). From my personal experience, this is less than ideal:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this config be part of the microshift
config.yaml
, or do users need to config elsewhere?Are there minimum network requirements for using telemetry with MicroShift?
Can it be scheduled for a specific time each day?
Is there any impact on the production cluster if the send fails? Or do we just see an error message somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of MicroShift's configuration file>
We are now targeting sending this every X hours/minutes/seconds instead of specific times since we would like to have this at least once per day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So will MicroShift by default now require a CSI? Or can these values be stored without a CSI because they are so small?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this data can be persisted either in memory and/or in the /var/lib/microshift directory. This is now expanded in this same section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to understand correctly but if the intention is to buffer samples locally and send them in a batch every 24h, it won't work against the actual Telemetry requirements: it can't ingest samples "too far" in the past (I think that the backfill window is 2h).
Backfilling data into Telemetry would be interesting but it isn't something that is supported right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So will a user toggle off and also have to use the OCP steps, or can we automate those steps after the opt-out toggle is selected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just using the configuration option to disable it (
status: Disabled
) will do. No need to follow OCP's opt out procedure, which involves changing pull secrets.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so it seems like we want to configure this at install, prior to first boot. Can we turn this on later and restart? (And vice versa, turn it off and restart.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you can enable/disable this everytime you restart MicroShift, it is not tied to the install/first boot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at some point during development, you might want to write traffic to the Telemetry staging server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! What is the endpoint for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the server is
infogw-proxy.api.stage.openshift.com
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I understand the
recent-stage
data source from grafana targets this backend, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it should
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is this something we gather that an admin can then use when an edge device is about to be upgraded or decommissioned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would also be interesting, it would need to be connected though. Let me give it some thought and include it here.