Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

USHIFT-5279: MicroShift telemetry enhancement #1742

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
311 changes: 311 additions & 0 deletions enhancements/microshift/microshift-telemetry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,311 @@
---
title: microshift-telemetry
authors:
- pacevedom
reviewers:
- "ggiguash, MicroShift contributor"
- "pmtk, MicroShift contributor"
- "eslutsky, MicroShift contributor"
- "copejon, MicroShift contributor"
approvers:
pacevedom marked this conversation as resolved.
Show resolved Hide resolved
- "jerpeter1, MicroShift principal engineer"
- "moadz, Telemetry principal engineer"
api-approvers:
- None
creation-date: 2025-01-27
last-updated: 2025-02-04
tracking-link:
- https://issues.redhat.com/browse/OCPSTRAT-1071
---

# MicroShift telemetry
## Summary
MicroShift clusters are lacking the remote health monitoring that OpenShift
has. Having no remote health monitoring means no visibility on where and how
MicroShift has been deployed and/or is running.

In order to enable visibility on the number of deployed systems and also their
usage patterns, this enhancement proposes the addition of the
[Telemetry API](https://rhobs-handbook.netlify.app/products/openshiftmonitoring/telemetry.md/)
to MicroShift.

## Motivation
MicroShift is currently not sending any kind of metric from production
deployments. Not having remote monitoring creates a blind spot on the
deployment and usage characteristics of MicroShift instances in production
environments.

MicroShift runs as an application on top of R4E (Rhel4Edge), the usage of RHEL
insights may yield some information about MicroShift, but there might be gaps.
RHEL insights knows about which packages are installed, but not whether

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it useful to mention RHEL Insights as a counterpoint in the MicroShift docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something about this in the R4E docs?

MicroShift is running or runtime metrics.

In order to enhance the user experience and Red Hat's insights about production
MicroShift deployments, this enhancement proposes enabling the use of Telemetry
API to get data from connected MicroShift clusters.

### User Stories
As Red Hat, I want to enable MicroShift clusters to report back to me to get

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do users see this telemetry data? Are there any increased system requirements, storage, or networking to use telemetry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understanding user as an application admin using MicroShift, users cant see any of this as the reports are intended to provide information for Red Hat. Users who want to have metrics on their systems should look into an observability stack instead (like Prometheus).
As for system requirements, the resource usage of this feature is negligible so no increase in any aspect. There is an estimation here.

knowledge on usage patterns of live deployments.

As a MicroShift admin, I want to have the option to opt-out of telemetry.

As a MicroShift admin, I want to have the option to configure how often

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about which moment in the day (or rather, after how many hours/minutes/seconds since MicroShift started) you want MicroShift to send metrics to Red Hat. It will default to 24h.

telemetry data is sent.


### Goals

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would a user potentially apply this information to their use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understanding user as an application admin using MicroShift, the only relevant metrics for them might be those of resource usage, besides their own applications running on top. All metrics here are system and MicroShift core components related.

* Enable MicroShift connected deployments to send current information about
system capabilities and usage characteristics.

* Get better understanding of deployment and usage patterns on MicroShift.

### Non-Goals
* Have an in-cluster metrics service.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is basically mentioning the monitoring stack (Prometheus, alert-manager, etc). The document in the link is targeting OCP systems and operators and we have none of them. In this enhancement we are using the same API on the backend, but the client is not the same.


* Provide recommendations and/or analysis to customers using their data.

## Proposal
Introduce an automatic, opt-out mechanism for connected MicroShift clusters to
report their status to Red Hat. Each MicroShift cluster should send the
following metrics (at least) to the Telemetry API:
* Number of available cores/RAM/disk
* Average utilization of cores/RAM/disk (% of total)
* Number of namespaces
* Number of running pods
* Number of container images on disk
* Number of routes/ingress/services
* Number of CRDs
* OS version / Type (rpm / ostree)

These metrics should be sent at least once a day.

### Metrics details
Based on [Telemetry API](https://rhobs-handbook.netlify.app/products/openshiftmonitoring/telemetry.md/)
MicroShift will send metrics using Prometheus format. There are several ways
of using the API: [using the client to forward metrics](https://github.com/openshift/telemeter/tree/main?tab=readme-ov-file#upload-endpoint-receive-metrics-in-client_modelmetricfamily-format-from-telemeter-client-currently-used-by-cmo),
or [direct write requests](https://github.com/openshift/telemeter/tree/main?tab=readme-ov-file#metricsv1receive-endpoint-receive-metrics-in-prompbwriterequest-format-from-any-client).
Using the client is not an option because that requires a local Prometheus
instance, not viable in the typical resource constrained MicroShift deployment.
The direct write requests take raw Prometheus data and require both special
labels (`_id` for the cluster id) and crafting the authentication headers
(cluster id and pull secret in a specific format in HTTP headers).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to assess the potential impact on the Telemetry write endpoint (https://infogw.api.openshift.com/metrics/v1/receive).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean in bandwidth used per day? Number of connections?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moadz would have more insights but my concerns would mainly be about network traffic and authentication requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included an estimation of data sent per metrics payload/report.


OpenShift is using the same API and the same backend, so we need a way to
distinguish MicroShift in the pool of metrics. For this we can use labels, as
it does not require applying for new supported metrics. MicroShift metrics
labels summary:
| Label | Values |
|:---|:---|
|_id|cluster id|
|label_kubernetes_io_arch|amd64, aarch64|
|resource|Used to specify K8s resource names: pods, namespaces, etc.|
|instance|ip address of the node|
|version|microshift version|
|ostree_commit|OStree commit id, if the system is deployed using ostree|

Metrics from MicroShift are already supported in the API because OpenShift is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Telemetry server has an allow-list of metrics which comes from OCP and incoming samples which don't match the list are dropped. What's the strategy to keep MicroShift metrics in sync with OCP metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the characteristics of MicroShift we think the list included here is enough for our purposes, as there are many parts of OCP that are missing and we do not plan to adopt any of them.
Also, MicroShift is an rpm based product, so we also have other means of reporting what's installed and what isn't. These metrics help build a closer to reality picture of what a system is running.
Keeping MicroShift metrics in sync may be done by means of having CI clusters report their metrics and check our stats using Grafana as part of our CI duties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put that in the enhancement?

having CI clusters report their metrics and check our stats using Grafana as part of our CI duties.

I expect the process to be automated otherwise I suspect that it will drift fast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added info here.

using them. List follows:
* cluster:capacity_cpu_cores:sum. Number of allocated CPUs for MicroShift.
* cluster:capacity_memory_bytes:sum. Number of bytes of memory allocated for
MicroShift.
* cluster:cpu_usage_cores:sum. Usage of CPU in percentage.
* cluster:memory_usage_bytes:sum. Usage of memory in percentage.
* cluster:usage:resources:sum. Usage of k8s resources, in count. Number of
pods, namespaces, services, etc.
* cluster:usage:containers:sum. Number of active containers.
* instance:etcd_object_counts:sum. Number of objects etcd contains.

Combining metrics with their labels:
| Metric | Labels |
|:---|:---|
|cluster:capacity_cpu_cores:sum|_id, instance, ostree_commit, version|
|cluster:capacity_memory_bytes:sum|_id, instance, ostree_commit, version|
|cluster:cpu_usage_cores:sum|_id, instance, ostree_commit|
|cluster:memory_usage_bytes:sum|_id, instance, ostree_commit|
|instance:etcd_object_counts:sum|_id, instance, ostree_commit|
|cluster:usage:resources:sum|_id, instance, ostree_commit, resource|
|cluster:usage:containers:sum|_id, instance, ostree_commit|

Each metric, including a single sample and all labels, is under 250B in size.

In order to keep these metrics in sync with those of OpenShift we must have
the following:
* CI jobs deploying MicroShift must use Telemetry.
* CI jobs Telemetry must the same endpoints as production MicroShift
deployments.
* CI jobs Telemetry data must be visible in a Grafana dashboard.
* E2E tests must be included to check for both sending and reception of
Telemetry data.

### Sending metrics
OpenShift is sending metrics through this API every 4.5 minutes. For MicroShift
deployments this might be a bit excessive due to resource usage, network
traffic and usage patterns.

MicroShift should send data once a day to minimize network traffic as it can be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't how Prometheus/Thanos works: if you don't send metrics at least every 5m, they will be marked as stale and disappear from the query result. Working with intermittent metrics is very hard in practice and not following the same principle as OCP Telemetry metrics would complicate usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the retention period for metrics in the thanos server? It might be challenging for microshift deployments to send metrics in such a high cadence because of customer limitations (provided they are connected, which is also not common use case).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by retention period. The first-level Telemetry backend keeps metrics for 15 days, they are then "copied" to another tier and kept virtually forever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was asking because of the querying. MicroShift's use cases are usually disconnected, but for those that are connected they might not have the best network, which is why we thought of reducing the amount of metrics we send. The current queries you mention are based on OpenShift, I understand? We should have separate dashboards/aggregation layers because the product is different than OCP. Are the ones you are talking about in tableau or are they in grafana?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking about the Prometheus query API & Grafana (I'm not knowledgeable about Tableau).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that be an issue if we had different dashboards? Our primary interest is to know how many clusters there are out there, usage is a secondary need in this case. Having sparse metrics should not make a huge difference for our intentions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sparse metrics lead to gaps when querying. A workaround would be to use functions like last_over_time(foo[24d]) (e.g. "return the last received datapoint in the last 24 hours"). From my personal experience, this is less than ideal:

  • you'll have to educate users about this particularity
  • you'll need to maintain different dashboards
  • you can't differentiate between an instance being deleted vs. an instance which hasn't reported yet.

deployed in constrained environments. To allow customizations a new
configuration option shall be added. This option will drive how often a metrics

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this config be part of the microshift config.yaml, or do users need to config elsewhere?
Are there minimum network requirements for using telemetry with MicroShift?
Can it be scheduled for a specific time each day?
Is there any impact on the production cluster if the send fails? Or do we just see an error message somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of MicroShift's configuration file>
We are now targeting sending this every X hours/minutes/seconds instead of specific times since we would like to have this at least once per day.

payload should be sent and it will default to 24h. In every MicroShift start
all metrics shall be sent. Afterwards, it will follow the configuration option
to schedule the next send.

As described above MicroShift will be using the [direct request](https://github.com/openshift/telemeter/tree/main?tab=readme-ov-file#metricsv1receive-endpoint-receive-metrics-in-prompbwriterequest-format-from-any-client)
endpoint.
Each metric must follow [Prometheus WriteRequest](https://github.com/prometheus/prometheus/blob/release-2.38/prompb/remote.proto#L22)
format.

### Sampling and batching metrics
Metrics sent by MicroShift can be categorized in:
* Static metrics. These are fixed throughout the execution of MicroShift, such
as memory or CPU.
* Dynamic metrics. These change and evolve throughout the execution of
MicroShift, such as resource usage or resources in the cluster.

If we assume default values, metrics are sent at least once a day in those
deployments that do not disable the functionality. While this is good enough
for static metrics, it provides a degraded view for dynamic ones. Having
once-per-day data on resource usage makes it virtually impossible to extract
any patterns out of it. For this reason it might be beneficial to sample
dynamic metrics more often and then batch them together with the static
metrics when sending them.

Sampling metrics means MicroShift needs to store values in between metrics

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So will MicroShift by default now require a CSI? Or can these values be stored without a CSI because they are so small?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this data can be persisted either in memory and/or in the /var/lib/microshift directory. This is now expanded in this same section.

reports. The dynamic metrics are not numerous enough to take a toll on resource
usage, but this interval must be configurable and default to a sensible value.

As seen before each metric is under 250B. Considering 7 metrics, having one of
them hold multiple samples (one for each resource type, totalling ) a complete
metrics payload with static and dynamic metrics would be under 2 KB.
Using defaults (1 payload every 24h, with 1h samples for all dynamic metrics)
would be under 40 KB.
In order to not lose sampling data MicroShift should not rely on storing it
only in memory. The `/var/lib/microshift` directory should be used to store
temporary data about usage for sending it later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to understand correctly but if the intention is to buffer samples locally and send them in a batch every 24h, it won't work against the actual Telemetry requirements: it can't ingest samples "too far" in the past (I think that the backfill window is 2h).

Backfilling data into Telemetry would be interesting but it isn't something that is supported right now.

Sampled metrics that have not yet been sent should be store in json format
comprising:
* metric name.
* label keys and values.
* samples with timestamps.
This json file may be stored under `/var/lib/microshift/resources/telemetry/`.
Unsent metrics will be stored here until they are accepted in the Telemetry
API. The file will be removed afterwards to start over with the next batch.

### Sensitive data
There is no user or private data in any of the metrics in MicroShift reports.

### Opting out
MicroShift may not always have the possibility of sending metrics. There may be
disconnected clusters, constrained environments where external traffic could be
audited, customers who simply do not want to share this information, etc. For
these reasons there must be an opt-out for this functionality, as OpenShift
already [allows](https://docs.openshift.com/container-platform/4.17/support/remote_health_monitoring/opting-out-of-remote-health-reporting.html).

Taking advantage of the configuration options that the feature requires, an

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So will a user toggle off and also have to use the OCP steps, or can we automate those steps after the opt-out toggle is selected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just using the configuration option to disable it (status: Disabled) will do. No need to follow OCP's opt out procedure, which involves changing pull secrets.

enable/disable toggle is provided.

### Workflow Description
**MicroShift** is the MicroShift main process.

1. MicroShift starts up.
2. MicroShift reads configuration. If telemetry is not enabled, finish here. If telemetry is enabled proceed to next step.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so it seems like we want to configure this at install, prior to first boot. Can we turn this on later and restart? (And vice versa, turn it off and restart.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you can enable/disable this everytime you restart MicroShift, it is not tied to the install/first boot.

3. Collect all metrics and send them. Include dynamic metrics from sampling if available. Retry if failed.
4. Schedule next send for `telemetry.reportInterval`.
5. Every `telemetry.sampleInterval` collect all dynamic metrics and store them in memory.
6. Wait until `telemetry.reportInterval` and go to step 3.

```mermaid
sequenceDiagram
participant MicroShift
participant Red Hat Telemetry
MicroShift ->> MicroShift: Start up. Read configuration
loop Send Report
MicroShift ->> MicroShift: Read dynamic metrics from the file, if it exists
MicroShift ->> MicroShift: Collect static and dynamic metrics
loop Retries
MicroShift -->> Red Hat Telemetry: Send WriteRequest with aggregated metrics
Red Hat Telemetry ->> MicroShift: 200 Ok
end
MicroShift ->> MicroShift: Schedule next send loop to telemetry.reportInterval
loop Sampling
MicroShift ->> MicroShift: Collect dynamic metrics every telemetry.samplingInterval
MicroShift ->> MicroShift: Store dynamic metrics until next report in the local file
end
end

```

### API Extensions
As described above, the feature needs to be configurable as there could be different reasons why MicroShift admins would not want their clusters to connect to Red Hat telemetry.
The following changes in the configuration file are proposed:
```yaml
telemetry:
status: <Enabled|Disabled> # Defaults to Enabled
sendingInterval: <Duration> # Defaults to 24h
samplingInterval: <Duration> # Defaults to 1h
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at some point during development, you might want to write traffic to the Telemetry staging server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! What is the endpoint for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the server is infogw-proxy.api.stage.openshift.com

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I understand the recent-stage data source from grafana targets this backend, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it should

```

### Topology Considerations
#### Hypershift / Hosted Control Planes
N/A

#### Standalone Clusters
N/A

#### Single-node Deployments or MicroShift
Enhancement is intended for MicroShift only.

### Implementation Details/Notes/Constraints
N/A

### Risks and Mitigations
Using the telemeter API requires connected clusters, and this might not always
be the case with MicroShift. For those clusters that are not connected there
is no easy way of getting metrics from them. A possible mitigation would be to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is this something we gather that an admin can then use when an edge device is about to be upgraded or decommissioned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would also be interesting, it would need to be connected though. Let me give it some thought and include it here.

store all metrics until its possible to send them, but this may never happen
depending on the use cases and deployment types.
In such cases we simply assume that no metrics will be available.

### Drawbacks
N/A

## Test Plan
## Graduation Criteria
The feature is planned to be released as GA directly.

### Dev Preview -> Tech Preview
N/A

### Tech Preview -> GA
- Ability to utilize the enhancement end to end
- End user documentation completed and published
- Available by default
- End-to-end tests

### Removing a deprecated feature
N/A

## Upgrade / Downgrade Strategy
N/A

## Version Skew Strategy
N/A

## Operational Aspects of API Extensions
N/A

## Support Procedures
N/A

## Alternatives
* Insights API was also considered for this enhancement. The Insights API is
intended to analyze clusters and extract data to provide recommendations.
After a report is sent a series of pipelines take action to produce different
reports, which are visible through the [OpenShift console](https://console.redhat.com/openshift). This
is intended for connected clusters that need a deeper analysis on their data
to produce close to optimal configurations. The insights operator does not
work on MicroShift, and the visualization part, as well as the ingestion,
would require significant changes that are outside the scope of MicroShift
team. The nature of a MicroShift deployment is not the same as OpenShift,
therefore a recommendation/analysis engine may not be the best fit for the
purpose of this feature.