Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metrics scraper] Metrics scraper pod overloads and crashes #8015

Closed
maciaszczykm opened this issue Jul 7, 2023 · 4 comments
Closed

[Metrics scraper] Metrics scraper pod overloads and crashes #8015

maciaszczykm opened this issue Jul 7, 2023 · 4 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@maciaszczykm
Copy link
Member

maciaszczykm commented Jul 7, 2023

We have a cluster here that has 34 nodes and 2497 pod. The metrics scraper seemed to reach 5000m of cpu and 6.7G of memory before eventually crashing.

Dashboard version v2.0.5
Metric scraper version v1.0.6
The metrics scraper produces roughly 500000 log lines per hour and look like this

Jan 27 20:07:24 dashboard-metrics-scraper-5cccbddcc-fpr6k dashboard-metrics-scraper 172.30.160.74 - - [27/Jan/2021:18:07:24 +0000] "GET /api/v1/dashboard/nodes//metrics/cpu/usage_rate HTTP/1.1" 200 874 "" "dashboard/v2.0.5"
Jan 27 20:07:24 dashboard-metrics-scraper-5cccbddcc-fpr6k dashboard-metrics-scraper 172.30.160.74 - - [27/Jan/2021:18:07:24 +0000] "GET /api/v1/dashboard/nodes//metrics/cpu/usage_rate HTTP/1.1" 200 875 "" "dashboard/v2.0.5"
Jan 27 20:07:24 dashboard-metrics-scraper-5cccbddcc-fpr6k dashboard-metrics-scraper 172.30.160.74 - - [27/Jan/2021:18:07:24 +0000] "GET /api/v1/dashboard/nodes//metrics/cpu/usage_rate HTTP/1.1" 200 878 "" "dashboard/v2.0.5"
Jan 27 20:07:24 dashboard-metrics-scraper-5cccbddcc-fpr6k dashboard-metrics-scraper 172.30.160.74 - - [27/Jan/2021:18:07:24 +0000] "GET /api/v1/dashboard/nodes//metrics/cpu/usage_rate HTTP/1.1" 200 888 "" "dashboard/v2.0.5"
Jan 27 20:07:24 dashboard-metrics-scraper-5cccbddcc-fpr6k dashboard-metrics-scraper 172.30.160.74 - - [27/Jan/2021:18:07:24 +0000] "GET /api/v1/dashboard/nodes//metrics/cpu/usage_rate HTTP/1.1" 200 892 "" "dashboard/v2.0.5"
Jan 27 20:07:24 dashboard-metrics-scraper-5cccbddcc-fpr6k dashboard-metrics-scraper 172.30.160.74 - - [27/Jan/2021:18:07:24 +0000] "GET /api/v1/dashboard/nodes//metrics/cpu/usage_rate HTTP/1.1" 200 891 "" "dashboard/v2.0.5"
It seems like it's handling the requests like it should, it's just getting overloaded and can't handle that very well. I think adding a cpu and memory limit wouldn't help very much because I think that also causes the pod to keep crashing once it hits it.
This is about as much info as I have about it on this cluster. The user did delete the pod and it came back and overloaded and crashed again.

Opened by @Joseph-Goergen.

See kubernetes-sigs/dashboard-metrics-scraper#38 for more details.

@maciaszczykm maciaszczykm added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 7, 2023
@maciaszczykm maciaszczykm changed the title Metrics scraper pod overloads and crashes [Metrics scraper] Metrics scraper pod overloads and crashes Jul 7, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 23, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

3 participants