Spotted today in our logs:
Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5922]: 2019/06/12 16:51:19 [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount] value [24], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:17.000', New: '2019/06/12-09:51:18.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5922]: [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount] value [24], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:17.000', New: '2019/06/12-09:51:18.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Main process exited, code=exited, status=1/FAILURE
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Unit entered failed state.
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Failed with result 'exit-code'.
We can workaround this by restarting the daemon via systemd on failure. Note to others: It's important to set RestartSec=10 (or higher) in the systemd unit file, otherwise the daemon will immediately crash again, because of Stackdriver rate limiting the updates:
Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5928]: 2019/06/12 16:51:19 [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount] value [15], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:18.000', New: '2019/06/12-09:51:19.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5928]: [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount] value [15], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:18.000', New: '2019/06/12-09:51:19.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Main process exited, code=exited, status=1/FAILURE
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Unit entered failed state.
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Failed with result 'exit-code'.
(The daemon should probably also handle these rate limiting errors better.)
Spotted today in our logs:
We can workaround this by restarting the daemon via systemd on failure. Note to others: It's important to set
RestartSec=10(or higher) in the systemd unit file, otherwise the daemon will immediately crash again, because of Stackdriver rate limiting the updates:(The daemon should probably also handle these rate limiting errors better.)