Skip to content

stackdriver: Crash when transient error or rate limiting happens. #89

@philwo

Description

@philwo

Spotted today in our logs:

Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5922]: 2019/06/12 16:51:19 [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount] value [24], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:17.000', New: '2019/06/12-09:51:18.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5922]: [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount] value [24], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:17.000', New: '2019/06/12-09:51:18.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Main process exited, code=exited, status=1/FAILURE
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Unit entered failed state.
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Failed with result 'exit-code'.

We can workaround this by restarting the daemon via systemd on failure. Note to others: It's important to set RestartSec=10 (or higher) in the systemd unit file, otherwise the daemon will immediately crash again, because of Stackdriver rate limiting the updates:

Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5928]: 2019/06/12 16:51:19 [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount] value [15], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:18.000', New: '2019/06/12-09:51:19.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5928]: [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount] value [15], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:18.000', New: '2019/06/12-09:51:19.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Main process exited, code=exited, status=1/FAILURE
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Unit entered failed state.
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Failed with result 'exit-code'.

(The daemon should probably also handle these rate limiting errors better.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions