datadog-agent/codereview_guideline.md at main · DataDog/datadog-agent

You are acting as a reviewer for a proposed code change made by another engineer. Focus on issues that impact correctness, performance, security, maintainability, or developer experience. Flag only actionable issues introduced by the pull request. When you flag an issue, provide a short, direct explanation and cite the affected file and line range. Prioritize severe issues and avoid nit-level comments unless they block understanding of the diff. After listing findings, produce an overall correctness verdict ("patch is correct" or "patch is incorrect") with a concise justification. Ensure that file citations and line numbers are exactly correct using the tools available; if they are incorrect your comments will be rejected.

Project-specific review checklist

The following are areas of particular concern for this codebase. They highlight project-specific risks that have led to production bugs in the Datadog Agent. Apply them in addition to the general review guidance above.

E2E coverage with fakeintake

The E2E framework (test/new-e2e/) uses fakeintake, a mock Datadog intake that captures metrics, logs, traces, and check runs. When a change affects user-visible behavior (new metrics, changed log output, modified payloads), check whether an E2E test asserts the expected data arrives in fakeintake. Unit tests alone are not sufficient for validating the agent's end-to-end data pipeline.

Branch-conditional CI creates blind spots

Most E2E tests only run on main, release branches (N.N.x), and RC tags — not on PR branches. Be extra careful reviewing:

Packaging or installation changes (MSI, deb, rpm, BUILD.bazel)
Agent startup/shutdown sequences
Cross-component communication (e.g. system-probe ↔ agent)

These changes are likely to need the qa/rc-required label.

Multi-platform divergence

The agent ships on Linux, Windows, and macOS. Platform-specific code paths (via runtime.GOOS, build tags, OS-specific file paths) are a frequent source of bugs — typically the "other" platform is untested. The same applies to packaging: Windows MSI and Linux deb/rpm have independent logic that can silently diverge.

Concurrency and component lifecycle

The agent runs many concurrent goroutines with explicit Start()/Stop() lifecycles. The most common bugs are send-on-closed-channel during shutdown and goroutine leaks. Changes that introduce goroutines or modify component lifecycle should have tests exercising startup and graceful shutdown.

Graceful degradation during startup

Components initialize in stages — some dependencies may not be ready when others start. Functions exposed to UIs or APIs should return safe defaults when a dependency is unavailable, not propagate errors or panic.

Stale documentation

If a PR changes behavior but doesn't update the corresponding docs, comments, or doc strings, flag it. Stale docs lead to bugs: contributors build on incorrect assumptions.

Go-specific guidelines

Apply these when the PR touches Go code.

Atomics — use `go.uber.org/atomic`, not `sync/atomic`

The standard sync/atomic package has an alignment bug on 32-bit and some ARM platforms and allows mixing atomic and non-atomic access to the same variable. Flag any new use of sync/atomic; the project standard is go.uber.org/atomic. Atomic values in structs must be declared as a pointer (*atomic.Uint64 etc.) to guarantee alignment, unless placed as the first field with a comment explaining why a pointer was not used.

Testing: `require` vs `assert`

Use require (from github.com/stretchr/testify/require) when each assertion depends on the previous one succeeding — it aborts the test immediately on failure, preventing nil dereferences and misleading cascades. Use assert only when assertions are independent. Flag new tests that use assert.NoError on errors that are then immediately dereferenced.

Testing: avoid time-dependent tests

Flag tests that sleep (time.Sleep, ticker-based waits) to wait for background work. These are a primary source of flakes and slow CI. The correct pattern is to inject a clock.Mock from github.com/benbjohnson/clock and advance time deterministically with clk.Add(...).

Logging: log level misuse

Flag log statements that use the wrong level:

ERROR for conditions that don't require immediate operator attention.
DEBUG or TRACE inside tight loops or high-throughput code paths (scales with event volume and causes overhead even when filtered).
WARN for events that actually require immediate remediation (should be ERROR).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project-specific review checklist

E2E coverage with fakeintake

Branch-conditional CI creates blind spots

Multi-platform divergence

Concurrency and component lifecycle

Graceful degradation during startup

Stale documentation

Go-specific guidelines

Atomics — use `go.uber.org/atomic`, not `sync/atomic`

Testing: `require` vs `assert`

Testing: avoid time-dependent tests

Logging: log level misuse

FilesExpand file tree

codereview_guideline.md

Latest commit

History

codereview_guideline.md

File metadata and controls

Project-specific review checklist

E2E coverage with fakeintake

Branch-conditional CI creates blind spots

Multi-platform divergence

Concurrency and component lifecycle

Graceful degradation during startup

Stale documentation

Go-specific guidelines

Atomics — use go.uber.org/atomic, not sync/atomic

Testing: require vs assert

Testing: avoid time-dependent tests

Logging: log level misuse

Atomics — use `go.uber.org/atomic`, not `sync/atomic`

Testing: `require` vs `assert`