Increase Vector storage and fix garbage collector #749

0x2b3bfa0 · 2025-09-29T20:34:27Z

It had to be cache invalidation. 🙈

charts/studio/values.yaml

0x2b3bfa0 · 2025-09-29T20:35:35Z

charts/studio/values.yaml

  # -- Vector arguments.
  args:
    - |-
-      while sleep 60; do find /data/vector/logs -type f -mtime +7 -delete; done &


There is also /events and /metrics and I forgot to clean up those, so we were running out of storage.

shcheklein · 2025-09-29T21:52:38Z

charts/studio/values.yaml

+          # HACK: assumes directory structure (single level nesting) and all files prefixed with timestamp
+          rm -f "$(find "$STORAGE" -type f | sort -t/ -k5 | tail -n1)"
+        done
+      done &


are we getting notified if it exits with non 0 or killed by a kernel?

are we getting notified if it exits with non 0

No, if you're asking specifically about rm -f it will exit 0 even if there's no data to delete. 😅

or killed by a kernel

If you mean SIGKILL from e.g. out of memory, no, we won't; good news, though, is that the whole pod will be killed and reflected in Kubernetes.

Nevermind, wrong answer: no, we aren't (can't easily) get notified about failures here. What we can do is print logs at least.

no, I mean exists with non 0 for whatever reason (I don't want to analyze each command and see if it can be non 0 or not - I assume it will exit at some point with non 0 for some reason and we MUST handle this - make sure pod restarts, make sure we have logs (for the command itself), make sure we are getting notifies.

Update: upon further thought, we shouldn't handle this especially; health checks should rely exclusively on Vector, and, if it fails to write logs for whatever reason, report it.

how / where are we getting notified?

let's put kill or exit on the 10th iteration to experiment

The whole cleanup script can fail, and I wouldn't worry too much about it; in the extreme case the whole loop crashes (unlikely as per the logic) and the pod remains active, the storage will fill and we'll notice.

the storage will fill and we'll notice.

how?

also, I would prefer to notice that cleanup is failing ... otherwise this system will be losing logs AFAIU

shcheklein · 2025-09-29T21:53:57Z

charts/studio/values.yaml

+        done
+      done &
+
      exec /usr/local/bin/vector --config-dir /etc/vector/


same question - what if it exits?

can we add logs to the whole entry point? also do set -uex to see what is happening in case it fails

same question - what if it exits?

If this exits, the whole pod will restart, and Kubernetes will be aware.

and Kubernetes will be aware.

we also need to be aware

This boils down to a more general problem: we can't [ish] observe the observability tool. 😅

The only way I can think of guaranteeing we notice an outage here, is to monitor the health of Vector from the DataChain master pod health check; so if something isn't working (i.e. dead-whatever switch), we get a Slack notification.

It's becoming less of an observability tool but a tool that customers rely to deal with logs. Thus we must have a way to observe its health.

The only way I can think of guaranteeing we notice an outage here, is to monitor the health of Vector from the DataChain master pod health check; so if something isn't working (i.e. dead-whatever switch), we get a Slack notification.

yes, sounds good to me.

Monitoring for Vector itself can't be addressed as part of this pull request, though. It has to be addressed as a follow-up.

kk, that's fine ... let's do this though before we ship it to customers

also, I would really prefer to be notified if cleanup loop is not working

I get your point, but making sure that the cleanup script works in a reliable way isn't neither easy nor elegant.

I would rather try to tackle this from the observability standpoint, and...

Monitor the aggregator service to make sure it's running.

Monitor storage usage metrics for e.g. less than 2% of free space.

charts/studio/values.yaml

shcheklein

there are still unresolved comments

we can't ship that we are not monitoring, it was too painful for the team and for the customers :(

0x2b3bfa0 · 2025-10-16T02:37:43Z

there are still unresolved comments

True

we can't ship that we are not monitoring, it was too painful for the team and for the customers :(

This pull request doesn't ship anything that wasn't already shipped, but fixes things that weren't already fixed. 😅

0x2b3bfa0 · 2025-10-16T02:50:01Z

Merging this to unblock, would have wanted to spend more time on it.

This reverts commit b9c9f9f.

0x2b3bfa0 and others added 2 commits September 29, 2025 22:34

Increase Vector storage and fix garbage collector

c9a94c7

Helm-Docs update

cac662c

0x2b3bfa0 commented Sep 29, 2025

View reviewed changes

charts/studio/values.yaml Show resolved Hide resolved

0x2b3bfa0 commented Sep 29, 2025

View reviewed changes

0x2b3bfa0 and others added 2 commits September 29, 2025 22:35

Update Chart.yaml

a5e2315

Helm-Docs update

4402d0c

0x2b3bfa0 requested a review from shcheklein September 29, 2025 20:43

0x2b3bfa0 and others added 8 commits September 29, 2025 23:35

Delete oldest

9cbc9da

Helm-Docs update

f4ec0a1

Update values.yaml

8043745

Helm-Docs update

813f4a0

Update values.yaml

cd61fbf

Helm-Docs update

9daca48

Update values.yaml

ec96272

Helm-Docs update

740dbcf

0x2b3bfa0 requested review from shcheklein and removed request for shcheklein September 29, 2025 21:47

shcheklein reviewed Sep 29, 2025

View reviewed changes

charts/studio/values.yaml Outdated Show resolved Hide resolved

0x2b3bfa0 and others added 10 commits September 30, 2025 00:49

Update values.yaml

0653ded

Helm-Docs update

c6ae05a

Update values.yaml

c493235

Helm-Docs update

34771a6

Update values.yaml

a8d7985

Helm-Docs update

f4052d6

Update values.yaml

e34f009

Helm-Docs update

a2d0525

Update values.yaml

8de0ead

Helm-Docs update

1a40d64

0x2b3bfa0 added 2 commits October 1, 2025 19:45

Update values.yaml

8bcba19

Merge branch 'main' into 0x2b3bfa0-patch-2

e144a5a

0x2b3bfa0 requested a review from shcheklein October 16, 2025 02:23

0x2b3bfa0 and others added 3 commits October 16, 2025 02:23

Helm-Docs update

46c8490

Increase aggregator resources

763e71b

Helm-Docs update

6f320b0

shcheklein reviewed Oct 16, 2025

View reviewed changes

0x2b3bfa0 requested a review from shcheklein October 16, 2025 02:36

shcheklein approved these changes Oct 16, 2025

View reviewed changes

0x2b3bfa0 merged commit b9c9f9f into main Oct 16, 2025
4 checks passed

0x2b3bfa0 deleted the 0x2b3bfa0-patch-2 branch October 16, 2025 02:50

dreadatour added a commit that referenced this pull request Oct 17, 2025

Revert "Increase Vector storage and fix garbage collector (#749)"

7961063

This reverts commit b9c9f9f.

dreadatour mentioned this pull request Oct 17, 2025

Revert vector change #761

Closed

Increase Vector storage and fix garbage collector #749

Increase Vector storage and fix garbage collector #749

Conversation

0x2b3bfa0 commented Sep 29, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0x2b3bfa0 Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shcheklein left a comment

Choose a reason for hiding this comment

Uh oh!

0x2b3bfa0 commented Oct 16, 2025

Uh oh!

0x2b3bfa0 commented Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

0x2b3bfa0 Sep 29, 2025 •

edited

Loading