Skip to content

Conversation

0x2b3bfa0
Copy link
Member

It had to be cache invalidation. 🙈

# -- Vector arguments.
args:
- |-
while sleep 60; do find /data/vector/logs -type f -mtime +7 -delete; done &
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also /events and /metrics and I forgot to clean up those, so we were running out of storage.

@0x2b3bfa0 0x2b3bfa0 requested review from shcheklein and removed request for shcheklein September 29, 2025 21:47
# HACK: assumes directory structure (single level nesting) and all files prefixed with timestamp
rm -f "$(find "$STORAGE" -type f | sort -t/ -k5 | tail -n1)"
done
done &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we getting notified if it exits with non 0 or killed by a kernel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we getting notified if it exits with non 0

No, if you're asking specifically about rm -f it will exit 0 even if there's no data to delete. 😅

or killed by a kernel

If you mean SIGKILL from e.g. out of memory, no, we won't; good news, though, is that the whole pod will be killed and reflected in Kubernetes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, wrong answer: no, we aren't (can't easily) get notified about failures here. What we can do is print logs at least.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I mean exists with non 0 for whatever reason (I don't want to analyze each command and see if it can be non 0 or not - I assume it will exit at some point with non 0 for some reason and we MUST handle this - make sure pod restarts, make sure we have logs (for the command itself), make sure we are getting notifies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: upon further thought, we shouldn't handle this especially; health checks should rely exclusively on Vector, and, if it fails to write logs for whatever reason, report it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how / where are we getting notified?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put kill or exit on the 10th iteration to experiment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole cleanup script can fail, and I wouldn't worry too much about it; in the extreme case the whole loop crashes (unlikely as per the logic) and the pod remains active, the storage will fill and we'll notice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the storage will fill and we'll notice.

how?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, I would prefer to notice that cleanup is failing ... otherwise this system will be losing logs AFAIU

done
done &
exec /usr/local/bin/vector --config-dir /etc/vector/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question - what if it exits?

can we add logs to the whole entry point? also do set -uex to see what is happening in case it fails

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question - what if it exits?

If this exits, the whole pod will restart, and Kubernetes will be aware.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and Kubernetes will be aware.

we also need to be aware

Copy link
Member Author

@0x2b3bfa0 0x2b3bfa0 Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This boils down to a more general problem: we can't [ish] observe the observability tool. 😅

The only way I can think of guaranteeing we notice an outage here, is to monitor the health of Vector from the DataChain master pod health check; so if something isn't working (i.e. dead-whatever switch), we get a Slack notification.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's becoming less of an observability tool but a tool that customers rely to deal with logs. Thus we must have a way to observe its health.

The only way I can think of guaranteeing we notice an outage here, is to monitor the health of Vector from the DataChain master pod health check; so if something isn't working (i.e. dead-whatever switch), we get a Slack notification.

yes, sounds good to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Monitoring for Vector itself can't be addressed as part of this pull request, though. It has to be addressed as a follow-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk, that's fine ... let's do this though before we ship it to customers

also, I would really prefer to be notified if cleanup loop is not working

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point, but making sure that the cleanup script works in a reliable way isn't neither easy nor elegant.

I would rather try to tackle this from the observability standpoint, and...

  1. Monitor the aggregator service to make sure it's running.
  2. Monitor storage usage metrics for e.g. less than 2% of free space.

@0x2b3bfa0 0x2b3bfa0 requested a review from shcheklein October 16, 2025 02:23
Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are still unresolved comments

we can't ship that we are not monitoring, it was too painful for the team and for the customers :(

@0x2b3bfa0 0x2b3bfa0 requested a review from shcheklein October 16, 2025 02:36
@0x2b3bfa0
Copy link
Member Author

there are still unresolved comments

True

we can't ship that we are not monitoring, it was too painful for the team and for the customers :(

This pull request doesn't ship anything that wasn't already shipped, but fixes things that weren't already fixed. 😅

@0x2b3bfa0
Copy link
Member Author

Merging this to unblock, would have wanted to spend more time on it.

@0x2b3bfa0 0x2b3bfa0 merged commit b9c9f9f into main Oct 16, 2025
4 checks passed
@0x2b3bfa0 0x2b3bfa0 deleted the 0x2b3bfa0-patch-2 branch October 16, 2025 02:50
dreadatour added a commit that referenced this pull request Oct 17, 2025
@dreadatour dreadatour mentioned this pull request Oct 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants