-
Notifications
You must be signed in to change notification settings - Fork 4
Increase Vector storage and fix garbage collector #749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# -- Vector arguments. | ||
args: | ||
- |- | ||
while sleep 60; do find /data/vector/logs -type f -mtime +7 -delete; done & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also /events
and /metrics
and I forgot to clean up those, so we were running out of storage.
# HACK: assumes directory structure (single level nesting) and all files prefixed with timestamp | ||
rm -f "$(find "$STORAGE" -type f | sort -t/ -k5 | tail -n1)" | ||
done | ||
done & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we getting notified if it exits with non 0 or killed by a kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we getting notified if it exits with non 0
No, if you're asking specifically about rm -f
it will exit 0 even if there's no data to delete. 😅
or killed by a kernel
If you mean SIGKILL
from e.g. out of memory, no, we won't; good news, though, is that the whole pod will be killed and reflected in Kubernetes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, wrong answer: no, we aren't (can't easily) get notified about failures here. What we can do is print logs at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I mean exists with non 0 for whatever reason (I don't want to analyze each command and see if it can be non 0 or not - I assume it will exit at some point with non 0 for some reason and we MUST handle this - make sure pod restarts, make sure we have logs (for the command itself), make sure we are getting notifies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: upon further thought, we shouldn't handle this especially; health checks should rely exclusively on Vector, and, if it fails to write logs for whatever reason, report it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how / where are we getting notified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's put kill or exit on the 10th iteration to experiment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole cleanup script can fail, and I wouldn't worry too much about it; in the extreme case the whole loop crashes (unlikely as per the logic) and the pod remains active, the storage will fill and we'll notice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the storage will fill and we'll notice.
how?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, I would prefer to notice that cleanup is failing ... otherwise this system will be losing logs AFAIU
done | ||
done & | ||
exec /usr/local/bin/vector --config-dir /etc/vector/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question - what if it exits?
can we add logs to the whole entry point? also do set -uex
to see what is happening in case it fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question - what if it exits?
If this exits, the whole pod will restart, and Kubernetes will be aware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and Kubernetes will be aware.
we also need to be aware
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This boils down to a more general problem: we can't [ish] observe the observability tool. 😅
The only way I can think of guaranteeing we notice an outage here, is to monitor the health of Vector from the DataChain master
pod health check; so if something isn't working (i.e. dead-whatever switch), we get a Slack notification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's becoming less of an observability tool but a tool that customers rely to deal with logs. Thus we must have a way to observe its health.
The only way I can think of guaranteeing we notice an outage here, is to monitor the health of Vector from the DataChain master pod health check; so if something isn't working (i.e. dead-whatever switch), we get a Slack notification.
yes, sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Monitoring for Vector itself can't be addressed as part of this pull request, though. It has to be addressed as a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk, that's fine ... let's do this though before we ship it to customers
also, I would really prefer to be notified if cleanup loop is not working
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get your point, but making sure that the cleanup script works in a reliable way isn't neither easy nor elegant.
I would rather try to tackle this from the observability standpoint, and...
- Monitor the aggregator service to make sure it's running.
- Monitor storage usage metrics for e.g. less than 2% of free space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are still unresolved comments
we can't ship that we are not monitoring, it was too painful for the team and for the customers :(
This pull request doesn't ship anything that wasn't already shipped, but fixes things that weren't already fixed. 😅 |
Merging this to unblock, would have wanted to spend more time on it. |
This reverts commit b9c9f9f.
It had to be cache invalidation. 🙈