Recommendations to avoid model thrashing? #7937
Unanswered
TimothySeah
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have a cluster of triton servers. Each of these loads a different model depending on the request it receives. However, because these models are large, there is a lot of "model thrashing" i.e. we waste time loading/unloading models that are too large to all fit in gpu memory. I was wondering if there is a general/canonical solution to this? For example, is there an easy way to route requests requiring a specific model to pods that already have that model loaded? Thanks.
Beta Was this translation helpful? Give feedback.
All reactions