Description
Using replicas for repetitive pod configuration in kubernetes_scheduler has been removed in f6907e8
The rationale is here
Unfortunately for a large setup we can easily breach default limits, 1.5Mb: etcdserver: request is too large
It's not always possible to bump max-request-bytes, e.g. for AWS EKS.
Currently both job-specific and even TorchX own environment variables are contributing to breaching this limit.
We would like to find a way to make replicas work to minimize job manifest size.
Motivation/Background
Increase the maximum cluster size we can support with k8s
Detailed Proposal
E.g. using ConfigMap with per node/role config or Downward API. Make use of the fact we have roles with many replicas that share a huge chunk of their configuration.
Alternatives
Don't use environment variables and long names anywhere in the configuration, still the limit will be significantly smaller than when using replicas on average.
Additional context/links
Description
Using
replicasfor repetitive pod configuration inkubernetes_schedulerhas been removed in f6907e8The rationale is here
Unfortunately for a large setup we can easily breach default limits, 1.5Mb:
etcdserver: request is too largeIt's not always possible to bump
max-request-bytes, e.g. for AWS EKS.Currently both job-specific and even TorchX own environment variables are contributing to breaching this limit.
We would like to find a way to make replicas work to minimize job manifest size.
Motivation/Background
Increase the maximum cluster size we can support with k8s
Detailed Proposal
E.g. using ConfigMap with per node/role config or Downward API. Make use of the fact we have roles with many replicas that share a huge chunk of their configuration.
Alternatives
Don't use environment variables and long names anywhere in the configuration, still the limit will be significantly smaller than when using replicas on average.
Additional context/links