Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using model with Pyspark on worker machine #103

Open
LovAsawa-Draup opened this issue Jan 17, 2025 · 0 comments
Open

When using model with Pyspark on worker machine #103

LovAsawa-Draup opened this issue Jan 17, 2025 · 0 comments

Comments

@LovAsawa-Draup
Copy link

The current implementation of the OpusMT model loading within the EasyNMT library uses the following approach:
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
However, this approach does not account for specifying a custom cache directory for model storage. The issue arises when deploying the model across a distributed environment, such as worker nodes in a Spark cluster. By default, the model is downloaded to the default Hugging Face cache directory (/home/.cache). While the master node typically has the necessary permissions for this directory, worker nodes often lack write access to /home/.

As a result, when the model is initialized on worker nodes, they attempt to download the model to the same default location, leading to permission errors.

Proposed Solution:
To avoid permission issues and ensure proper model distribution across worker nodes, the cache directory should be explicitly set during model initialization. The cache_dir parameter can be passed directly to the from_pretrained() method, ensuring models are downloaded and cached in a specified directory accessible by all nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant