When using model with Pyspark on worker machine #103

LovAsawa-Draup · 2025-01-17T20:47:43Z

The current implementation of the OpusMT model loading within the EasyNMT library uses the following approach:
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
However, this approach does not account for specifying a custom cache directory for model storage. The issue arises when deploying the model across a distributed environment, such as worker nodes in a Spark cluster. By default, the model is downloaded to the default Hugging Face cache directory (/home/.cache). While the master node typically has the necessary permissions for this directory, worker nodes often lack write access to /home/.

As a result, when the model is initialized on worker nodes, they attempt to download the model to the same default location, leading to permission errors.

Proposed Solution:
To avoid permission issues and ensure proper model distribution across worker nodes, the cache directory should be explicitly set during model initialization. The cache_dir parameter can be passed directly to the from_pretrained() method, ensuring models are downloaded and cached in a specified directory accessible by all nodes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using model with Pyspark on worker machine #103

When using model with Pyspark on worker machine #103

LovAsawa-Draup commented Jan 17, 2025

When using model with Pyspark on worker machine #103

When using model with Pyspark on worker machine #103

Comments

LovAsawa-Draup commented Jan 17, 2025