Skip to content

Conversation

@GeigerJ2
Copy link

@GeigerJ2 GeigerJ2 commented Mar 25, 2025

Working off of #108, and pair-coded with @agoscinski 🫶

Given that one intends to pass a RemoteData object that references a file (not a directory) on a remote machine, e.g., in the following minimal working example:

import tempfile
from aiida import orm, load_profile
from aiida_shell import launch_shell_job

load_profile()

computer = orm.load_computer("localhost")

with tempfile.TemporaryDirectory() as tmp_dir:

    tmp_file = f"{tmp_dir}/file.txt"
    with open(tmp_file, "w") as f:
        f.write("foo")

    node = orm.RemoteData(computer=computer)
    node.set_remote_path(tmp_file)
    node.store()

    launch_shell_job(
        "echo",
        arguments="{remote}",
        nodes={"remote": node},
        metadata={"options": {"use_symlinks": True}},
    )

This would fail silently, with the reason for it being the instructions that are generated here:

for key, node in remote_data_nodes.items():
if key in filenames:
instructions.append((computer.uuid, node.get_remote_path(), filenames[key]))
else:
instructions.append((computer.uuid, f'{node.get_remote_path()}/*', '.'))

As one can see, if filenames is not given, the remote_path of the RemoteData node is appended with /*, leading to an instruction such as:

[('a09341db-9681-4676-8c83-f37bb72756bb', '/tmp/tmpvvnssdqj/file.txt/*', '.')]

hence, not linking/copying anything.

With this PR, if the RemoteDatas path points to a file, and a corresponding target filename in the working directory is not given, the actual filename is set as the default in handle_remote_data_nodes, as well as process_arguments_and_nodes (to resolve the arguments placeholder in the final CLI command to the actual filename).

While debugging, we were also struggling with some errors during symlinking originating from SFTP in paramiko called via the SshTransport, but I currently fail to reproduce those in a clean environment with an SSH-configured Computer.

@sphuber
Copy link
Owner

sphuber commented Mar 26, 2025

The problem here is not really one with the implementation in aiida-shell but with the RemoteData or at least its usage in your example. The RemoteData class is intended to model a directory on a remote computer, not just a file. See also its docstring:

https://github.com/aiidateam/aiida-core/blob/660fec70ef43a64be7edae3c12f0a0fd5ef84349/src/aiida/orm/nodes/data/remote/base.py#L9

Just like FolderData is for a directory on the local computer. For a single local file there is SinglefileData, there just isn't (yet) an equivalent for remote files yet. If you create a RemoteData where its path is a file, my bet is that this would fail in aiida-core as well in a bunch of places as it assumes that the root path is directory and not a file. So I think your example here is malformed. If you want to copy a remote file, you have to wrap its parent directory in a RemoteData and add that as input.

@GeigerJ2
Copy link
Author

Hi @sphuber, thanks for chiming in! Indeed, I do agree that the root cause is the way RemoteData is set up. Overall, it's a bit confusing, because a few lines below, in the class docstring (rather than the file docstring you linked), it is mentioned that the class can also be used to store the link to a file, see here:
https://github.com/aiidateam/aiida-core/blob/660fec70ef43a64be7edae3c12f0a0fd5ef84349/src/aiida/orm/nodes/data/remote/base.py#L29
While in the end, this is apparently not really supported.

We considered both options that are currently available when registering the parent directory of our file of interest, however, we were not happy with either:

  • Symlinking/copying all the contained files of that directory to the working directory of the process can be confusing to the user, as the parent directory might contain many (unrelated) files.
  • Symlinking/copying the entire directory under a folder given via filenames forces us to modify the final call to the executable/command line script from <command> input-file which is what the user would typically provide in the input file to our tool to <command> filename-directory/input-file. Of course we could adapt our code in a way that this setup is done automatically but, apart from possible confusion, it might not even work as the script might expect certain files to be in its run directory. I'm not familiar enough with the specific scripts that will be used in our application, but I will check back with our collaborators.

We also thought about extending aiida-core with RemoteSinglefileData, and/or possibly also changing RemoteData to RemoteFolderData and deprecating RemoteData (see here and here). However, for the latter option, backwards compatibility is a concern (and might also require a migration?), and, overall, as the release cycle in aiida-core is more infrequent, we saw the changes in this PR as the only quick way to achieve our goal for the moment. I do imagine that in aiida-core setting up RemoteData with a file path would break, if it is typically assumed to be a directory, but for our use case, the setup of this PR currently works. Though, as the aiida-core 2.7 release draws closer, I think it could be feasible that we at least add RemoteSinglefileData, while keeping the behavior of RemoteData as is, while long-term, one could change to RemoteFolderData and RemoteSingleFileData for consistency with the classes for local files.

@GeigerJ2
Copy link
Author

Just another update here: What we are trying to achieve actually does work when we do specify the appropriate filenames. So the changes in this PR are probably not necessary after all.

Though, if one doesn't want to force users to provide the filenames argument in aiida-shell in that case, the considerations of this PR might still be useful. As constructing instances of RemoteData is possible and doesn't except when providing a path to a file, I think one can expect people will still use it that way (as we did for our use case). At the version of #108, things fail silently (see also here). One might say this is the fault of the user, as RemoteData is not intended to be used that way but I see it problematic that this behavior isn't enforced in the class.

@sphuber
Copy link
Owner

sphuber commented Mar 27, 2025

Thanks for the updates @GeigerJ2 . Could you please sketch the exact example that you are trying to run? That would make it easier to understand what the real problem is and if there really is a problem in aiida-shell or aiida-core. As you seem to say that you can solve it using filenames, that would seem to indicate that there is no problem right? Using filenames is not a workaround or anything. It is there for a reason in any case 😄

@GeigerJ2
Copy link
Author

GeigerJ2 commented Apr 14, 2025

Hi @sphuber, thanks for your feedback, and sorry for the delay in my response here. Based on the test_nodes_remote_data_filename test you added, one can illustrate the issue with the modified test below. The main change from the original test is that the RemoteData objects are constructed from the paths to the remote files, not their parent folders:

def test_nodes_remote_files_filename(generate_calc_job, generate_code, tmp_path, aiida_localhost):
    """Test the ``nodes`` and ``filenames`` inputs with ``RemoteData`` nodes."""
    remote_path_a = tmp_path / 'remote_a' / 'file_a.txt'
    remote_path_b = tmp_path / 'remote_b' / 'file_b.txt'
    remote_path_a.parent.mkdir()
    remote_path_b.parent.mkdir()
    remote_path_a.write_text('content a')
    remote_path_b.write_text('content b')

    remote_data_a = RemoteData(remote_path=str(remote_path_a.absolute()), computer=aiida_localhost)
    remote_data_b = RemoteData(remote_path=str(remote_path_b.absolute()), computer=aiida_localhost)

    inputs = {
        'code': generate_code(),
        'arguments': ['{remote_a}'],
        'nodes': {
            'remote_a': remote_data_a,
            'remote_b': remote_data_b,
        },
        'filenames': {'remote_a': 'target_remote'},
    }
    dirpath, calc_info = generate_calc_job('core.shell', inputs)

    code_info = calc_info.codes_info[0]
    assert code_info.cmdline_params == ['target_remote']

    assert calc_info.remote_symlink_list == []
    assert sorted(calc_info.remote_copy_list) == [
        (aiida_localhost.uuid, str(remote_path_a), 'target_remote'),
        (aiida_localhost.uuid, str(remote_path_b / '*'), '.'),
    ]

Here, the last assert passes with the following sorted remote_copy_list:

[('aa2a9b45-00b4-4018-933b-ae9ca1f8cede',
  '/tmp/pytest-of-geiger_j/pytest-28/test_nodes_remote_data_files_f0/remote_a/file_a.txt',
  'target_remote'),
 ('aa2a9b45-00b4-4018-933b-ae9ca1f8cede',
  '/tmp/pytest-of-geiger_j/pytest-28/test_nodes_remote_data_files_f0/remote_b/file_b.txt/*',
  '.')]

The first instruction in the remote_copy_list does make sense, and it's what we obtain for our use case when we pass filenames, linking the remote file to a specific filename in the target working directory, therefore resolving our original issue. However, the second instruction doesn't make sense and is not captured before the actual submission of the job (see also here: aiidateam/aiida-core#6803). Thus, the only way the user notices this "wrong" behavior, is when their calculation crashes due to the missing symlink/copied over input files, making it hard to trace back.

Now, if one argues that the remote_path of a RemoteData must point to folders, not files, then this would just be plain wrong construction of the class on the side of the user and not anything we'd need to consider here. However, this is not enforced anywhere in aiida-core, and there are even tests where apparent file paths are being used for construction, e.g.,
https://github.com/aiidateam/aiida-core/blob/660fec70ef43a64be7edae3c12f0a0fd5ef84349/tests/tools/visualization/test_graph.py#L60.
Ideally, one would improve that in aiida-core (see here), but instead, this PR provides explicit handling for the case that a RemoteData that points to a file and doesn't have an associated entry in the filenames dict is contained, such that it is automatically added.

I extended the test_nodes_remote_data_filename test to also capture this behavior, see here:

def test_nodes_remote_data_filename(generate_calc_job, generate_code, tmp_path, aiida_localhost):
    """Test the ``nodes`` and ``filenames`` inputs with ``RemoteData`` nodes."""
    remote_path_a = tmp_path / 'remote_a'
    remote_path_b = tmp_path / 'remote_b'
    remote_path_a.mkdir()
    remote_path_b.mkdir()
    (remote_path_a / 'file_a.txt').write_text('content a')
    (remote_path_b / 'file_b.txt').write_text('content b')


    remote_path_c = tmp_path / 'remote_c' / 'file_c.txt'
    remote_path_d = tmp_path / 'remote_d' / 'file_d.txt'
    remote_path_c.parent.mkdir()
    remote_path_d.parent.mkdir()
    remote_path_c.write_text('content c')
    remote_path_d.write_text('content d')


    remote_data_a = RemoteData(remote_path=str(remote_path_a.absolute()), computer=aiida_localhost)
    remote_data_b = RemoteData(remote_path=str(remote_path_b.absolute()), computer=aiida_localhost)
    remote_data_c = RemoteData(remote_path=str(remote_path_c.absolute()), computer=aiida_localhost)
    remote_data_d = RemoteData(remote_path=str(remote_path_d.absolute()), computer=aiida_localhost)


    inputs = {
        'code': generate_code(),
        'arguments': ['{remote_a}'],
        'nodes': {
            'remote_a': remote_data_a,
            'remote_b': remote_data_b,
            'remote_c': remote_data_c,
            'remote_d': remote_data_d,
        },
        'filenames': {
            'remote_a': 'target_remote',
            'remote_c': 'target_remote_file',
        },
    }
    dirpath, calc_info = generate_calc_job('core.shell', inputs)


    code_info = calc_info.codes_info[0]
    assert code_info.cmdline_params == ['target_remote']


    assert calc_info.remote_symlink_list == []
    assert sorted(calc_info.remote_copy_list) == [
        (aiida_localhost.uuid, str(remote_path_a), 'target_remote'),
        (aiida_localhost.uuid, str(remote_path_b / '*'), '.'),
        (aiida_localhost.uuid, str(remote_path_c), 'target_remote_file'),
        (aiida_localhost.uuid, str(remote_path_d), 'file_d.txt'),
    ]

At this point, I'd say it's up to you if it's a use case you would like to include, or you deem it irrelevant because RemoteData is used in a way it's not intended to be used. Setting filenames fixes it for our current application, but it seems more like a side-effect of how the "source path" is being handled, rather than the explicitly desired behavior. If you think it's not necessary, feel free to close the PR. Cheers!

@sphuber
Copy link
Owner

sphuber commented May 7, 2025

Thanks for the detailed example. The problem indeed seems to boil down to the fact that aiida-shell is implemented such that the remote path of a RemoteData is assumed to be a directory. By default, the content of that directory is written to the working directory of the job, but filenames can be used to write the entire directory to a particular subdirectory inside the working dir.

This assumption breaks down if RemoteData.remote_path points to a file. In your example, you can abuse filenames to get the desired result, but evidently it fails for other cases. This is one of the reasons why I am hesitant to accept the "hack" to also allow RemoteData nodes referencing files because it further muddies the water. But not only that, in order to distinguish, the implementation needs to check whether the path is a file or directory and that requires opening a transport. I really don't want to do this, because doing so goes outside of the protections of the transport pooling mechanism in AiiDA's engine and so carries a number of risks. In the current state, we shouldn't be opening transports in calculation and parser plugins.

That is why I really think this should first be fixed on the side of aiida-core. I think it makes perfect sense to complete the analogy of SinglefileData and FolderData and create the equivalent versions on the remote. Then it would be easy for aiida-shell to handle the RemoteSinglefileData case.

Finally, instead of this workaround here. Is there no way in your use case to just use the directory of the target file as the path of your RemoteData? I gather that this is problematic because that directory contains multiple files and you only have to get 1 and not the others. But surely you could perhaps try to make sure that whatever program writes that file, does so in a specific folder?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants