Skip to content

Forward args to _get_remote_config() and honour core/no_scm if present #10719

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 30, 2025

Conversation

rgoya
Copy link
Contributor

@rgoya rgoya commented Apr 10, 2025

This is a proposed fix for #10608, the code here makes steps 9 and 10 described in the issue work.

Summary:

This change allows a user to access the dvc information in an environment that is disconnected from the original Git backend (e.g. in a deployed container, see #10608), by using something like:

dvc.api.get_url(path,
                repo,
                config={"core": {"no_scm": True}}
                )

Description:

Mainly, a call to dvc/repo/open_repo.py:open_repo(url, *args, **kwargs) may contain a parameter config in **kwargs. With this config a user might indicate they do not want to access the repo with Git support, by using config={"core": {"no_scm": True}}.

During the execution of dvc/repo/open_repo.py:open_repo(), there is a call to a function dvc/repo/open_repo.py:_get_remote_config() that returns the remote configuration({"core": {"remote"}}. This is then merged to the user provided config parameter before calling Repo(url, *args, **kwargs).

dvc/repo/open_repo.py:_get_remote_config(), in turn, does a quick Repo() call to get the remote configuration. However, it does not use any of the parameters requested via dvc/repo/open_repo.py:open_repo() and thus relies entirely on the contents of .dvc/config. This means that even if the user requested no SCM support, it will try to look for a Git repo if .dvc/config says so, and fail if it does not find it.

This PR modifies dvc/repo/open_repo.py:_get_remote_config() to receive *args, **kwargs and honour the request to use or ignore Git support when accessing the dvc repo.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Copy link

codecov bot commented Apr 10, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.08%. Comparing base (2431ec6) to head (0712ad2).
Report is 45 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10719      +/-   ##
==========================================
+ Coverage   90.68%   91.08%   +0.39%     
==========================================
  Files         504      504              
  Lines       39795    39970     +175     
  Branches     3141     3158      +17     
==========================================
+ Hits        36087    36405     +318     
+ Misses       3042     2938     -104     
+ Partials      666      627      -39     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines 104 to 106
# It seems some tests might be passing a 'config' key that is not a dict
if not isinstance(user_config, dict):
user_config = {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into this, some tests send kwargs = {'config': None, ...; this safeguard protects against this.


if no_scm_flag is not None:
# Honour specific SCM treatment if requested in the call
repo = Repo(url, config={"core": {"no_scm": no_scm_flag}})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR, Repo(config=...) should just work.

Suggested change
repo = Repo(url, config={"core": {"no_scm": no_scm_flag}})
repo = Repo(url, config=kwargs.get("config"))

I don't want to specialize core.no_scm in any way or handle it.

Copy link
Contributor Author

@rgoya rgoya Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it doesn't feel ideal to handle core.no_scm itself. Your solution makes sense and it works for my specific use case, but it triggers other errors in the dvc test suite; which makes me think there are other non-core.no_scm configuration options that are being used that _get_remote_config() doesn't like.

(I went with the core.no_scm specific approach to highlight the need.)

These are the errors I get when using repo = Repo(url, config=kwargs.get("config")):

FAILED tests/func/test_import.py::test_import_no_hash[files1-expected_info_calls1] - dvc_data.index.index.DataIndexDirError: failed to load directory ('dir',)
FAILED tests/func/test_repo_index.py::test_data_index - dvc_data.index.index.DataIndexDirError: failed to load directory ('edir',)
FAILED tests/func/repro/test_repro_pull.py::test_repro_pulls_missing_import - dvc.exceptions.ReproductionError: failed to reproduce 'foo.dvc'
FAILED tests/func/test_data_cloud.py::test_pull_external_dvc_imports - FileNotFoundError: [Errno 2] No storage files available: 'foo'
FAILED tests/func/test_data_cloud.py::test_pull_external_dvc_imports_mixed - FileNotFoundError: [Errno 2] No storage files available: 'foo'
FAILED tests/func/repro/test_repro.py::test_repro_pulls_missing_import - dvc.exceptions.ReproductionError: failed to reproduce 'foo.dvc'
FAILED tests/func/test_import.py::test_import_dir - dvc_data.index.index.DataIndexDirError: failed to load directory ('dir',)
FAILED tests/func/test_import.py::test_import_file_from_dir - dvc_data.index.index.DataIndexDirError: failed to load directory ('dir',)
FAILED tests/func/test_import.py::test_import_file_from_dir_to_dir - dvc_data.index.index.DataIndexDirError: failed to load directory ('dir',)
FAILED tests/func/test_import.py::test_import_rev - FileNotFoundError: [Errno 2] No storage files available: 'foo'
FAILED tests/func/test_import.py::test_pull_imported_stage - dvc.exceptions.CheckoutError: Checkout failed for following targets:
FAILED tests/func/test_import.py::test_pull_import_no_download - FileNotFoundError: [Errno 2] No such file or directory: '/private/var/folders/_3/h8jc4f6d5gg6dwk7t6464_z80000gp/T/pytest-of-rodrigo.goya/pytest-17/popen-gw13/test_pull_import_no_download0/.dvc/cache/fs/local/e3501e821bcee8f40107794afbe767d1/.F0VDC8H_fGFnvnEcrt...
FAILED tests/func/test_import.py::test_pull_import_no_download_rev_lock - dvc.exceptions.DownloadError: 1 files failed to download
FAILED tests/func/test_import.py::test_pull_imported_directory_stage[dir] - dvc_data.index.index.DataIndexDirError: failed to load directory ('dir',)
FAILED tests/func/test_import.py::test_pull_imported_directory_stage[dir/] - dvc_data.index.index.DataIndexDirError: failed to load directory ('dir',)
FAILED tests/func/test_import.py::test_pull_wildcard_imported_directory_stage - dvc_data.index.index.DataIndexDirError: failed to load directory ('dir123',)
FAILED tests/func/test_update.py::test_update_import[True] - FileNotFoundError: [Errno 2] No storage files available: 'version'
FAILED tests/func/test_import.py::test_pull_non_workspace - FileNotFoundError: [Errno 2] No storage files available: 'foo'
FAILED tests/func/test_update.py::test_update_import_after_remote_updates_to_dvc - FileNotFoundError: [Errno 2] No storage files available: 'version'
FAILED tests/func/test_import.py::test_import_with_jobs - dvc_data.index.index.DataIndexDirError: failed to load directory ('dir1',)

@rgoya rgoya requested a review from skshetry April 10, 2025 07:51
@rgoya rgoya force-pushed the respect_no_scm_in_open_repo_call branch 2 times, most recently from 91cf291 to f598ecd Compare April 26, 2025 06:38
@rgoya rgoya force-pushed the respect_no_scm_in_open_repo_call branch from f598ecd to 652560a Compare April 26, 2025 07:03
@rgoya
Copy link
Contributor Author

rgoya commented Apr 26, 2025

AFAIR, Repo(config=...) should just work.
- repo = Repo(url, config={"core": {"no_scm": no_scm_flag}})
+ repo = Repo(url, config=kwargs.get("config"))
I don't want to specialize core.no_scm in any way or handle it.

@skshetry, Ok, I finally got some time to track down the errors.

I took your suggestion of using something closer to Repo(url, config=kwargs.get("config")).

I tracked the errors that were triggered by that call to the way that dvc manages imports. Mainly, it creates a DVCFileSystem for a remote repo using a local cache (see issue #9385 / fix #9423 ). This means that when _get_remote_config() is eventually called, it receives the url of the remote repo, but a kwargs["config"]["cache"] that points to the local cache.

Since the function of _get_remote_config() is to get the actual remote's cache location, in order for dvc import to work, _get_remote_config() needs to ignore anything sent in kwargs["config"]["cache"].

I modified the code to to the above, and added some comments to the code for future reference.

I also added a related test to the test suite.

Please let me know what you think.


Note: FYI, I've been seeing some flakiness in the test suite from what seems to be a race condition in the logs, where a message of DEBUG s3fs:utils.py:80 RC: discarding all clients sometimes sneaks in and fails the assert, e.g.:

>       assert first(caplog.messages) == expected_message
E       assert 'RC: discarding all clients' == "Output 'path... 'stage.dvc'."
E         
E         + RC: discarding all clients
E         - Output 'path'(stage: 'stage.dvc') is missing version info. Cache for it will not be collected. Use `dvc repro` to get your pipeline up to date.
E         - You can also use `dvc commit stage.dvc` to associate existing 'path' with stage: 'stage.dvc'.

tests/unit/output/test_output.py:106: AssertionError
----------------------------- Captured stderr call -----------------------------
WARNING: Output 'path'(stage: 'stage.dvc') is missing version info. Cache for it will not be collected. Use `dvc repro` to get your pipeline up to date.
You can also use `dvc commit stage.dvc` to associate existing 'path' with stage: 'stage.dvc'.
------------------------------ Captured log call -------------------------------
DEBUG    s3fs:utils.py:80 RC: discarding all clients
WARNING  dvc.output:output.py:1160 Output 'path'(stage: 'stage.dvc') is missing version info. Cache for it will not be collected. Use `dvc repro` to get your pipeline up to date.
You can also use `dvc commit stage.dvc` to associate existing 'path' with stage: 'stage.dvc'.

@skshetry
Copy link
Collaborator

Great research, @rgoya. I have taken a quick look at the PR and it looks good. Please give me some time (maybe this week) to look at this more closely.

Copy link
Collaborator

@skshetry skshetry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contributing! The way RepoDependency sets the cache looks fragile and we should figure out a better solution in the future. Your current solution is practical and works well within the constraints.

I also appreciate the effort you’ve put into adding detailed comments and a very good test.

That said, I want to note that core.no_scm may not always work reliably in all scenarios. Configuration options like this are contextual, so they might not work on every cases.

I’m happy to approve this. Thank you for your time and effort in improving the project! 🙂

@skshetry skshetry merged commit 4bde0e7 into iterative:main Apr 30, 2025
36 of 41 checks passed
@rgoya
Copy link
Contributor Author

rgoya commented Apr 30, 2025

Thank you for contributing! The way RepoDependency sets the cache looks fragile and we should figure out a better solution in the future. Your current solution is practical and works well within the constraints.

I also appreciate the effort you’ve put into adding detailed comments and a very good test.

That said, I want to note that core.no_scm may not always work reliably in all scenarios. Configuration options like this are contextual, so they might not work on every cases.

I’m happy to approve this. Thank you for your time and effort in improving the project! 🙂

Thank you, @skshetry. It's been quite illustrative peeking into the dvc code. Thanks for the heads up on core.no_scm, I'll keep an eye out for that; if something breaks I might come back with another issue ;)

@rgoya
Copy link
Contributor Author

rgoya commented May 5, 2025

@skshetry, incidentally, do you have a defined release schedule? (Just wondering when this change would make it into my conda-environment)

@skshetry
Copy link
Collaborator

skshetry commented May 6, 2025

@skshetry, incidentally, do you have a defined release schedule? (Just wondering when this change would make it into my conda-environment)

Hopefully, by Wednesday. If not, then I'll release by the end of next week (I am out till early next week).

@rgoya
Copy link
Contributor Author

rgoya commented May 6, 2025

@skshetry, incidentally, do you have a defined release schedule? (Just wondering when this change would make it into my conda-environment)

Hopefully, by Wednesday. If not, then I'll release by the end of next week (I am out till early next week).

Alright! I'll keep an eye out for it. Thanks!

skshetry added a commit that referenced this pull request May 6, 2025
This `_get_remote_config()` should use `uninitialized=True` so that it supports
more broader kinds of broken/partially-initialized repositories, including some that have
`.dvc` directory missing, or `.git` directory missing.

This partially reverts #10719. #10608 is also fixed, and no longer requires `core.no_scm` to be passed.
This was already supported by `dvc.api.get_url` as it uses `uninitialized=True`, but this was not respected
in `_get_remote_config()` where it would fail before.

That said, this whole `open_repo`/`_get_remote_config` is terribly broken. For one, it is opening
a local repository, and forcing it's remote config to a repository opened with `Repo(rev=...)`,
where the config may be different.
skshetry added a commit that referenced this pull request May 6, 2025
This `_get_remote_config()` should use `uninitialized=True` so that it supports
more broader kinds of broken/partially-initialized repositories, including some that have
`.dvc` directory missing, or `.git` directory missing.

This partially reverts #10719. #10608 is also fixed, and no longer requires `core.no_scm` to be passed.
This was already supported by `dvc.api.get_url` as it uses `uninitialized=True`, but this was not respected
in `_get_remote_config()` where it would fail before.

That said, this whole `open_repo`/`_get_remote_config` is terribly broken. For one, it is opening
a local repository, and forcing it's remote config to a repository opened with `Repo(rev=...)`,
where the config may be different.
skshetry added a commit that referenced this pull request May 6, 2025
This `_get_remote_config()` should use `uninitialized=True` so that it supports
more broader kinds of broken/partially-initialized repositories, including some that have
`.dvc` directory missing, or `.git` directory missing.

This partially reverts #10719. #10608 is also fixed, and no longer requires `core.no_scm` to be passed.
This was already supported by `dvc.api.get_url` as it uses `uninitialized=True`, but this was not respected
in `_get_remote_config()` where it would fail before.

That said, this whole `open_repo`/`_get_remote_config` is terribly broken. For one, it is opening
a local repository, and forcing it's remote config to a repository opened with `Repo(rev=...)`,
where the config may be different.
skshetry added a commit that referenced this pull request May 6, 2025
This `_get_remote_config()` should use `uninitialized=True` so that it supports
more broader kinds of broken/partially-initialized repositories, including some that have
`.dvc` directory missing, or `.git` directory missing.

This partially reverts #10719. #10608 is also fixed, and no longer requires `core.no_scm` to be passed.
This was already supported by `dvc.api.get_url` as it uses `uninitialized=True`, but this was not respected
in `_get_remote_config()` where it would fail before.

That said, this whole `open_repo`/`_get_remote_config` is terribly broken. For one, it is opening
a local repository, and forcing it's remote config to a repository opened with `Repo(rev=...)`,
where the config may be different.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants