Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make gix_path::env:shell() more robust and use it in gix-command #1862

Merged
merged 15 commits into from
Mar 20, 2025

Conversation

EliahKagan
Copy link
Member

@EliahKagan EliahKagan commented Feb 23, 2025

Tasks since #1862 (comment):

  • Use the shim sh.exe when available
  • Fall back to the non-shim sh.exe
  • Test locally to ensure the failures still go away with the shim
  • Only use inferred (git root) if it looks like Git for Windows and looks safe
  • Fix unrelated test-fixtures-windows breakage (#1849, #1870) and rebase
  • Fix unrelated test-32bit breakage (#1874) and rebase
  • Retest locally after all nontrivial rebasing
  • Investigate local failures in #1864 with non-shim, in case relevant here

Outscoped:

(On all the above, see "What it means for this PR" in #1862 (comment).)

The original PR description follows.


This makes git_path::env::shell() more likely to work correctly on Windows, then attempts to use it in gix_command::Prepare when running a command with a shell due to use_shell being set to true. Full details are in the commit messages, though the changes are also summarized here.

I think this is not ready to merge yet, due to failures that happen when running the tests locally from PowerShell, but that I cannot produce otherwise. I do not know what causes these failures, and I expect I may have difficulty getting this to a point where it would be ready to merge without input. Since the failures happen in a test that was fairly recently added, I figured you might just know what is going on.

I think shell() can and should be used for gix-command. But if not then the reason will be relevant to other potential uses of shell(). In that case, assuming they leave enough good uses for shell() that it is kept, I hope to discover and articulate those reasons in the documentation for shell(). But I think most likely there is a bug or wrong test expectation somewhere that accounts for the local test failure, and that either the current implementation of shell() with the modifications included here, or something much like it, is suitable for broad use.

Making gix_path::env::shell() more robust

This makes gix_path::env::shell() more likely to work correctly on Windows by:

  • Checking that the path to a sh.exe associated with Git for Windows actually exists (and is either a regular file or a symbolic link that, when fully dereferenced, gives a file), because if that is not the case, then the fallback of just looking up sh.exe using a PATH search is more likely to succeed.

    This change is needed to use shell() more widely without breaking things on systems where sh.exe is available and usable but cannot be found in the usual location where Git for Windows supplies it.

  • Using / rather than \ directory separators when appending usr, bin, and sh.exe components to the Git for Windows installation directory path obtained via git --exec-path and going up three components. This causes the path through sh.exe is run to use all / separators except in the unusual situation that GIT_EXEC_PATH has been set to a path with \ separator. (This is unusual because people don't usually set GIT_EXEC_PATH. It would probably be best, when setting it on Windows, to use / separators, but I do not actually know how often people do that.)

    I believe this is only a minor improvement, in view of this path not being automatically passed through to the shell, as described in #1840 (reply in thread). But this change also makes tests in gix-command and gix-credentials pass that would otherwise require more extensive modification related to \ escaping, when modifying gix-command to use shell().

Using gix_path::env::shell() in gix-command

Before the changes in this PR, gix-command uses the relative path sh.exe on Windows. This works a significant fraction of the time already. In particular, the problems with the relative path bash.exe that make it frequently unusable to find a non-WSL bash on Windows, described in #1359 (comment), do not apply to sh.exe, since there is no Microsoft-provided sh.exe related to WSL. However:

  • Such an sh.exe is not always the best choice when present.
  • There may be no sh.exe that can be found in PATH. For example, if the only sh.exe is supplied by Git for Windows, and neither of the Git for Windows directories that contain an sh.exe binary are in PATH, then before the changes in this PR, a shell will not be found unless shell_program() is called (every time a command with use_shell is run).

Therefore, this uses shell(), with the improvements described above, in gix-command.


As alluded to above, I think this is not ready yet. All tests pass on CI, as well a when run locally from Git Bash. But when running them locally in PowerShell, gix-merge::merge blob::platform::merge::with_external fails. The failure does not require GIX_TEST_IGNORE_ARCHIVES to be set, and this PR does not currently modify any test fixture scripts.

Windows failures that occur when tests are run from PowerShell but not from Git Bash have usually been due to #1359 and are thus not expected since #1712, where gix-testtools finds bash.exe associated with Git for Windows and uses it to run fixtures. But that fix didn't affect the way gix-command uses a shell, so it does not provide for the kind of shell invocations occurring here.

Still, it doesn't make sense to me why the failure happens, and why it happens only when I run the tests locally from PowerShell. On this system, in the PATH, including when accessed from PowerShell, the command sh finds is a shim for the same shell that, as of this PR, gix-path finds and gix-command uses. Furthermore, the change here makes gix-path find sh.exe more like the way gix-testtools finds bash.exe, both of which are Bash when provided by Git for Windows.

Furthermore, while most of my local testing has been with Git for Windows 2.48.1, I have verified that the test fails the same way on the same system with Git for Windows 2.47.1(2), and that the experiment described below to examine the details of the failure also gives the same results with that version. This is to say that it seems like this failure is not related, even conceptually, to #1849.

C:\Users\ek\source\repos\gitoxide [run-ci/consistent-sh ≡]> cargo nextest run -p gix-merge --no-fail-fast
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.48s
────────────
 Nextest run ID 5771b786-7fff-43d7-9721-b20970cbb558 with nextest profile: default
    Starting 20 tests across 2 binaries
     Running [ 00:00:00] 0/20: 0 running, 0 passed, 0 skipped
        PASS [   0.018s] gix-merge::merge blob::builtin_driver::text::fuzzed
        PASS [   0.022s] gix-merge::merge blob::builtin_driver::text::both_sides_same_changes_are_conflict_free
        PASS [   0.025s] gix-merge::merge blob::pipeline::without_transformation
        PASS [   0.031s] gix-merge::merge blob::pipeline::non_existing
        PASS [   0.035s] gix-merge::merge blob::pipeline::binary_below_large_file_threshold
        PASS [   0.038s] gix-merge::merge blob::builtin_driver::text::both_differ_partially_resolution_is_conflicting
        PASS [   0.040s] gix-merge::merge blob::builtin_driver::binary
        PASS [   0.047s] gix-merge::merge blob::pipeline::worktree_filter
        PASS [   0.054s] gix-merge::merge blob::pipeline::above_large_file_threshold
        PASS [   0.052s] gix-merge::merge blob::platform::merge::builtin_with_conflict
        PASS [   0.054s] gix-merge::merge blob::platform::merge::builtin_text_uses_binary_if_needed
        PASS [   0.058s] gix-merge::merge blob::platform::merge::one_buffer_too_large
        PASS [   0.070s] gix-merge::merge blob::platform::merge::same_binaries_do_not_count_as_conflicted
        PASS [   0.067s] gix-merge::merge blob::platform::prepare_merge::ancestor_and_current_and_other_do_not_exist
        PASS [   0.067s] gix-merge::merge blob::platform::prepare_merge::driver_selection
        PASS [   0.059s] gix-merge::merge blob::platform::set_resource::invalid_resource_types
        PASS [   0.200s] gix-merge::merge blob::platform::merge::missing_buffers_are_empty_buffers
        FAIL [   0.245s] gix-merge::merge blob::platform::merge::with_external
──── STDOUT:             gix-merge::merge blob::platform::merge::with_external

running 1 test
test blob::platform::merge::with_external ... FAILED

failures:

failures:
    blob::platform::merge::with_external

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 19 filtered out; finished in 0.18s

──── STDERR:             gix-merge::merge blob::platform::merge::with_external
thread 'blob::platform::merge::with_external' panicked at gix-merge\tests\merge\blob\platform.rs:297:13:
b
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

        PASS [   0.330s] gix-merge::merge blob::builtin_driver::text::run_baseline
        PASS [   0.998s] gix-merge::merge tree::run_baseline
────────────
     Summary [   1.036s] 20 tests run: 19 passed, 1 failed, 0 skipped
        FAIL [   0.245s] gix-merge::merge blob::platform::merge::with_external
error: test run failed

The failing assertion is the first of the assertions in this fragment of the failing test:

let platform_ref = platform.prepare_merge(&db, Default::default())?;
let mut buf = Vec::new();
let res = platform_ref.merge(&mut buf, default_labels(), &Default::default())?;
assert_eq!(res, (Pick::Buffer, Resolution::Complete), "merge drivers always merge ");
let mut lines = cleaned_driver_lines(&buf)?;
for tmp_file in lines.by_ref().take(3) {
assert!(tmp_file.contains_str(&b".tmp"[..]), "{tmp_file}");
}
let lines: Vec<_> = lines.collect();
assert_eq!(
lines,
[
"7",
"b",
"ancestor label",
"current label",
"other label",
"%F",
"b",
"theirs"
],
"we handle word-splitting and definitely pick-up what's written into the %A buffer"
);

The first three of the cleaned driver lines are supposed to contain .tmp, but the first one is just b, which is expected to apper later. This somehow varies based on how the tests are run, and there are at least three environments for figuring out what's going on: CI, local in Git Bash, and local in PowerShell. It seems like they shouldn't differ, but they do.

Locally, it is possible to inspect things in a debugger, but this is harder on CI because while action-tmate can be used to get a reverse shell, on Windows runners the reverse shell's environment differs from the noninteractive environment in which the tests run. Fortunately, I can break the tests in a way that makes them always fail by panicking and printing all the lines:

diff --git a/gix-merge/tests/merge/blob/platform.rs b/gix-merge/tests/merge/blob/platform.rs
index 6230377b8..62882cc69 100644
--- a/gix-merge/tests/merge/blob/platform.rs
+++ b/gix-merge/tests/merge/blob/platform.rs
@@ -293,6 +293,11 @@ theirs
         let res = platform_ref.merge(&mut buf, default_labels(), &Default::default())?;
         assert_eq!(res, (Pick::Buffer, Resolution::Complete), "merge drivers always merge ");
         let mut lines = cleaned_driver_lines(&buf)?;
+        panic!(
+            "original: {:#?}\ncleaned: {:#?}",
+            buf.lines().map(|s| s.as_bstr()).collect::<Vec<_>>(),
+            lines.collect::<Vec<_>>()
+        );
         for tmp_file in lines.by_ref().take(3) {
             assert!(tmp_file.contains_str(&b".tmp"[..]), "{tmp_file}");
         }

To be clear, while all of the following are from messages printed in failing tests, the tests fail because of that change. All of them would otherwise pass, except the local run, in Windows, from PowerShell, shown last. Also, I do not know why this is happening. I am hoping you may have some insight into it.

Based off the main branch, on CI, in Ubuntu (would pass)

In the test-fast (ubuntu-latest) run, for comparison:

original: [
    "ours/home/runner/work/gitoxide/gitoxide/gix-merge/.tmpmK4t1a",
    "/home/runner/work/gitoxide/gitoxide/gix-merge/.tmpNFsS5l",
    "/home/runner/work/gitoxide/gitoxide/gix-merge/.tmp8rvAjw",
    "7",
    "b",
    "ancestor label",
    "current label",
    "other label",
    "%F",
    "b",
    "theirs",
]
cleaned: [
    "ours/home/runner/work/gitoxide/gitoxide/gix-merge/.tmpmK4t1a",
    "/.tmpNFsS5l",
    "/.tmp8rvAjw",
    "7",
    "b",
    "ancestor label",
    "current label",
    "other label",
    "%F",
    "b",
    "theirs",
]

Based off the main branch, on CI, in Windows (would pass)

In the test-fast (windows-latest) run:

original: [
    "oursD:agitoxidegitoxidegix-merge.tmpDYZgPn",
    "D:agitoxidegitoxidegix-merge.tmpOKiCuJ",
    "D:agitoxidegitoxidegix-merge.tmpOahQgz",
    "7",
    "b",
    "ancestor label",
    "current label",
    "other label",
    "%F",
    "b",
    "theirs",
]
cleaned: [
    "oursD:agitoxidegitoxidegix-merge.tmpDYZgPn",
    "D:agitoxidegitoxidegix-merge.tmpOKiCuJ",
    "D:agitoxidegitoxidegix-merge.tmpOahQgz",
    "7",
    "b",
    "ancestor label",
    "current label",
    "other label",
    "%F",
    "b",
    "theirs",
]

Note how backslashes are being treated as escape characters and removed, which seems like it is not intended, but this does not in and of itself cause a problem with this test: it happens even on the main branch, and even without the changes from this PR.

Based off the main branch, locally, in Windows, from Git Bash (would pass)

Via cargo nextest run -p gix-merge --no-fail-fast:

original: [
    "oursC:Userseksourcereposgitoxidegix-merge.tmpORUoAj",
    "C:Userseksourcereposgitoxidegix-merge.tmpuUGgiZ",
    "C:Userseksourcereposgitoxidegix-merge.tmpsYzC12",
    "7",
    "b",
    "ancestor label",
    "current label",
    "other label",
    "%F",
    "b",
    "theirs",
]
cleaned: [
    "oursC:Userseksourcereposgitoxidegix-merge.tmpORUoAj",
    "C:Userseksourcereposgitoxidegix-merge.tmpuUGgiZ",
    "C:Userseksourcereposgitoxidegix-merge.tmpsYzC12",
    "7",
    "b",
    "ancestor label",
    "current label",
    "other label",
    "%F",
    "b",
    "theirs",
]

Based off the main branch, locally, in Windows, from PowerShell (would pass)

Via cargo nextest run -p gix-merge --no-fail-fast:

original: [
    "oursC:Userseksourcereposgitoxidegix-merge.tmpPNy4Pa",
    "C:Userseksourcereposgitoxidegix-merge.tmp5Fkmzy",
    "C:Userseksourcereposgitoxidegix-merge.tmp9aFhwO",
    "7",
    "b",
    "ancestor label",
    "current label",
    "other label",
    "%F",
    "b",
    "theirs",
]
cleaned: [
    "oursC:Userseksourcereposgitoxidegix-merge.tmpPNy4Pa",
    "C:Userseksourcereposgitoxidegix-merge.tmp5Fkmzy",
    "C:Userseksourcereposgitoxidegix-merge.tmp9aFhwO",
    "7",
    "b",
    "ancestor label",
    "current label",
    "other label",
    "%F",
    "b",
    "theirs",
]

Based off this feature branch, on CI, in Ubuntu (would pass)

In the test-fast (ubuntu-latest) run, for comparison:

original: [
    "ours/home/runner/work/gitoxide/gitoxide/gix-merge/.tmpvKkoxy",
    "/home/runner/work/gitoxide/gitoxide/gix-merge/.tmpr402lE",
    "/home/runner/work/gitoxide/gitoxide/gix-merge/.tmp8bJjIe",
    "7",
    "b",
    "ancestor label",
    "current label",
    "other label",
    "%F",
    "b",
    "theirs",
]
cleaned: [
    "ours/home/runner/work/gitoxide/gitoxide/gix-merge/.tmpvKkoxy",
    "/.tmpr402lE",
    "/.tmp8bJjIe",
    "7",
    "b",
    "ancestor label",
    "current label",
    "other label",
    "%F",
    "b",
    "theirs",
]

That is the same as based off the main branch, which is very much as expected, because the changes on this feature branch do not produce a different value for gix_path::env::shell() on systems other than Windows.

Based off this feature branch, on CI, on Windows (would pass)

In the test-fast (windows-latest) run:

original: [
    "oursD:agitoxidegitoxidegix-merge.tmpyuHJtT",
    "D:agitoxidegitoxidegix-merge.tmpC2s55H",
    "D:agitoxidegitoxidegix-merge.tmp06rraO",
    "7",
    "b",
    "ancestor label",
    "current label",
    "other label",
    "%F",
    "b",
    "theirs",
]
cleaned: [
    "oursD:agitoxidegitoxidegix-merge.tmpyuHJtT",
    "D:agitoxidegitoxidegix-merge.tmpC2s55H",
    "D:agitoxidegitoxidegix-merge.tmp06rraO",
    "7",
    "b",
    "ancestor label",
    "current label",
    "other label",
    "%F",
    "b",
    "theirs",
]

This is similar to the main branch CI and local Git Bash runs shown above.

Based off this feature branch, locally, in Windows, from Git Bash (would pass)

Via cargo nextest run -p gix-merge --no-fail-fast:

original: [
    "oursC:Userseksourcereposgitoxidegix-merge.tmpUfxnE8",
    "C:Userseksourcereposgitoxidegix-merge.tmpiUdSfX",
    "C:Userseksourcereposgitoxidegix-merge.tmpmkVmRq",
    "7",
    "b",
    "ancestor label",
    "current label",
    "other label",
    "%F",
    "b",
    "theirs",
]
cleaned: [
    "oursC:Userseksourcereposgitoxidegix-merge.tmpUfxnE8",
    "C:Userseksourcereposgitoxidegix-merge.tmpiUdSfX",
    "C:Userseksourcereposgitoxidegix-merge.tmpmkVmRq",
    "7",
    "b",
    "ancestor label",
    "current label",
    "other label",
    "%F",
    "b",
    "theirs",
]

Based off this feature branch, locally, in Windows, from PowerShell (would fail)

Via cargo nextest run -p gix-merge --no-fail-fast:

original: [
    "b",
    "theirserseksourcereposgitoxidegix-merge.tmpTDKDff",
    "C:Userseksourcereposgitoxidegix-merge.tmp5WJaSb",
    "C:Userseksourcereposgitoxidegix-merge.tmpyQDnZt",
    "7",
    "b",
    "ancestor label",
    "current label",
    "other label",
    "%F",
]
cleaned: [
    "b",
    "theirserseksourcereposgitoxidegix-merge.tmpTDKDff",
    "C:Userseksourcereposgitoxidegix-merge.tmp5WJaSb",
    "C:Userseksourcereposgitoxidegix-merge.tmpyQDnZt",
    "7",
    "b",
    "ancestor label",
    "current label",
    "other label",
    "%F",
]

Note how b comes first, and the three lines with paths to files named containing .tmp follow it. This causes this assertion, which is run for the first three lines, to fail in the first iteration (i.e. for the first line):

assert!(tmp_file.contains_str(&b".tmp"[..]), "{tmp_file}");

But I do not understand why. I'm not very familiar with the details of gix-merge. I'm also not entirely clear on which kinds of shell transformations (from parsing % items into fields, from shell expansions on unquoted parameter expansion, and from the implicit effect of joining on spaces when passing multiple arguments arising from an unquoted expansion to echo) are intended to occur in the test command, and with what effects.

Is this likely to be caused by different environments across shells run in different ways? Or is it somehow due to the name the shell is called with, even though that does not become its $0 here and, furthermore, its $0 does not seem to be expanded? What could be going on?

@Byron
Copy link
Member

Byron commented Feb 24, 2025

Thanks you very much for tackling this!

I didn't look at the code of this PR but want to share some thoughts on the failure.

Here is the program that is run:

for arg in  %O %A %B %L %P %S %X %Y %F; do echo $arg >> \"%A\"; done; cat \"%O\" \"%B\" >> \"%A\""

The failing output has 10 lines, whereas a valid output has 11 lines. Probably that is related. My feeling is that one of the arguments in the for loop substitutions is empty. It would disappear then, and everything is offset.
If these were quoted that shouldn't happen and is probably a good idea just to make the test-script more robust.

From there it should become more obvious which parameter comes out empty, and from there it should become clear what the problem really is.

I hope that helps.

@EliahKagan
Copy link
Member Author

The failing output has 10 lines, whereas a valid output has 11 lines. Probably that is related.

Yes, in the failing run, the theirs line is absent, and one of the b lines is moved to the top.

@EliahKagan
Copy link
Member Author

EliahKagan commented Feb 24, 2025

A couple things didn't mention in #1862 (comment):

My feeling is that one of the arguments in the for loop substitutions is empty. It would disappear then, and everything is offset.

It doesn't look like that specific mechanism accounts for the failure, because in the failing case a b line appears as the first line of the output, while otherwise b does not appear until the fifth line.

As an early step (before even doing most of the testing described above), I tried adding ' ' quotes around the format specifiers and " " quotes around the parameter expansion in the merge driver command. The effect was more splitting. For example, ancestor label became two lines, ancestor and label. I think part of what using unquoted parameter expansion is currently achieving for the echo arguments is split on newlines, when then effectively get replaced with spaces because when echo is run with multiple arguments it joins them with spaces. I am very unclear on what the exact intended behavior of this merge driver test command is.

The broader issue for me examining this specific test is that I am familiar neither with the details of gix-merge and its tests nor (even though this is general Git knowledge) with the format expected for custom merge drivers.

Thanks you very much for tackling this!

No problem. My goal that this is related to is to be able to fix some other bugs where we do not run scripts correctly, such as bugs in the special casing of scripts with shebangs when use_shell is false. Fixing such bugs runs the risk of producing worse problems if there are other bugs whose effects are worse than the effect of existing bugs. (For example, running a wrong command, instead of failing to run anything, would often be worse. Likewise, running a wrong command in circumstances that arise more readily could be worse than running a wrong command in circumstances that arise less readily.) It also runs the risk that I introduce new bugs that weren't present at all before.

But that also means I risk become separated from the original bug-fixing goals, which this is peripheral to, if I shift my focus too much onto this. If possible, I'd like to avoid become too embroiled in the details of the gix-merge tests. It might not be possible to avoid this, but there are two other approaches I may attempt first to try to figure out what is going on:

  • Write something to examine how the environment is changed before and after the changes here. Manual testing with similar but non-identical ways of running shell processes suggests that the environment changes are not very great on my local machine (which is where they would have to be significant in order to be the cause of the problems I am observing on that machine but not CI). But this has the advantage that (a) it might be readily adaptable into a regression test, and (b) if I can accurately state how the environments are different, then your much greater familiarity with the failing gix-merge test and the intended gix-merge behavior might make the problem obvious even if it is not obvious to me.
  • Investigate failures on a conceptually related feature branch where I am trying to improve how gix-testtools finds bash. I'm not sure yet because this could be the result of a mistake (and unfortunately, the full test suite takes quite some time to run with GIX_TEST_IGNORE_ARCHIVES=1 on my local Windows 10 development machine), but it looks like there are comparable failures--and many of them, though most may be from a single failing fixture script--when I try to make changes there that harmonize how gix-testtools runs bash with how this PR tries to make gix-path find sh. The changes I'm working on in gix-testtools seem much smaller than those here, yet I have far worse local test failures, so that's interesting. I'll try to give more information on that soon, possibly in another draft PR about that.

Edit: Reworded for clarity and reordered the sections.

@Byron
Copy link
Member

Byron commented Feb 25, 2025

theirs is empty, even though it should be a be something that the caller knows.

I am very unclear on what the exact intended behavior of this merge driver test command is.

It attempts to make the arguments that the merge-driver was called with observable. I think there is no quoting because each of the arguments is known not to have spaces in it. However, the lack of quoting also makes empty arguments unobservable, a definitive shortcoming of the current implementation.

The broader issue for me examining this specific test is that I am familiar neither with the details of gix-merge and its tests nor (even though this is general Git knowledge) with the format expected for custom merge drivers.

Here is the docs, but the short version is that it's a script into which various parameters can be substituted into - this substitution is performed by the the caller of the merge-driver. That caller does a bare string substitution, without any care for quotes or whitespace in general.

I'd like to avoid become too embroiled in the details of the gix-merge tests.

I definitely wouldn't want you to suffer unnecessarily. The key for me to solve this certainly is to reproduce the error. For now, it doesn't make much sense to me either, unless… it's a substitution engine, what if one of the arguments that it substitutes also contains substitutable characters? Probably that's not what's happening here though, as % should be uncommon enough.

This is a patch that should tell exactly what's executed, and from there it would become more obvious what's really happening.

diff --git a/gix-merge/src/blob/platform/merge.rs b/gix-merge/src/blob/platform/merge.rs
index 058a30128..324742a00 100644
--- a/gix-merge/src/blob/platform/merge.rs
+++ b/gix-merge/src/blob/platform/merge.rs
@@ -211,7 +211,7 @@ pub(super) mod inner {
                     cmd.extend_from_slice(token);
                 }
 
-                Ok(merge::Command {
+                Ok(dbg!(merge::Command {
                     cmd: gix_command::prepare(gix_path::from_bstring(cmd))
                         .with_context(context)
                         .command_may_be_shell_script()
@@ -223,7 +223,7 @@ pub(super) mod inner {
                     current_path: ours_path,
                     ancestor: base_tmp,
                     other: theirs_tmp,
-                })
+                }))
             }
 
             /// Return the configured driver program for use with [`Self::prepare_external_driver()`], or `Err`

I hope that helps.

@EliahKagan
Copy link
Member Author

EliahKagan commented Feb 25, 2025

Thanks--you're right that the actual arguments that are passed are important information. I had glossed over this because it did not look like they differed between the passing and failing environment, but that doesn't mean they aren't needed to know what's going on.

The results, with the patch you gave that reveals the Command instance with dbg!, when running the test on this feature branch from PowerShell where it fails, are:

C:\Users\ek\source\repos\gitoxide [run-ci/consistent-sh ≡ +0 ~1 -0 ~]> cargo nextest run -p gix-merge with_external
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.46s
────────────
 Nextest run ID 42b38560-812f-4bcd-9548-97bb529b272d with nextest profile: default
    Starting 1 test across 2 binaries (19 tests skipped)
     Running [ 00:00:00] 0/1: 0 running, 0 passed, 0 skipped
        FAIL [   0.158s] gix-merge::merge blob::platform::merge::with_external
──── STDOUT:             gix-merge::merge blob::platform::merge::with_external

running 1 test
test blob::platform::merge::with_external ... FAILED

failures:

failures:
    blob::platform::merge::with_external

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 19 filtered out; finished in 0.15s

──── STDERR:             gix-merge::merge blob::platform::merge::with_external
[gix-merge\src\blob\platform\merge.rs:214:20] merge::Command
{
    cmd:
    gix_command::prepare(gix_path::from_bstring(cmd)).with_context(context).command_may_be_shell_script().stdin(Stdio::null()).stdout(Stdio::inherit()).stderr(Stdio::inherit()).into(),
    current: ours_tmp, current_path: ours_path, ancestor: base_tmp, other:
    theirs_tmp,
} = "C:/Users/ek/scoop/apps/git/2.48.1/usr/bin/sh.exe" "-c" "for arg in  C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmp8ULmDC C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpmHJGGo C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpLByBCI 7 \'b\' \'ancestor label\' \'current label\' \'other label\' %F; do echo $arg >> \"C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpmHJGGo\"; done; cat \"C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmp8ULmDC\" \"C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpLByBCI\" >> \"C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpmHJGGo\"" "--"

thread 'blob::platform::merge::with_external' panicked at gix-merge\tests\merge\blob\platform.rs:297:13:
b
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

  Cancelling due to test failure
────────────
     Summary [   0.167s] 1 test run: 0 passed, 1 failed, 19 skipped
        FAIL [   0.158s] gix-merge::merge blob::platform::merge::with_external
error: test run failed

That's a bit hard to read due to the way the Command sub-object formats, and it does not automatically show output in the case that the test passes. So I have also tested with this change instead:

diff --git a/gix-merge/src/blob/platform/merge.rs b/gix-merge/src/blob/platform/merge.rs
index 058a30128..39e27e51f 100644
--- a/gix-merge/src/blob/platform/merge.rs
+++ b/gix-merge/src/blob/platform/merge.rs
@@ -399,6 +399,7 @@ impl<'parent> PlatformRef<'parent> {
         match self.configured_driver() {
             Ok(driver) => {
                 let mut cmd = self.prepare_external_driver(driver.command.clone(), labels, context.clone())?;
+                panic!("program: {:?}\nargs: {:#?}", cmd.get_program(), cmd.get_args());
                 let status = cmd.status().map_err(|err| Error::SpawnExternalDriver {
                     cmd: format!("{:?}", cmd.cmd),
                     source: err,

Then, when run from PowerShell, the output included:

program: "C:/Users/ek/scoop/apps/git/2.48.1/usr/bin/sh.exe"
args: CommandArgs {
    inner: [
        Regular(
            "-c",
        ),
        Regular(
            "for arg in  C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpayn7BE C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpEvLyJo C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpCbHa1a 7 \'b\' \'ancestor label\' \'current label\' \'other label\' %F; do echo $arg >> \"C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpEvLyJo\"; done; cat \"C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpayn7BE\" \"C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpCbHa1a\" >> \"C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpEvLyJo\"",
        ),
        Regular(
            "--",
        ),
    ],
}

When run from Git Bash, where the test passes when allowed to proceed, the output included this instead:

program: "C:/Users/ek/scoop/apps/git/2.48.1/usr/bin/sh.exe"
args: CommandArgs {
    inner: [
        Regular(
            "-c",
        ),
        Regular(
            "for arg in  C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpfax7ph C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpcPJLm3 C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmp5fMWd7 7 \'b\' \'ancestor label\' \'current label\' \'other label\' %F; do echo $arg >> \"C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpcPJLm3\"; done; cat \"C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpfax7ph\" \"C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmp5fMWd7\" >> \"C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpcPJLm3\"",
        ),
        Regular(
            "--",
        ),
    ],
}

This is to say that only the randomly generated suffixes after .tmp in temporary filenames are different (and they would differ across separate identical runs, too).

On the main branch, in PowerShell, where the test also passes if allowed to proceed, the only differences besides those temporary files' names is the shell command:

program: "sh"
args: CommandArgs {
    inner: [
        Regular(
            "-c",
        ),
        Regular(
            "for arg in  C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpD7kDyR C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmppnvU83 C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpkYgWkr 7 \'b\' \'ancestor label\' \'current label\' \'other label\' %F; do echo $arg >> \"C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmppnvU83\"; done; cat \"C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpD7kDyR\" \"C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmpkYgWkr\" >> \"C:\\Users\\ek\\source\\repos\\gitoxide\\gix-merge\\.tmppnvU83\"",
        ),
        Regular(
            "--",
        ),
    ],
}

State from previous runs in this experiment is not likely to be a contributing factor; I ran gix clean -xde in between.

On main, sh is used rather than a full. That is looked up to C:\Users\ek\scoop\shims\sh.exe on this system, which is a scoop shim that delegates to C:\Users\ek\scoop\apps\git\current\bin\sh.exe, which is the Git for Windows bin\sh.exe shim that delegates to the Git for Windows usr\bin\sh.exe. That latter distinction--that is, whether the Git for Windows sh.exe is involved--seems to be an essential element here, because the failure goes away under this change:

diff --git a/gix-path/src/env/mod.rs b/gix-path/src/env/mod.rs
index 78e2da294..cd6203df7 100644
--- a/gix-path/src/env/mod.rs
+++ b/gix-path/src/env/mod.rs
@@ -51,7 +51,7 @@ pub fn shell() -> &'static OsStr {
                     // more readable messages, append literally with `/` separators. The path from
                     // `git --exec-path` will already have all `/` separators (and no trailing `/`)
                     // unless it was explicitly overridden to an unusual value via `GIT_EXEC_PATH`.
-                    raw_path.push("/usr/bin/sh.exe");
+                    raw_path.push("/bin/sh.exe");
                     raw_path
                 })
                 .filter(|raw_path| {

But I am reluctant to apply that change without understanding what it is about the effect of the Git for Windows shim that is helping, to know that this shim really is preferable rather than the test happening to hinge on a quirk that is different. One reason I am reluctant is that, under that change, if git is provided by the Git for Windows SDK, its sh.exe would not be found. If the non-shim Git for Windows sh.exe is usable but just moderately less preferred, then we could check for the shim first and then fall back to the non-shim. But I would want to understand the situation better before jumping into that, in case really neither is preferred, or in case the non-shim executable does not work well enough for our needs to be worth using directly.

My guess is that the relevant effect of the shim is in changing the environment, but I am not certain it refrains entirely from adjusting arguments. (I think the shim does not adjust arguments directly. But even if that is right, msys-2.0.dll affects interpretation of arguments--that is, what ends up in the argv that a program that uses that DLL as its libc uses--and does so in a way that is affected by some environment variables.)

I'll try to modify the command to report its arguments and environment in a way that they cannot be confused with anything else, such as by writing them to a file. However, since this will necessarily modifying the command or its environment, it is not certain to be successful at revealing the cause.


Edit: A possible contributing factor is that I also have some bin directories associated with a regular MSYS2 installation (separate from Git for Windows) in my PATH on Windows.

When the shim is used, two directories belonging to the MSYS2 installation that is associated with the running shell, /mingw64/bin and /usr/bin, as well as my per-user bin at /c/Users/ek/bin (which I think is less important in this case), are prepended, to take precedence.

The reason I say /mingw64/bin and /usr/bin belong to the MSYS2 installation associated with the running shell is that they are Unix-style paths that are automatically treated that way and translated based on that when calling native external commands.

C:\Users\ek\scoop\apps\git\2.48.1\usr\bin\sh.exe -c 'echo "$PATH" | tr ":" "\n" | sort' > ubp.txt
C:\Users\ek\scoop\apps\git\2.48.1\bin\sh.exe -c 'echo "$PATH" | tr ":" "\n" | sort' > bp.txt
git --no-pager diff --no-index ubp.txt bp.txt

Where the output of that last command is:

diff --git a/ubp.txt b/bp.txt
index 7390dca90..7190bd5b3 100644
--- a/ubp.txt
+++ b/bp.txt
@@ -1,3 +1,6 @@
+/mingw64/bin
+/usr/bin
+/c/Users/ek/bin
 /c/Users/ek/scoop/apps/pwsh/7.5.0
 /c/Program Files (x86)/VMware/VMware Workstation/bin
 /c/Windows/system32

Note that this experiment is not equivalent to checking the $0, $@, and the environment variables in an actual merge driver command, and I am not certain if this is the cause. Even if it is, I am not clear on how it is the cause.

@Byron
Copy link
Member

Byron commented Feb 26, 2025

This is very strange and puzzling.
I understand that the way the shell is called is the same no matter what, so the reason for the difference is in how the shell scripted merge-driver is interpreted.

And that's particularly strange as there isn't even any argument to interpret, the values are quite literally baked in. All I can think of is poking around in the script, maybe by starting to execute it directly. It's pretty clear what is invoked in the working and non-working case, so that could be generalized to invocations that are repeatable on the command-line directly. From there it can be reduced and probed to see what exactly is causing the observable difference (assuming it reproduces in the first place). If it doesn't, we'd know that some environment variable is affecting it as well.

And I say this without even implying that you spend more time on this, I am just sharing thoughts.

@EliahKagan
Copy link
Member Author

EliahKagan commented Feb 26, 2025

And I say this without even implying that you spend more time on this, I am just sharing thoughts.

Actually, that is no problem. What I wanted to avoid was becoming too distracted trying to understand the semantics of external merge drivers, how gix-merge uses them, and the claims about them that the test cases are asserting.

But it looks like the problem is due to unexpected behavior of sh/bash on Windows when called through a path obtained by gix_path::env::shell(), which in this case is triggered by the particular command run through gix-merge but which is unexpected even if one is aware of the details of how that is supposed to work. But that is what I was worried about being distracted from. So I am willing to spend more time on this--investigating just this kind of weirdness one of the things I wanted to save time to be able to do.

Of course, I still hope I manage to come to a sufficient understanding sooner rather than later, since it is itself a fragment of the larger issue of how to make gix-command use shells more robustly.

I understand that the way the shell is called is the same no matter what, so the reason for the difference is in how the shell scripted merge-driver is interpreted.

Yes. I believe even more strongly, now, that the effect of the shim is the key. The effect is probably through its customization of environment variables, though I am not certain.

I continue to suspect that the presence of a separate MSYS2 installation with a bin directory in my PATH is a contributing factor. But when gitoxide uses a shell, it should work properly even in such a case, especially since most things work on this system, including running the gitoxide test suite from PowerShell since #1712.

With the shim sh.exe provided by Git for Windows (which the scoop shim delegates to, though here I am running it directly):

C:\Users\ek> C:\Users\ek\scoop\apps\git\2.48.1\bin\sh.exe -c 'type -a sh cygpath'
sh is /usr/bin/sh
sh is /c/Users/ek/scoop/shims/sh
sh is /c/msys64/usr/bin/sh
cygpath is /usr/bin/cygpath
cygpath is /c/msys64/usr/bin/cygpath

With the non-shim sh.exe provided by Git for Windows:

C:\Users\ek> C:\Users\ek\scoop\apps\git\2.48.1\usr\bin\sh.exe -c 'type -a sh cygpath'
sh is /c/Users/ek/scoop/shims/sh
sh is /c/msys64/usr/bin/sh
cygpath is /c/msys64/usr/bin/cygpath

So the PATH changes effected by the shim make the environment a lot more reasonable. Without them, even the cygpath executable--that I might otherwise use as part of checking some of the intended behavior--is from a different MSYS2 installation than the shell from which it would be run. There is an argument to be made that my environment is broken, such that the failures I am having locally are not actually a problem and could be ignored. I think that argument is weak, though.

If we are going to prefer an sh.exe provided by Git for Windows even if another one appears earlier in the PATH--as gix_path::env::shell() does--then it should work even if its associated bin directories are not already in PATH, or even if something else with some same-named binaries appears first. This situation on my system is like that, just with the scoop shim (which delegates to the Git for Windows shim without modifying the environment, i.e., it runs sh.exe with the environment modifications of the Git for Windows shim and no others) appearing even earlier in the path.

More broadly, as I understand it, the Git for Windows shim exists for two reasons. It delegates to the "real" executable. But it also modifies the environment. Shims don't have to modify the environment; for example, the scoop shim does not. Presumably these environment modifications are sometimes valuable. So unless we can know we don't need them, which is hard to do in the general case of gix_command::Prepare, we should probably prefer to call the shim, but fall back to the non-shim if it is absent (which probably does about as well as can be done in the Git for Windows SDK, which doesn't provide the shims).

This might be separate from the issue of whether to prefer one of the shims of git.exe over the "real" git.exe, because running that non-shim directly is deliberately supported by Git for Windows since git-for-windows/git#2506. Currently we usually find a shim for git.exe, since that's what's usually in the path, but if we don't, we use the non-shim found in an expected installation location. (This was the case at since #1419, and both before and after #1456, though more consistently afterwards. See also the "Simpler paths could be used" section of #1456, as well as this code comment and the part of #1758 (comment) about clangarm64 and shims.)

Switching to using the shell through its shim makes sense both for sh.exe here and for bash.exe in #1864 (where it would be switching back, i.e. undoing the part of that PR that stopped using the shim). So I am tempted to just make those changes and--after retesting, just in case--say that it is fixed. Although I think that will be the solution, I don't want to jump to it just yet, because:

  • Until I understand, or at least have some better idea of, what the effect of the shim is in the failures I've observed, I will not be certain that a change to preferring it is robust. It could be that some equally reasonable (or unreasonable) environment would be broken by preferring the shim, rather than fixed. I doubt that, but I don't know.
  • It may be that there are contributing factors that are important problems for gix-command in their own right, and that I am not otherwise aware of. I'd rather not pass up the opportunity to find out.
  • Knowing about what depends on the effect of the shim may shed light on whether there is anything further that needs to be documented as a caveat in gix-command, and what that would be.
  • Knowing the effect of the shim and its implications may make it possible to implement an optimization later that does the work of the shim and thus avoids the extra subprocess creation, which on Windows is somewhat expensive. I certainly do not consider that as a blocker for this, but I don't want to pass up an opportunity to gather information that would be essential to it.

Further investigation, so far

As for why I believe the effect of the shim is the cause, I tried this modification:

diff --git a/gix-merge/tests/merge/blob/platform.rs b/gix-merge/tests/merge/blob/platform.rs
index 6230377b8..2dc09ebe5 100644
--- a/gix-merge/tests/merge/blob/platform.rs
+++ b/gix-merge/tests/merge/blob/platform.rs
@@ -265,7 +265,7 @@ theirs
             [gix_merge::blob::Driver {
                 name: "b".into(),
                 command:
-                    "for arg in  %O %A %B %L %P %S %X %Y %F; do echo $arg >> \"%A\"; done; cat \"%O\" \"%B\" >> \"%A\""
+                    "(printf '[%q]\\n' \"$0\" \"$@\"; set -o posix; set) >args+env; for arg in  %O %A %B %L %P %S %X %Y %F; do echo $arg >> \"%A\"; done; cat \"%O\" \"%B\" >> \"%A\""
                         .into(),
                 ..Default::default()
             }],

That makes it create a file called args+env showing:

  • $0 for the shell, which is always -- (due to #1842) but this confirms it.
  • The positional parameters for the shell ($@). There are none, but this confirms it.
  • All environment variables. I produced the environment variables with set -o posix; set rather than a more typical approach of running printenv or env with no arguments, in order to avoid confounding effects arising from the possibility of calling a separate external command that might be looked up in a directory belonging to a different MSYS2 installation. (This is in a ( ) subshell, so set -o posix does not have a broader effect.)

I ran cargo nextest run -p gix-merge with_external on this feature branch, on the Windows 10 system where I observed the failures, with that change, twice:

  • With no further change.
  • With shell() modified to use the shim (already known to make the failure go away).

In the second run, the change to shell() so it uses the shim was:

diff --git a/gix-path/src/env/mod.rs b/gix-path/src/env/mod.rs
index 78e2da294..cd6203df7 100644
--- a/gix-path/src/env/mod.rs
+++ b/gix-path/src/env/mod.rs
@@ -51,7 +51,7 @@ pub fn shell() -> &'static OsStr {
                     // more readable messages, append literally with `/` separators. The path from
                     // `git --exec-path` will already have all `/` separators (and no trailing `/`)
                     // unless it was explicitly overridden to an unusual value via `GIT_EXEC_PATH`.
-                    raw_path.push("/usr/bin/sh.exe");
+                    raw_path.push("/bin/sh.exe");
                     raw_path
                 })
                 .filter(|raw_path| {

After each run, I moved the just-created args+env file from the gix-merge to the top-level directory of the working tree and renamed it descriptively, to no-shim.txt in the first run and with-shim.txt in the second run. The changed lines between them, obtained by running git --no-pager diff -U0 --no-index no-shim.txt with-shim.txt, are:

diff --git a/no-shim.txt b/with-shim.txt
index 3f51928c0..11cbfa69b 100644
--- a/no-shim.txt
+++ b/with-shim.txt
@@ -4 +4 @@ APPDATA='C:\Users\ek\AppData\Roaming'
-BASH=/usr/bin/sh
+BASH=/usr/bin/bash
@@ -10 +10 @@ BASH_CMDS=()
-BASH_EXECUTION_STRING='(printf '\''[%q]\n'\'' "$0" "$@"; set -o posix; set) >args+env; for arg in  C:\Users\ek\source\repos\gitoxide\gix-merge\.tmpLx3byu C:\Users\ek\source\repos\gitoxide\gix-merge\.tmp7p4vnw C:\Users\ek\source\repos\gitoxide\gix-merge\.tmpnxT3lk 7 '\''b'\'' '\''ancestor label'\'' '\''current label'\'' '\''other label'\'' %F; do echo $arg >> "C:\Users\ek\source\repos\gitoxide\gix-merge\.tmp7p4vnw"; done; cat "C:\Users\ek\source\repos\gitoxide\gix-merge\.tmpLx3byu" "C:\Users\ek\source\repos\gitoxide\gix-merge\.tmpnxT3lk" >> "C:\Users\ek\source\repos\gitoxide\gix-merge\.tmp7p4vnw"'
+BASH_EXECUTION_STRING='(printf '\''[%q]\n'\'' "$0" "$@"; set -o posix; set) >args+env; for arg in  C:\Users\ek\source\repos\gitoxide\gix-merge\.tmp5rjEOo C:\Users\ek\source\repos\gitoxide\gix-merge\.tmpaOs19F C:\Users\ek\source\repos\gitoxide\gix-merge\.tmp62jrFQ 7 '\''b'\'' '\''ancestor label'\'' '\''current label'\'' '\''other label'\'' %F; do echo $arg >> "C:\Users\ek\source\repos\gitoxide\gix-merge\.tmpaOs19F"; done; cat "C:\Users\ek\source\repos\gitoxide\gix-merge\.tmp5rjEOo" "C:\Users\ek\source\repos\gitoxide\gix-merge\.tmp62jrFQ" >> "C:\Users\ek\source\repos\gitoxide\gix-merge\.tmpaOs19F"'
@@ -39,0 +40 @@ EUID=197609
+EXEPATH='C:\Users\ek\scoop\apps\git\2.48.1\bin'
@@ -56,0 +58 @@ MACHTYPE=x86_64-pc-msys
+MSYSTEM=MINGW64
@@ -61 +63 @@ NEXTEST_PROFILE=default
-NEXTEST_RUN_ID=203cd7ac-794f-4c11-af4a-c96333d95745
+NEXTEST_RUN_ID=ab48dadf-b347-40f2-8770-a73e91c8eb7d
@@ -72 +74 @@ PAGER='less -F'
-PATH='/c/Users/ek/source/repos/gitoxide/target/debug/deps:/c/Users/ek/source/repos/gitoxide/target/debug:/c/Users/ek/.rustup/toolchains/stable-x86_64-pc-windows-msvc/lib/rustlib/x86_64-pc-windows-msvc/lib:/c/Users/ek/scoop/apps/pwsh/7.5.0:/c/Program Files (x86)/VMware/VMware Workstation/bin:/c/Windows/system32:/c/Windows:/c/Windows/System32/Wbem:/c/Windows/System32/WindowsPowerShell/v1.0:/c/Windows/System32/OpenSSH:/c/Program Files (x86)/Windows Kits/10/Windows Performance Toolkit:/c/Program Files/dotnet:/c/Users/ek/scoop/apps/vscodium/current/bin:/c/Users/ek/scoop/apps/nodejs-lts/current/bin:/c/Users/ek/scoop/apps/nodejs-lts/current:/c/Users/ek/scoop/apps/openjdk23/current/bin:/c/Users/ek/scoop/apps/ghostscript/current/lib:/c/Users/ek/scoop/apps/gpg/current/bin:/c/Users/ek/scoop/apps/mpv/current:/c/Users/ek/scoop/apps/maven/current/bin:/c/Users/ek/.cargo/bin:/c/Users/ek/bin:/c/Users/ek/.local/share/gem/ruby/3.0.0/bin:/c/Users/ek/scoop/shims:/c/Users/ek/AppData/Local/Packages/PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0/LocalCache/local-packages/Python312/Scripts:/c/Users/ek/AppData/Local/Microsoft/WindowsApps:/c/Users/ek/.dotnet/tools:/c/Users/ek/AppData/Local/Programs/Microsoft VS Code/bin:/c/msys64/ucrt64/bin:/c/msys64/ucrt64/sbin:/c/msys64/usr/bin:/c/Users/ek/AppData/Local/Programs/MiKTeX/miktex/bin/x64:/c/users/ek/.local/bin:/c/Users/ek/AppData/Local/JetBrains/Toolbox/scripts:/c/Program Files/IronPython 3.4:/c/Users/ek/AppData/Local/Programs/Microsoft VS Code Insiders/bin:/c/Users/ek/.dotnet/tools'
+PATH='/mingw64/bin:/usr/bin:/c/Users/ek/bin:/c/Users/ek/source/repos/gitoxide/target/debug/deps:/c/Users/ek/source/repos/gitoxide/target/debug:/c/Users/ek/.rustup/toolchains/stable-x86_64-pc-windows-msvc/lib/rustlib/x86_64-pc-windows-msvc/lib:/c/Users/ek/scoop/apps/pwsh/7.5.0:/c/Program Files (x86)/VMware/VMware Workstation/bin:/c/Windows/system32:/c/Windows:/c/Windows/System32/Wbem:/c/Windows/System32/WindowsPowerShell/v1.0:/c/Windows/System32/OpenSSH:/c/Program Files (x86)/Windows Kits/10/Windows Performance Toolkit:/c/Program Files/dotnet:/c/Users/ek/scoop/apps/vscodium/current/bin:/c/Users/ek/scoop/apps/nodejs-lts/current/bin:/c/Users/ek/scoop/apps/nodejs-lts/current:/c/Users/ek/scoop/apps/openjdk23/current/bin:/c/Users/ek/scoop/apps/ghostscript/current/lib:/c/Users/ek/scoop/apps/gpg/current/bin:/c/Users/ek/scoop/apps/mpv/current:/c/Users/ek/scoop/apps/maven/current/bin:/c/Users/ek/.cargo/bin:/c/Users/ek/bin:/c/Users/ek/.local/share/gem/ruby/3.0.0/bin:/c/Users/ek/scoop/shims:/c/Users/ek/AppData/Local/Packages/PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0/LocalCache/local-packages/Python312/Scripts:/c/Users/ek/AppData/Local/Microsoft/WindowsApps:/c/Users/ek/.dotnet/tools:/c/Users/ek/AppData/Local/Programs/Microsoft VS Code/bin:/c/msys64/ucrt64/bin:/c/msys64/ucrt64/sbin:/c/msys64/usr/bin:/c/Users/ek/AppData/Local/Programs/MiKTeX/miktex/bin/x64:/c/users/ek/.local/bin:/c/Users/ek/AppData/Local/JetBrains/Toolbox/scripts:/c/Program Files/IronPython 3.4:/c/Users/ek/AppData/Local/Programs/Microsoft VS Code Insiders/bin:/c/Users/ek/.dotnet/tools'
@@ -74,0 +77 @@ PIPESTATUS=([0]="0")
+PLINK_PROTOCOL=ssh

(The PATH line is one of the changed lines, as indicated by the leading - and +, even though GitHub is failing to highlight it as such for some reason.)

The changed lines that I think are very unlikely to contribute in any way to the difference in behavior are:

  • BASH_EXECUTION_STRING, which differs only in the names of the temporary files, which conform to the same template and are reasonable in both cases.
  • NEXTEST_RUN_ID, which differs in the expected way across separate cargo nextest runs. Both have the correct format, and the tested code, and test cases, do not plausibly use this variable.
  • PLINK_PROTOCOL, since no transport operations are involved here. (Also, while I do have PuTTY and plink installed, I am not using them with Git or gitoxide, which find an OpenSSH implementation.)

In contrast, the changed lines for environment variables that I think could plausibly make a difference are:

  • BASH, which changed from /usr/bin/sh to /usr/bin/bash when the shim was used. This is usually the value of argv[0] is it is seen by the shell. (bash creates a BASH variable whether it is running in POSIX mode or not.)

    I don't know whether the path it shows changing from /usr/bin/sh to /usr/bin/bash this means that the sh.exe shim is actually running the non-shim bash.exe instead of non-shim sh.exe. The simplest explanation is that it is actually running bash.exe, which would be surprising (there is a separate shim for bash.exe, after all).

    But whatever the filename of the executable that is loaded, the way bash is invoked affects whether it enters POSIX mode automatically. When invoked by the name sh, it automatically enters POSIX mode (after running commands from any startup files that it would have run commands from). This suggests that the shim may cause the shell not to run in POSIX mode. And that seems to be confirmed by the following:

    C:\Users\ek> C:\Users\ek\scoop\apps\git\2.48.1\usr\bin\sh.exe -c 'echo "$SHELLOPTS"'
    braceexpand:hashall:interactive-comments:posix
    C:\Users\ek> C:\Users\ek\scoop\apps\git\2.48.1\bin\sh.exe -c 'echo "$SHELLOPTS"'
    braceexpand:hashall:interactive-comments
    

    That is certainly interesting in its own right, and potentially important to adjacent work on gitoxide. It even suggests that defaulting to sh.exe on Windows, when it is thought to be a shim provided by Git for Windows, might be a reasonable way to run bash scripts in Windows while getting around the problem--described in http://gitpython-developers/GitPython#1791 and GitoxideLabs/gitoxide#1359 (comment)--that the bash.exe that is usually found is the one associated with WSL! (If it is reliably the case across different installations and versions, and not considered a bug in Git for Windows.)

    But I am not sure why, or if, that would make a difference for the command run in this gix-merge test.

  • EXEPATH, which was newly defined when the shim was used. I am not sure what effect this has. gix-path itself will try to use it to avoid having to run git config -l. But this is in a subprocess of the executable that uses gix-path and other gitoxide crates, so it does not have an effect on this.

    (This path is also different from the one that it usually has in a Git Bash shell and that the EXEPATH optimization in gix-path would need; typically the path would be one level up, i.e., it would not have the bin component. It would also typically be a Windows path. So it would usually be C:\Users\ek\scoop\apps\git\2.48.1. However, I believe that is a separate issue. Somehow the git-bash.exe program that runs Git Bash, typically in a mintty window, causes EXEPATH to be customized differently. The somewhat unusual value of EXEPATH is also set when I use Git Bash in Windows Terminal, where I made the profile because scoop does not automatically configure that. To the best of my knowledge, there is no hard guarantee about whether EXEPATH exists or what its value is. But even if the value EXEPATH takes on my system is considered wrong, it is not likely the cause of the test failure here, which happens in the case that EXEPATH is not set because it is run outside of a Git Bash environment and without the shim.)

  • MSYSTEM, which was newly defined when the shim was used. But I am not sure what effect that has, if any, once the shell is already running. But perhaps it already has had an effect on the shell itself due to being set beforehand in the shim?

  • PATH, which with the shim has /mingw64/bin:/usr/bin:/c/Users/ek/bin: prepended.

    To be clear, this PATH is automatically converted into a ;-separated PATH whose entries like /c/... are turned into entries like C:\ when a native Windows program is called. (I think that might technically always happen and it just is always converted back in the called program when that program uses msys-2.0.dll as its libc, but I am not sure.) So that the path is shown in this form in the shell is not itself a problem.

    However, how this conversion--as well as other conversions such as in command-line parsing--are done by msys-2.0.dll is affected by the environment, reflects the environment for which it is done, and I believe also differs in some ways across Cygwin-like environments of different kinds (beyond just the DLL that implements it being named differently).

    Those are other aspects of running MSYS2 programs, and of running things from MSYS2 programs that may or may not be MSYS2 programs, whose implications for gix-command I have been looking into already and concurrently. It applies to running shells like sh and bash, which are often MSYS2 programs, including if provided by Git for Windows (except they may have been MSYS1 programs in very old versions). It applies to the case where use_shell is true, but also the case where gix-command's own shebang interpretation would or should cause a shell like sh or bash to be used.

    I had hoped that this issue was separate from that. Manual experiments--both before opening this issue and now--made me think that was not a contributing factor here. The subsequently described experiment (see below) rules out some variations of this. But I think it is still a plausible cause.

The differences in BASH_EXECUTION_STRING and PATH were verified using this diff tool.

I did another experiment where I applies this change instead, tracing the shell commands and recording standard error including from external processes such as cat:

diff --git a/gix-merge/tests/merge/blob/platform.rs b/gix-merge/tests/merge/blob/platform.rs
index 6230377b8..050e97873 100644
--- a/gix-merge/tests/merge/blob/platform.rs
+++ b/gix-merge/tests/merge/blob/platform.rs
@@ -265,7 +265,7 @@ theirs
             [gix_merge::blob::Driver {
                 name: "b".into(),
                 command:
-                    "for arg in  %O %A %B %L %P %S %X %Y %F; do echo $arg >> \"%A\"; done; cat \"%O\" \"%B\" >> \"%A\""
+                    "exec 2>messages; set -x; for arg in  %O %A %B %L %P %S %X %Y %F; do echo $arg >> \"%A\"; done; cat \"%O\" \"%B\" >> \"%A\""
                         .into(),
                 ..Default::default()
             }],

Analogously to the above-described experiment, I ran the test with that change twice, once without the shim and once with the shim, moving and renaming the log file after each run.

The diffs showed only the change in temporary-file names. In neither run was anything in a different order. Nor were there any error messages, such as I would expect if the shell could not open a file for redirection or if cat could not open a file for read.

All I can think of is poking around in the script, maybe by starting to execute it directly.

Running modified versions of the script that write to different locations I can easily inspect within a test directory does not reveal anything. But I am not sure this captures what is relevant. I think the next step along these lines is to preserve the temporary files that the test uses and examine them.

@Byron
Copy link
Member

Byron commented Feb 27, 2025

That's good to know.

Of course, I still hope I manage to come to a sufficient understanding sooner rather than later, since it is itself a fragment of the larger issue of how to make gix-command use shells more robustly.

This is incredibly valued work, and work that I can't even do as it, I wouldn't last long. I guess that's another way to thank you for your incredible work!

Although I think that will be the solution, I don't want to jump to it just yet, because:

My initial intuition was to jump to it and be done, but the reasoning to not do that just yet is very sound. I found the "do the work of the shim and avoid calling it that way" very interesting. Understanding this better could at least lead to documentation that helps to one day implement it, even if it's not done in the initial implementation.

I mostly skimmed over the details that followed, but saw the details of the merge-driver runs. It's strange that it has the wrong output despite the calls being the same, and that it seems so elusive to find out why it does that. The caller is known, the input is known, the binary is known, the script itself is known, so what could possibly be causing the different result. I wonder if it can be executed directly while reproducing the issue, allowing it to be prodded and probed with impunity until it reveals its secret.

The diff-tool is very interesting by the way, I starred the repo and would hope to benefit from it one day.

@EliahKagan
Copy link
Member Author

EliahKagan commented Mar 1, 2025

I have found that the cause of the problem is that, if...

  • a shell in one MSYS2 installation, that uses one msys-2.0.dll
  • runs a program from another MSYS2 installation, whose msys-2.0.dll differs,
  • and uses >> redirection to open a file as stdout for append

...then, instead of appending, the effect of >> is instead to overwrite at the beginning of the file! Note that this is not the same as acting like >, which truncates. This does not truncate.

In the failing test, the eight bytes b\ntheirs (where, by \n, I mean a newline character) should be appended to the file whose path is substituted for the %A placeholder, because >> is used. But those bytes are instead written to the beginning, over what was there before. That produces the observed output, which fails the test because it starts with the line b, which does not contain the substring .tmp.

Simplified, repeatable demonstration

Here is a boiled-down example, with more explicit (but otherwise simpler) commands, using two PortableGit installations:

C:\Users\ek> C:\Users\ek\opt\PortableGit-2.47.1.2\usr\bin\sh.exe -c 'printf abcdefg >long; printf ABC >short; /c/Users/ek/opt/PortableGit-2.48.1/usr/bin/cat short >>long; /usr/bin/cat long'
ABCdefg

It should produce abcdefgABC, but instead it produces ABCdefg. That there are no newlines, and that the overwritten letters are the capitalized versions of what they replace, is for clarity only--I have verified that neither of these illustrative elements is required. In addition, while that example uses cat to make it easier to compare to what is happening in the failing test, this behavior is not specific to cat. This has the same effect:

C:\Users\ek> C:\Users\ek\opt\PortableGit-2.47.1.2\usr\bin\sh.exe -c 'printf abcdefg >long; /c/Users/ek/opt/PortableGit-2.48.1/usr/bin/printf ABC >>long; /usr/bin/cat long'
ABCdefg

(bash users may be accustomed to printf being a builtin, but it is also available as an external command.)

When the outer sh is interactive, Ctrl+C has no effect in it afterwards, suggesting that signal handling is disrupted in it. But compared the effect of >> effectively seeking to 0 and overwriting, disruption of signal handling is not something I find especially unintuitive.

This is distinct from "DLL hell"

This also does not arise due to executables using the wrong DLL, at least not in a straightforward way:

  • The Windows 10 development system where I observed the test failure has an environment that is unhygenic in the sense that a directory appears in my PATH that contains some MSYS2-related DLLs, including msys-2.0.dll. This directory is furthermore for a different MSYS2-related DLL than the one the outer shell associated with Git for Windows uses.

    (This would be irrelevant on a Unix-like system, but on Windows the directories in PATH are included among those that are searched when an executable that links to a DLL is loaded. PATH on Windows contributes to searches that, on a Unix-like system, would use PATH, but it also contributes to searches that, on a Unix-like system, would use LD_LIBRARY_PATH.)

  • But that is not required to produce this. The significance of the dissonant PATH was only that it found the cat executable of a different MSYS2 installation. As shown above, I can do it between two PortableGit installations of different versions where PATH is not modified. That was on a virtual machine completely isolated from my Windows 10 development machine. It also goes both ways, even on my Windows 10 development machine. Either MSYS2 installation can be the caller, calling the other one.

  • The problem occurs both when they are two different PortableGit versions and when one of them is PortableGit (or git installed through scoop, and presumably this would work with any Git for Windows installation method) and the other is a normal MSYS2 installation.

  • Moving the executable being called out of the foreign bin directory makes the problem go away, which seems to be due to causing it to use the wrong msys-2.0.dll, i.e. the same one as the caller, rather than the one it was built and installed to use. Placing it in a directory with a copy of the msys-2.0.dll it would ordinarily use brings back the problem.

That they use msys-2.0.dll at different versions (or at least different builds) is essential, however:

  • The problem does not happen for non-MSYS2 programs placed in a bin directory where they would cause the problem if they were MSYS2 programs.
  • The problem does not happen when there are two separate MSYS2 installations of the same version with the same msys-2.0.dll.

But it is not unprecedented

While this seems unlikely to be anticipated or desired, I am not certain that this is even considered a bug in MSYS2. As far as I can find, no MSYS2 documentation warns against a program in on MSYS2 installation calling a program in another MSYS2 installation. However, this is explicitly documented as unsupported in Cygwin (from which MSYS2 derives):

However, a clean separation requires that you don't try to run executables of one Cygwin installation from processes running in another Cygwin installation. This may or may not work, but the chances that the result is not what you expect are pretty high.

Though what I have observed seems to me much subtler, more surprising, and if unexpected more insidious in its effects, than some of the usual effects that are warned about:

If you get the error "shared region is corrupted" or "shared region version mismatch" it means you have multiple versions of cygwin1.dll running at the same time which conflict with each other.

Outer strace results

MSYS2 installations, including the MSYS2 installation that is provided with Git for Windows, come with an strace program. This knows about MSYS2 system call emulation and displays output in a way that makes sense. But it is actually a native Windows executable. It does not itself link to msys-2.0.dll or any other MSYS2 DLLs. As such, running the redirected command with strace keeps the problem from occurring, so strace cannot be used to observe calls in the executable that receives the opened file handle and whose writes overwrite bytes of the file.

But it can be used in the outer process, i.e. the shell that performs the redirection. When I do that, I do not see anything obviously wrong. It shows an open call with a flags argument of 0x209. Examining the relevant header file reveals that this is:

0x209 = 0x200 | 0x8 | 0x1 = _FCREAT | _FAPPEND | _FREAD

For comparison, when I use > redirection, the open call has a flags argument of 0x601, which is:

0x601 = 0x400 | 0x200 | 0x1 = _FTRUNC | _FCREAT | _FREAD

MSYS2 does not normally allow weird seeking

One question on my mind was how far from normal this behavior of >> really is. I am unaccustomed to being able to seek back behind where I started when opening a file in append mode! But is that restriction actually followed on all platforms? If not, then this still would feel like an MSYS2 bug--and still definitely constiute a malfunction, since commands like cat and printf certainly do not attempt to seek backwards from where they started!--but the strangeness of the behavior would be decreased. This could be tested with a program like:

use std::fs::File;
use std::io::{self, Seek, SeekFrom, Write};
use std::os::unix::io::{AsRawFd, FromRawFd};

fn main() -> io::Result<()> {
    let handle = io::stdout().lock();
    let fd = handle.as_raw_fd();
    let mut out = unsafe { File::from_raw_fd(fd) };

    out.seek(SeekFrom::Start(0))?;
    write!(out, "ABC")?;
    out.flush()?;

    Ok(())
}

Except, Rust doesn't have an MSYS2 target. (It has MinGW toolchains, but it does not produce executables that use msys-2.0.dll for system call emulation and libc.) So:

#define _POSIX_C_SOURCE 1

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

_Noreturn static void die(const char *msg)
{
    perror(msg);
    exit(EXIT_FAILURE);
}

int main(void)
{
    if (lseek(fileno(stdout), 0, SEEK_SET) == -1)
        die("lseek failed");

    if (fputs("ABC", stdout) == EOF)
        die("fputs failed");

    if (fflush(stdout) != 0)
        die("fflush failed");
}

That C program in MSYS2, as in that C program and the Rust program in GNU/Linux, runs, and reports no errors with > or >> redirection. But, as expected, redirecting >>outfile where outfile starts out as abcdefg results in the new contents of abcdefgABC, not ABCdefg. MSYS2 does not allow seeking behind where one starts when opening in append mode. That behavior with >> is, in a sense, as weird on MSYS2 as it is on other systems.

I will search further to see if this has been reported before and try to figure out if it is considered a bug in MSYS2, then try to open a bug for the behavior or a feature request (or pull request) to update the documentation, or both.

This was found by proceeding along the lines you suggested

I wonder if it can be executed directly while reproducing the issue, allowing it to be prodded and probed with impunity until it reveals its secret.

I was able to do this by making the change:

diff --git a/gix-merge/src/blob/platform/merge.rs b/gix-merge/src/blob/platform/merge.rs
index 058a30128..88aaf05ad 100644
--- a/gix-merge/src/blob/platform/merge.rs
+++ b/gix-merge/src/blob/platform/merge.rs
@@ -399,6 +399,8 @@ impl<'parent> PlatformRef<'parent> {
         match self.configured_driver() {
             Ok(driver) => {
                 let mut cmd = self.prepare_external_driver(driver.command.clone(), labels, context.clone())?;
+                let arg1 = cmd.cmd.get_args().nth(1).unwrap().to_str().unwrap().to_owned();
+                std::fs::write("arg1", arg1)?;
                 let status = cmd.status().map_err(|err| Error::SpawnExternalDriver {
                     cmd: format!("{:?}", cmd.cmd),
                     source: err,

And debugging the test, with a breakpoint placed at the line immediately after the changed lines. I ran the test up to the breakpoint, then examined arg1 to make sure problem was not with the way I had been attempting to look at it before, and to get the exact command for this unit test run with the exact paths it will use (including the temporary-file filenames that change each time):

for arg in  C:\Users\ek\source\repos\gitoxide\gix-merge\.tmpV0qmjs C:\Users\ek\source\repos\gitoxide\gix-merge\.tmphsSiEQ C:\Users\ek\source\repos\gitoxide\gix-merge\.tmp6oOm3k 7 'b' 'ancestor label' 'current label' 'other label' %F; do echo $arg >> "C:\Users\ek\source\repos\gitoxide\gix-merge\.tmphsSiEQ"; done; cat "C:\Users\ek\source\repos\gitoxide\gix-merge\.tmpV0qmjs" "C:\Users\ek\source\repos\gitoxide\gix-merge\.tmp6oOm3k" >> "C:\Users\ek\source\repos\gitoxide\gix-merge\.tmphsSiEQ"

Using vim -b in case some encoding oddity was the cause (it wasn't), I created two new files, splitting arg1 between the loop that performs the expansion and uses the echo builtin (arg1-part1), and the cat command with >> redirection (arg2-part2):

for arg in  C:\Users\ek\source\repos\gitoxide\gix-merge\.tmpV0qmjs C:\Users\ek\source\repos\gitoxide\gix-merge\.tmphsSiEQ C:\Users\ek\source\repos\gitoxide\gix-merge\.tmp6oOm3k 7 'b' 'ancestor label' 'current label' 'other label' %F; do echo $arg >> "C:\Users\ek\source\repos\gitoxide\gix-merge\.tmphsSiEQ"; done
cat "C:\Users\ek\source\repos\gitoxide\gix-merge\.tmpV0qmjs" "C:\Users\ek\source\repos\gitoxide\gix-merge\.tmp6oOm3k" >> "C:\Users\ek\source\repos\gitoxide\gix-merge\.tmphsSiEQ"

In PowerShell (where cat is an alias of Get-Content, not to be confused with cat in a shell script), I assigned the contents of those files to variables, verified that the split really didn't lose anything except the ; and space between the parts, and staged everything including the temporary files created by the test with the contents they were set up with before the external merge command runs:

$script = cat gix-merge/arg1
$script1 = cat gix-merge/arg1-part1
$script2 = cat gix-merge/arg1-part2
$script -eq "$script1; $script2"  # Printed `True`, as expected.
git add .

After that, git diff --staged -- gix-merge/.tmp* printed this, which subsequent git restore . would be able to bring back so long as I didn't stage again:

diff --git a/gix-merge/.tmp6oOm3k b/gix-merge/.tmp6oOm3k
new file mode 100644
index 000000000..228068dbe
--- /dev/null
+++ b/gix-merge/.tmp6oOm3k
@@ -0,0 +1 @@
+theirs
\ No newline at end of file
diff --git a/gix-merge/.tmpV0qmjs b/gix-merge/.tmpV0qmjs
new file mode 100644
index 000000000..617807982
--- /dev/null
+++ b/gix-merge/.tmpV0qmjs
@@ -0,0 +1 @@
+b
diff --git a/gix-merge/.tmphsSiEQ b/gix-merge/.tmphsSiEQ
new file mode 100644
index 000000000..424860eef
--- /dev/null
+++ b/gix-merge/.tmphsSiEQ
@@ -0,0 +1 @@
+ours
\ No newline at end of file

This let me do what the test does, but check the state before and after the cat command with >> redirection (i.e., before and after $script2).

Checking this semi-manual test procedure

First, as a control, with the shim that I know makes the test pass, running the first part of the command:

C:\Users\ek\scoop\apps\git\2.48.1\bin\sh.exe -c "$script1"
git diff
warning: in the working copy of 'gix-merge/.tmphsSiEQ', LF will be replaced by CRLF the next time Git touches it
diff --git a/gix-merge/.tmphsSiEQ b/gix-merge/.tmphsSiEQ
index 424860eef..bebbf0e54 100644
--- a/gix-merge/.tmphsSiEQ
+++ b/gix-merge/.tmphsSiEQ
@@ -1 +1,9 @@
-ours
\ No newline at end of file
+oursC:Userseksourcereposgitoxidegix-merge.tmpV0qmjs
+C:Userseksourcereposgitoxidegix-merge.tmphsSiEQ
+C:Userseksourcereposgitoxidegix-merge.tmp6oOm3k
+7
+b
+ancestor label
+current label
+other label
+%F

Still in the control, running the second part of the script:

C:\Users\ek\scoop\apps\git\2.48.1\bin\sh.exe -c "$script2"
git diff
warning: in the working copy of 'gix-merge/.tmphsSiEQ', LF will be replaced by CRLF the next time Git touches it
diff --git a/gix-merge/.tmphsSiEQ b/gix-merge/.tmphsSiEQ
index 424860eef..7ef80b2e1 100644
--- a/gix-merge/.tmphsSiEQ
+++ b/gix-merge/.tmphsSiEQ
@@ -1 +1,11 @@
-ours
\ No newline at end of file
+oursC:Userseksourcereposgitoxidegix-merge.tmpV0qmjs
+C:Userseksourcereposgitoxidegix-merge.tmphsSiEQ
+C:Userseksourcereposgitoxidegix-merge.tmp6oOm3k
+7
+b
+ancestor label
+current label
+other label
+%F
+b
+theirs
\ No newline at end of file

This confirmed that when the shim sh.exe is used, the expected effect in the test--which is known to work when the shim is used--happens. The b and theirs are appended. It also confirmed that this approach to testing is capable of observing the effects of the first and second halves of the original command, in $script1 and $script2.

Using this semi-manual procedure

I ran git restore ., then proceeded as before but with the non-shim sh.exe:

C:\Users\ek\scoop\apps\git\2.48.1\usr\bin\sh.exe -c "$script1"
git diff

That produced the same effect as with the shim:

warning: in the working copy of 'gix-merge/.tmphsSiEQ', LF will be replaced by CRLF the next time Git touches it
diff --git a/gix-merge/.tmphsSiEQ b/gix-merge/.tmphsSiEQ
index 424860eef..bebbf0e54 100644
--- a/gix-merge/.tmphsSiEQ
+++ b/gix-merge/.tmphsSiEQ
@@ -1 +1,9 @@
-ours
\ No newline at end of file
+oursC:Userseksourcereposgitoxidegix-merge.tmpV0qmjs
+C:Userseksourcereposgitoxidegix-merge.tmphsSiEQ
+C:Userseksourcereposgitoxidegix-merge.tmp6oOm3k
+7
+b
+ancestor label
+current label
+other label
+%F

But then the second half, of course, does not:

C:\Users\ek\scoop\apps\git\2.48.1\usr\bin\sh.exe -c "$script2"
git diff

This produced the effect that occurs when the test is run all the way through and fails:

warning: in the working copy of 'gix-merge/.tmphsSiEQ', LF will be replaced by CRLF the next time Git touches it
diff --git a/gix-merge/.tmphsSiEQ b/gix-merge/.tmphsSiEQ
index 424860eef..2501f1918 100644
--- a/gix-merge/.tmphsSiEQ
+++ b/gix-merge/.tmphsSiEQ
@@ -1 +1,10 @@
-ours
\ No newline at end of file
+b
+theirserseksourcereposgitoxidegix-merge.tmpV0qmjs
+C:Userseksourcereposgitoxidegix-merge.tmphsSiEQ
+C:Userseksourcereposgitoxidegix-merge.tmp6oOm3k
+7
+b
+ancestor label
+current label
+other label
+%F

But by breaking it down this way, observing the files after each major step, preserving the files, and being able to easily rerun everything, with and without variations... it was very clear that a >>-redirected command was really writing over the beginning of its output file! I had previously assumed this must not be happening--it was.

Which cat?

The cat the non-shim found was in my regular MSYS2 installation:

> C:\Users\ek\scoop\apps\git\2.48.1\usr\bin\sh.exe -c 'type cat'
cat is /c/msys64/usr/bin/cat

In contrast, with the shim:

> C:\Users\ek\scoop\apps\git\2.48.1\bin\sh.exe -c 'type cat'
cat is /usr/bin/cat

(Which means C:\Users\ek\scoop\apps\git\2.48.1\usr\bin\cat.exe in a program in my Git for Windows MSYS2 installation on this system.)

This was the first piece of the puzzle I had. See the mention of the differing PATH in previous comments.

Which cat is used is enough to produce the difference. I repeated the above experiment, with both the shim and non-shim sh.exe, giving absolute paths for cat multiple times for different paths and ways of running the shell. Using cat from the different MSYS2 installation always caused the problem whether or not a shim was used.

Making an independent demonstration

From that, it was easy to make:

C:\Users\ek\scoop\apps\git\2.48.1\usr\bin\sh.exe -c 'printf abcdefg >long; printf ABC >short; cat short >> long; cat long'
ABCdefg

Then I tested in different directions and on two other systems, including the Windows 11 virtual machine used for the clearer version of that command shown near the top of the comment.

(That command, shown in the very first code block of this comment, also supports the very hygenic environment tested there--where no shims are used and no relevant directories are ever in PATH--because it uses absolute paths for all external commands, including the final cat to display the file.)

What it means for this PR

Although unless I find out that this is known, I should report it to MSYS2--and possibly even to Git for Windows since it is with Git for Windows that I expect this kind of cross-MSYS2 interaction is most likely to occur--that is not a blocker for this PR.

This shows one case of the problem with not prepending directories to PATH for the desired environment. Since using a shim is the easy way to do this in a way that is likely to be correct, I think the idea of changing it to use (git root)/bin/sh.exe instead of (git root)/usr/bin/sh.exe is the way to go, so long as it is okay to exacerbate #1868 in the ways described there.

Really, what I will want to do is to change it to prefer the shim but fall back to the non-shim, for the SDK-related reasons described earlier. I may also want to make it check more robustly that the paths it has found really do seem to be inside something like a Git for Windows installation (if not, then whatever we found may well be worse than just using the simple name sh.exe with the PATH search that entails), if this can be done in a way that is fast and still pretty simple. Beyond that, I am inclined to think that further refinements of the design may be beyond the scope of this PR.

However, before this is ready to merge, I will want to fix the new unrelated breakage on CI (#1849), which blocks important CI tests here and in that blocks important tests for this and #1864; retest this locally with the shim-related changes just described; and investigate the local failures in #1864 that also happen only when the non-shim is used, to make sure they are comparable or otherwise explicable (an example of an otherwise explicable cause is that it might even be using the wrong git), to rule out any other readily knowable problems that could apply to this PR too.

Edit (2025-03-15): I've added task and outscoped lists to the top of the PR description for the plan articulated here.

@EliahKagan EliahKagan force-pushed the run-ci/consistent-sh branch from 65f706b to adc2ce1 Compare March 2, 2025 10:57
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Mar 2, 2025
It now prefers the `(git root)/bin/sh.exe` shim, falling back to
the `(git root)/usr/bin/sh.exe` non-shim to support the Git for
Windows SDK which doesn't have the shim.

The reason to prefer the shim is that it sets environment
variables, including prepending `bin` directories that provide
tools one would expect to have when using it. Without this, common
POSIX commands may be unavailable, or different and incompatible
implementations of them may be found. In particular, if they are
found in a different MSYS2 installation whose `msys-2.0.dll` is of
a different version or otherwise a different build, then calling
them directly may produce strange behavior. See:

- https://cygwin.com/faq.html#faq.using.multiple-copies
- GitoxideLabs#1862 (comment)

Overall this makes things more robust than either preferring the
non-shim or just doing a path search for `sh` as was done before
that. But it exacerbates GitoxideLabs#1868 (as described there), so if the Git
for Windows `sh.exe` shim continues to work as it currently does,
then further improvements may be called for here.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Mar 3, 2025
This makes a few changes to make `shell()` more robust:

1. Check the last two components of the path `git --exec-path`
   gave, to make sure they are `libexec/git-core`.

   (The check is done in such a way that the separator may be `/`
   or `\`, though a `\` separator here would be unexpected. We
   permit it because it may plausibly be present due to an
   overriden `GIT_EXEC_PATH` that breaks with Git's own behavior of
   using `/` but that is otherwise fully usable.)

   If the directory is not named `git-core`, or it is a top-level
   directory (no parent),  or its parent is not named `libexec`,
   then it is not reasonable to guess that this is in a directory
   where it would be safe to use `sh.exe` in the expected relative
   location. (Even if safe, such a layout does not suggest that a
   `sh.exe` found in it would be better choice than the fallback of
   just doing a `PATH` search.)

2. Check the grandparent component (that `../..` would go to) of
   the path `git --exec-path` gave, to make sure it is recognized
   name of a platform-specific `usr`-like directory that has been
   used in MSYS2.

   This is to avoid traversing up out of less common directory
   trees that have some different and shallower structure than
   found in a typical Git for Windows or MSYS2 installation.

3. Instead of using only the `(git root)/usr/bin/sh.exe` non-shim,
   prefer the `(git root)/bin/sh.exe` shim. If that is not found,
   fall back to the `(git root)/usr/bin/sh.exe` non-shim, mainly to
   support the Git for Windows SDK, which doesn't have the shim.

   The reason to prefer the shim is that it sets environment
   variables, including prepending `bin` directories that provide
   tools one would expect to have when using it. Without this,
   common POSIX commands may be unavailable, or different and
   incompatible implementations of them may be found.

   In particular, if they are found in a different MSYS2
   installation whose `msys-2.0.dll` is of a different version or
   otherwise a different build, then calling them directly may
   produce strange behavior. See:

   - https://cygwin.com/faq.html#faq.using.multiple-copies
   - GitoxideLabs#1862 (comment)

   This makes things more robust overall than either preferring the
   non-shim or just doing a path search for `sh` as was done before
   that. But it exacerbates GitoxideLabs#1868 (as described there), so if the
   Git for Windows `sh.exe` shim continues to work as it currently
   does, then further improvements may be called for here.
@EliahKagan EliahKagan force-pushed the run-ci/consistent-sh branch 2 times, most recently from 352581d to ac34530 Compare March 9, 2025 00:10
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Mar 11, 2025
This makes a few changes to make `shell()` more robust:

1. Check the last two components of the path `git --exec-path`
   gave, to make sure they are `libexec/git-core`.

   (The check is done in such a way that the separator may be `/`
   or `\`, though a `\` separator here would be unexpected. We
   permit it because it may plausibly be present due to an
   overriden `GIT_EXEC_PATH` that breaks with Git's own behavior of
   using `/` but that is otherwise fully usable.)

   If the directory is not named `git-core`, or it is a top-level
   directory (no parent),  or its parent is not named `libexec`,
   then it is not reasonable to guess that this is in a directory
   where it would be safe to use `sh.exe` in the expected relative
   location. (Even if safe, such a layout does not suggest that a
   `sh.exe` found in it would be better choice than the fallback of
   just doing a `PATH` search.)

2. Check the grandparent component (that `../..` would go to) of
   the path `git --exec-path` gave, to make sure it is recognized
   name of a platform-specific `usr`-like directory that has been
   used in MSYS2.

   This is to avoid traversing up out of less common directory
   trees that have some different and shallower structure than
   found in a typical Git for Windows or MSYS2 installation.

3. Instead of using only the `(git root)/usr/bin/sh.exe` non-shim,
   prefer the `(git root)/bin/sh.exe` shim. If that is not found,
   fall back to the `(git root)/usr/bin/sh.exe` non-shim, mainly to
   support the Git for Windows SDK, which doesn't have the shim.

   The reason to prefer the shim is that it sets environment
   variables, including prepending `bin` directories that provide
   tools one would expect to have when using it. Without this,
   common POSIX commands may be unavailable, or different and
   incompatible implementations of them may be found.

   In particular, if they are found in a different MSYS2
   installation whose `msys-2.0.dll` is of a different version or
   otherwise a different build, then calling them directly may
   produce strange behavior. See:

   - https://cygwin.com/faq.html#faq.using.multiple-copies
   - GitoxideLabs#1862 (comment)

   This makes things more robust overall than either preferring the
   non-shim or just doing a path search for `sh` as was done before
   that. But it exacerbates GitoxideLabs#1868 (as described there), so if the
   Git for Windows `sh.exe` shim continues to work as it currently
   does, then further improvements may be called for here.
@EliahKagan EliahKagan force-pushed the run-ci/consistent-sh branch from ac34530 to 93f0804 Compare March 11, 2025 11:13
@EliahKagan
Copy link
Member Author

I've added a task list (and outscoped list) to the PR description for the plan articulated in #1862 (comment), so it's clear what is done and what is left to do. I think this may be almost ready.

EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Mar 14, 2025
This makes a few changes to make `shell()` more robust:

1. Check the last two components of the path `git --exec-path`
   gave, to make sure they are `libexec/git-core`.

   (The check is done in such a way that the separator may be `/`
   or `\`, though a `\` separator here would be unexpected. We
   permit it because it may plausibly be present due to an
   overriden `GIT_EXEC_PATH` that breaks with Git's own behavior of
   using `/` but that is otherwise fully usable.)

   If the directory is not named `git-core`, or it is a top-level
   directory (no parent),  or its parent is not named `libexec`,
   then it is not reasonable to guess that this is in a directory
   where it would be safe to use `sh.exe` in the expected relative
   location. (Even if safe, such a layout does not suggest that a
   `sh.exe` found in it would be better choice than the fallback of
   just doing a `PATH` search.)

2. Check the grandparent component (that `../..` would go to) of
   the path `git --exec-path` gave, to make sure it is recognized
   name of a platform-specific `usr`-like directory that has been
   used in MSYS2.

   This is to avoid traversing up out of less common directory
   trees that have some different and shallower structure than
   found in a typical Git for Windows or MSYS2 installation.

3. Instead of using only the `(git root)/usr/bin/sh.exe` non-shim,
   prefer the `(git root)/bin/sh.exe` shim. If that is not found,
   fall back to the `(git root)/usr/bin/sh.exe` non-shim, mainly to
   support the Git for Windows SDK, which doesn't have the shim.

   The reason to prefer the shim is that it sets environment
   variables, including prepending `bin` directories that provide
   tools one would expect to have when using it. Without this,
   common POSIX commands may be unavailable, or different and
   incompatible implementations of them may be found.

   In particular, if they are found in a different MSYS2
   installation whose `msys-2.0.dll` is of a different version or
   otherwise a different build, then calling them directly may
   produce strange behavior. See:

   - https://cygwin.com/faq.html#faq.using.multiple-copies
   - GitoxideLabs#1862 (comment)

   This makes things more robust overall than either preferring the
   non-shim or just doing a path search for `sh` as was done before
   that. But it exacerbates GitoxideLabs#1868 (as described there), so if the
   Git for Windows `sh.exe` shim continues to work as it currently
   does, then further improvements may be called for here.
@EliahKagan EliahKagan force-pushed the run-ci/consistent-sh branch from 93f0804 to 37c7b0e Compare March 14, 2025 05:39
@Byron
Copy link
Member

Byron commented Mar 15, 2025

Something I wanted to just have mentioned, even though this might not be the best place.

Indeed, using the SHELL environment variable, like predicted, is a bad idea as despite usually yielding the login shell, it's not necessarily POSIX compatible, something that is required by any kind of 'shellanigans' that gix-command performs. GitButler broke in the moment someone used it with fish 😅.
Ultimately, GitButler settled for using SHELL but only if it's a known POSIX (-enough) compliant shell that gix-command is known to work with, and otherwise going for gix_path::env::shell().

The idea is to have a higher chance of picking up the configuration that the user sees in terminals. In practice, that doesn't work much better, at last not for me, so I keep thinking that there should be another way of extracting the environment variables from an actual login shell instead.

If gix-command would eventually learn that, for client-like GUI programs that would definitely be useful. Also, I am not advocating for implementing this here, if it lands it will probably be in GitButler first, and eventually finds a more general implementation here.

EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Mar 16, 2025
The change in the previous commit of switching to the non-shim
`bash.exe` in `(git root)/usr/bin` causes problems, because the
environment may not be correct for shell commands and scripts.
In particular, the `PATH` might not enable standard POSIX tools to
be found, or the tools that are found may interoperate incorrectly
with the shell. The latter caused failures in GitoxideLabs#1862 in an analogous
choice of `sh.exe`, which were addressed by preferring the shim
when available. See:

- GitoxideLabs#1862 (comment)

Here, 90 tests started to fail when the test suite was run locally
from PowerShell (i.e. not a Git Bash environment) on a Windows 10
system that, in addition to a full Git for Windows installation,
contains a separate non-GfW MSYS2 installation whose `bin`
directories are in `PATH` even in non-MSYS2 environments. The
failures were described, and most of them investigated, as follows:

- GitoxideLabs#1864 (comment)
- https://gist.github.com/EliahKagan/3c5eebd091e66d8c912fddbce0a064cd
- https://gist.github.com/EliahKagan/17066ad1f7b0aa98e4fdf4642abe1d93

Most failures, including all those that were unintuitive, were
directly or indirectly due to the `make_remote_repos.sh` fixture
script encountering the error:

    fatal: bad config line 10 in file ./config

This happened due to the same incorrect behavior of `>>`, when used
by a shell that links to one `msys-2.0.dll` running a program that
links to a different `msys-2.0.dll` of another version or build, as
caused the failure encountered with the non-shim in GitoxideLabs#1862.

(It may be the handful of other failures are also caused by this
`>>` problem, but as of now that has not been examined.)

This commit temporarily instruments that fixture script so that,
when tests are run, the observations and analysis in the last gist
above can be confirmed. (These changes are also shown there.)
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Mar 16, 2025
The change in the previous commit of switching to the non-shim
`bash.exe` in `(git root)/usr/bin` causes problems, because the
environment may not be correct for shell commands and scripts.
In particular, the `PATH` might not enable standard POSIX tools to
be found, or the tools that are found may interoperate incorrectly
with the shell. The latter caused failures in GitoxideLabs#1862 in an analogous
choice of `sh.exe`, which were addressed by preferring the shim
when available. See:

- GitoxideLabs#1862 (comment)

Here, 90 tests started to fail when the test suite was run locally
from PowerShell (i.e. not a Git Bash environment) on a Windows 10
system that, in addition to a full Git for Windows installation,
contains a separate non-GfW MSYS2 installation whose `bin`
directories are in `PATH` even in non-MSYS2 environments. The
failures were described, and most of them investigated, as follows:

- GitoxideLabs#1864 (comment)
- https://gist.github.com/EliahKagan/3c5eebd091e66d8c912fddbce0a064cd
- https://gist.github.com/EliahKagan/17066ad1f7b0aa98e4fdf4642abe1d93

Most failures, including all those that were unintuitive, were
directly or indirectly due to the `make_remote_repos.sh` fixture
script encountering the error:

    fatal: bad config line 10 in file ./config

This happened due to the same incorrect behavior of `>>`, when used
by a shell that links to one `msys-2.0.dll` running a program that
links to a different `msys-2.0.dll` of another version or build, as
caused the failure encountered with the non-shim in GitoxideLabs#1862.

(It may be the handful of other failures are also caused by this
`>>` problem, but as of now that has not been examined.)

This commit temporarily instruments that fixture script so that,
when tests are run, the observations and analysis in the last gist
above can be confirmed. (These changes are also shown there.)
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Mar 18, 2025
Now that Git for Windows 2.49.0 has a stable release, this changes
the upgrade step that was added to `test-fixtures-windows` in
4237e5a (GitoxideLabs#1870), so that it downloads an installer from the release
marked as "latest", rather than the release that has the newest
tag. The release marked "latest" is usually a stable release in
projects that have any stable releases, and in particular it is a
stable release in Git for Windows.

This is *not* needed to switch from the release candidate to the
stable release for 2.49.0. The download logic already in place
currently gets the stable release automatically, because it is the
newest tag.

Nonetheless, there are three reasons to prefer the "latest" tag to
get the stable release, now that the stable release is available.
In descending order of significance, they are:

- We upgrade to work around GitoxideLabs#1849, for which 2.49.0 is preferable
  to 2.48.1 (which the Windows runner images currently have).
  Continuing to take the newest tag will eventually take a
  pre-release for the next version. That would probably work, but
  it is not currently a goal.

  There is sometimes a delay between when a stable release of Git
  for Windows comes out and when the stable runner images are
  released with it. (Pre-release runner images exist, but they are
  not run on GitHub-hosted runners.) So even assuming this upgrade
  step is to be removed once it is no longer needed, it could
  easily end up remaining long enough for a new Git for Windows
  pre-release to come out.

- An update may potentially be released for an earlier minor
  version (y in x.y.z), in which case the tag for it would be
  newer and we would downgrade instead. Now that the release
  marked "latest" is usable here, we can use it and avoid that.

- If we decide to eventually deliberately test pre-releases, the
  step added in GitoxideLabs#1849 would probably not be usable in that form,
  because it could take either the next pre-release or a patch to
  an ealier release per the above points, and also for the separate
  reason that this CI job is not necessarily where we would want to
  test that. (As one example, there is currently no CI testing of
  the Git for Windows SDK, even though supporting it is an explicit
  goal discussed in GitoxideLabs#1758, GitoxideLabs#1761, GitoxideLabs#1862, and GitoxideLabs#1864. If that is
  added, it may be a more opportune way to test prereleases.)
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Mar 18, 2025
Now that Git for Windows 2.49.0 has a stable release, this changes
the upgrade step that was added to `test-fixtures-windows` in
4237e5a (GitoxideLabs#1870), so that it downloads an installer from the release
marked as "latest", rather than the release that has the newest
tag. The release marked "latest" is usually a stable release in
projects that have any stable releases, and in particular it is a
stable release in Git for Windows.

This is *not* needed to switch from the release candidate to the
stable release for 2.49.0. The download logic already in place
currently gets the stable release automatically, because it is the
newest tag.

Nonetheless, there are three reasons to prefer the "latest" tag to
get the stable release, now that the stable release is available.
In descending order of significance, they are:

- We upgrade to work around GitoxideLabs#1849, for which 2.49.0 is preferable
  to 2.48.1 (which the Windows runner images currently have).
  Continuing to take the newest tag will eventually take a
  pre-release for the next version. That would probably work, but
  it is not currently a goal.

  There is sometimes a delay between when a stable release of Git
  for Windows comes out and when the stable runner images are
  released with it. (Pre-release runner images exist, but they are
  not run on GitHub-hosted runners.) So even assuming this upgrade
  step is to be removed once it is no longer needed, it could
  easily end up remaining long enough for a new Git for Windows
  pre-release to come out.

- An update may potentially be released for an earlier minor
  version (y in x.y.z), in which case the tag for it would be
  newer and we would downgrade instead. Now that the release
  marked "latest" is usable here, we can use it and avoid that.

- If we decide to eventually deliberately test pre-releases, the
  step added in GitoxideLabs#1849 would probably not be usable in that form,
  because it could take either the next pre-release or a patch to
  an ealier release per the above points, and also for the separate
  reason that this CI job is not necessarily where we would want to
  test that.

  (As one example, there is currently no CI testing of the Git for
  Windows SDK, even though supporting it, in general and for
  running the test suite, is an explicit goal discussed in GitoxideLabs#1758,
  GitoxideLabs#1761, GitoxideLabs#1862, and GitoxideLabs#1864. If that is added, it may be a more
  opportune way to test prereleases.)
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Mar 18, 2025
This makes a few changes to make `shell()` more robust:

1. Check the last two components of the path `git --exec-path`
   gave, to make sure they are `libexec/git-core`.

   (The check is done in such a way that the separator may be `/`
   or `\`, though a `\` separator here would be unexpected. We
   permit it because it may plausibly be present due to an
   overriden `GIT_EXEC_PATH` that breaks with Git's own behavior of
   using `/` but that is otherwise fully usable.)

   If the directory is not named `git-core`, or it is a top-level
   directory (no parent),  or its parent is not named `libexec`,
   then it is not reasonable to guess that this is in a directory
   where it would be safe to use `sh.exe` in the expected relative
   location. (Even if safe, such a layout does not suggest that a
   `sh.exe` found in it would be better choice than the fallback of
   just doing a `PATH` search.)

2. Check the grandparent component (that `../..` would go to) of
   the path `git --exec-path` gave, to make sure it is recognized
   name of a platform-specific `usr`-like directory that has been
   used in MSYS2.

   This is to avoid traversing up out of less common directory
   trees that have some different and shallower structure than
   found in a typical Git for Windows or MSYS2 installation.

3. Instead of using only the `(git root)/usr/bin/sh.exe` non-shim,
   prefer the `(git root)/bin/sh.exe` shim. If that is not found,
   fall back to the `(git root)/usr/bin/sh.exe` non-shim, mainly to
   support the Git for Windows SDK, which doesn't have the shim.

   The reason to prefer the shim is that it sets environment
   variables, including prepending `bin` directories that provide
   tools one would expect to have when using it. Without this,
   common POSIX commands may be unavailable, or different and
   incompatible implementations of them may be found.

   In particular, if they are found in a different MSYS2
   installation whose `msys-2.0.dll` is of a different version or
   otherwise a different build, then calling them directly may
   produce strange behavior. See:

   - https://cygwin.com/faq.html#faq.using.multiple-copies
   - GitoxideLabs#1862 (comment)

   This makes things more robust overall than either preferring the
   non-shim or just doing a path search for `sh` as was done before
   that. But it exacerbates GitoxideLabs#1868 (as described there), so if the
   Git for Windows `sh.exe` shim continues to work as it currently
   does, then further improvements may be called for here.
@EliahKagan EliahKagan force-pushed the run-ci/consistent-sh branch from 592cc5a to fc8e5f8 Compare March 18, 2025 15:25
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Mar 18, 2025
The change in the previous commit of switching to the non-shim
`bash.exe` in `(git root)/usr/bin` causes problems, because the
environment may not be correct for shell commands and scripts.
In particular, the `PATH` might not enable standard POSIX tools to
be found, or the tools that are found may interoperate incorrectly
with the shell. The latter caused failures in GitoxideLabs#1862 in an analogous
choice of `sh.exe`, which were addressed by preferring the shim
when available. See:

- GitoxideLabs#1862 (comment)

Here, 90 tests started to fail when the test suite was run locally
from PowerShell (i.e. not a Git Bash environment) on a Windows 10
system that, in addition to a full Git for Windows installation,
contains a separate non-GfW MSYS2 installation whose `bin`
directories are in `PATH` even in non-MSYS2 environments. The
failures were described, and most of them investigated, as follows:

- GitoxideLabs#1864 (comment)
- https://gist.github.com/EliahKagan/3c5eebd091e66d8c912fddbce0a064cd
- https://gist.github.com/EliahKagan/17066ad1f7b0aa98e4fdf4642abe1d93

Most failures, including all those that were unintuitive, were
directly or indirectly due to the `make_remote_repos.sh` fixture
script encountering the error:

    fatal: bad config line 10 in file ./config

This happened due to the same incorrect behavior of `>>`, when used
by a shell that links to one `msys-2.0.dll` running a program that
links to a different `msys-2.0.dll` of another version or build, as
caused the failure encountered with the non-shim in GitoxideLabs#1862.

(It may be the handful of other failures are also caused by this
`>>` problem, but as of now that has not been examined.)

This commit temporarily instruments that fixture script so that,
when tests are run, the observations and analysis in the last gist
above can be confirmed. (These changes are also shown there.)
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Mar 19, 2025
This makes a few changes to make `shell()` more robust:

1. Check the last two components of the path `git --exec-path`
   gave, to make sure they are `libexec/git-core`.

   (The check is done in such a way that the separator may be `/`
   or `\`, though a `\` separator here would be unexpected. We
   permit it because it may plausibly be present due to an
   overriden `GIT_EXEC_PATH` that breaks with Git's own behavior of
   using `/` but that is otherwise fully usable.)

   If the directory is not named `git-core`, or it is a top-level
   directory (no parent),  or its parent is not named `libexec`,
   then it is not reasonable to guess that this is in a directory
   where it would be safe to use `sh.exe` in the expected relative
   location. (Even if safe, such a layout does not suggest that a
   `sh.exe` found in it would be better choice than the fallback of
   just doing a `PATH` search.)

2. Check the grandparent component (that `../..` would go to) of
   the path `git --exec-path` gave, to make sure it is recognized
   name of a platform-specific `usr`-like directory that has been
   used in MSYS2.

   This is to avoid traversing up out of less common directory
   trees that have some different and shallower structure than
   found in a typical Git for Windows or MSYS2 installation.

3. Instead of using only the `(git root)/usr/bin/sh.exe` non-shim,
   prefer the `(git root)/bin/sh.exe` shim. If that is not found,
   fall back to the `(git root)/usr/bin/sh.exe` non-shim, mainly to
   support the Git for Windows SDK, which doesn't have the shim.

   The reason to prefer the shim is that it sets environment
   variables, including prepending `bin` directories that provide
   tools one would expect to have when using it. Without this,
   common POSIX commands may be unavailable, or different and
   incompatible implementations of them may be found.

   In particular, if they are found in a different MSYS2
   installation whose `msys-2.0.dll` is of a different version or
   otherwise a different build, then calling them directly may
   produce strange behavior. See:

   - https://cygwin.com/faq.html#faq.using.multiple-copies
   - GitoxideLabs#1862 (comment)

   This makes things more robust overall than either preferring the
   non-shim or just doing a path search for `sh` as was done before
   that. But it exacerbates GitoxideLabs#1868 (as described there), so if the
   Git for Windows `sh.exe` shim continues to work as it currently
   does, then further improvements may be called for here.
@EliahKagan EliahKagan force-pushed the run-ci/consistent-sh branch from fc8e5f8 to 37d676e Compare March 19, 2025 06:18
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Mar 19, 2025
The change in the previous commit of switching to the non-shim
`bash.exe` in `(git root)/usr/bin` causes problems, because the
environment may not be correct for shell commands and scripts.
In particular, the `PATH` might not enable standard POSIX tools to
be found, or the tools that are found may interoperate incorrectly
with the shell. The latter caused failures in GitoxideLabs#1862 in an analogous
choice of `sh.exe`, which were addressed by preferring the shim
when available. See:

- GitoxideLabs#1862 (comment)

Here, 90 tests started to fail when the test suite was run locally
from PowerShell (i.e. not a Git Bash environment) on a Windows 10
system that, in addition to a full Git for Windows installation,
contains a separate non-GfW MSYS2 installation whose `bin`
directories are in `PATH` even in non-MSYS2 environments. The
failures were described, and most of them investigated, as follows:

- GitoxideLabs#1864 (comment)
- https://gist.github.com/EliahKagan/3c5eebd091e66d8c912fddbce0a064cd
- https://gist.github.com/EliahKagan/17066ad1f7b0aa98e4fdf4642abe1d93

Most failures, including all those that were unintuitive, were
directly or indirectly due to the `make_remote_repos.sh` fixture
script encountering the error:

    fatal: bad config line 10 in file ./config

This happened due to the same incorrect behavior of `>>`, when used
by a shell that links to one `msys-2.0.dll` running a program that
links to a different `msys-2.0.dll` of another version or build, as
caused the failure encountered with the non-shim in GitoxideLabs#1862.

(It may be the handful of other failures are also caused by this
`>>` problem, but as of now that has not been examined.)

This commit temporarily instruments that fixture script so that,
when tests are run, the observations and analysis in the last gist
above can be confirmed. (These changes are also shown there.)
`gix_command::Prepare` previously used `sh` on Windows rather than
first checking for a usable `sh` implementation associated with a
Git for Windows installation. This changes it to use the `gix-path`
facility for finding what is likely the best 'sh' implementation
for POSIX shell scripts that will operate on Git repositories. This
increases consistency across how different crates find 'sh', and
brings the benefit of preferring the Git for Windows `sh.exe` on
Windows when it can be found reliably.
Because `gix-command` uses `gix_path::env::shell()` to find sh,
and `gix-credentials` uses `gix-command`.
This makes the path returned by `gix_path::env::shell()` on Windows
more usable by:

1. Adding components with `/` separators. While in principle a `\`
   should work, the path of the shell itself is used in shell
   scripts (script files and `sh -c` operands) that may not account
   for the presence of backslashes, and it is also harder to read
   paths with `\` in contexts where it appears escaped, which may
   include various messages from Rust code and shell scripts.

   The path before what we add will already use `/` and never `\`,
   unless `GIT_EXEC_PATH` has been set to a strange value, because
   it is based on `git --exec-path`, which by default gives a path
   with `/` separators. Thus, ensuring that the part we add uses `/`
   should be sufficient to get a path without `\` in all cases when
   it is clearly reasonable to do so. This therefore also usually
   increases stylistic consistency of the path, which is another
   factor that makes it more user-friendly in messages.

   This is needed to get tests to pass since changing `gix-command`
   to use `gix_path::env::shell()` on Windows, where a path is
   formatted in away that sometimes quotes `\` characters. Their
   expectations could be adjusted, but it seems likely that various
   other software, much of which may otherwise be working, has
   similar expectations. Using `/` instead of `\` works whether `\`
   is expected to be displayed quoted or not.

2. Check that the path to the shell plausibly has a shell there,
   only using it if it a file or a non-broken file symlink. When
   this is not the case, the fallback short name is used instead.

3. The fallback short name is changed from `sh` to `sh.exe`, since
   the `.exe` suffix is appended in other short names on Windows,
   such as `git.exe`, as well as being part of the filename
   component of the path we build for the shell when using the
   implementation provided as part of Git for Windows.

Those changes only affect Windows.

This also adds tests for (1) and (2) above, as well as for the
expectation that we get an absolute path, to make sure we don't
build a path that would be absolute on a Unix-like system but is
relative on Windows (a path that starts with just one `/` or `\`).

These tests are not Windows-specific, since all these expectations
should already hold on Unix-like systems, where currently we are
using the hard-coded path `/bin/sh`, which is an absolute path on
those systems. (Some Unix-like systems may technically not have
`/bin/sh` or it may not be the best path to use for a shell that
should be POSIX-compatible, but we are already relying on this,
and handling that better is outside the scope of the changes here.)
This makes a few changes to make `shell()` more robust:

1. Check the last two components of the path `git --exec-path`
   gave, to make sure they are `libexec/git-core`.

   (The check is done in such a way that the separator may be `/`
   or `\`, though a `\` separator here would be unexpected. We
   permit it because it may plausibly be present due to an
   overriden `GIT_EXEC_PATH` that breaks with Git's own behavior of
   using `/` but that is otherwise fully usable.)

   If the directory is not named `git-core`, or it is a top-level
   directory (no parent),  or its parent is not named `libexec`,
   then it is not reasonable to guess that this is in a directory
   where it would be safe to use `sh.exe` in the expected relative
   location. (Even if safe, such a layout does not suggest that a
   `sh.exe` found in it would be better choice than the fallback of
   just doing a `PATH` search.)

2. Check the grandparent component (that `../..` would go to) of
   the path `git --exec-path` gave, to make sure it is recognized
   name of a platform-specific `usr`-like directory that has been
   used in MSYS2.

   This is to avoid traversing up out of less common directory
   trees that have some different and shallower structure than
   found in a typical Git for Windows or MSYS2 installation.

3. Instead of using only the `(git root)/usr/bin/sh.exe` non-shim,
   prefer the `(git root)/bin/sh.exe` shim. If that is not found,
   fall back to the `(git root)/usr/bin/sh.exe` non-shim, mainly to
   support the Git for Windows SDK, which doesn't have the shim.

   The reason to prefer the shim is that it sets environment
   variables, including prepending `bin` directories that provide
   tools one would expect to have when using it. Without this,
   common POSIX commands may be unavailable, or different and
   incompatible implementations of them may be found.

   In particular, if they are found in a different MSYS2
   installation whose `msys-2.0.dll` is of a different version or
   otherwise a different build, then calling them directly may
   produce strange behavior. See:

   - https://cygwin.com/faq.html#faq.using.multiple-copies
   - GitoxideLabs#1862 (comment)

   This makes things more robust overall than either preferring the
   non-shim or just doing a path search for `sh` as was done before
   that. But it exacerbates GitoxideLabs#1868 (as described there), so if the
   Git for Windows `sh.exe` shim continues to work as it currently
   does, then further improvements may be called for here.
- Allowing `usr` as a `<platform>` prefix is more likely to produce
  desired than undesired behavior. But due to the ambiguity of the
  name `usr` on non-Unix systems, maybe this would lead to problems
  that are relevant to security. The concern is described and the
  `usr` prefix, which is never needed to find the shell in a Git
  for Windows installation, is no longer matched.

  Note that this only affects it as a path component that
  `libexec/git-core` is found initially to be in. We do still use

      <prefix>/libexec/git-core/../../../usr/bin/sh.exe

  if we don't find something we can plausibly use at:

      `<prefix>/libexec/git-core/../../../bin/sh.exe

  The helper docstring explains why a security model under which
  this is reasonable does necessarily entail that it is reasonable
  to allow a `<prefix>` of `usr`, *even though* in the path with
  `usr` that we form, it is `usr` in the `<prefix>` position.

  With that said, and even though it does not help find `sh` from
  Git for Windows, hopefully future research can establish that it
  is safe to treat `usr/libexec/git-core` the same as platform
  prefixes like `mingw64/libexec/git-core` and it can be enabled.

- Start refactoring by extracting and renaming the recently
  introduced helper constants and functions.

- Extract most of the `cfg!(windows)` branch in `shell()` itself,
  even with its helpers already extracted, to make it a helper
  function as well.

- Document the recently introduced helper constants.
The new `auxiliary` module within `gix_path::env` is a sibling of
the `git` module and is similarly an implementation detail only.

So far the only "auxiliary" program this module finds is `sh`.

The module is named `auxiliary` rather than `aux` because Windows
has problems with files named like `aux` or `aux.rs`, due to `AUX`
being a reserved device name.
This also uses `once_cell::sync::Lazy` for the inferred Git for
Windows installation directory based on `git --exec-path` output,
though it is only used in one function that is itself only used
by a caller that itself caches, so it too is so far effectively
just a refactoring.
That was just added in the previous commit.

`installation_config()` and `installation_config_prefix()` should
never use, as their only strategy, the technique implemented in the
newly introduced `git_for_windows_root()` helper, because if the
system-scope configuration file is present elsewhere, including
due to `GIT_CONFIG_SYSTEM` being set, then that *should* affect the
values returned by those functions (except on Apple Git and any
other systems where a separate higher "unknown" scope exists and is
typically nonempty).

Rather, some uses of `installation_config_prefix()`, such as to
find other files that are better looked for relative to installed
non-configuration files or installation directory itself rather
than relative to its configuration file, might in the future be
suitable to look fo9r using `git_for_windows_root()`.
They are now in a form where they take the name of the command to
be found and look for it in a way that would find many of the
binaries shipped with and reasonable to use with Git for Windows.

However, as noted in comments, further refinement as well as new
tests would be called for if using it for purposes other than to
find `sh`. For now it remains private with `gix_path::env` and is
only used to find `sh`, as an implementation detail of
`gix_path::shell()`.
It didn't mention how expanded use may require directories like
`mingw64/bin` to be added to searched locations, and also was
somewhat hard to read. This adds that and rephrases.
The tests are not limited to finding `sh`, but this may still not
be ready for use in finding other commands, for the commented
reasons. These tests are a step toward that, but they are mainly
to make sure the search works as expected both when the looked-for
command is and is not found in one of the searched `bin` dirs.

This is to say that, in the tests, the commands that can be found
but in `usr/bin` rather than the first choice of `bin` are to an
extent a stand-in for `sh` when searched for in an environment that
doesn't have `(git root)/bin` (like the Git for Windows SDK), and
the commands that cannot bve found are to an extent a standi-in for
`sh` on systems where it cannot be found (such as due to `git` not
being installed or installed from MinGit or some other minimal or
nonstandard Git installation, or `GIT_EXEC_PATH` being defined and
having a value that cannot be used or that can be used but points
to a directory structure that does not have a usable `sh`).
Since the code under test is currently compiled on all platforms.

The current tests' assertions are not guaranteed to hold outside of
Windows systems. (It would be unusual to call the functions they
are testing outside of Windows. Those functions are themselves
mainly not marked to be conditionally compiled so that their
callers uses techniques like `if cfg!(windows)` that aren't
technically conditional compilation. Those techniques are
themselves valuable because they can sometimes be simpler or more
readable than conditional compilation or easier to avoid
false-positive diagnostics, and because they allow type checking to
occur even when building on other platforms, while still usually
being fast because "runtime" conditions that are `false` constants
still facilitate removal of dead code as an optimization.) So the
tests of those functions are likewise built on all targets, but
marked conditonally ignored on non-Windows platforms.
@EliahKagan EliahKagan force-pushed the run-ci/consistent-sh branch from 37d676e to b70cdb1 Compare March 20, 2025 02:29
@EliahKagan
Copy link
Member Author

Regarding #1864 (comment) and various other discussion of this in #1864: I've rebased this after #1864 was merged, so there will be no confusion due to their commits being interleaved. (I had originally envisioned that this would come in before that PR, whose changes are largely based on some of the changes here. But either order works fine and should be equally clear.)

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for all your (fantastic) work on this!

Once again I think the code is very idiomatic and made to be maintainable, hence plentiful comments capture what the code alone would not. Given the subtlety of the matter, doing so definitely is the right way to go, while tests are used where possible.

Please note that I didn't try to understand everything in detail, and I at most skimmed the long reasoning, knowing that I wouldn't be able to meaningfully contribute in an armchair review without the time to actually run this and validate it on Windows. It's a luxury I can afford when you are the author though 😁🙏.

@Byron Byron merged commit 0ba3147 into GitoxideLabs:main Mar 20, 2025
21 checks passed
@EliahKagan EliahKagan deleted the run-ci/consistent-sh branch March 20, 2025 03:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants