* v2.47.0-rc1 test failure on cygwin @ 2024-10-04 1:02 Ramsay Jones 2024-10-04 3:59 ` Patrick Steinhardt ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Ramsay Jones @ 2024-10-04 1:02 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: GIT Mailing-list, Junio C Hamano Hi Patrick, Just a quick heads up: t0610-reftable-basics.sh test 47 (ref transaction: many concurrent writers) fails on cygwin. The tail end of the debug output for this test looks like: .. ++ wait fatal: update_ref failed for ref 'refs/heads/branch-82': reftable: transaction prepare: I/O error fatal: update_ref failed for ref 'refs/heads/branch-2': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-3': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-27': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-10': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-21': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-14': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-9': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-19': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-33': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-28': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-29': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-25': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-53': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-41': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-37': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-36': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-50': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-62': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-48': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-49': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-59': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-67': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-81': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-72': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-70': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-54': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-71': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-74': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-95': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-87': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-92': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-64': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-94': cannot lock references fatal: update_ref failed for ref 'refs/heads/branch-100': cannot lock references ++ git for-each-ref --sort=v:refname ++ test_cmp expect actual ++ test 2 -ne 2 ++ eval 'diff -u' '"$@"' +++ diff -u expect actual --- expect 2024-10-03 23:21:01.284020500 +0000 +++ actual 2024-10-03 23:21:19.881283200 +0000 @@ -1,101 +1,66 @@ 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-1 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-2 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-3 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-4 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-5 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-6 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-7 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-8 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-9 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-10 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-11 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-12 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-13 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-14 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-15 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-16 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-17 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-18 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-19 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-20 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-21 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-22 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-23 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-24 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-25 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-26 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-27 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-28 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-29 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-30 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-31 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-32 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-33 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-34 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-35 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-36 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-37 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-38 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-39 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-40 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-41 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-42 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-43 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-44 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-45 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-46 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-47 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-48 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-49 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-50 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-51 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-52 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-53 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-54 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-55 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-56 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-57 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-58 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-59 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-60 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-61 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-62 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-63 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-64 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-65 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-66 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-67 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-68 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-69 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-70 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-71 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-72 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-73 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-74 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-75 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-76 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-77 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-78 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-79 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-80 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-81 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-82 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-83 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-84 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-85 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-86 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-87 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-88 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-89 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-90 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-91 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-92 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-93 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-94 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-95 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-96 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-97 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-98 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-99 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-100 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/main error: last command exited with $?=1 not ok 47 - ref transaction: many concurrent writers .. t0610-reftable-basics.sh passed on 'rc0', but this test (and the timeout facility) is new in 'rc1'. I tried simply increasing the timeout (10 fold), but that didn't change the result. (I didn't really expect it to - the 'reftable: transaction prepare: I/O error' does not look timing related!). Again, just a heads up. (I can't look at it until tomorrow now; any ideas?) ATB, Ramsay Jones ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.47.0-rc1 test failure on cygwin 2024-10-04 1:02 v2.47.0-rc1 test failure on cygwin Ramsay Jones @ 2024-10-04 3:59 ` Patrick Steinhardt 2024-10-04 6:13 ` Patrick Steinhardt 2024-10-04 12:16 ` [PATCH] t0610: work around flaky test with concurrent writers Patrick Steinhardt 2024-10-04 15:32 ` [PATCH v2] " Patrick Steinhardt 2 siblings, 1 reply; 19+ messages in thread From: Patrick Steinhardt @ 2024-10-04 3:59 UTC (permalink / raw) To: Ramsay Jones; +Cc: GIT Mailing-list, Junio C Hamano On Fri, Oct 04, 2024 at 02:02:44AM +0100, Ramsay Jones wrote: > Hi Patrick, > > Just a quick heads up: t0610-reftable-basics.sh test 47 (ref transaction: many > concurrent writers) fails on cygwin. The tail end of the debug output for this > test looks like: > [snip] > > t0610-reftable-basics.sh passed on 'rc0', but this test (and the timeout facility) > is new in 'rc1'. I tried simply increasing the timeout (10 fold), but that didn't > change the result. (I didn't really expect it to - the 'reftable: transaction > prepare: I/O error' does not look timing related!). > > Again, just a heads up. (I can't look at it until tomorrow now; any ideas?) This failure is kind of known and discussed in [1]. Just to make it explicit: this test failure doesn't really surface a regression, the reftable code already failed for concurrent writes before. I fixed that and added the test that is now flaky, as the fix itself is seemingly only sufficient on Linux and macOS. I didn't yet have the time to look at whether I can fix it, but should finally find the time to do so today. Patrick [1]: <20240927040752.GA567671@coredump.intra.peff.net> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.47.0-rc1 test failure on cygwin 2024-10-04 3:59 ` Patrick Steinhardt @ 2024-10-04 6:13 ` Patrick Steinhardt 2024-10-04 9:13 ` Johannes Schindelin 0 siblings, 1 reply; 19+ messages in thread From: Patrick Steinhardt @ 2024-10-04 6:13 UTC (permalink / raw) To: Ramsay Jones; +Cc: GIT Mailing-list, Junio C Hamano On Fri, Oct 04, 2024 at 05:59:30AM +0200, Patrick Steinhardt wrote: > On Fri, Oct 04, 2024 at 02:02:44AM +0100, Ramsay Jones wrote: > > Hi Patrick, > > > > Just a quick heads up: t0610-reftable-basics.sh test 47 (ref transaction: many > > concurrent writers) fails on cygwin. The tail end of the debug output for this > > test looks like: > > > [snip] > > > > t0610-reftable-basics.sh passed on 'rc0', but this test (and the timeout facility) > > is new in 'rc1'. I tried simply increasing the timeout (10 fold), but that didn't > > change the result. (I didn't really expect it to - the 'reftable: transaction > > prepare: I/O error' does not look timing related!). > > > > Again, just a heads up. (I can't look at it until tomorrow now; any ideas?) > > This failure is kind of known and discussed in [1]. Just to make it > explicit: this test failure doesn't really surface a regression, the > reftable code already failed for concurrent writes before. I fixed that > and added the test that is now flaky, as the fix itself is seemingly > only sufficient on Linux and macOS. > > I didn't yet have the time to look at whether I can fix it, but should > finally find the time to do so today. Hm, interestingly enough I cannot reproduce the issue on Cygwin myself, but I can reproduce the issue with MinGW. And in fact, the logs you have sent all indicate that we cannot acquire the lock, there is no sign of I/O errors here. So I guess you're running into timeout issues. Does the following patch fix this for you? diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh index 2d951c8ceb..b5cad805d4 100755 --- a/t/t0610-reftable-basics.sh +++ b/t/t0610-reftable-basics.sh @@ -455,10 +455,7 @@ test_expect_success 'ref transaction: many concurrent writers' ' git init repo && ( cd repo && - # Set a high timeout such that a busy CI machine will not abort - # early. 10 seconds should hopefully be ample of time to make - # this non-flaky. - git config set reftable.lockTimeout 10000 && + git config set reftable.lockTimeout -1 && test_commit --no-tag initial && head=$(git rev-parse HEAD) && The issue on Win32 is different: we cannot commit the "tables.list" lock via rename(3P) because the target file may be open for reading by a concurrent process. I guess that Cygwin has proper POSIX semantics for rename(3P) and thus doesn't hit the same issue. We already try to emulate POSIX semantics somewhat in `mingw_rename()` by using a retry-loop when we hit `ERROR_ACCESS_DENIED`, which is what we get when the target file is open in another process. But that seemingly isn't enough when there is a lot of contention around a file. So I'm currently investigating whether we can adopt something similar to what Cygwin is doing for Win32, as well. I assume that they use `FILE_RENAME_INFORMATION_EX` with `FILE_RENAME_POSIX_SEMANTICS`, which should give us what we're looking for. gh, well. Turns out the implementation of rename(3P) in Cygwin is 500 lines long. I guess this is a non-trivial problem :) But they of course have to handle a whole lot more cases than we have to. But my guess was correct: they do use `FILE_RENAME_POSIX_SEMANTICS`. The catch is that this flag only exists in Windows 10 and newer. But that should be a fine compromise. I'll try to wrap my head around how all of this works. Patrick ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: v2.47.0-rc1 test failure on cygwin 2024-10-04 6:13 ` Patrick Steinhardt @ 2024-10-04 9:13 ` Johannes Schindelin 2024-10-04 10:09 ` Patrick Steinhardt 0 siblings, 1 reply; 19+ messages in thread From: Johannes Schindelin @ 2024-10-04 9:13 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: Ramsay Jones, GIT Mailing-list, Junio C Hamano Hi Patrick, On Fri, 4 Oct 2024, Patrick Steinhardt wrote: > On Fri, Oct 04, 2024 at 05:59:30AM +0200, Patrick Steinhardt wrote: > > On Fri, Oct 04, 2024 at 02:02:44AM +0100, Ramsay Jones wrote: > > > > > > Just a quick heads up: t0610-reftable-basics.sh test 47 (ref > > > transaction: many concurrent writers) fails on cygwin. The tail end > > > of the debug output for this test looks like: > > > > > [snip] > > > > > > t0610-reftable-basics.sh passed on 'rc0', but this test (and the > > > timeout facility) is new in 'rc1'. I tried simply increasing the > > > timeout (10 fold), but that didn't change the result. (I didn't > > > really expect it to - the 'reftable: transaction prepare: I/O error' > > > does not look timing related!). > > > > > > Again, just a heads up. (I can't look at it until tomorrow now; any > > > ideas?) > > > > This failure is kind of known and discussed in [1]. Just to make it > > explicit: this test failure doesn't really surface a regression, the > > reftable code already failed for concurrent writes before. I fixed > > that and added the test that is now flaky, as the fix itself is > > seemingly only sufficient on Linux and macOS. > > > > I didn't yet have the time to look at whether I can fix it, but should > > finally find the time to do so today. > > Hm, interestingly enough I cannot reproduce the issue on Cygwin myself, > but I can reproduce the issue with MinGW. And in fact, the logs you have > sent all indicate that we cannot acquire the lock, there is no sign of > I/O errors here. So I guess you're running into timeout issues. Does the > following patch fix this for you? > > diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh > index 2d951c8ceb..b5cad805d4 100755 > --- a/t/t0610-reftable-basics.sh > +++ b/t/t0610-reftable-basics.sh > @@ -455,10 +455,7 @@ test_expect_success 'ref transaction: many concurrent writers' ' > git init repo && > ( > cd repo && > - # Set a high timeout such that a busy CI machine will not abort > - # early. 10 seconds should hopefully be ample of time to make > - # this non-flaky. > - git config set reftable.lockTimeout 10000 && > + git config set reftable.lockTimeout -1 && > test_commit --no-tag initial && > > head=$(git rev-parse HEAD) && That looks plausible to me, as this test is exercising the POSIX emulation layer of Cygwin much more than Git itself (and therefore I expect this test case to take a really long time on Cygwin). If it turns out that this works around the issue, I would propose to use something like this instead: # Cygwin needs a slightly longer timeout if have_prereq !CYGWIN then git config set reftable.lockTimeout 10000 else git config set reftable.lockTimeout 60000 fi && > The issue on Win32 is different: we cannot commit the "tables.list" lock > via rename(3P) because the target file may be open for reading by a > concurrent process. I guess that Cygwin has proper POSIX semantics for > rename(3P) and thus doesn't hit the same issue. Indeed, this is where the symptom lies. I worked around it in Git for Windows v2.47.0-rc1 via this patch: -- snipsnap -- diff --git a/compat/mingw.c b/compat/mingw.c index c1030e40f227..56db193d1360 100644 --- a/compat/mingw.c +++ b/compat/mingw.c @@ -784,6 +784,7 @@ int mingw_open (const char *filename, int oflags, ...) int fd, create = (oflags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL); wchar_t wfilename[MAX_LONG_PATH]; open_fn_t open_fn; + int tries = 0; va_start(args, oflags); mode = va_arg(args, int); @@ -813,7 +814,11 @@ int mingw_open (const char *filename, int oflags, ...) else if (xutftowcs_long_path(wfilename, filename) < 0) return -1; - fd = open_fn(wfilename, oflags, mode); + do { + fd = open_fn(wfilename, oflags, mode); + } while (fd < 0 && GetLastError() == ERROR_SHARING_VIOLATION && + retry_ask_yes_no(&tries, "Opening '%s' failed because another " + "application accessed it. Try again?", filename)); if ((oflags & O_CREAT) && fd >= 0 && are_wsl_compatible_mode_bits_enabled()) { _mode_t wsl_mode = S_IFREG | (mode&0777); ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: v2.47.0-rc1 test failure on cygwin 2024-10-04 9:13 ` Johannes Schindelin @ 2024-10-04 10:09 ` Patrick Steinhardt 2024-10-04 11:11 ` Johannes Schindelin 0 siblings, 1 reply; 19+ messages in thread From: Patrick Steinhardt @ 2024-10-04 10:09 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Ramsay Jones, GIT Mailing-list, Junio C Hamano On Fri, Oct 04, 2024 at 11:13:34AM +0200, Johannes Schindelin wrote: > Hi Patrick, > > On Fri, 4 Oct 2024, Patrick Steinhardt wrote: > > > On Fri, Oct 04, 2024 at 05:59:30AM +0200, Patrick Steinhardt wrote: > > > On Fri, Oct 04, 2024 at 02:02:44AM +0100, Ramsay Jones wrote: > > > > > > > > Just a quick heads up: t0610-reftable-basics.sh test 47 (ref > > > > transaction: many concurrent writers) fails on cygwin. The tail end > > > > of the debug output for this test looks like: > > > > > > > [snip] > > > > > > > > t0610-reftable-basics.sh passed on 'rc0', but this test (and the > > > > timeout facility) is new in 'rc1'. I tried simply increasing the > > > > timeout (10 fold), but that didn't change the result. (I didn't > > > > really expect it to - the 'reftable: transaction prepare: I/O error' > > > > does not look timing related!). > > > > > > > > Again, just a heads up. (I can't look at it until tomorrow now; any > > > > ideas?) > > > > > > This failure is kind of known and discussed in [1]. Just to make it > > > explicit: this test failure doesn't really surface a regression, the > > > reftable code already failed for concurrent writes before. I fixed > > > that and added the test that is now flaky, as the fix itself is > > > seemingly only sufficient on Linux and macOS. > > > > > > I didn't yet have the time to look at whether I can fix it, but should > > > finally find the time to do so today. > > > > Hm, interestingly enough I cannot reproduce the issue on Cygwin myself, > > but I can reproduce the issue with MinGW. And in fact, the logs you have > > sent all indicate that we cannot acquire the lock, there is no sign of > > I/O errors here. So I guess you're running into timeout issues. Does the > > following patch fix this for you? > > > > diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh > > index 2d951c8ceb..b5cad805d4 100755 > > --- a/t/t0610-reftable-basics.sh > > +++ b/t/t0610-reftable-basics.sh > > @@ -455,10 +455,7 @@ test_expect_success 'ref transaction: many concurrent writers' ' > > git init repo && > > ( > > cd repo && > > - # Set a high timeout such that a busy CI machine will not abort > > - # early. 10 seconds should hopefully be ample of time to make > > - # this non-flaky. > > - git config set reftable.lockTimeout 10000 && > > + git config set reftable.lockTimeout -1 && > > test_commit --no-tag initial && > > > > head=$(git rev-parse HEAD) && > > That looks plausible to me, as this test is exercising the POSIX emulation > layer of Cygwin much more than Git itself (and therefore I expect this > test case to take a really long time on Cygwin). If it turns out that this > works around the issue, I would propose to use something like this > instead: > > # Cygwin needs a slightly longer timeout > if have_prereq !CYGWIN > then > git config set reftable.lockTimeout 10000 > else > git config set reftable.lockTimeout 60000 > fi && We also saw that this creates an issue when running e.g. with the memory and leak sanitizers. So I'd just generally raise the timeout value to something way higher to avoid flakiness, like 5 minutes. > > The issue on Win32 is different: we cannot commit the "tables.list" lock > > via rename(3P) because the target file may be open for reading by a > > concurrent process. I guess that Cygwin has proper POSIX semantics for > > rename(3P) and thus doesn't hit the same issue. > > Indeed, this is where the symptom lies. I worked around it in Git for > Windows v2.47.0-rc1 via this patch: > > -- snipsnap -- > diff --git a/compat/mingw.c b/compat/mingw.c > index c1030e40f227..56db193d1360 100644 > --- a/compat/mingw.c > +++ b/compat/mingw.c > @@ -784,6 +784,7 @@ int mingw_open (const char *filename, int oflags, ...) > int fd, create = (oflags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL); > wchar_t wfilename[MAX_LONG_PATH]; > open_fn_t open_fn; > + int tries = 0; > > va_start(args, oflags); > mode = va_arg(args, int); > @@ -813,7 +814,11 @@ int mingw_open (const char *filename, int oflags, ...) > else if (xutftowcs_long_path(wfilename, filename) < 0) > return -1; > > - fd = open_fn(wfilename, oflags, mode); > + do { > + fd = open_fn(wfilename, oflags, mode); > + } while (fd < 0 && GetLastError() == ERROR_SHARING_VIOLATION && > + retry_ask_yes_no(&tries, "Opening '%s' failed because another " > + "application accessed it. Try again?", filename)); > > if ((oflags & O_CREAT) && fd >= 0 && are_wsl_compatible_mode_bits_enabled()) { > _mode_t wsl_mode = S_IFREG | (mode&0777); Wait, this is surprising to me as I saw the error happening when calling rename, not open. So why would retries in `open()` fix the symptom? I'm probably missing something. In any case I also tried something like the below patch (sorry, whitespace-broken). But unfortunately this still caused permission errors when the new path was held open by another process. I think for now I'd still lean into the direction of adding the !WINDOWS prerequisite to the test and increasing timeouts such that I can continue to investigate without time pressure. Patrick @ -2152,9 +2152,12 @@ int mingw_accept(int sockfd1, struct sockaddr *sa, socklen_t *sz) int mingw_rename(const char *pold, const char *pnew) { DWORD attrs, gle; - int tries = 0; + int tries = 0, wpnew_len, wpold_len; wchar_t wpold[MAX_PATH], wpnew[MAX_PATH]; - if (xutftowcs_path(wpold, pold) < 0 || xutftowcs_path(wpnew, pnew) < 0) + + wpold_len = xutftowcs_path(wpold, pold); + wpnew_len = xutftowcs_path(wpnew, pnew); + if (wpold_len < 0 || wpnew_len < 0) return -1; /* @@ -2166,6 +2169,58 @@ int mingw_rename(const char *pold, const char *pnew) if (errno != EEXIST) return -1; repeat: + { +#define FileRenameInfoEx 22 + enum { + FILE_RENAME_REPLACE_IF_EXISTS = 0x01, + FILE_RENAME_POSIX_SEMANTICS = 0x02, + FILE_RENAME_IGNORE_READONLY_ATTRIBUTE = 0x40, + }; + typedef struct FILE_RENAME_INFORMATION { + union { + BOOLEAN ReplaceIfExists; + ULONG Flags; + }; + HANDLE RootDirectory; + ULONG FileNameLength; + WCHAR FileName[1]; + } *PFILE_RENAME_INFORMATION; + size_t wpnew_size = wpnew_len * sizeof(wchar_t); + PFILE_RENAME_INFORMATION fri = NULL; + HANDLE handle = NULL; + BOOL success; + int ret; + + handle = CreateFileW(wpold, DELETE, + FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, NULL, + OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL); + if (handle == INVALID_HANDLE_VALUE) { + errno = ENOENT; + ret = -1; + goto out; + } + + fri = xcalloc(1, sizeof(*fri) + wpnew_size); + fri->Flags = FILE_RENAME_REPLACE_IF_EXISTS | FILE_RENAME_POSIX_SEMANTICS | + FILE_RENAME_IGNORE_READONLY_ATTRIBUTE; + fri->FileNameLength = wpnew_len; + memcpy(fri->FileName, wpnew, wpnew_size); + + success = SetFileInformationByHandle(handle, FileRenameInfoEx, + fri, sizeof(*fri) + wpnew_size); + if (!success) { + errno = err_win_to_posix(GetLastError()); + ret = -1; + goto out; + } + + ret = 0; +out: + CloseHandle(handle); + free(fri); + return ret; + } + ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.47.0-rc1 test failure on cygwin 2024-10-04 10:09 ` Patrick Steinhardt @ 2024-10-04 11:11 ` Johannes Schindelin 2024-10-04 11:32 ` Patrick Steinhardt 2024-10-04 16:09 ` Junio C Hamano 0 siblings, 2 replies; 19+ messages in thread From: Johannes Schindelin @ 2024-10-04 11:11 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: Ramsay Jones, GIT Mailing-list, Junio C Hamano Hi Patrick, On Fri, 4 Oct 2024, Patrick Steinhardt wrote: > On Fri, Oct 04, 2024 at 11:13:34AM +0200, Johannes Schindelin wrote: > > > > On Fri, 4 Oct 2024, Patrick Steinhardt wrote: > > > > > diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh > > > index 2d951c8ceb..b5cad805d4 100755 > > > --- a/t/t0610-reftable-basics.sh > > > +++ b/t/t0610-reftable-basics.sh > > > @@ -455,10 +455,7 @@ test_expect_success 'ref transaction: many concurrent writers' ' > > > git init repo && > > > ( > > > cd repo && > > > - # Set a high timeout such that a busy CI machine will not abort > > > - # early. 10 seconds should hopefully be ample of time to make > > > - # this non-flaky. > > > - git config set reftable.lockTimeout 10000 && > > > + git config set reftable.lockTimeout -1 && > > > test_commit --no-tag initial && > > > > > > head=$(git rev-parse HEAD) && > > > > That looks plausible to me, as this test is exercising the POSIX emulation > > layer of Cygwin much more than Git itself (and therefore I expect this > > test case to take a really long time on Cygwin). If it turns out that this > > works around the issue, I would propose to use something like this > > instead: > > > > # Cygwin needs a slightly longer timeout > > if have_prereq !CYGWIN > > then > > git config set reftable.lockTimeout 10000 > > else > > git config set reftable.lockTimeout 60000 > > fi && > > We also saw that this creates an issue when running e.g. with the memory > and leak sanitizers. So I'd just generally raise the timeout value to > something way higher to avoid flakiness, like 5 minutes. Sounds good! > > > The issue on Win32 is different: we cannot commit the "tables.list" lock > > > via rename(3P) because the target file may be open for reading by a > > > concurrent process. I guess that Cygwin has proper POSIX semantics for > > > rename(3P) and thus doesn't hit the same issue. > > > > Indeed, this is where the symptom lies. I worked around it in Git for > > Windows v2.47.0-rc1 via this patch: > > > > -- snipsnap -- > > diff --git a/compat/mingw.c b/compat/mingw.c > > index c1030e40f227..56db193d1360 100644 > > --- a/compat/mingw.c > > +++ b/compat/mingw.c > > @@ -784,6 +784,7 @@ int mingw_open (const char *filename, int oflags, ...) > > int fd, create = (oflags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL); > > wchar_t wfilename[MAX_LONG_PATH]; > > open_fn_t open_fn; > > + int tries = 0; > > > > va_start(args, oflags); > > mode = va_arg(args, int); > > @@ -813,7 +814,11 @@ int mingw_open (const char *filename, int oflags, ...) > > else if (xutftowcs_long_path(wfilename, filename) < 0) > > return -1; > > > > - fd = open_fn(wfilename, oflags, mode); > > + do { > > + fd = open_fn(wfilename, oflags, mode); > > + } while (fd < 0 && GetLastError() == ERROR_SHARING_VIOLATION && > > + retry_ask_yes_no(&tries, "Opening '%s' failed because another " > > + "application accessed it. Try again?", filename)); > > > > if ((oflags & O_CREAT) && fd >= 0 && are_wsl_compatible_mode_bits_enabled()) { > > _mode_t wsl_mode = S_IFREG | (mode&0777); > > Wait, this is surprising to me as I saw the error happening when calling > rename, not open. So why would retries in `open()` fix the symptom? I'm > probably missing something. I am sorry, I did not really read the bug report in detail, I only meant to unblock Git for Windows v2.47.0-rc1 and thought I had a fix in hand. It certainly fixed the failures on my local machine, but it unfortunately did not fix the problems in CI. I tried to debug in CI a bit, but it is a gnarly bug to investigate, what with plenty of processes the intentionally block each other, and no `gdb` to help. > In any case I also tried something like the below patch (sorry, > whitespace-broken). Oh, you reminded me that the `mingw_rename()` function looks _substantially_ different in Git for Windows than in Git. I did not have the time (or the strength of mind, more like) to upstream those changes yet. > But unfortunately this still caused permission errors when the new path > was held open by another process. Yes, this will _always_ be a problem, I think. The `FILE_RENAME_POSIX_SEMANTICS` as per its documentation should help, but if it does not in your tests it might actually not quite work as advertised (wouldn't be the first time I encounter such an issue). I tried to read through the code (it's a lot!) to figure out whether there is potentially any situation when the `tables.list` file is opened but not closed immediately, but couldn't find any. Do you know off-hand of any such scenario? > I think for now I'd still lean into the direction of adding the !WINDOWS > prerequisite to the test and increasing timeouts such that I can > continue to investigate without time pressure. Let me bang my head against this problem for a little while longer. You might be right, though, that this is a thing we cannot fix in time for v2.47.0, which would be sad. Ciao, Johannes > > Patrick > > @ -2152,9 +2152,12 @@ int mingw_accept(int sockfd1, struct sockaddr *sa, socklen_t *sz) > int mingw_rename(const char *pold, const char *pnew) > { > DWORD attrs, gle; > - int tries = 0; > + int tries = 0, wpnew_len, wpold_len; > wchar_t wpold[MAX_PATH], wpnew[MAX_PATH]; > - if (xutftowcs_path(wpold, pold) < 0 || xutftowcs_path(wpnew, pnew) < 0) > + > + wpold_len = xutftowcs_path(wpold, pold); > + wpnew_len = xutftowcs_path(wpnew, pnew); > + if (wpold_len < 0 || wpnew_len < 0) > return -1; > > /* > @@ -2166,6 +2169,58 @@ int mingw_rename(const char *pold, const char *pnew) > if (errno != EEXIST) > return -1; > repeat: > + { > +#define FileRenameInfoEx 22 > + enum { > + FILE_RENAME_REPLACE_IF_EXISTS = 0x01, > + FILE_RENAME_POSIX_SEMANTICS = 0x02, > + FILE_RENAME_IGNORE_READONLY_ATTRIBUTE = 0x40, > + }; > + typedef struct FILE_RENAME_INFORMATION { > + union { > + BOOLEAN ReplaceIfExists; > + ULONG Flags; > + }; > + HANDLE RootDirectory; > + ULONG FileNameLength; > + WCHAR FileName[1]; > + } *PFILE_RENAME_INFORMATION; > + size_t wpnew_size = wpnew_len * sizeof(wchar_t); > + PFILE_RENAME_INFORMATION fri = NULL; > + HANDLE handle = NULL; > + BOOL success; > + int ret; > + > + handle = CreateFileW(wpold, DELETE, > + FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, NULL, > + OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL); > + if (handle == INVALID_HANDLE_VALUE) { > + errno = ENOENT; > + ret = -1; > + goto out; > + } > + > + fri = xcalloc(1, sizeof(*fri) + wpnew_size); > + fri->Flags = FILE_RENAME_REPLACE_IF_EXISTS | FILE_RENAME_POSIX_SEMANTICS | > + FILE_RENAME_IGNORE_READONLY_ATTRIBUTE; > + fri->FileNameLength = wpnew_len; > + memcpy(fri->FileName, wpnew, wpnew_size); > + > + success = SetFileInformationByHandle(handle, FileRenameInfoEx, > + fri, sizeof(*fri) + wpnew_size); > + if (!success) { > + errno = err_win_to_posix(GetLastError()); > + ret = -1; > + goto out; > + } > + > + ret = 0; > +out: > + CloseHandle(handle); > + free(fri); > + return ret; > + } > + > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.47.0-rc1 test failure on cygwin 2024-10-04 11:11 ` Johannes Schindelin @ 2024-10-04 11:32 ` Patrick Steinhardt 2024-10-04 16:09 ` Junio C Hamano 1 sibling, 0 replies; 19+ messages in thread From: Patrick Steinhardt @ 2024-10-04 11:32 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Ramsay Jones, GIT Mailing-list, Junio C Hamano On Fri, Oct 04, 2024 at 01:11:15PM +0200, Johannes Schindelin wrote: > Hi Patrick, > > > > The issue on Win32 is different: we cannot commit the "tables.list" lock > > > > via rename(3P) because the target file may be open for reading by a > > > > concurrent process. I guess that Cygwin has proper POSIX semantics for > > > > rename(3P) and thus doesn't hit the same issue. > > > > > > Indeed, this is where the symptom lies. I worked around it in Git for > > > Windows v2.47.0-rc1 via this patch: > > > > > > -- snipsnap -- > > > diff --git a/compat/mingw.c b/compat/mingw.c > > > index c1030e40f227..56db193d1360 100644 > > > --- a/compat/mingw.c > > > +++ b/compat/mingw.c > > > @@ -784,6 +784,7 @@ int mingw_open (const char *filename, int oflags, ...) > > > int fd, create = (oflags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL); > > > wchar_t wfilename[MAX_LONG_PATH]; > > > open_fn_t open_fn; > > > + int tries = 0; > > > > > > va_start(args, oflags); > > > mode = va_arg(args, int); > > > @@ -813,7 +814,11 @@ int mingw_open (const char *filename, int oflags, ...) > > > else if (xutftowcs_long_path(wfilename, filename) < 0) > > > return -1; > > > > > > - fd = open_fn(wfilename, oflags, mode); > > > + do { > > > + fd = open_fn(wfilename, oflags, mode); > > > + } while (fd < 0 && GetLastError() == ERROR_SHARING_VIOLATION && > > > + retry_ask_yes_no(&tries, "Opening '%s' failed because another " > > > + "application accessed it. Try again?", filename)); > > > > > > if ((oflags & O_CREAT) && fd >= 0 && are_wsl_compatible_mode_bits_enabled()) { > > > _mode_t wsl_mode = S_IFREG | (mode&0777); > > > > Wait, this is surprising to me as I saw the error happening when calling > > rename, not open. So why would retries in `open()` fix the symptom? I'm > > probably missing something. > > I am sorry, I did not really read the bug report in detail, I only meant > to unblock Git for Windows v2.47.0-rc1 and thought I had a fix in hand. It > certainly fixed the failures on my local machine, but it unfortunately did > not fix the problems in CI. Nothing to be sorry about. Quite on the contrary, thanks for chiming in here, I was hoping for your input. > I tried to debug in CI a bit, but it is a gnarly bug to investigate, what > with plenty of processes the intentionally block each other, and no `gdb` > to help. Yup, indeed it is. > > In any case I also tried something like the below patch (sorry, > > whitespace-broken). > > Oh, you reminded me that the `mingw_rename()` function looks > _substantially_ different in Git for Windows than in Git. I did not have > the time (or the strength of mind, more like) to upstream those changes > yet. > > > But unfortunately this still caused permission errors when the new path > > was held open by another process. > > Yes, this will _always_ be a problem, I think. The > `FILE_RENAME_POSIX_SEMANTICS` as per its documentation should help, but if > it does not in your tests it might actually not quite work as advertised > (wouldn't be the first time I encounter such an issue). It could of course be that the code I wrote is just plain wrong. I had to stub out the struct as well as the constants because those are not available on my Windows system, due to whatever reason. Not even when bumping _WIN32_WINNT to 0x0A00 (W10+). Maybe this is because I'm using Windows 10, but I thought that thish should have been available starting with Windows 10 RC1. > I tried to read through the code (it's a lot!) to figure out whether there > is potentially any situation when the `tables.list` file is opened but not > closed immediately, but couldn't find any. Do you know off-hand of any > such scenario? We indeed do keep the `tables.list` file descriptor open as part of our stat cache. But that only happens on non-Windows system. This all happens in `reftable_stack_reload_maybe_reuse()` and is documented quite extensively in there. > > I think for now I'd still lean into the direction of adding the !WINDOWS > > prerequisite to the test and increasing timeouts such that I can > > continue to investigate without time pressure. > > Let me bang my head against this problem for a little while longer. You > might be right, though, that this is a thing we cannot fix in time for > v2.47.0, which would be sad. A bit sad, yes, but as I mentioned this is at least not a regression. The test just demonstrates that the improvements I did in this area are not yet sufficient for all platforms and that I need to spend some more time on it. Or that the central "tables.list" mechanism is not a good fit for Windows, which would be a shame. I'll send the patch as a reply to this thread so that it can be picked up as required. Patrick ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.47.0-rc1 test failure on cygwin 2024-10-04 11:11 ` Johannes Schindelin 2024-10-04 11:32 ` Patrick Steinhardt @ 2024-10-04 16:09 ` Junio C Hamano 2024-10-04 17:14 ` Patrick Steinhardt 1 sibling, 1 reply; 19+ messages in thread From: Junio C Hamano @ 2024-10-04 16:09 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Patrick Steinhardt, Ramsay Jones, GIT Mailing-list Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: >> But unfortunately this still caused permission errors when the new path >> was held open by another process. > > Yes, this will _always_ be a problem, I think. The > `FILE_RENAME_POSIX_SEMANTICS` as per its documentation should help, but if > it does not in your tests it might actually not quite work as advertised > (wouldn't be the first time I encounter such an issue). > > I tried to read through the code (it's a lot!) to figure out whether there > is potentially any situation when the `tables.list` file is opened but not > closed immediately, but couldn't find any. Do you know off-hand of any > such scenario? > >> I think for now I'd still lean into the direction of adding the !WINDOWS >> prerequisite to the test and increasing timeouts such that I can >> continue to investigate without time pressure. > > Let me bang my head against this problem for a little while longer. You > might be right, though, that this is a thing we cannot fix in time for > v2.47.0, which would be sad. If you folks think it would help stabilizing the tentative fix, I am open to the idea of delaying the 2.47 by a few days. Currently the 2.47-final is scheduled on the 7th (Monday), but we can do 2.47-rc2 on that day instead, and move the final to 10th (Thu) or 11th (Fri) [*]. Thanks, all, for working together. [Footnote] * All dates are US/Pacific, 10:00 am ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.47.0-rc1 test failure on cygwin 2024-10-04 16:09 ` Junio C Hamano @ 2024-10-04 17:14 ` Patrick Steinhardt 2024-10-04 17:54 ` Junio C Hamano 0 siblings, 1 reply; 19+ messages in thread From: Patrick Steinhardt @ 2024-10-04 17:14 UTC (permalink / raw) To: Junio C Hamano; +Cc: Johannes Schindelin, Ramsay Jones, GIT Mailing-list On Fri, Oct 04, 2024 at 09:09:01AM -0700, Junio C Hamano wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > > >> But unfortunately this still caused permission errors when the new path > >> was held open by another process. > > > > Yes, this will _always_ be a problem, I think. The > > `FILE_RENAME_POSIX_SEMANTICS` as per its documentation should help, but if > > it does not in your tests it might actually not quite work as advertised > > (wouldn't be the first time I encounter such an issue). > > > > I tried to read through the code (it's a lot!) to figure out whether there > > is potentially any situation when the `tables.list` file is opened but not > > closed immediately, but couldn't find any. Do you know off-hand of any > > such scenario? > > > >> I think for now I'd still lean into the direction of adding the !WINDOWS > >> prerequisite to the test and increasing timeouts such that I can > >> continue to investigate without time pressure. > > > > Let me bang my head against this problem for a little while longer. You > > might be right, though, that this is a thing we cannot fix in time for > > v2.47.0, which would be sad. > > If you folks think it would help stabilizing the tentative fix, I am > open to the idea of delaying the 2.47 by a few days. Currently the > 2.47-final is scheduled on the 7th (Monday), but we can do 2.47-rc2 > on that day instead, and move the final to 10th (Thu) or 11th (Fri) > [*]. > > Thanks, all, for working together. > > > [Footnote] > > * All dates are US/Pacific, 10:00 am Right now I don't yet have a good idea for how to fix the issue. I've been trying a bunch of different things that, according to Windows docs, should've made the renames work. But they didn't, and I don't really have an alternative right now. So I'll need to keep on thinking about this, and maybe get some more help from people familiar with Windows. So deferring Git 2.47 because of it probably does not make much sense unless somebody can up with a solution. Also, as noted already, this is not a regression, the behaviour actually improved even on Windows. Not to the degree I was hoping for, but at least a bit. So that's another reason why I don't think it is worth deferring over this. Thanks! Patrick ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.47.0-rc1 test failure on cygwin 2024-10-04 17:14 ` Patrick Steinhardt @ 2024-10-04 17:54 ` Junio C Hamano 0 siblings, 0 replies; 19+ messages in thread From: Junio C Hamano @ 2024-10-04 17:54 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: Johannes Schindelin, Ramsay Jones, GIT Mailing-list Patrick Steinhardt <ps@pks.im> writes: > Right now I don't yet have a good idea for how to fix the issue. I've > been trying a bunch of different things that, according to Windows docs, > should've made the renames work. But they didn't, and I don't really > have an alternative right now. So I'll need to keep on thinking about > this, and maybe get some more help from people familiar with Windows. > > So deferring Git 2.47 because of it probably does not make much sense > unless somebody can up with a solution. Also, as noted already, this is > not a regression, the behaviour actually improved even on Windows. Not > to the degree I was hoping for, but at least a bit. So that's another > reason why I don't think it is worth deferring over this. OK. I was thinking that we may need more time to cook the real fix on target (read: not Windows) platforms, but if that is not the case, then let's stick to the original schedule. Thanks. ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH] t0610: work around flaky test with concurrent writers 2024-10-04 1:02 v2.47.0-rc1 test failure on cygwin Ramsay Jones 2024-10-04 3:59 ` Patrick Steinhardt @ 2024-10-04 12:16 ` Patrick Steinhardt 2024-10-04 14:47 ` Ramsay Jones 2024-10-04 16:22 ` Junio C Hamano 2024-10-04 15:32 ` [PATCH v2] " Patrick Steinhardt 2 siblings, 2 replies; 19+ messages in thread From: Patrick Steinhardt @ 2024-10-04 12:16 UTC (permalink / raw) To: git Cc: Ramsay Jones, Junio C Hamano, Johannes Schindelin, Jeff King, Josh Steadmon In 6241ce2170 (refs/reftable: reload locked stack when preparing transaction, 2024-09-24) we have introduced a new test that exercises how the reftable backend behaves with many concurrent writers all racing with each other. This test was introduced after a couple of fixes in this context that should make concurrent writes behave gracefully. As it turns out though, Windows systems do not yet handle concurrent writes properly, as we've got two reports for Cygwin and MinGW failing in this newly added test. The root cause of this is how we update the "tables.list" file: when writing a new stack of tables we first write the data into a lockfile and then rename that file into place. But Windows forbids us from doing that rename when the target path is open for reading by another process. And as the test races both readers and writers with each other we are quite likely to hit this edge case. Now the two reports are somewhat different from one another: - On Cygwin we hit timeouts because we fail to lock the "tables.list" file within 10 seconds. The renames themselves succeed even when the target file is open because Cygwin provides extensive compatibility logic to make them work even when the target file is open already. - On MinGW we hit I/O errors on rename. While we do have some retry logic in place to make the rename work in some cases, this is seemingly not sufficient when there is this much contention around the files. Neither of these cases is a regression: the logic didn't work before the mentioned commit, and after the commit it performs well on Linux, macOS and in Cygwin, and at least a bit better with MinGW. But the tests show that we need to put more thought into how to make this work properly on MinGW systems. The fact that Cygwin can work around this issue with better emulation of POSIX-style atomic renames shows that we can in theory make MinGW work better, as well. But doing so likely requires quite some fiddling with Windows internals, and Git v2.47 is about to be released in a couple days. This makes any potential fix quite risky as it would have to happen deep down in our rename(3P) implementation in "compat/mingw.c". Let's instead work around both issues by disabling the test on MinGW and by significantly increasing the locking timeout for Cygwin. This bumped timeout also helps when running with e.g. the address and memory sanitizers, which also tend to significantly extend the runtime of this test. This should be revisited after Git v2.47 is out. Signed-off-by: Patrick Steinhardt <ps@pks.im> --- This fix can be applied to remove some of the stress with the Git v2.47 release pending. If would of course be preferable to find an alternate fix that makes MinGW work as required, but if you take the 500 lines of code that is the rename(3P) implemenation of Cygwin as a hint you quickly figure out that this is a rather complex problem. Patrick t/t0610-reftable-basics.sh | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh index 2d951c8ceb..86a746aff0 100755 --- a/t/t0610-reftable-basics.sh +++ b/t/t0610-reftable-basics.sh @@ -450,15 +450,27 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' ' ) ' -test_expect_success 'ref transaction: many concurrent writers' ' +# This test fails most of the time on Windows systems. The root cause is +# that Windows does not allow us to rename the "tables.list.lock" file into +# place when "tables.list" is open for reading by a concurrent process. +# +# The same issue does not happen on Cygwin because its implementation of +# rename(3P) is emulating POSIX-style renames, including renames over files +# that are open. +test_expect_success !MINGW 'ref transaction: many concurrent writers' ' test_when_finished "rm -rf repo" && git init repo && ( cd repo && - # Set a high timeout such that a busy CI machine will not abort - # early. 10 seconds should hopefully be ample of time to make - # this non-flaky. - git config set reftable.lockTimeout 10000 && + # Set a high timeout. While a couple of seconds should be + # plenty, using the address sanitizer will significantly slow + # us down here. Furthermore, Cygwin is also way slower due to + # the POSIX-style rename emulation. So we are aiming way higher + # than you would ever think is necessary just to keep us from + # flaking. We could also lock indefinitely by passing -1, but + # that could potentially block CI jobs indefinitely if there + # was a bug here. + git config set reftable.lockTimeout 300000 && test_commit --no-tag initial && head=$(git rev-parse HEAD) && base-commit: 111e864d69c84284441b083966c2065c2e9a4e78 -- 2.47.0.rc0.dirty ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] t0610: work around flaky test with concurrent writers 2024-10-04 12:16 ` [PATCH] t0610: work around flaky test with concurrent writers Patrick Steinhardt @ 2024-10-04 14:47 ` Ramsay Jones 2024-10-04 15:26 ` Patrick Steinhardt 2024-10-04 16:32 ` Junio C Hamano 2024-10-04 16:22 ` Junio C Hamano 1 sibling, 2 replies; 19+ messages in thread From: Ramsay Jones @ 2024-10-04 14:47 UTC (permalink / raw) To: Patrick Steinhardt, git Cc: Junio C Hamano, Johannes Schindelin, Jeff King, Josh Steadmon Hi Patrick, On 04/10/2024 13:16, Patrick Steinhardt wrote: [snip] > Now the two reports are somewhat different from one another: > > - On Cygwin we hit timeouts because we fail to lock the "tables.list" > file within 10 seconds. The renames themselves succeed even when the > target file is open because Cygwin provides extensive compatibility > logic to make them work even when the target file is open already. Hmm, not so much for me! :( > > - On MinGW we hit I/O errors on rename. While we do have some retry > logic in place to make the rename work in some cases, this is > seemingly not sufficient when there is this much contention around > the files. I am seeing I/O errors. > > Neither of these cases is a regression: the logic didn't work before the > mentioned commit, and after the commit it performs well on Linux, macOS > and in Cygwin, and at least a bit better with MinGW. But the tests show > that we need to put more thought into how to make this work properly on > MinGW systems. OK > > The fact that Cygwin can work around this issue with better emulation of > POSIX-style atomic renames shows that we can in theory make MinGW work > better, as well. But doing so likely requires quite some fiddling with > Windows internals, and Git v2.47 is about to be released in a couple > days. This makes any potential fix quite risky as it would have to > happen deep down in our rename(3P) implementation in "compat/mingw.c". > > Let's instead work around both issues by disabling the test on MinGW > and by significantly increasing the locking timeout for Cygwin. This > bumped timeout also helps when running with e.g. the address and memory > sanitizers, which also tend to significantly extend the runtime of this > test. This doesn't work for me. > > This should be revisited after Git v2.47 is out. > > Signed-off-by: Patrick Steinhardt <ps@pks.im> > --- [PATCH snipped] As I said last night, the first thing I tried was increasing the timeout ten-fold (ie, to 100000), but that didn't change the outcome. I am also on windows 10 (on an *old* 4th gen core i5): $ uname -a CYGWIN_NT-10.0-19045 satellite 3.5.4-1.x86_64 2024-08-25 16:52 UTC x86_64 Cygwin $ I don't know why we appear to be seeing different behaviour. Since I had already edited this test to up the timeout, I increased it again to match the value in this (redacted) patch (ie I didn't actually apply your patch): $ git diff diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh index 2d951c8ceb..7b6ded1c6a 100755 --- a/t/t0610-reftable-basics.sh +++ b/t/t0610-reftable-basics.sh @@ -458,7 +458,7 @@ test_expect_success 'ref transaction: many concurrent writers' ' # Set a high timeout such that a busy CI machine will not abort # early. 10 seconds should hopefully be ample of time to make # this non-flaky. - git config set reftable.lockTimeout 10000 && + git config set reftable.lockTimeout 300000 && test_commit --no-tag initial && head=$(git rev-parse HEAD) && $ However, apart from less debugging output, the result was essentially the same: ... ++ wait ++ git update-ref refs/heads/branch-100 HEAD fatal: update_ref failed for ref 'refs/heads/branch-97': reftable: transaction failure: I/O error fatal: update_ref failed for ref 'refs/heads/branch-87': reftable: transaction prepare: I/O error ++ git for-each-ref --sort=v:refname ++ test_cmp expect actual ++ test 2 -ne 2 ++ eval 'diff -u' '"$@"' +++ diff -u expect actual --- expect 2024-10-04 13:31:42.450969800 +0000 +++ actual 2024-10-04 13:32:07.097150600 +0000 @@ -84,7 +84,6 @@ 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-84 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-85 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-86 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-87 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-88 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-89 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-90 error: last command exited with $?=1 not ok 47 - ref transaction: many concurrent writers ... In an earlier email you suggested a timeout of -1, thus: $ git diff diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh index 2d951c8ceb..9ab37a8d69 100755 --- a/t/t0610-reftable-basics.sh +++ b/t/t0610-reftable-basics.sh @@ -458,7 +458,7 @@ test_expect_success 'ref transaction: many concurrent writers' ' # Set a high timeout such that a busy CI machine will not abort # early. 10 seconds should hopefully be ample of time to make # this non-flaky. - git config set reftable.lockTimeout 10000 && + git config set reftable.lockTimeout -1 && test_commit --no-tag initial && head=$(git rev-parse HEAD) && $ The result looks familiar: ++ wait ++ git update-ref refs/heads/branch-100 HEAD fatal: update_ref failed for ref 'refs/heads/branch-46': reftable: transaction prepare: I/O error ++ git for-each-ref --sort=v:refname ++ test_cmp expect actual ++ test 2 -ne 2 ++ eval 'diff -u' '"$@"' +++ diff -u expect actual --- expect 2024-10-04 14:35:20.159564800 +0000 +++ actual 2024-10-04 14:35:43.304762500 +0000 @@ -43,7 +43,6 @@ 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-43 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-44 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-45 -68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-46 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-47 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-48 68d032e9edd3481ac96382786ececc37ec28709e commit refs/heads/branch-49 error: last command exited with $?=1 not ok 47 - ref transaction: many concurrent writers Can you think of anything else to try? I would strongly suggest skipping this test on cygwin as well as MINGW. ATB, Ramsay Jones ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] t0610: work around flaky test with concurrent writers 2024-10-04 14:47 ` Ramsay Jones @ 2024-10-04 15:26 ` Patrick Steinhardt 2024-10-04 16:32 ` Junio C Hamano 1 sibling, 0 replies; 19+ messages in thread From: Patrick Steinhardt @ 2024-10-04 15:26 UTC (permalink / raw) To: Ramsay Jones Cc: git, Junio C Hamano, Johannes Schindelin, Jeff King, Josh Steadmon On Fri, Oct 04, 2024 at 03:47:01PM +0100, Ramsay Jones wrote: > On 04/10/2024 13:16, Patrick Steinhardt wrote: > [snip] > > > Now the two reports are somewhat different from one another: > > > > - On Cygwin we hit timeouts because we fail to lock the "tables.list" > > file within 10 seconds. The renames themselves succeed even when the > > target file is open because Cygwin provides extensive compatibility > > logic to make them work even when the target file is open already. > > Hmm, not so much for me! :( > > > > > - On MinGW we hit I/O errors on rename. While we do have some retry > > logic in place to make the rename work in some cases, this is > > seemingly not sufficient when there is this much contention around > > the files. > > I am seeing I/O errors. Interesting! I wonder why I don't see them. Maybe it's boiling down to timing again. [snip] > Can you think of anything else to try? Not really, no. I'd be curious whether Windows 11 has the same failure mode for Cygwin, but cannot test it myself. > I would strongly suggest skipping this test on cygwin as well as MINGW. Yup, I definitely agree. I was operating under the assumption that Cygwin works alright. Now that we know that it doesn't we should adapt the condition from "!MINGW" to "!WINDOWS". Patrick ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] t0610: work around flaky test with concurrent writers 2024-10-04 14:47 ` Ramsay Jones 2024-10-04 15:26 ` Patrick Steinhardt @ 2024-10-04 16:32 ` Junio C Hamano 1 sibling, 0 replies; 19+ messages in thread From: Junio C Hamano @ 2024-10-04 16:32 UTC (permalink / raw) To: Ramsay Jones Cc: Patrick Steinhardt, git, Johannes Schindelin, Jeff King, Josh Steadmon Ramsay Jones <ramsay@ramsayjones.plus.com> writes: > This doesn't work for me. > Can you think of anything else to try? > > I would strongly suggest skipping this test on cygwin as well as MINGW. There was some mention of "Win 10 or later" in the thread, IIRC, while explaining why Cygwin version of rename() works. Is it possible that the version of Windows and Cygwin you and Patrick used are different enough? In the meantime, I'll queue a SQUASH??? on top for preparing today's integration. This is a tangent, but there are places that use !MINGW,!CYGWIN and there are places that use !WINDOWS; I think they are equivalent prerequisites that we might want to straighten out their uses as a clean-up later. t/t0610-reftable-basics.sh | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git c/t/t0610-reftable-basics.sh w/t/t0610-reftable-basics.sh index 86a746aff0..e10c894cb0 100755 --- c/t/t0610-reftable-basics.sh +++ w/t/t0610-reftable-basics.sh @@ -454,10 +454,10 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' ' # that Windows does not allow us to rename the "tables.list.lock" file into # place when "tables.list" is open for reading by a concurrent process. # -# The same issue does not happen on Cygwin because its implementation of +# The same issue may not happen on Cygwin because its implementation of # rename(3P) is emulating POSIX-style renames, including renames over files -# that are open. -test_expect_success !MINGW 'ref transaction: many concurrent writers' ' +# that are open, but it probably depends on the version. +test_expect_success !WINDOWS 'ref transaction: many concurrent writers' ' test_when_finished "rm -rf repo" && git init repo && ( ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] t0610: work around flaky test with concurrent writers 2024-10-04 12:16 ` [PATCH] t0610: work around flaky test with concurrent writers Patrick Steinhardt 2024-10-04 14:47 ` Ramsay Jones @ 2024-10-04 16:22 ` Junio C Hamano 1 sibling, 0 replies; 19+ messages in thread From: Junio C Hamano @ 2024-10-04 16:22 UTC (permalink / raw) To: Patrick Steinhardt Cc: git, Ramsay Jones, Johannes Schindelin, Jeff King, Josh Steadmon Patrick Steinhardt <ps@pks.im> writes: > This fix can be applied to remove some of the stress with the Git v2.47 > release pending. If would of course be preferable to find an alternate > fix that makes MinGW work as required, but if you take the 500 lines of > code that is the rename(3P) implemenation of Cygwin as a hint you > quickly figure out that this is a rather complex problem. Yup, this is good for 2.47 and I agree it is sensible to punt the real solution after the upcoming release. Thanks. ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v2] t0610: work around flaky test with concurrent writers 2024-10-04 1:02 v2.47.0-rc1 test failure on cygwin Ramsay Jones 2024-10-04 3:59 ` Patrick Steinhardt 2024-10-04 12:16 ` [PATCH] t0610: work around flaky test with concurrent writers Patrick Steinhardt @ 2024-10-04 15:32 ` Patrick Steinhardt 2024-10-04 16:32 ` Ramsay Jones ` (2 more replies) 2 siblings, 3 replies; 19+ messages in thread From: Patrick Steinhardt @ 2024-10-04 15:32 UTC (permalink / raw) To: git Cc: Ramsay Jones, Junio C Hamano, Johannes Schindelin, Jeff King, Josh Steadmon In 6241ce2170 (refs/reftable: reload locked stack when preparing transaction, 2024-09-24) we have introduced a new test that exercises how the reftable backend behaves with many concurrent writers all racing with each other. This test was introduced after a couple of fixes in this context that should make concurrent writes behave gracefully. As it turns out though, Windows systems do not yet handle concurrent writes properly, as we've got two reports for Cygwin and MinGW failing in this newly added test. The root cause of this is how we update the "tables.list" file: when writing a new stack of tables we first write the data into a lockfile and then rename that file into place. But Windows forbids us from doing that rename when the target path is open for reading by another process. And as the test races both readers and writers with each other we are quite likely to hit this edge case. This is not a regression: the logic didn't work before the mentioned commit, and after the commit it performs well on Linux and macOS, and the situation on Windows should have at least improved a bit. But the test shows that we need to put more thought into how to make this work properly there. Work around the issue by disabling the test on Windows for now. While at it, increase the locking timeout to address reported timeouts when using either the address or memory sanitizer, which also tend to significantly extend the runtime of this test. This should be revisited after Git v2.47 is out. Signed-off-by: Patrick Steinhardt <ps@pks.im> --- t/t0610-reftable-basics.sh | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh index 2d951c8ceb..babec7993e 100755 --- a/t/t0610-reftable-basics.sh +++ b/t/t0610-reftable-basics.sh @@ -450,15 +450,22 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' ' ) ' -test_expect_success 'ref transaction: many concurrent writers' ' +# This test fails most of the time on Windows systems. The root cause is +# that Windows does not allow us to rename the "tables.list.lock" file into +# place when "tables.list" is open for reading by a concurrent process. +test_expect_success !WINDOWS 'ref transaction: many concurrent writers' ' test_when_finished "rm -rf repo" && git init repo && ( cd repo && - # Set a high timeout such that a busy CI machine will not abort - # early. 10 seconds should hopefully be ample of time to make - # this non-flaky. - git config set reftable.lockTimeout 10000 && + # Set a high timeout. While a couple of seconds should be + # plenty, using the address sanitizer will significantly slow + # us down here. So we are aiming way higher than you would ever + # think is necessary just to keep us from flaking. We could + # also lock indefinitely by passing -1, but that could + # potentially block CI jobs indefinitely if there was a bug + # here. + git config set reftable.lockTimeout 300000 && test_commit --no-tag initial && head=$(git rev-parse HEAD) && Interdiff against v1: diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh index 86a746aff0..babec7993e 100755 --- a/t/t0610-reftable-basics.sh +++ b/t/t0610-reftable-basics.sh @@ -453,23 +453,18 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' ' # This test fails most of the time on Windows systems. The root cause is # that Windows does not allow us to rename the "tables.list.lock" file into # place when "tables.list" is open for reading by a concurrent process. -# -# The same issue does not happen on Cygwin because its implementation of -# rename(3P) is emulating POSIX-style renames, including renames over files -# that are open. -test_expect_success !MINGW 'ref transaction: many concurrent writers' ' +test_expect_success !WINDOWS 'ref transaction: many concurrent writers' ' test_when_finished "rm -rf repo" && git init repo && ( cd repo && # Set a high timeout. While a couple of seconds should be # plenty, using the address sanitizer will significantly slow - # us down here. Furthermore, Cygwin is also way slower due to - # the POSIX-style rename emulation. So we are aiming way higher - # than you would ever think is necessary just to keep us from - # flaking. We could also lock indefinitely by passing -1, but - # that could potentially block CI jobs indefinitely if there - # was a bug here. + # us down here. So we are aiming way higher than you would ever + # think is necessary just to keep us from flaking. We could + # also lock indefinitely by passing -1, but that could + # potentially block CI jobs indefinitely if there was a bug + # here. git config set reftable.lockTimeout 300000 && test_commit --no-tag initial && base-commit: 111e864d69c84284441b083966c2065c2e9a4e78 -- 2.47.0.rc0.dirty ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v2] t0610: work around flaky test with concurrent writers 2024-10-04 15:32 ` [PATCH v2] " Patrick Steinhardt @ 2024-10-04 16:32 ` Ramsay Jones 2024-10-04 16:35 ` Junio C Hamano 2024-10-04 22:41 ` Jeff King 2 siblings, 0 replies; 19+ messages in thread From: Ramsay Jones @ 2024-10-04 16:32 UTC (permalink / raw) To: Patrick Steinhardt, git Cc: Junio C Hamano, Johannes Schindelin, Jeff King, Josh Steadmon Hi Patrick, On 04/10/2024 16:32, Patrick Steinhardt wrote: > In 6241ce2170 (refs/reftable: reload locked stack when preparing > transaction, 2024-09-24) we have introduced a new test that exercises > how the reftable backend behaves with many concurrent writers all racing > with each other. This test was introduced after a couple of fixes in > this context that should make concurrent writes behave gracefully. As it > turns out though, Windows systems do not yet handle concurrent writes > properly, as we've got two reports for Cygwin and MinGW failing in this > newly added test. > > The root cause of this is how we update the "tables.list" file: when > writing a new stack of tables we first write the data into a lockfile > and then rename that file into place. But Windows forbids us from doing > that rename when the target path is open for reading by another process. > And as the test races both readers and writers with each other we are > quite likely to hit this edge case. > > This is not a regression: the logic didn't work before the mentioned > commit, and after the commit it performs well on Linux and macOS, and > the situation on Windows should have at least improved a bit. But the > test shows that we need to put more thought into how to make this work > properly there. > > Work around the issue by disabling the test on Windows for now. While at > it, increase the locking timeout to address reported timeouts when using > either the address or memory sanitizer, which also tend to significantly > extend the runtime of this test. > > This should be revisited after Git v2.47 is out. > > Signed-off-by: Patrick Steinhardt <ps@pks.im> > --- > t/t0610-reftable-basics.sh | 17 ++++++++++++----- > 1 file changed, 12 insertions(+), 5 deletions(-) > > diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh > index 2d951c8ceb..babec7993e 100755 > --- a/t/t0610-reftable-basics.sh > +++ b/t/t0610-reftable-basics.sh > @@ -450,15 +450,22 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' ' > ) > ' > > -test_expect_success 'ref transaction: many concurrent writers' ' > +# This test fails most of the time on Windows systems. The root cause is > +# that Windows does not allow us to rename the "tables.list.lock" file into > +# place when "tables.list" is open for reading by a concurrent process. > +test_expect_success !WINDOWS 'ref transaction: many concurrent writers' ' > test_when_finished "rm -rf repo" && > git init repo && > ( > cd repo && > - # Set a high timeout such that a busy CI machine will not abort > - # early. 10 seconds should hopefully be ample of time to make > - # this non-flaky. > - git config set reftable.lockTimeout 10000 && > + # Set a high timeout. While a couple of seconds should be > + # plenty, using the address sanitizer will significantly slow > + # us down here. So we are aiming way higher than you would ever > + # think is necessary just to keep us from flaking. We could > + # also lock indefinitely by passing -1, but that could > + # potentially block CI jobs indefinitely if there was a bug > + # here. > + git config set reftable.lockTimeout 300000 && > test_commit --no-tag initial && > > head=$(git rev-parse HEAD) && I did apply this patch and (unsurprising) it now passes just fine! :) ATB, Ramsay Jones ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2] t0610: work around flaky test with concurrent writers 2024-10-04 15:32 ` [PATCH v2] " Patrick Steinhardt 2024-10-04 16:32 ` Ramsay Jones @ 2024-10-04 16:35 ` Junio C Hamano 2024-10-04 22:41 ` Jeff King 2 siblings, 0 replies; 19+ messages in thread From: Junio C Hamano @ 2024-10-04 16:35 UTC (permalink / raw) To: Patrick Steinhardt Cc: git, Ramsay Jones, Johannes Schindelin, Jeff King, Josh Steadmon Patrick Steinhardt <ps@pks.im> writes: > This should be revisited after Git v2.47 is out. > > Signed-off-by: Patrick Steinhardt <ps@pks.im> > --- > t/t0610-reftable-basics.sh | 17 ++++++++++++----- > 1 file changed, 12 insertions(+), 5 deletions(-) Ah, I spoke before I finished my inbox. Let's use this version. Thanks. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2] t0610: work around flaky test with concurrent writers 2024-10-04 15:32 ` [PATCH v2] " Patrick Steinhardt 2024-10-04 16:32 ` Ramsay Jones 2024-10-04 16:35 ` Junio C Hamano @ 2024-10-04 22:41 ` Jeff King 2 siblings, 0 replies; 19+ messages in thread From: Jeff King @ 2024-10-04 22:41 UTC (permalink / raw) To: Patrick Steinhardt Cc: git, Ramsay Jones, Junio C Hamano, Johannes Schindelin, Josh Steadmon On Fri, Oct 04, 2024 at 05:32:16PM +0200, Patrick Steinhardt wrote: > Work around the issue by disabling the test on Windows for now. While at > it, increase the locking timeout to address reported timeouts when using > either the address or memory sanitizer, which also tend to significantly > extend the runtime of this test. > > This should be revisited after Git v2.47 is out. This sounds OK to me. In general, I worry about papering over bugs with a "we'll revisit it later" note. We may forget about things or never prioritize them. And the worst-case scenario here is that we might later decide to promote reftable to the default format, forgetting that concurrency has problems on Windows. But given that it's an area you've been actively working in, there's already been some discussion towards the real fix, and we still consider reftables somewhat experimental, I feel OK bumping it for now. Thanks for all the digging on this (from you and from other folks). -Peff ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2024-10-04 22:41 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-10-04 1:02 v2.47.0-rc1 test failure on cygwin Ramsay Jones 2024-10-04 3:59 ` Patrick Steinhardt 2024-10-04 6:13 ` Patrick Steinhardt 2024-10-04 9:13 ` Johannes Schindelin 2024-10-04 10:09 ` Patrick Steinhardt 2024-10-04 11:11 ` Johannes Schindelin 2024-10-04 11:32 ` Patrick Steinhardt 2024-10-04 16:09 ` Junio C Hamano 2024-10-04 17:14 ` Patrick Steinhardt 2024-10-04 17:54 ` Junio C Hamano 2024-10-04 12:16 ` [PATCH] t0610: work around flaky test with concurrent writers Patrick Steinhardt 2024-10-04 14:47 ` Ramsay Jones 2024-10-04 15:26 ` Patrick Steinhardt 2024-10-04 16:32 ` Junio C Hamano 2024-10-04 16:22 ` Junio C Hamano 2024-10-04 15:32 ` [PATCH v2] " Patrick Steinhardt 2024-10-04 16:32 ` Ramsay Jones 2024-10-04 16:35 ` Junio C Hamano 2024-10-04 22:41 ` Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).