git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* v2.47.0-rc1 test failure on cygwin
@ 2024-10-04  1:02 Ramsay Jones
  2024-10-04  3:59 ` Patrick Steinhardt
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Ramsay Jones @ 2024-10-04  1:02 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: GIT Mailing-list, Junio C Hamano

Hi Patrick,

Just a quick heads up: t0610-reftable-basics.sh test 47 (ref transaction: many
concurrent writers) fails on cygwin. The tail end of the debug output for this
test looks like:

..

++ wait
fatal: update_ref failed for ref 'refs/heads/branch-82': reftable: transaction prepare: I/O error
fatal: update_ref failed for ref 'refs/heads/branch-2': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-3': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-27': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-10': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-21': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-14': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-9': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-19': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-33': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-28': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-29': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-25': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-53': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-41': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-37': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-36': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-50': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-62': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-48': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-49': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-59': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-67': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-81': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-72': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-70': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-54': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-71': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-74': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-95': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-87': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-92': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-64': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-94': cannot lock references
fatal: update_ref failed for ref 'refs/heads/branch-100': cannot lock references
++ git for-each-ref --sort=v:refname
++ test_cmp expect actual
++ test 2 -ne 2
++ eval 'diff -u' '"$@"'
+++ diff -u expect actual
--- expect	2024-10-03 23:21:01.284020500 +0000
+++ actual	2024-10-03 23:21:19.881283200 +0000
@@ -1,101 +1,66 @@
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-1
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-2
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-3
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-4
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-5
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-6
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-7
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-8
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-9
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-10
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-11
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-12
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-13
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-14
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-15
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-16
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-17
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-18
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-19
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-20
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-21
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-22
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-23
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-24
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-25
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-26
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-27
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-28
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-29
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-30
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-31
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-32
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-33
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-34
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-35
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-36
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-37
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-38
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-39
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-40
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-41
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-42
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-43
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-44
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-45
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-46
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-47
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-48
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-49
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-50
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-51
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-52
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-53
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-54
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-55
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-56
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-57
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-58
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-59
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-60
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-61
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-62
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-63
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-64
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-65
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-66
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-67
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-68
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-69
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-70
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-71
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-72
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-73
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-74
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-75
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-76
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-77
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-78
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-79
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-80
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-81
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-82
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-83
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-84
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-85
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-86
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-87
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-88
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-89
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-90
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-91
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-92
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-93
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-94
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-95
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-96
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-97
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-98
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-99
-68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-100
 68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/main
error: last command exited with $?=1
not ok 47 - ref transaction: many concurrent writers

..

t0610-reftable-basics.sh passed on 'rc0', but this test (and the timeout facility)
is new in 'rc1'. I tried simply increasing the timeout (10 fold), but that didn't
change the result. (I didn't really expect it to - the 'reftable: transaction
prepare: I/O error' does not look timing related!).

Again, just a heads up. (I can't look at it until tomorrow now; any ideas?)

ATB,
Ramsay Jones



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: v2.47.0-rc1 test failure on cygwin
  2024-10-04  1:02 v2.47.0-rc1 test failure on cygwin Ramsay Jones
@ 2024-10-04  3:59 ` Patrick Steinhardt
  2024-10-04  6:13   ` Patrick Steinhardt
  2024-10-04 12:16 ` [PATCH] t0610: work around flaky test with concurrent writers Patrick Steinhardt
  2024-10-04 15:32 ` [PATCH v2] " Patrick Steinhardt
  2 siblings, 1 reply; 19+ messages in thread
From: Patrick Steinhardt @ 2024-10-04  3:59 UTC (permalink / raw)
  To: Ramsay Jones; +Cc: GIT Mailing-list, Junio C Hamano

On Fri, Oct 04, 2024 at 02:02:44AM +0100, Ramsay Jones wrote:
> Hi Patrick,
> 
> Just a quick heads up: t0610-reftable-basics.sh test 47 (ref transaction: many
> concurrent writers) fails on cygwin. The tail end of the debug output for this
> test looks like:
> 
[snip]
> 
> t0610-reftable-basics.sh passed on 'rc0', but this test (and the timeout facility)
> is new in 'rc1'. I tried simply increasing the timeout (10 fold), but that didn't
> change the result. (I didn't really expect it to - the 'reftable: transaction
> prepare: I/O error' does not look timing related!).
> 
> Again, just a heads up. (I can't look at it until tomorrow now; any ideas?)

This failure is kind of known and discussed in [1]. Just to make it
explicit: this test failure doesn't really surface a regression, the
reftable code already failed for concurrent writes before. I fixed that
and added the test that is now flaky, as the fix itself is seemingly
only sufficient on Linux and macOS.

I didn't yet have the time to look at whether I can fix it, but should
finally find the time to do so today.

Patrick

[1]: <20240927040752.GA567671@coredump.intra.peff.net>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: v2.47.0-rc1 test failure on cygwin
  2024-10-04  3:59 ` Patrick Steinhardt
@ 2024-10-04  6:13   ` Patrick Steinhardt
  2024-10-04  9:13     ` Johannes Schindelin
  0 siblings, 1 reply; 19+ messages in thread
From: Patrick Steinhardt @ 2024-10-04  6:13 UTC (permalink / raw)
  To: Ramsay Jones; +Cc: GIT Mailing-list, Junio C Hamano

On Fri, Oct 04, 2024 at 05:59:30AM +0200, Patrick Steinhardt wrote:
> On Fri, Oct 04, 2024 at 02:02:44AM +0100, Ramsay Jones wrote:
> > Hi Patrick,
> > 
> > Just a quick heads up: t0610-reftable-basics.sh test 47 (ref transaction: many
> > concurrent writers) fails on cygwin. The tail end of the debug output for this
> > test looks like:
> > 
> [snip]
> > 
> > t0610-reftable-basics.sh passed on 'rc0', but this test (and the timeout facility)
> > is new in 'rc1'. I tried simply increasing the timeout (10 fold), but that didn't
> > change the result. (I didn't really expect it to - the 'reftable: transaction
> > prepare: I/O error' does not look timing related!).
> > 
> > Again, just a heads up. (I can't look at it until tomorrow now; any ideas?)
> 
> This failure is kind of known and discussed in [1]. Just to make it
> explicit: this test failure doesn't really surface a regression, the
> reftable code already failed for concurrent writes before. I fixed that
> and added the test that is now flaky, as the fix itself is seemingly
> only sufficient on Linux and macOS.
> 
> I didn't yet have the time to look at whether I can fix it, but should
> finally find the time to do so today.

Hm, interestingly enough I cannot reproduce the issue on Cygwin myself,
but I can reproduce the issue with MinGW. And in fact, the logs you have
sent all indicate that we cannot acquire the lock, there is no sign of
I/O errors here. So I guess you're running into timeout issues. Does the
following patch fix this for you?

diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
index 2d951c8ceb..b5cad805d4 100755
--- a/t/t0610-reftable-basics.sh
+++ b/t/t0610-reftable-basics.sh
@@ -455,10 +455,7 @@ test_expect_success 'ref transaction: many concurrent writers' '
 	git init repo &&
 	(
 		cd repo &&
-		# Set a high timeout such that a busy CI machine will not abort
-		# early. 10 seconds should hopefully be ample of time to make
-		# this non-flaky.
-		git config set reftable.lockTimeout 10000 &&
+		git config set reftable.lockTimeout -1 &&
 		test_commit --no-tag initial &&
 
 		head=$(git rev-parse HEAD) &&

The issue on Win32 is different: we cannot commit the "tables.list" lock
via rename(3P) because the target file may be open for reading by a
concurrent process. I guess that Cygwin has proper POSIX semantics for
rename(3P) and thus doesn't hit the same issue.

We already try to emulate POSIX semantics somewhat in `mingw_rename()`
by using a retry-loop when we hit `ERROR_ACCESS_DENIED`, which is what
we get when the target file is open in another process. But that
seemingly isn't enough when there is a lot of contention around a file.
So I'm currently investigating whether we can adopt something similar to
what Cygwin is doing for Win32, as well. I assume that they use
`FILE_RENAME_INFORMATION_EX` with `FILE_RENAME_POSIX_SEMANTICS`, which
should give us what we're looking for.

gh, well. Turns out the implementation of rename(3P) in Cygwin is 500
lines long. I guess this is a non-trivial problem :) But they of course
have to handle a whole lot more cases than we have to. But my guess was
correct: they do use `FILE_RENAME_POSIX_SEMANTICS`. The catch is that
this flag only exists in Windows 10 and newer. But that should be a fine
compromise.

I'll try to wrap my head around how all of this works.

Patrick

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: v2.47.0-rc1 test failure on cygwin
  2024-10-04  6:13   ` Patrick Steinhardt
@ 2024-10-04  9:13     ` Johannes Schindelin
  2024-10-04 10:09       ` Patrick Steinhardt
  0 siblings, 1 reply; 19+ messages in thread
From: Johannes Schindelin @ 2024-10-04  9:13 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Ramsay Jones, GIT Mailing-list, Junio C Hamano

Hi Patrick,

On Fri, 4 Oct 2024, Patrick Steinhardt wrote:

> On Fri, Oct 04, 2024 at 05:59:30AM +0200, Patrick Steinhardt wrote:
> > On Fri, Oct 04, 2024 at 02:02:44AM +0100, Ramsay Jones wrote:
> > >
> > > Just a quick heads up: t0610-reftable-basics.sh test 47 (ref
> > > transaction: many concurrent writers) fails on cygwin. The tail end
> > > of the debug output for this test looks like:
> > >
> > [snip]
> > >
> > > t0610-reftable-basics.sh passed on 'rc0', but this test (and the
> > > timeout facility) is new in 'rc1'. I tried simply increasing the
> > > timeout (10 fold), but that didn't change the result. (I didn't
> > > really expect it to - the 'reftable: transaction prepare: I/O error'
> > > does not look timing related!).
> > >
> > > Again, just a heads up. (I can't look at it until tomorrow now; any
> > > ideas?)
> >
> > This failure is kind of known and discussed in [1]. Just to make it
> > explicit: this test failure doesn't really surface a regression, the
> > reftable code already failed for concurrent writes before. I fixed
> > that and added the test that is now flaky, as the fix itself is
> > seemingly only sufficient on Linux and macOS.
> >
> > I didn't yet have the time to look at whether I can fix it, but should
> > finally find the time to do so today.
>
> Hm, interestingly enough I cannot reproduce the issue on Cygwin myself,
> but I can reproduce the issue with MinGW. And in fact, the logs you have
> sent all indicate that we cannot acquire the lock, there is no sign of
> I/O errors here. So I guess you're running into timeout issues. Does the
> following patch fix this for you?
>
> diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
> index 2d951c8ceb..b5cad805d4 100755
> --- a/t/t0610-reftable-basics.sh
> +++ b/t/t0610-reftable-basics.sh
> @@ -455,10 +455,7 @@ test_expect_success 'ref transaction: many concurrent writers' '
>  	git init repo &&
>  	(
>  		cd repo &&
> -		# Set a high timeout such that a busy CI machine will not abort
> -		# early. 10 seconds should hopefully be ample of time to make
> -		# this non-flaky.
> -		git config set reftable.lockTimeout 10000 &&
> +		git config set reftable.lockTimeout -1 &&
>  		test_commit --no-tag initial &&
>
>  		head=$(git rev-parse HEAD) &&

That looks plausible to me, as this test is exercising the POSIX emulation
layer of Cygwin much more than Git itself (and therefore I expect this
test case to take a really long time on Cygwin). If it turns out that this
works around the issue, I would propose to use something like this
instead:

		# Cygwin needs a slightly longer timeout
		if have_prereq !CYGWIN
		then
			git config set reftable.lockTimeout 10000
		else
			git config set reftable.lockTimeout 60000
		fi &&

> The issue on Win32 is different: we cannot commit the "tables.list" lock
> via rename(3P) because the target file may be open for reading by a
> concurrent process. I guess that Cygwin has proper POSIX semantics for
> rename(3P) and thus doesn't hit the same issue.

Indeed, this is where the symptom lies. I worked around it in Git for
Windows v2.47.0-rc1 via this patch:

-- snipsnap --
diff --git a/compat/mingw.c b/compat/mingw.c
index c1030e40f227..56db193d1360 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -784,6 +784,7 @@ int mingw_open (const char *filename, int oflags, ...)
 	int fd, create = (oflags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL);
 	wchar_t wfilename[MAX_LONG_PATH];
 	open_fn_t open_fn;
+	int tries = 0;

 	va_start(args, oflags);
 	mode = va_arg(args, int);
@@ -813,7 +814,11 @@ int mingw_open (const char *filename, int oflags, ...)
 	else if (xutftowcs_long_path(wfilename, filename) < 0)
 		return -1;

-	fd = open_fn(wfilename, oflags, mode);
+	do {
+		fd = open_fn(wfilename, oflags, mode);
+	} while (fd < 0 && GetLastError() == ERROR_SHARING_VIOLATION &&
+		 retry_ask_yes_no(&tries, "Opening '%s' failed because another "
+			"application accessed it. Try again?", filename));

 	if ((oflags & O_CREAT) && fd >= 0 && are_wsl_compatible_mode_bits_enabled()) {
 		_mode_t wsl_mode = S_IFREG | (mode&0777);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: v2.47.0-rc1 test failure on cygwin
  2024-10-04  9:13     ` Johannes Schindelin
@ 2024-10-04 10:09       ` Patrick Steinhardt
  2024-10-04 11:11         ` Johannes Schindelin
  0 siblings, 1 reply; 19+ messages in thread
From: Patrick Steinhardt @ 2024-10-04 10:09 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Ramsay Jones, GIT Mailing-list, Junio C Hamano

On Fri, Oct 04, 2024 at 11:13:34AM +0200, Johannes Schindelin wrote:
> Hi Patrick,
> 
> On Fri, 4 Oct 2024, Patrick Steinhardt wrote:
> 
> > On Fri, Oct 04, 2024 at 05:59:30AM +0200, Patrick Steinhardt wrote:
> > > On Fri, Oct 04, 2024 at 02:02:44AM +0100, Ramsay Jones wrote:
> > > >
> > > > Just a quick heads up: t0610-reftable-basics.sh test 47 (ref
> > > > transaction: many concurrent writers) fails on cygwin. The tail end
> > > > of the debug output for this test looks like:
> > > >
> > > [snip]
> > > >
> > > > t0610-reftable-basics.sh passed on 'rc0', but this test (and the
> > > > timeout facility) is new in 'rc1'. I tried simply increasing the
> > > > timeout (10 fold), but that didn't change the result. (I didn't
> > > > really expect it to - the 'reftable: transaction prepare: I/O error'
> > > > does not look timing related!).
> > > >
> > > > Again, just a heads up. (I can't look at it until tomorrow now; any
> > > > ideas?)
> > >
> > > This failure is kind of known and discussed in [1]. Just to make it
> > > explicit: this test failure doesn't really surface a regression, the
> > > reftable code already failed for concurrent writes before. I fixed
> > > that and added the test that is now flaky, as the fix itself is
> > > seemingly only sufficient on Linux and macOS.
> > >
> > > I didn't yet have the time to look at whether I can fix it, but should
> > > finally find the time to do so today.
> >
> > Hm, interestingly enough I cannot reproduce the issue on Cygwin myself,
> > but I can reproduce the issue with MinGW. And in fact, the logs you have
> > sent all indicate that we cannot acquire the lock, there is no sign of
> > I/O errors here. So I guess you're running into timeout issues. Does the
> > following patch fix this for you?
> >
> > diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
> > index 2d951c8ceb..b5cad805d4 100755
> > --- a/t/t0610-reftable-basics.sh
> > +++ b/t/t0610-reftable-basics.sh
> > @@ -455,10 +455,7 @@ test_expect_success 'ref transaction: many concurrent writers' '
> >  	git init repo &&
> >  	(
> >  		cd repo &&
> > -		# Set a high timeout such that a busy CI machine will not abort
> > -		# early. 10 seconds should hopefully be ample of time to make
> > -		# this non-flaky.
> > -		git config set reftable.lockTimeout 10000 &&
> > +		git config set reftable.lockTimeout -1 &&
> >  		test_commit --no-tag initial &&
> >
> >  		head=$(git rev-parse HEAD) &&
> 
> That looks plausible to me, as this test is exercising the POSIX emulation
> layer of Cygwin much more than Git itself (and therefore I expect this
> test case to take a really long time on Cygwin). If it turns out that this
> works around the issue, I would propose to use something like this
> instead:
> 
> 		# Cygwin needs a slightly longer timeout
> 		if have_prereq !CYGWIN
> 		then
> 			git config set reftable.lockTimeout 10000
> 		else
> 			git config set reftable.lockTimeout 60000
> 		fi &&

We also saw that this creates an issue when running e.g. with the memory
and leak sanitizers. So I'd just generally raise the timeout value to
something way higher to avoid flakiness, like 5 minutes.

> > The issue on Win32 is different: we cannot commit the "tables.list" lock
> > via rename(3P) because the target file may be open for reading by a
> > concurrent process. I guess that Cygwin has proper POSIX semantics for
> > rename(3P) and thus doesn't hit the same issue.
> 
> Indeed, this is where the symptom lies. I worked around it in Git for
> Windows v2.47.0-rc1 via this patch:
> 
> -- snipsnap --
> diff --git a/compat/mingw.c b/compat/mingw.c
> index c1030e40f227..56db193d1360 100644
> --- a/compat/mingw.c
> +++ b/compat/mingw.c
> @@ -784,6 +784,7 @@ int mingw_open (const char *filename, int oflags, ...)
>  	int fd, create = (oflags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL);
>  	wchar_t wfilename[MAX_LONG_PATH];
>  	open_fn_t open_fn;
> +	int tries = 0;
> 
>  	va_start(args, oflags);
>  	mode = va_arg(args, int);
> @@ -813,7 +814,11 @@ int mingw_open (const char *filename, int oflags, ...)
>  	else if (xutftowcs_long_path(wfilename, filename) < 0)
>  		return -1;
> 
> -	fd = open_fn(wfilename, oflags, mode);
> +	do {
> +		fd = open_fn(wfilename, oflags, mode);
> +	} while (fd < 0 && GetLastError() == ERROR_SHARING_VIOLATION &&
> +		 retry_ask_yes_no(&tries, "Opening '%s' failed because another "
> +			"application accessed it. Try again?", filename));
> 
>  	if ((oflags & O_CREAT) && fd >= 0 && are_wsl_compatible_mode_bits_enabled()) {
>  		_mode_t wsl_mode = S_IFREG | (mode&0777);

Wait, this is surprising to me as I saw the error happening when calling
rename, not open. So why would retries in `open()` fix the symptom? I'm
probably missing something.

In any case I also tried something like the below patch (sorry,
whitespace-broken). But unfortunately this still caused permission
errors when the new path was held open by another process.

I think for now I'd still lean into the direction of adding the !WINDOWS
prerequisite to the test and increasing timeouts such that I can
continue to investigate without time pressure.

Patrick

@ -2152,9 +2152,12 @@ int mingw_accept(int sockfd1, struct sockaddr *sa, socklen_t *sz)
 int mingw_rename(const char *pold, const char *pnew)
 {
        DWORD attrs, gle;
-       int tries = 0;
+       int tries = 0, wpnew_len, wpold_len;
        wchar_t wpold[MAX_PATH], wpnew[MAX_PATH];
-       if (xutftowcs_path(wpold, pold) < 0 || xutftowcs_path(wpnew, pnew) < 0)
+
+       wpold_len = xutftowcs_path(wpold, pold);
+       wpnew_len = xutftowcs_path(wpnew, pnew);
+       if (wpold_len < 0 || wpnew_len < 0)
                return -1;
 
        /*
@@ -2166,6 +2169,58 @@ int mingw_rename(const char *pold, const char *pnew)
        if (errno != EEXIST)
                return -1;
 repeat:
+       {
+#define FileRenameInfoEx                      22
+               enum {
+                       FILE_RENAME_REPLACE_IF_EXISTS         = 0x01,
+                       FILE_RENAME_POSIX_SEMANTICS           = 0x02,
+                       FILE_RENAME_IGNORE_READONLY_ATTRIBUTE = 0x40,
+               };
+               typedef struct FILE_RENAME_INFORMATION {
+                       union {
+                               BOOLEAN ReplaceIfExists;
+                               ULONG Flags;
+                       };
+                       HANDLE RootDirectory;
+                       ULONG FileNameLength;
+                       WCHAR FileName[1];
+               } *PFILE_RENAME_INFORMATION;
+               size_t wpnew_size = wpnew_len * sizeof(wchar_t);
+               PFILE_RENAME_INFORMATION fri = NULL;
+               HANDLE handle = NULL;
+               BOOL success;
+               int ret;
+
+               handle = CreateFileW(wpold, DELETE,
+                                    FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, NULL,
+                                    OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
+               if (handle == INVALID_HANDLE_VALUE) {
+                       errno = ENOENT;
+                       ret = -1;
+                       goto out;
+               }
+
+               fri = xcalloc(1, sizeof(*fri) + wpnew_size);
+               fri->Flags = FILE_RENAME_REPLACE_IF_EXISTS | FILE_RENAME_POSIX_SEMANTICS |
+                            FILE_RENAME_IGNORE_READONLY_ATTRIBUTE;
+               fri->FileNameLength = wpnew_len;
+               memcpy(fri->FileName, wpnew, wpnew_size);
+
+               success = SetFileInformationByHandle(handle, FileRenameInfoEx,
+                                                    fri, sizeof(*fri) + wpnew_size);
+               if (!success) {
+                       errno = err_win_to_posix(GetLastError());
+                       ret = -1;
+                       goto out;
+               }
+
+               ret = 0;
+out:
+               CloseHandle(handle);
+               free(fri);
+               return ret;
+       }
+

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: v2.47.0-rc1 test failure on cygwin
  2024-10-04 10:09       ` Patrick Steinhardt
@ 2024-10-04 11:11         ` Johannes Schindelin
  2024-10-04 11:32           ` Patrick Steinhardt
  2024-10-04 16:09           ` Junio C Hamano
  0 siblings, 2 replies; 19+ messages in thread
From: Johannes Schindelin @ 2024-10-04 11:11 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Ramsay Jones, GIT Mailing-list, Junio C Hamano

Hi Patrick,

On Fri, 4 Oct 2024, Patrick Steinhardt wrote:

> On Fri, Oct 04, 2024 at 11:13:34AM +0200, Johannes Schindelin wrote:
> >
> > On Fri, 4 Oct 2024, Patrick Steinhardt wrote:
> >
> > > diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
> > > index 2d951c8ceb..b5cad805d4 100755
> > > --- a/t/t0610-reftable-basics.sh
> > > +++ b/t/t0610-reftable-basics.sh
> > > @@ -455,10 +455,7 @@ test_expect_success 'ref transaction: many concurrent writers' '
> > >  	git init repo &&
> > >  	(
> > >  		cd repo &&
> > > -		# Set a high timeout such that a busy CI machine will not abort
> > > -		# early. 10 seconds should hopefully be ample of time to make
> > > -		# this non-flaky.
> > > -		git config set reftable.lockTimeout 10000 &&
> > > +		git config set reftable.lockTimeout -1 &&
> > >  		test_commit --no-tag initial &&
> > >
> > >  		head=$(git rev-parse HEAD) &&
> >
> > That looks plausible to me, as this test is exercising the POSIX emulation
> > layer of Cygwin much more than Git itself (and therefore I expect this
> > test case to take a really long time on Cygwin). If it turns out that this
> > works around the issue, I would propose to use something like this
> > instead:
> >
> > 		# Cygwin needs a slightly longer timeout
> > 		if have_prereq !CYGWIN
> > 		then
> > 			git config set reftable.lockTimeout 10000
> > 		else
> > 			git config set reftable.lockTimeout 60000
> > 		fi &&
>
> We also saw that this creates an issue when running e.g. with the memory
> and leak sanitizers. So I'd just generally raise the timeout value to
> something way higher to avoid flakiness, like 5 minutes.

Sounds good!

> > > The issue on Win32 is different: we cannot commit the "tables.list" lock
> > > via rename(3P) because the target file may be open for reading by a
> > > concurrent process. I guess that Cygwin has proper POSIX semantics for
> > > rename(3P) and thus doesn't hit the same issue.
> >
> > Indeed, this is where the symptom lies. I worked around it in Git for
> > Windows v2.47.0-rc1 via this patch:
> >
> > -- snipsnap --
> > diff --git a/compat/mingw.c b/compat/mingw.c
> > index c1030e40f227..56db193d1360 100644
> > --- a/compat/mingw.c
> > +++ b/compat/mingw.c
> > @@ -784,6 +784,7 @@ int mingw_open (const char *filename, int oflags, ...)
> >  	int fd, create = (oflags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL);
> >  	wchar_t wfilename[MAX_LONG_PATH];
> >  	open_fn_t open_fn;
> > +	int tries = 0;
> >
> >  	va_start(args, oflags);
> >  	mode = va_arg(args, int);
> > @@ -813,7 +814,11 @@ int mingw_open (const char *filename, int oflags, ...)
> >  	else if (xutftowcs_long_path(wfilename, filename) < 0)
> >  		return -1;
> >
> > -	fd = open_fn(wfilename, oflags, mode);
> > +	do {
> > +		fd = open_fn(wfilename, oflags, mode);
> > +	} while (fd < 0 && GetLastError() == ERROR_SHARING_VIOLATION &&
> > +		 retry_ask_yes_no(&tries, "Opening '%s' failed because another "
> > +			"application accessed it. Try again?", filename));
> >
> >  	if ((oflags & O_CREAT) && fd >= 0 && are_wsl_compatible_mode_bits_enabled()) {
> >  		_mode_t wsl_mode = S_IFREG | (mode&0777);
>
> Wait, this is surprising to me as I saw the error happening when calling
> rename, not open. So why would retries in `open()` fix the symptom? I'm
> probably missing something.

I am sorry, I did not really read the bug report in detail, I only meant
to unblock Git for Windows v2.47.0-rc1 and thought I had a fix in hand. It
certainly fixed the failures on my local machine, but it unfortunately did
not fix the problems in CI.

I tried to debug in CI a bit, but it is a gnarly bug to investigate, what
with plenty of processes the intentionally block each other, and no `gdb`
to help.

> In any case I also tried something like the below patch (sorry,
> whitespace-broken).

Oh, you reminded me that the `mingw_rename()` function looks
_substantially_ different in Git for Windows than in Git. I did not have
the time (or the strength of mind, more like) to upstream those changes
yet.

> But unfortunately this still caused permission errors when the new path
> was held open by another process.

Yes, this will _always_ be a problem, I think. The
`FILE_RENAME_POSIX_SEMANTICS` as per its documentation should help, but if
it does not in your tests it might actually not quite work as advertised
(wouldn't be the first time I encounter such an issue).

I tried to read through the code (it's a lot!) to figure out whether there
is potentially any situation when the `tables.list` file is opened but not
closed immediately, but couldn't find any. Do you know off-hand of any
such scenario?

> I think for now I'd still lean into the direction of adding the !WINDOWS
> prerequisite to the test and increasing timeouts such that I can
> continue to investigate without time pressure.

Let me bang my head against this problem for a little while longer. You
might be right, though, that this is a thing we cannot fix in time for
v2.47.0, which would be sad.

Ciao,
Johannes

>
> Patrick
>
> @ -2152,9 +2152,12 @@ int mingw_accept(int sockfd1, struct sockaddr *sa, socklen_t *sz)
>  int mingw_rename(const char *pold, const char *pnew)
>  {
>         DWORD attrs, gle;
> -       int tries = 0;
> +       int tries = 0, wpnew_len, wpold_len;
>         wchar_t wpold[MAX_PATH], wpnew[MAX_PATH];
> -       if (xutftowcs_path(wpold, pold) < 0 || xutftowcs_path(wpnew, pnew) < 0)
> +
> +       wpold_len = xutftowcs_path(wpold, pold);
> +       wpnew_len = xutftowcs_path(wpnew, pnew);
> +       if (wpold_len < 0 || wpnew_len < 0)
>                 return -1;
>
>         /*
> @@ -2166,6 +2169,58 @@ int mingw_rename(const char *pold, const char *pnew)
>         if (errno != EEXIST)
>                 return -1;
>  repeat:
> +       {
> +#define FileRenameInfoEx                      22
> +               enum {
> +                       FILE_RENAME_REPLACE_IF_EXISTS         = 0x01,
> +                       FILE_RENAME_POSIX_SEMANTICS           = 0x02,
> +                       FILE_RENAME_IGNORE_READONLY_ATTRIBUTE = 0x40,
> +               };
> +               typedef struct FILE_RENAME_INFORMATION {
> +                       union {
> +                               BOOLEAN ReplaceIfExists;
> +                               ULONG Flags;
> +                       };
> +                       HANDLE RootDirectory;
> +                       ULONG FileNameLength;
> +                       WCHAR FileName[1];
> +               } *PFILE_RENAME_INFORMATION;
> +               size_t wpnew_size = wpnew_len * sizeof(wchar_t);
> +               PFILE_RENAME_INFORMATION fri = NULL;
> +               HANDLE handle = NULL;
> +               BOOL success;
> +               int ret;
> +
> +               handle = CreateFileW(wpold, DELETE,
> +                                    FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, NULL,
> +                                    OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
> +               if (handle == INVALID_HANDLE_VALUE) {
> +                       errno = ENOENT;
> +                       ret = -1;
> +                       goto out;
> +               }
> +
> +               fri = xcalloc(1, sizeof(*fri) + wpnew_size);
> +               fri->Flags = FILE_RENAME_REPLACE_IF_EXISTS | FILE_RENAME_POSIX_SEMANTICS |
> +                            FILE_RENAME_IGNORE_READONLY_ATTRIBUTE;
> +               fri->FileNameLength = wpnew_len;
> +               memcpy(fri->FileName, wpnew, wpnew_size);
> +
> +               success = SetFileInformationByHandle(handle, FileRenameInfoEx,
> +                                                    fri, sizeof(*fri) + wpnew_size);
> +               if (!success) {
> +                       errno = err_win_to_posix(GetLastError());
> +                       ret = -1;
> +                       goto out;
> +               }
> +
> +               ret = 0;
> +out:
> +               CloseHandle(handle);
> +               free(fri);
> +               return ret;
> +       }
> +
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: v2.47.0-rc1 test failure on cygwin
  2024-10-04 11:11         ` Johannes Schindelin
@ 2024-10-04 11:32           ` Patrick Steinhardt
  2024-10-04 16:09           ` Junio C Hamano
  1 sibling, 0 replies; 19+ messages in thread
From: Patrick Steinhardt @ 2024-10-04 11:32 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Ramsay Jones, GIT Mailing-list, Junio C Hamano

On Fri, Oct 04, 2024 at 01:11:15PM +0200, Johannes Schindelin wrote:
> Hi Patrick,
> > > > The issue on Win32 is different: we cannot commit the "tables.list" lock
> > > > via rename(3P) because the target file may be open for reading by a
> > > > concurrent process. I guess that Cygwin has proper POSIX semantics for
> > > > rename(3P) and thus doesn't hit the same issue.
> > >
> > > Indeed, this is where the symptom lies. I worked around it in Git for
> > > Windows v2.47.0-rc1 via this patch:
> > >
> > > -- snipsnap --
> > > diff --git a/compat/mingw.c b/compat/mingw.c
> > > index c1030e40f227..56db193d1360 100644
> > > --- a/compat/mingw.c
> > > +++ b/compat/mingw.c
> > > @@ -784,6 +784,7 @@ int mingw_open (const char *filename, int oflags, ...)
> > >  	int fd, create = (oflags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL);
> > >  	wchar_t wfilename[MAX_LONG_PATH];
> > >  	open_fn_t open_fn;
> > > +	int tries = 0;
> > >
> > >  	va_start(args, oflags);
> > >  	mode = va_arg(args, int);
> > > @@ -813,7 +814,11 @@ int mingw_open (const char *filename, int oflags, ...)
> > >  	else if (xutftowcs_long_path(wfilename, filename) < 0)
> > >  		return -1;
> > >
> > > -	fd = open_fn(wfilename, oflags, mode);
> > > +	do {
> > > +		fd = open_fn(wfilename, oflags, mode);
> > > +	} while (fd < 0 && GetLastError() == ERROR_SHARING_VIOLATION &&
> > > +		 retry_ask_yes_no(&tries, "Opening '%s' failed because another "
> > > +			"application accessed it. Try again?", filename));
> > >
> > >  	if ((oflags & O_CREAT) && fd >= 0 && are_wsl_compatible_mode_bits_enabled()) {
> > >  		_mode_t wsl_mode = S_IFREG | (mode&0777);
> >
> > Wait, this is surprising to me as I saw the error happening when calling
> > rename, not open. So why would retries in `open()` fix the symptom? I'm
> > probably missing something.
> 
> I am sorry, I did not really read the bug report in detail, I only meant
> to unblock Git for Windows v2.47.0-rc1 and thought I had a fix in hand. It
> certainly fixed the failures on my local machine, but it unfortunately did
> not fix the problems in CI.

Nothing to be sorry about. Quite on the contrary, thanks for chiming in
here, I was hoping for your input.

> I tried to debug in CI a bit, but it is a gnarly bug to investigate, what
> with plenty of processes the intentionally block each other, and no `gdb`
> to help.

Yup, indeed it is.

> > In any case I also tried something like the below patch (sorry,
> > whitespace-broken).
> 
> Oh, you reminded me that the `mingw_rename()` function looks
> _substantially_ different in Git for Windows than in Git. I did not have
> the time (or the strength of mind, more like) to upstream those changes
> yet.
> 
> > But unfortunately this still caused permission errors when the new path
> > was held open by another process.
> 
> Yes, this will _always_ be a problem, I think. The
> `FILE_RENAME_POSIX_SEMANTICS` as per its documentation should help, but if
> it does not in your tests it might actually not quite work as advertised
> (wouldn't be the first time I encounter such an issue).

It could of course be that the code I wrote is just plain wrong. I had
to stub out the struct as well as the constants because those are not
available on my Windows system, due to whatever reason. Not even when
bumping _WIN32_WINNT to 0x0A00 (W10+). Maybe this is because I'm using
Windows 10, but I thought that thish should have been available starting
with Windows 10 RC1.

> I tried to read through the code (it's a lot!) to figure out whether there
> is potentially any situation when the `tables.list` file is opened but not
> closed immediately, but couldn't find any. Do you know off-hand of any
> such scenario?

We indeed do keep the `tables.list` file descriptor open as part of our
stat cache. But that only happens on non-Windows system. This all
happens in `reftable_stack_reload_maybe_reuse()` and is documented quite
extensively in there.

> > I think for now I'd still lean into the direction of adding the !WINDOWS
> > prerequisite to the test and increasing timeouts such that I can
> > continue to investigate without time pressure.
> 
> Let me bang my head against this problem for a little while longer. You
> might be right, though, that this is a thing we cannot fix in time for
> v2.47.0, which would be sad.

A bit sad, yes, but as I mentioned this is at least not a regression.
The test just demonstrates that the improvements I did in this area are
not yet sufficient for all platforms and that I need to spend some more
time on it. Or that the central "tables.list" mechanism is not a good
fit for Windows, which would be a shame.

I'll send the patch as a reply to this thread so that it can be picked
up as required.

Patrick

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] t0610: work around flaky test with concurrent writers
  2024-10-04  1:02 v2.47.0-rc1 test failure on cygwin Ramsay Jones
  2024-10-04  3:59 ` Patrick Steinhardt
@ 2024-10-04 12:16 ` Patrick Steinhardt
  2024-10-04 14:47   ` Ramsay Jones
  2024-10-04 16:22   ` Junio C Hamano
  2024-10-04 15:32 ` [PATCH v2] " Patrick Steinhardt
  2 siblings, 2 replies; 19+ messages in thread
From: Patrick Steinhardt @ 2024-10-04 12:16 UTC (permalink / raw)
  To: git
  Cc: Ramsay Jones, Junio C Hamano, Johannes Schindelin, Jeff King,
	Josh Steadmon

In 6241ce2170 (refs/reftable: reload locked stack when preparing
transaction, 2024-09-24) we have introduced a new test that exercises
how the reftable backend behaves with many concurrent writers all racing
with each other. This test was introduced after a couple of fixes in
this context that should make concurrent writes behave gracefully. As it
turns out though, Windows systems do not yet handle concurrent writes
properly, as we've got two reports for Cygwin and MinGW failing in this
newly added test.

The root cause of this is how we update the "tables.list" file: when
writing a new stack of tables we first write the data into a lockfile
and then rename that file into place. But Windows forbids us from doing
that rename when the target path is open for reading by another process.
And as the test races both readers and writers with each other we are
quite likely to hit this edge case.

Now the two reports are somewhat different from one another:

  - On Cygwin we hit timeouts because we fail to lock the "tables.list"
    file within 10 seconds. The renames themselves succeed even when the
    target file is open because Cygwin provides extensive compatibility
    logic to make them work even when the target file is open already.

  - On MinGW we hit I/O errors on rename. While we do have some retry
    logic in place to make the rename work in some cases, this is
    seemingly not sufficient when there is this much contention around
    the files.

Neither of these cases is a regression: the logic didn't work before the
mentioned commit, and after the commit it performs well on Linux, macOS
and in Cygwin, and at least a bit better with MinGW. But the tests show
that we need to put more thought into how to make this work properly on
MinGW systems.

The fact that Cygwin can work around this issue with better emulation of
POSIX-style atomic renames shows that we can in theory make MinGW work
better, as well. But doing so likely requires quite some fiddling with
Windows internals, and Git v2.47 is about to be released in a couple
days. This makes any potential fix quite risky as it would have to
happen deep down in our rename(3P) implementation in "compat/mingw.c".

Let's instead work around both issues by disabling the test on MinGW
and by significantly increasing the locking timeout for Cygwin. This
bumped timeout also helps when running with e.g. the address and memory
sanitizers, which also tend to significantly extend the runtime of this
test.

This should be revisited after Git v2.47 is out.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---

This fix can be applied to remove some of the stress with the Git v2.47
release pending. If would of course be preferable to find an alternate
fix that makes MinGW work as required, but if you take the 500 lines of
code that is the rename(3P) implemenation of Cygwin as a hint you
quickly figure out that this is a rather complex problem.

Patrick

 t/t0610-reftable-basics.sh | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
index 2d951c8ceb..86a746aff0 100755
--- a/t/t0610-reftable-basics.sh
+++ b/t/t0610-reftable-basics.sh
@@ -450,15 +450,27 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' '
 	)
 '
 
-test_expect_success 'ref transaction: many concurrent writers' '
+# This test fails most of the time on Windows systems. The root cause is
+# that Windows does not allow us to rename the "tables.list.lock" file into
+# place when "tables.list" is open for reading by a concurrent process.
+#
+# The same issue does not happen on Cygwin because its implementation of
+# rename(3P) is emulating POSIX-style renames, including renames over files
+# that are open.
+test_expect_success !MINGW 'ref transaction: many concurrent writers' '
 	test_when_finished "rm -rf repo" &&
 	git init repo &&
 	(
 		cd repo &&
-		# Set a high timeout such that a busy CI machine will not abort
-		# early. 10 seconds should hopefully be ample of time to make
-		# this non-flaky.
-		git config set reftable.lockTimeout 10000 &&
+		# Set a high timeout. While a couple of seconds should be
+		# plenty, using the address sanitizer will significantly slow
+		# us down here. Furthermore, Cygwin is also way slower due to
+		# the POSIX-style rename emulation. So we are aiming way higher
+		# than you would ever think is necessary just to keep us from
+		# flaking. We could also lock indefinitely by passing -1, but
+		# that could potentially block CI jobs indefinitely if there
+		# was a bug here.
+		git config set reftable.lockTimeout 300000 &&
 		test_commit --no-tag initial &&
 
 		head=$(git rev-parse HEAD) &&

base-commit: 111e864d69c84284441b083966c2065c2e9a4e78
-- 
2.47.0.rc0.dirty


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] t0610: work around flaky test with concurrent writers
  2024-10-04 12:16 ` [PATCH] t0610: work around flaky test with concurrent writers Patrick Steinhardt
@ 2024-10-04 14:47   ` Ramsay Jones
  2024-10-04 15:26     ` Patrick Steinhardt
  2024-10-04 16:32     ` Junio C Hamano
  2024-10-04 16:22   ` Junio C Hamano
  1 sibling, 2 replies; 19+ messages in thread
From: Ramsay Jones @ 2024-10-04 14:47 UTC (permalink / raw)
  To: Patrick Steinhardt, git
  Cc: Junio C Hamano, Johannes Schindelin, Jeff King, Josh Steadmon

Hi Patrick,

On 04/10/2024 13:16, Patrick Steinhardt wrote:
[snip]

> Now the two reports are somewhat different from one another:
> 
>   - On Cygwin we hit timeouts because we fail to lock the "tables.list"
>     file within 10 seconds. The renames themselves succeed even when the
>     target file is open because Cygwin provides extensive compatibility
>     logic to make them work even when the target file is open already.

Hmm, not so much for me! :(

> 
>   - On MinGW we hit I/O errors on rename. While we do have some retry
>     logic in place to make the rename work in some cases, this is
>     seemingly not sufficient when there is this much contention around
>     the files.

I am seeing I/O errors.

> 
> Neither of these cases is a regression: the logic didn't work before the
> mentioned commit, and after the commit it performs well on Linux, macOS
> and in Cygwin, and at least a bit better with MinGW. But the tests show
> that we need to put more thought into how to make this work properly on
> MinGW systems.

OK

> 
> The fact that Cygwin can work around this issue with better emulation of
> POSIX-style atomic renames shows that we can in theory make MinGW work
> better, as well. But doing so likely requires quite some fiddling with
> Windows internals, and Git v2.47 is about to be released in a couple
> days. This makes any potential fix quite risky as it would have to
> happen deep down in our rename(3P) implementation in "compat/mingw.c".
> 
> Let's instead work around both issues by disabling the test on MinGW
> and by significantly increasing the locking timeout for Cygwin. This
> bumped timeout also helps when running with e.g. the address and memory
> sanitizers, which also tend to significantly extend the runtime of this
> test.

This doesn't work for me.

> 
> This should be revisited after Git v2.47 is out.
> 
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---

[PATCH snipped]

As I said last night, the first thing I tried was increasing the timeout
ten-fold (ie, to 100000), but that didn't change the outcome.

I am also on windows 10 (on an *old* 4th gen core i5): 
 
  $ uname -a
  CYGWIN_NT-10.0-19045 satellite 3.5.4-1.x86_64 2024-08-25 16:52 UTC x86_64 Cygwin
  $ 

I don't know why we appear to be seeing different behaviour. Since I had
already edited this test to up the timeout, I increased it again to match
the value in this (redacted) patch (ie I didn't actually apply your patch):

  $ git diff
  diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
  index 2d951c8ceb..7b6ded1c6a 100755
  --- a/t/t0610-reftable-basics.sh
  +++ b/t/t0610-reftable-basics.sh
  @@ -458,7 +458,7 @@ test_expect_success 'ref transaction: many concurrent writers' '
                  # Set a high timeout such that a busy CI machine will not abort
                  # early. 10 seconds should hopefully be ample of time to make
                  # this non-flaky.
  -               git config set reftable.lockTimeout 10000 &&
  +               git config set reftable.lockTimeout 300000 &&
                  test_commit --no-tag initial &&
   
                  head=$(git rev-parse HEAD) &&
  $ 

However, apart from less debugging output, the result was essentially the
same:

  ...
  
  ++ wait
  ++ git update-ref refs/heads/branch-100 HEAD
  fatal: update_ref failed for ref 'refs/heads/branch-97': reftable: transaction failure: I/O error
  fatal: update_ref failed for ref 'refs/heads/branch-87': reftable: transaction prepare: I/O error
  ++ git for-each-ref --sort=v:refname
  ++ test_cmp expect actual
  ++ test 2 -ne 2
  ++ eval 'diff -u' '"$@"'
  +++ diff -u expect actual
  --- expect      2024-10-04 13:31:42.450969800 +0000
  +++ actual      2024-10-04 13:32:07.097150600 +0000
  @@ -84,7 +84,6 @@
   68d032e9edd3481ac96382786ececc37ec28709e commit        refs/heads/branch-84
   68d032e9edd3481ac96382786ececc37ec28709e commit        refs/heads/branch-85
   68d032e9edd3481ac96382786ececc37ec28709e commit        refs/heads/branch-86
  -68d032e9edd3481ac96382786ececc37ec28709e commit        refs/heads/branch-87
   68d032e9edd3481ac96382786ececc37ec28709e commit        refs/heads/branch-88
   68d032e9edd3481ac96382786ececc37ec28709e commit        refs/heads/branch-89
   68d032e9edd3481ac96382786ececc37ec28709e commit        refs/heads/branch-90
  error: last command exited with $?=1
  not ok 47 - ref transaction: many concurrent writers
 
  ...

In an earlier email you suggested a timeout of -1, thus:

  $ git diff
  diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
  index 2d951c8ceb..9ab37a8d69 100755
  --- a/t/t0610-reftable-basics.sh
  +++ b/t/t0610-reftable-basics.sh
  @@ -458,7 +458,7 @@ test_expect_success 'ref transaction: many concurrent writers' '
                  # Set a high timeout such that a busy CI machine will not abort
                  # early. 10 seconds should hopefully be ample of time to make
                  # this non-flaky.
  -               git config set reftable.lockTimeout 10000 &&
  +               git config set reftable.lockTimeout -1 &&
                  test_commit --no-tag initial &&
   
                  head=$(git rev-parse HEAD) &&
  $ 

The result looks familiar:
  
  ++ wait
  ++ git update-ref refs/heads/branch-100 HEAD
  fatal: update_ref failed for ref 'refs/heads/branch-46': reftable: transaction prepare: I/O error
  ++ git for-each-ref --sort=v:refname
  ++ test_cmp expect actual
  ++ test 2 -ne 2
  ++ eval 'diff -u' '"$@"'
  +++ diff -u expect actual
  --- expect	2024-10-04 14:35:20.159564800 +0000
  +++ actual	2024-10-04 14:35:43.304762500 +0000
  @@ -43,7 +43,6 @@
   68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-43
   68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-44
   68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-45
  -68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-46
   68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-47
   68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-48
   68d032e9edd3481ac96382786ececc37ec28709e commit	refs/heads/branch-49
  error: last command exited with $?=1
  not ok 47 - ref transaction: many concurrent writers

Can you think of anything else to try?
  
I would strongly suggest skipping this test on cygwin as well as MINGW.

ATB,
Ramsay Jones

 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] t0610: work around flaky test with concurrent writers
  2024-10-04 14:47   ` Ramsay Jones
@ 2024-10-04 15:26     ` Patrick Steinhardt
  2024-10-04 16:32     ` Junio C Hamano
  1 sibling, 0 replies; 19+ messages in thread
From: Patrick Steinhardt @ 2024-10-04 15:26 UTC (permalink / raw)
  To: Ramsay Jones
  Cc: git, Junio C Hamano, Johannes Schindelin, Jeff King,
	Josh Steadmon

On Fri, Oct 04, 2024 at 03:47:01PM +0100, Ramsay Jones wrote:
> On 04/10/2024 13:16, Patrick Steinhardt wrote:
> [snip]
> 
> > Now the two reports are somewhat different from one another:
> > 
> >   - On Cygwin we hit timeouts because we fail to lock the "tables.list"
> >     file within 10 seconds. The renames themselves succeed even when the
> >     target file is open because Cygwin provides extensive compatibility
> >     logic to make them work even when the target file is open already.
> 
> Hmm, not so much for me! :(
> 
> > 
> >   - On MinGW we hit I/O errors on rename. While we do have some retry
> >     logic in place to make the rename work in some cases, this is
> >     seemingly not sufficient when there is this much contention around
> >     the files.
> 
> I am seeing I/O errors.

Interesting! I wonder why I don't see them. Maybe it's boiling down to
timing again.

[snip]
> Can you think of anything else to try?

Not really, no. I'd be curious whether Windows 11 has the same failure
mode for Cygwin, but cannot test it myself.

> I would strongly suggest skipping this test on cygwin as well as MINGW.

Yup, I definitely agree. I was operating under the assumption that
Cygwin works alright. Now that we know that it doesn't we should adapt
the condition from "!MINGW" to "!WINDOWS".

Patrick

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2] t0610: work around flaky test with concurrent writers
  2024-10-04  1:02 v2.47.0-rc1 test failure on cygwin Ramsay Jones
  2024-10-04  3:59 ` Patrick Steinhardt
  2024-10-04 12:16 ` [PATCH] t0610: work around flaky test with concurrent writers Patrick Steinhardt
@ 2024-10-04 15:32 ` Patrick Steinhardt
  2024-10-04 16:32   ` Ramsay Jones
                     ` (2 more replies)
  2 siblings, 3 replies; 19+ messages in thread
From: Patrick Steinhardt @ 2024-10-04 15:32 UTC (permalink / raw)
  To: git
  Cc: Ramsay Jones, Junio C Hamano, Johannes Schindelin, Jeff King,
	Josh Steadmon

In 6241ce2170 (refs/reftable: reload locked stack when preparing
transaction, 2024-09-24) we have introduced a new test that exercises
how the reftable backend behaves with many concurrent writers all racing
with each other. This test was introduced after a couple of fixes in
this context that should make concurrent writes behave gracefully. As it
turns out though, Windows systems do not yet handle concurrent writes
properly, as we've got two reports for Cygwin and MinGW failing in this
newly added test.

The root cause of this is how we update the "tables.list" file: when
writing a new stack of tables we first write the data into a lockfile
and then rename that file into place. But Windows forbids us from doing
that rename when the target path is open for reading by another process.
And as the test races both readers and writers with each other we are
quite likely to hit this edge case.

This is not a regression: the logic didn't work before the mentioned
commit, and after the commit it performs well on Linux and macOS, and
the situation on Windows should have at least improved a bit. But the
test shows that we need to put more thought into how to make this work
properly there.

Work around the issue by disabling the test on Windows for now. While at
it, increase the locking timeout to address reported timeouts when using
either the address or memory sanitizer, which also tend to significantly
extend the runtime of this test.

This should be revisited after Git v2.47 is out.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 t/t0610-reftable-basics.sh | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
index 2d951c8ceb..babec7993e 100755
--- a/t/t0610-reftable-basics.sh
+++ b/t/t0610-reftable-basics.sh
@@ -450,15 +450,22 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' '
 	)
 '
 
-test_expect_success 'ref transaction: many concurrent writers' '
+# This test fails most of the time on Windows systems. The root cause is
+# that Windows does not allow us to rename the "tables.list.lock" file into
+# place when "tables.list" is open for reading by a concurrent process.
+test_expect_success !WINDOWS 'ref transaction: many concurrent writers' '
 	test_when_finished "rm -rf repo" &&
 	git init repo &&
 	(
 		cd repo &&
-		# Set a high timeout such that a busy CI machine will not abort
-		# early. 10 seconds should hopefully be ample of time to make
-		# this non-flaky.
-		git config set reftable.lockTimeout 10000 &&
+		# Set a high timeout. While a couple of seconds should be
+		# plenty, using the address sanitizer will significantly slow
+		# us down here. So we are aiming way higher than you would ever
+		# think is necessary just to keep us from flaking. We could
+		# also lock indefinitely by passing -1, but that could
+		# potentially block CI jobs indefinitely if there was a bug
+		# here.
+		git config set reftable.lockTimeout 300000 &&
 		test_commit --no-tag initial &&
 
 		head=$(git rev-parse HEAD) &&

Interdiff against v1:
  diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
  index 86a746aff0..babec7993e 100755
  --- a/t/t0610-reftable-basics.sh
  +++ b/t/t0610-reftable-basics.sh
  @@ -453,23 +453,18 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' '
   # This test fails most of the time on Windows systems. The root cause is
   # that Windows does not allow us to rename the "tables.list.lock" file into
   # place when "tables.list" is open for reading by a concurrent process.
  -#
  -# The same issue does not happen on Cygwin because its implementation of
  -# rename(3P) is emulating POSIX-style renames, including renames over files
  -# that are open.
  -test_expect_success !MINGW 'ref transaction: many concurrent writers' '
  +test_expect_success !WINDOWS 'ref transaction: many concurrent writers' '
   	test_when_finished "rm -rf repo" &&
   	git init repo &&
   	(
   		cd repo &&
   		# Set a high timeout. While a couple of seconds should be
   		# plenty, using the address sanitizer will significantly slow
  -		# us down here. Furthermore, Cygwin is also way slower due to
  -		# the POSIX-style rename emulation. So we are aiming way higher
  -		# than you would ever think is necessary just to keep us from
  -		# flaking. We could also lock indefinitely by passing -1, but
  -		# that could potentially block CI jobs indefinitely if there
  -		# was a bug here.
  +		# us down here. So we are aiming way higher than you would ever
  +		# think is necessary just to keep us from flaking. We could
  +		# also lock indefinitely by passing -1, but that could
  +		# potentially block CI jobs indefinitely if there was a bug
  +		# here.
   		git config set reftable.lockTimeout 300000 &&
   		test_commit --no-tag initial &&
   

base-commit: 111e864d69c84284441b083966c2065c2e9a4e78
-- 
2.47.0.rc0.dirty


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: v2.47.0-rc1 test failure on cygwin
  2024-10-04 11:11         ` Johannes Schindelin
  2024-10-04 11:32           ` Patrick Steinhardt
@ 2024-10-04 16:09           ` Junio C Hamano
  2024-10-04 17:14             ` Patrick Steinhardt
  1 sibling, 1 reply; 19+ messages in thread
From: Junio C Hamano @ 2024-10-04 16:09 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Patrick Steinhardt, Ramsay Jones, GIT Mailing-list

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> But unfortunately this still caused permission errors when the new path
>> was held open by another process.
>
> Yes, this will _always_ be a problem, I think. The
> `FILE_RENAME_POSIX_SEMANTICS` as per its documentation should help, but if
> it does not in your tests it might actually not quite work as advertised
> (wouldn't be the first time I encounter such an issue).
>
> I tried to read through the code (it's a lot!) to figure out whether there
> is potentially any situation when the `tables.list` file is opened but not
> closed immediately, but couldn't find any. Do you know off-hand of any
> such scenario?
>
>> I think for now I'd still lean into the direction of adding the !WINDOWS
>> prerequisite to the test and increasing timeouts such that I can
>> continue to investigate without time pressure.
>
> Let me bang my head against this problem for a little while longer. You
> might be right, though, that this is a thing we cannot fix in time for
> v2.47.0, which would be sad.

If you folks think it would help stabilizing the tentative fix, I am
open to the idea of delaying the 2.47 by a few days.  Currently the
2.47-final is scheduled on the 7th (Monday), but we can do 2.47-rc2
on that day instead, and move the final to 10th (Thu) or 11th (Fri)
[*].

Thanks, all, for working together.


[Footnote]

 * All dates are US/Pacific, 10:00 am

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] t0610: work around flaky test with concurrent writers
  2024-10-04 12:16 ` [PATCH] t0610: work around flaky test with concurrent writers Patrick Steinhardt
  2024-10-04 14:47   ` Ramsay Jones
@ 2024-10-04 16:22   ` Junio C Hamano
  1 sibling, 0 replies; 19+ messages in thread
From: Junio C Hamano @ 2024-10-04 16:22 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Ramsay Jones, Johannes Schindelin, Jeff King, Josh Steadmon

Patrick Steinhardt <ps@pks.im> writes:

> This fix can be applied to remove some of the stress with the Git v2.47
> release pending. If would of course be preferable to find an alternate
> fix that makes MinGW work as required, but if you take the 500 lines of
> code that is the rename(3P) implemenation of Cygwin as a hint you
> quickly figure out that this is a rather complex problem.

Yup, this is good for 2.47 and I agree it is sensible to punt the
real solution after the upcoming release.

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] t0610: work around flaky test with concurrent writers
  2024-10-04 14:47   ` Ramsay Jones
  2024-10-04 15:26     ` Patrick Steinhardt
@ 2024-10-04 16:32     ` Junio C Hamano
  1 sibling, 0 replies; 19+ messages in thread
From: Junio C Hamano @ 2024-10-04 16:32 UTC (permalink / raw)
  To: Ramsay Jones
  Cc: Patrick Steinhardt, git, Johannes Schindelin, Jeff King,
	Josh Steadmon

Ramsay Jones <ramsay@ramsayjones.plus.com> writes:

> This doesn't work for me.
> Can you think of anything else to try?
>   
> I would strongly suggest skipping this test on cygwin as well as MINGW.

There was some mention of "Win 10 or later" in the thread, IIRC,
while explaining why Cygwin version of rename() works.  Is it
possible that the version of Windows and Cygwin you and Patrick
used are different enough?

In the meantime, I'll queue a SQUASH??? on top for preparing today's
integration.

This is a tangent, but there are places that use !MINGW,!CYGWIN and
there are places that use !WINDOWS; I think they are equivalent
prerequisites that we might want to straighten out their uses as a
clean-up later.



 t/t0610-reftable-basics.sh | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git c/t/t0610-reftable-basics.sh w/t/t0610-reftable-basics.sh
index 86a746aff0..e10c894cb0 100755
--- c/t/t0610-reftable-basics.sh
+++ w/t/t0610-reftable-basics.sh
@@ -454,10 +454,10 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' '
 # that Windows does not allow us to rename the "tables.list.lock" file into
 # place when "tables.list" is open for reading by a concurrent process.
 #
-# The same issue does not happen on Cygwin because its implementation of
+# The same issue may not happen on Cygwin because its implementation of
 # rename(3P) is emulating POSIX-style renames, including renames over files
-# that are open.
-test_expect_success !MINGW 'ref transaction: many concurrent writers' '
+# that are open, but it probably depends on the version.
+test_expect_success !WINDOWS 'ref transaction: many concurrent writers' '
 	test_when_finished "rm -rf repo" &&
 	git init repo &&
 	(

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] t0610: work around flaky test with concurrent writers
  2024-10-04 15:32 ` [PATCH v2] " Patrick Steinhardt
@ 2024-10-04 16:32   ` Ramsay Jones
  2024-10-04 16:35   ` Junio C Hamano
  2024-10-04 22:41   ` Jeff King
  2 siblings, 0 replies; 19+ messages in thread
From: Ramsay Jones @ 2024-10-04 16:32 UTC (permalink / raw)
  To: Patrick Steinhardt, git
  Cc: Junio C Hamano, Johannes Schindelin, Jeff King, Josh Steadmon

Hi Patrick,

On 04/10/2024 16:32, Patrick Steinhardt wrote:
> In 6241ce2170 (refs/reftable: reload locked stack when preparing
> transaction, 2024-09-24) we have introduced a new test that exercises
> how the reftable backend behaves with many concurrent writers all racing
> with each other. This test was introduced after a couple of fixes in
> this context that should make concurrent writes behave gracefully. As it
> turns out though, Windows systems do not yet handle concurrent writes
> properly, as we've got two reports for Cygwin and MinGW failing in this
> newly added test.
> 
> The root cause of this is how we update the "tables.list" file: when
> writing a new stack of tables we first write the data into a lockfile
> and then rename that file into place. But Windows forbids us from doing
> that rename when the target path is open for reading by another process.
> And as the test races both readers and writers with each other we are
> quite likely to hit this edge case.
> 
> This is not a regression: the logic didn't work before the mentioned
> commit, and after the commit it performs well on Linux and macOS, and
> the situation on Windows should have at least improved a bit. But the
> test shows that we need to put more thought into how to make this work
> properly there.
> 
> Work around the issue by disabling the test on Windows for now. While at
> it, increase the locking timeout to address reported timeouts when using
> either the address or memory sanitizer, which also tend to significantly
> extend the runtime of this test.
> 
> This should be revisited after Git v2.47 is out.
> 
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  t/t0610-reftable-basics.sh | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
> index 2d951c8ceb..babec7993e 100755
> --- a/t/t0610-reftable-basics.sh
> +++ b/t/t0610-reftable-basics.sh
> @@ -450,15 +450,22 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' '
>  	)
>  '
>  
> -test_expect_success 'ref transaction: many concurrent writers' '
> +# This test fails most of the time on Windows systems. The root cause is
> +# that Windows does not allow us to rename the "tables.list.lock" file into
> +# place when "tables.list" is open for reading by a concurrent process.
> +test_expect_success !WINDOWS 'ref transaction: many concurrent writers' '
>  	test_when_finished "rm -rf repo" &&
>  	git init repo &&
>  	(
>  		cd repo &&
> -		# Set a high timeout such that a busy CI machine will not abort
> -		# early. 10 seconds should hopefully be ample of time to make
> -		# this non-flaky.
> -		git config set reftable.lockTimeout 10000 &&
> +		# Set a high timeout. While a couple of seconds should be
> +		# plenty, using the address sanitizer will significantly slow
> +		# us down here. So we are aiming way higher than you would ever
> +		# think is necessary just to keep us from flaking. We could
> +		# also lock indefinitely by passing -1, but that could
> +		# potentially block CI jobs indefinitely if there was a bug
> +		# here.
> +		git config set reftable.lockTimeout 300000 &&
>  		test_commit --no-tag initial &&
>  
>  		head=$(git rev-parse HEAD) &&

I did apply this patch and (unsurprising) it now passes just fine! :)

ATB,
Ramsay Jones


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] t0610: work around flaky test with concurrent writers
  2024-10-04 15:32 ` [PATCH v2] " Patrick Steinhardt
  2024-10-04 16:32   ` Ramsay Jones
@ 2024-10-04 16:35   ` Junio C Hamano
  2024-10-04 22:41   ` Jeff King
  2 siblings, 0 replies; 19+ messages in thread
From: Junio C Hamano @ 2024-10-04 16:35 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Ramsay Jones, Johannes Schindelin, Jeff King, Josh Steadmon

Patrick Steinhardt <ps@pks.im> writes:

> This should be revisited after Git v2.47 is out.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  t/t0610-reftable-basics.sh | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)

Ah, I spoke before I finished my inbox.  Let's use this version.

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: v2.47.0-rc1 test failure on cygwin
  2024-10-04 16:09           ` Junio C Hamano
@ 2024-10-04 17:14             ` Patrick Steinhardt
  2024-10-04 17:54               ` Junio C Hamano
  0 siblings, 1 reply; 19+ messages in thread
From: Patrick Steinhardt @ 2024-10-04 17:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, Ramsay Jones, GIT Mailing-list

On Fri, Oct 04, 2024 at 09:09:01AM -0700, Junio C Hamano wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> >> But unfortunately this still caused permission errors when the new path
> >> was held open by another process.
> >
> > Yes, this will _always_ be a problem, I think. The
> > `FILE_RENAME_POSIX_SEMANTICS` as per its documentation should help, but if
> > it does not in your tests it might actually not quite work as advertised
> > (wouldn't be the first time I encounter such an issue).
> >
> > I tried to read through the code (it's a lot!) to figure out whether there
> > is potentially any situation when the `tables.list` file is opened but not
> > closed immediately, but couldn't find any. Do you know off-hand of any
> > such scenario?
> >
> >> I think for now I'd still lean into the direction of adding the !WINDOWS
> >> prerequisite to the test and increasing timeouts such that I can
> >> continue to investigate without time pressure.
> >
> > Let me bang my head against this problem for a little while longer. You
> > might be right, though, that this is a thing we cannot fix in time for
> > v2.47.0, which would be sad.
> 
> If you folks think it would help stabilizing the tentative fix, I am
> open to the idea of delaying the 2.47 by a few days.  Currently the
> 2.47-final is scheduled on the 7th (Monday), but we can do 2.47-rc2
> on that day instead, and move the final to 10th (Thu) or 11th (Fri)
> [*].
> 
> Thanks, all, for working together.
> 
> 
> [Footnote]
> 
>  * All dates are US/Pacific, 10:00 am

Right now I don't yet have a good idea for how to fix the issue. I've
been trying a bunch of different things that, according to Windows docs,
should've made the renames work. But they didn't, and I don't really
have an alternative right now. So I'll need to keep on thinking about
this, and maybe get some more help from people familiar with Windows.

So deferring Git 2.47 because of it probably does not make much sense
unless somebody can up with a solution. Also, as noted already, this is
not a regression, the behaviour actually improved even on Windows. Not
to the degree I was hoping for, but at least a bit. So that's another
reason why I don't think it is worth deferring over this.

Thanks!

Patrick

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: v2.47.0-rc1 test failure on cygwin
  2024-10-04 17:14             ` Patrick Steinhardt
@ 2024-10-04 17:54               ` Junio C Hamano
  0 siblings, 0 replies; 19+ messages in thread
From: Junio C Hamano @ 2024-10-04 17:54 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Johannes Schindelin, Ramsay Jones, GIT Mailing-list

Patrick Steinhardt <ps@pks.im> writes:

> Right now I don't yet have a good idea for how to fix the issue. I've
> been trying a bunch of different things that, according to Windows docs,
> should've made the renames work. But they didn't, and I don't really
> have an alternative right now. So I'll need to keep on thinking about
> this, and maybe get some more help from people familiar with Windows.
>
> So deferring Git 2.47 because of it probably does not make much sense
> unless somebody can up with a solution. Also, as noted already, this is
> not a regression, the behaviour actually improved even on Windows. Not
> to the degree I was hoping for, but at least a bit. So that's another
> reason why I don't think it is worth deferring over this.

OK.  

I was thinking that we may need more time to cook the real fix on
target (read: not Windows) platforms, but if that is not the case,
then let's stick to the original schedule.

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] t0610: work around flaky test with concurrent writers
  2024-10-04 15:32 ` [PATCH v2] " Patrick Steinhardt
  2024-10-04 16:32   ` Ramsay Jones
  2024-10-04 16:35   ` Junio C Hamano
@ 2024-10-04 22:41   ` Jeff King
  2 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2024-10-04 22:41 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Ramsay Jones, Junio C Hamano, Johannes Schindelin,
	Josh Steadmon

On Fri, Oct 04, 2024 at 05:32:16PM +0200, Patrick Steinhardt wrote:

> Work around the issue by disabling the test on Windows for now. While at
> it, increase the locking timeout to address reported timeouts when using
> either the address or memory sanitizer, which also tend to significantly
> extend the runtime of this test.
> 
> This should be revisited after Git v2.47 is out.

This sounds OK to me.

In general, I worry about papering over bugs with a "we'll revisit it
later" note. We may forget about things or never prioritize them. And
the worst-case scenario here is that we might later decide to promote
reftable to the default format, forgetting that concurrency has problems
on Windows.

But given that it's an area you've been actively working in, there's
already been some discussion towards the real fix, and we still consider
reftables somewhat experimental, I feel OK bumping it for now.

Thanks for all the digging on this (from you and from other folks).

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-10-04 22:41 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-04  1:02 v2.47.0-rc1 test failure on cygwin Ramsay Jones
2024-10-04  3:59 ` Patrick Steinhardt
2024-10-04  6:13   ` Patrick Steinhardt
2024-10-04  9:13     ` Johannes Schindelin
2024-10-04 10:09       ` Patrick Steinhardt
2024-10-04 11:11         ` Johannes Schindelin
2024-10-04 11:32           ` Patrick Steinhardt
2024-10-04 16:09           ` Junio C Hamano
2024-10-04 17:14             ` Patrick Steinhardt
2024-10-04 17:54               ` Junio C Hamano
2024-10-04 12:16 ` [PATCH] t0610: work around flaky test with concurrent writers Patrick Steinhardt
2024-10-04 14:47   ` Ramsay Jones
2024-10-04 15:26     ` Patrick Steinhardt
2024-10-04 16:32     ` Junio C Hamano
2024-10-04 16:22   ` Junio C Hamano
2024-10-04 15:32 ` [PATCH v2] " Patrick Steinhardt
2024-10-04 16:32   ` Ramsay Jones
2024-10-04 16:35   ` Junio C Hamano
2024-10-04 22:41   ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).