ETXTBSY window in _

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* ETXTBSY window in __fput
@ 2025-08-26 21:05 Alexander Monakov
  2025-08-26 22:00 ` Al Viro
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Alexander Monakov @ 2025-08-26 21:05 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Alexander Viro, Christian Brauner, Jan Kara, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1987 bytes --]

Dear fs hackers,

I suspect there's an unfortunate race window in __fput where file locks are
dropped (locks_remove_file) prior to decreasing writer refcount
(put_file_access). If I'm not mistaken, this window is observable and it
breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained
in more detail below.

The program demonstrating the problem is attached (a slightly modified version
of the demo given by Russ Cox on the Go issue tracker, see URL in first line).
It makes 20 threads, each executing an infinite loop doing the following:

1) open an fd for writing with O_CLOEXEC
2) write executable code into it
3) close it
4) fork
5) in the child, attempt to execve the just-written file

If you compile it with -DNOWAIT, you'll see that execve often fails with
ETXTBSY. This happens if another thread forked while we were holding an open fd
between steps 1 and 3, our fd "leaked" in that child, and then we reached our
step 5 before that child did execve (at which point the leaked fd would be
closed thanks to O_CLOEXEC).

I suggested on the Go bugreport that the problem can be solved without any
inter-thread cooperation by utilizing BSD locks. Replace step 3 by

3a) place an exlusive lock on the file identified by fd (flock(fd, LOCK_EX))
3b) close the fd
3c) open an fd on the same path again
3d) place a lock on it again
3e) close it again

Since BSD locks are placed via the open file description, the lock placed at
step 3a is not released until all descriptors duplicated via forks are closed.
Hence, at step 3d we wait until all forked children proceeded to execve.

Recently another person tried this solution and observed that they still see the
errors, albeit at a much lower rate, about three per 30 minutes (I've not been
able to replicate that). I suspect the race window from the first paragraph
makes that possible.

If so, would it be possible to close that window? Would be nice to have this
algorithm work reliably.

Thanks.
Alexander

[-- Attachment #2: Type: text/plain, Size: 1450 bytes --]

/* ETXTBSY race example from https://github.com/golang/go/issues/22315 */
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <pthread.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/wait.h>
#include <sys/file.h>

static void *runner(void *);

int
main(void)
{
	pthread_t thr[20];

	for (int i=1; i<20; i++)
		pthread_create(&thr[i], 0, runner, (void*)(uintptr_t)i);
	runner(0);
}

static const char *script = "#!/bin/sh\nexit 0\n";

static void *
runner(void *v)
{
	int i, fd, pid, status;
	char buf[100], *argv[2];

	i = (int)(uintptr_t)v;
	snprintf(buf, sizeof buf, "txtbusy-%d", i);
	argv[0] = buf;
	argv[1] = 0;
	for(;;) {
		fd = open(buf, O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0777);
		if(fd < 0) {
			perror("open");
			exit(2);
		}
		write(fd, script, strlen(script));
#ifndef NOWAIT
		flock(fd, LOCK_EX);
		close(fd);
		fd = open(buf, O_RDONLY|O_CLOEXEC);
		flock(fd, LOCK_SH);
#endif
		close(fd);
		pid = fork();
		if(pid < 0) {
			perror("fork");
			exit(2);
		}
		if(pid == 0) {
			execve(buf, argv, 0);
			exit(errno);
		}
		if(waitpid(pid, &status, 0) < 0) {
			perror("waitpid");
			exit(2);
		}
		if(!WIFEXITED(status)) {
			perror("waitpid not exited");
			exit(2);
		}
		status = WEXITSTATUS(status);
		if(status != 0)
			fprintf(stderr, "exec: %d %s\n", status, strerror(status));
	}
	return 0;
}

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-08-26 21:05 ETXTBSY window in __fput Alexander Monakov
@ 2025-08-26 22:00 ` Al Viro
  2025-08-27  7:22   ` Alexander Monakov
  2025-08-29  7:21 ` Alexander Monakov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Al Viro @ 2025-08-26 22:00 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, linux-kernel

On Wed, Aug 27, 2025 at 12:05:38AM +0300, Alexander Monakov wrote:
> Dear fs hackers,
> 
> I suspect there's an unfortunate race window in __fput where file locks are
> dropped (locks_remove_file) prior to decreasing writer refcount
> (put_file_access). If I'm not mistaken, this window is observable and it
> breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained
> in more detail below.
> 
> The program demonstrating the problem is attached (a slightly modified version
> of the demo given by Russ Cox on the Go issue tracker, see URL in first line).
> It makes 20 threads, each executing an infinite loop doing the following:
> 
> 1) open an fd for writing with O_CLOEXEC
> 2) write executable code into it
> 3) close it
> 4) fork
> 5) in the child, attempt to execve the just-written file
> 
> If you compile it with -DNOWAIT, you'll see that execve often fails with
> ETXTBSY. This happens if another thread forked while we were holding an open fd
> between steps 1 and 3, our fd "leaked" in that child, and then we reached our
> step 5 before that child did execve (at which point the leaked fd would be
> closed thanks to O_CLOEXEC).

Egads...  Let me get it straight - you have a bunch of threads sharing descriptor
tables and some of them are forking (or cloning without shared descriptor tables)
while that is going on?

Frankly, in such situation I would spawn a thread for that, did unshare(CLONE_FILES)
in it, replaced the binary and buggered off, with parent waiting for it to complete.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-08-26 22:00 ` Al Viro
@ 2025-08-27  7:22   ` Alexander Monakov
  2025-08-27 11:52     ` Theodore Ts'o
  2025-08-27 13:16     ` Aleksa Sarai
  0 siblings, 2 replies; 23+ messages in thread
From: Alexander Monakov @ 2025-08-27  7:22 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, linux-kernel

On Tue, 26 Aug 2025, Al Viro wrote:

> Egads...  Let me get it straight - you have a bunch of threads sharing descriptor
> tables and some of them are forking (or cloning without shared descriptor tables)
> while that is going on?

I suppose if they could start a new process in a more straightforward manner,
they would. But you cannot start a new process without fork. Anyway, I'm but
a messenger here: the problem has been hit by various people in the Go community
(and by Go team itself, at least twice). Here I'm asking about a potential
shortcoming in __fput that exacerbates the problem.

> Frankly, in such situation I would spawn a thread for that, did unshare(CLONE_FILES)
> in it, replaced the binary and buggered off, with parent waiting for it to complete.

Good to know, but it doesn't sound very efficient (and like something that could be
integrated in Go runtime).

Anyhow, if the alleged race window in __fput is real, why not close it for good?

Alexander

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-08-27  7:22   ` Alexander Monakov
@ 2025-08-27 11:52     ` Theodore Ts'o
  2025-08-27 13:05       ` Alexander Monakov
  2025-08-27 13:16     ` Aleksa Sarai
  1 sibling, 1 reply; 23+ messages in thread
From: Theodore Ts'o @ 2025-08-27 11:52 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Al Viro, linux-fsdevel, Christian Brauner, Jan Kara, linux-kernel

On Wed, Aug 27, 2025 at 10:22:14AM +0300, Alexander Monakov wrote:
> 
> On Tue, 26 Aug 2025, Al Viro wrote:
> 
> > Egads...  Let me get it straight - you have a bunch of threads sharing descriptor
> > tables and some of them are forking (or cloning without shared descriptor tables)
> > while that is going on?
> 
> I suppose if they could start a new process in a more straightforward manner,
> they would. But you cannot start a new process without fork. Anyway, I'm but
> a messenger here: the problem has been hit by various people in the Go community
> (and by Go team itself, at least twice). Here I'm asking about a potential
> shortcoming in __fput that exacerbates the problem.

I'm assuming that the problem is showing up in real life when users
run a go problem using "go run" where the golang compiler freshly
writes the executable, and then fork/exec's the binary.  And using
multiple threads sharing descriptor tables was just to make a reliable
reproducer?

						- Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-08-27 11:52     ` Theodore Ts'o
@ 2025-08-27 13:05       ` Alexander Monakov
  2025-08-31 19:22         ` David Laight
  0 siblings, 1 reply; 23+ messages in thread
From: Alexander Monakov @ 2025-08-27 13:05 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Al Viro, linux-fsdevel, Christian Brauner, Jan Kara, linux-kernel

On Wed, 27 Aug 2025, Theodore Ts'o wrote:

> On Wed, Aug 27, 2025 at 10:22:14AM +0300, Alexander Monakov wrote:
> > 
> > On Tue, 26 Aug 2025, Al Viro wrote:
> > 
> > > Egads...  Let me get it straight - you have a bunch of threads sharing descriptor
> > > tables and some of them are forking (or cloning without shared descriptor tables)
> > > while that is going on?
> > 
> > I suppose if they could start a new process in a more straightforward manner,
> > they would. But you cannot start a new process without fork. Anyway, I'm but
> > a messenger here: the problem has been hit by various people in the Go community
> > (and by Go team itself, at least twice). Here I'm asking about a potential
> > shortcoming in __fput that exacerbates the problem.
> 
> I'm assuming that the problem is showing up in real life when users
> run a go problem using "go run" where the golang compiler freshly
> writes the executable, and then fork/exec's the binary.  And using
> multiple threads sharing descriptor tables was just to make a reliable
> reproducer?

You need at least two threads: while one thread does open-write-close-fork,
there needs to be another thread that forks concurrently with the write.

Alexander

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-08-27 13:05       ` Alexander Monakov
@ 2025-08-31 19:22         ` David Laight
  2025-09-01  8:44           ` Jan Kara
  0 siblings, 1 reply; 23+ messages in thread
From: David Laight @ 2025-08-31 19:22 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Theodore Ts'o, Al Viro, linux-fsdevel, Christian Brauner,
	Jan Kara, linux-kernel

On Wed, 27 Aug 2025 16:05:51 +0300 (MSK)
Alexander Monakov <amonakov@ispras.ru> wrote:

> On Wed, 27 Aug 2025, Theodore Ts'o wrote:
> 
> > On Wed, Aug 27, 2025 at 10:22:14AM +0300, Alexander Monakov wrote:  
> > > 
> > > On Tue, 26 Aug 2025, Al Viro wrote:
> > >   
> > > > Egads...  Let me get it straight - you have a bunch of threads sharing descriptor
> > > > tables and some of them are forking (or cloning without shared descriptor tables)
> > > > while that is going on?  
> > > 
> > > I suppose if they could start a new process in a more straightforward manner,
> > > they would. But you cannot start a new process without fork. Anyway, I'm but
> > > a messenger here: the problem has been hit by various people in the Go community
> > > (and by Go team itself, at least twice). Here I'm asking about a potential
> > > shortcoming in __fput that exacerbates the problem.  
> > 
> > I'm assuming that the problem is showing up in real life when users
> > run a go problem using "go run" where the golang compiler freshly
> > writes the executable, and then fork/exec's the binary.  And using
> > multiple threads sharing descriptor tables was just to make a reliable
> > reproducer?  
> 
> You need at least two threads: while one thread does open-write-close-fork,
> there needs to be another thread that forks concurrently with the write.

Is this made worse by the code that defers fput to a worker thread?
(or am I misremembering things again?)

	David

> 
> Alexander
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-08-31 19:22         ` David Laight
@ 2025-09-01  8:44           ` Jan Kara
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Kara @ 2025-09-01  8:44 UTC (permalink / raw)
  To: David Laight
  Cc: Alexander Monakov, Theodore Ts'o, Al Viro, linux-fsdevel,
	Christian Brauner, Jan Kara, linux-kernel

On Sun 31-08-25 20:22:44, David Laight wrote:
> On Wed, 27 Aug 2025 16:05:51 +0300 (MSK)
> Alexander Monakov <amonakov@ispras.ru> wrote:
> 
> > On Wed, 27 Aug 2025, Theodore Ts'o wrote:
> > 
> > > On Wed, Aug 27, 2025 at 10:22:14AM +0300, Alexander Monakov wrote:  
> > > > 
> > > > On Tue, 26 Aug 2025, Al Viro wrote:
> > > >   
> > > > > Egads...  Let me get it straight - you have a bunch of threads sharing descriptor
> > > > > tables and some of them are forking (or cloning without shared descriptor tables)
> > > > > while that is going on?  
> > > > 
> > > > I suppose if they could start a new process in a more straightforward manner,
> > > > they would. But you cannot start a new process without fork. Anyway, I'm but
> > > > a messenger here: the problem has been hit by various people in the Go community
> > > > (and by Go team itself, at least twice). Here I'm asking about a potential
> > > > shortcoming in __fput that exacerbates the problem.  
> > > 
> > > I'm assuming that the problem is showing up in real life when users
> > > run a go problem using "go run" where the golang compiler freshly
> > > writes the executable, and then fork/exec's the binary.  And using
> > > multiple threads sharing descriptor tables was just to make a reliable
> > > reproducer?  
> > 
> > You need at least two threads: while one thread does open-write-close-fork,
> > there needs to be another thread that forks concurrently with the write.
> 
> Is this made worse by the code that defers fput to a worker thread?
> (or am I misremembering things again?)

fput() is offloaded to task work (i.e., it happens on exit of a task to
userspace). But I don't think it impacts this particular problem in a
significant way.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-08-27  7:22   ` Alexander Monakov
  2025-08-27 11:52     ` Theodore Ts'o
@ 2025-08-27 13:16     ` Aleksa Sarai
  2025-08-27 14:29       ` Alexander Monakov
  1 sibling, 1 reply; 23+ messages in thread
From: Aleksa Sarai @ 2025-08-27 13:16 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Al Viro, linux-fsdevel, Christian Brauner, Jan Kara, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 877 bytes --]

On 2025-08-27, Alexander Monakov <amonakov@ispras.ru> wrote:
> > Frankly, in such situation I would spawn a thread for that, did unshare(CLONE_FILES)
> > in it, replaced the binary and buggered off, with parent waiting for it to complete.
> 
> Good to know, but it doesn't sound very efficient (and like something that could be
> integrated in Go runtime).

Can't you create a goroutine, runtime.LockOSThread,
unshare(CLONE_FILES), do the work, and then return -- without
runtime.UnlockOSThread (to kill the thread and stop it from being used
by other Go code)? Or does that not work in stdlib?

We have to do this a lot in runc and other Go programs that mess around
with unshare() or other per-thread attributes that don't play well with
Go's process model.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-08-27 13:16     ` Aleksa Sarai
@ 2025-08-27 14:29       ` Alexander Monakov
  0 siblings, 0 replies; 23+ messages in thread
From: Alexander Monakov @ 2025-08-27 14:29 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Al Viro, linux-fsdevel, Christian Brauner, Jan Kara, linux-kernel

On Wed, 27 Aug 2025, Aleksa Sarai wrote:

> On 2025-08-27, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > Frankly, in such situation I would spawn a thread for that, did unshare(CLONE_FILES)
> > > in it, replaced the binary and buggered off, with parent waiting for it to complete.
> > 
> > Good to know, but it doesn't sound very efficient (and like something that could be
> > integrated in Go runtime).
> 
> Can't you create a goroutine, runtime.LockOSThread,
> unshare(CLONE_FILES), do the work, and then return -- without
> runtime.UnlockOSThread (to kill the thread and stop it from being used
> by other Go code)? Or does that not work in stdlib?

Again, with regards to Go backstory, I'm just the messenger. But were I
a Go runtime implementor, I'm afraid no, I wouldn't be able to do that,
because while the file is being written, it's done with normal I/O APIs.
The runtime has no way to look in the future and anticipate that the
file currently being written will be execve'd.

My first priority here is not to dig up alternative solutions to the
ETXTBSY problem. I'm asking about a race window in __fput.

Alexander

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-08-26 21:05 ETXTBSY window in __fput Alexander Monakov
  2025-08-26 22:00 ` Al Viro
@ 2025-08-29  7:21 ` Alexander Monakov
  2025-08-29  9:47   ` Christian Brauner
  2025-08-29 18:32 ` Colin Walters
  2025-09-01 18:39 ` Mateusz Guzik
  3 siblings, 1 reply; 23+ messages in thread
From: Alexander Monakov @ 2025-08-29  7:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Alexander Viro, Christian Brauner, Jan Kara, linux-kernel

On Wed, 27 Aug 2025, Alexander Monakov wrote:

> Dear fs hackers,
> 
> I suspect there's an unfortunate race window in __fput where file locks are
> dropped (locks_remove_file) prior to decreasing writer refcount
> (put_file_access). If I'm not mistaken, this window is observable and it
> breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained
> in more detail below.

The race in __fput is a problem irrespective of how the testcase triggers it,
right? It's just showing a real-world scenario. But the issue can be
demonstrated without a multithreaded fork: imagine one process placing an
exclusive lock on a file and writing to it, another process waiting on that
lock and immediately execve'ing when the lock is released.

Can put_file_access be moved prior to locks_remove_file in __fput?

Alexander

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-08-29  7:21 ` Alexander Monakov
@ 2025-08-29  9:47   ` Christian Brauner
  2025-08-29 10:17     ` Alexander Monakov
  0 siblings, 1 reply; 23+ messages in thread
From: Christian Brauner @ 2025-08-29  9:47 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: linux-fsdevel, Alexander Viro, Jan Kara, linux-kernel

On Fri, Aug 29, 2025 at 10:21:35AM +0300, Alexander Monakov wrote:
> 
> On Wed, 27 Aug 2025, Alexander Monakov wrote:
> 
> > Dear fs hackers,
> > 
> > I suspect there's an unfortunate race window in __fput where file locks are
> > dropped (locks_remove_file) prior to decreasing writer refcount
> > (put_file_access). If I'm not mistaken, this window is observable and it
> > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained
> > in more detail below.
> 
> The race in __fput is a problem irrespective of how the testcase triggers it,
> right? It's just showing a real-world scenario. But the issue can be
> demonstrated without a multithreaded fork: imagine one process placing an
> exclusive lock on a file and writing to it, another process waiting on that
> lock and immediately execve'ing when the lock is released.
> 
> Can put_file_access be moved prior to locks_remove_file in __fput?

Even if we fix this there's no guarantee that the kernel will give that
letting the close() of a writably opened file race against a concurrent
exec of the same file will not result in EBUSY in some arcane way
currently or in the future.

The fundamental problem is the idiotic exe_file_deny_write_access()
mechanism for execve. This is the crux of the whole go issue. I've tried
to removethis nonsense (twice?). Everytime because of some odd userspace
regression we had to revert (And now we're apparently at the stage where
in another patchset people think that this stuff needs to become a uapi
O_* flag which will absolutely not happen.).

My point is: currently you need synchronization for this to work cleanly
in some form.

But what I would do is the following. So far we've always failed to
remove the deny on write mechanism because doing it unconditionally
broke very weird use-cases with very strange justificatons.

I think we should turn this on its head and give execveat() a flag
AT_EXECVE_NODENYWRITE that allows applications to ignore the write
restrictions.

Applications like go can just start using this flag when exec'ing.
Applications that need the annoying deny-write mechanism can just
continue exec'ing as before without AT_EXECVE_NODENYWRITE.

__Completely untested and uncompiled__ draft below:

diff --git a/fs/exec.c b/fs/exec.c
index 2a1e5e4042a1..59e8fcd3fc19 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -766,7 +766,7 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
        int err;
        struct file *file __free(fput) = NULL;
        struct open_flags open_exec_flags = {
-               .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
+               .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC | __F_EXEC_NODENYWRITE,
                .acc_mode = MAY_EXEC,
                .intent = LOOKUP_OPEN,
                .lookup_flags = LOOKUP_FOLLOW,
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 5598e4d57422..b0c01cba1560 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1158,10 +1158,10 @@ static int __init fcntl_init(void)
         * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
         * is defined as O_NONBLOCK on some platforms and not on others.
         */
-       BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
+       BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
                HWEIGHT32(
                        (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
-                       __FMODE_EXEC));
+                       __FMODE_EXEC | __F_EXEC_NODENYWRITE));

        fasync_cache = kmem_cache_create("fasync_cache",
                                         sizeof(struct fasync_struct), 0,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d7ab4f96d705..123f74cbe7a4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3215,12 +3215,16 @@ static inline int exe_file_deny_write_access(struct file *exe_file)
 {
        if (unlikely(FMODE_FSNOTIFY_HSM(exe_file->f_mode)))
                return 0;
+       if (unlikely(exe_file->f_flags & __F_EXEC_NODENYWRITE))
+               return 0;
        return deny_write_access(exe_file);
 }
 static inline void exe_file_allow_write_access(struct file *exe_file)
 {
        if (unlikely(!exe_file || FMODE_FSNOTIFY_HSM(exe_file->f_mode)))
                return;
+       if (unlikely(exe_file->f_flags & __F_EXEC_NODENYWRITE))
+               return;
        allow_write_access(exe_file);
 }

diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 613475285643..f3c8a457bc7d 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -61,6 +61,9 @@
 #ifndef O_CLOEXEC
 #define O_CLOEXEC      02000000        /* set close_on_exec */
 #endif
+#ifdef __KERNEL__
+#define __F_EXEC_NODENYWRITE 020000000000 /* bit 31 */
+#endif

 /*
  * Before Linux 2.6.33 only O_DSYNC semantics were implemented, but using
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index f291ab4f94eb..123fb158dc5b 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -174,6 +174,7 @@
 #define AT_HANDLE_CONNECTABLE  0x002   /* Request a connectable file handle */

 /* Flags for execveat2(2). */
+#define AT_EXECVE_NODENYWRITE  0x001   /* Allow writable file descriptors to the executable file. */
 #define AT_EXECVE_CHECK                0x10000 /* Only perform a check if execution
                                           would be allowed. */

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-08-29  9:47   ` Christian Brauner
@ 2025-08-29 10:17     ` Alexander Monakov
  2025-08-29 11:07       ` Christian Brauner
  0 siblings, 1 reply; 23+ messages in thread
From: Alexander Monakov @ 2025-08-29 10:17 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Alexander Viro, Jan Kara, linux-kernel


On Fri, 29 Aug 2025, Christian Brauner wrote:

> On Fri, Aug 29, 2025 at 10:21:35AM +0300, Alexander Monakov wrote:
> > 
> > On Wed, 27 Aug 2025, Alexander Monakov wrote:
> > 
> > > Dear fs hackers,
> > > 
> > > I suspect there's an unfortunate race window in __fput where file locks are
> > > dropped (locks_remove_file) prior to decreasing writer refcount
> > > (put_file_access). If I'm not mistaken, this window is observable and it
> > > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained
> > > in more detail below.
> > 
> > The race in __fput is a problem irrespective of how the testcase triggers it,
> > right? It's just showing a real-world scenario. But the issue can be
> > demonstrated without a multithreaded fork: imagine one process placing an
> > exclusive lock on a file and writing to it, another process waiting on that
> > lock and immediately execve'ing when the lock is released.
> > 
> > Can put_file_access be moved prior to locks_remove_file in __fput?
> 
> Even if we fix this there's no guarantee that the kernel will give that
> letting the close() of a writably opened file race against a concurrent
> exec of the same file will not result in EBUSY in some arcane way
> currently or in the future.

Forget Go and execve. Take the two-process scenario from my last email.
The program waiting on flock shouldn't be able to observe elevated
refcounts on the file after the lock is released. It matters not only
for execve, but also for unmounting the underlying filesystem, right?
And maybe other things too. So why not fix the ordering issue in __fput
and if there are other bugs breaking valid uses of flock, fix them too?

Alexander

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-08-29 10:17     ` Alexander Monakov
@ 2025-08-29 11:07       ` Christian Brauner
  2025-08-29 11:45         ` Alexander Monakov
  0 siblings, 1 reply; 23+ messages in thread
From: Christian Brauner @ 2025-08-29 11:07 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: linux-fsdevel, Alexander Viro, Jan Kara, linux-kernel

On Fri, Aug 29, 2025 at 01:17:16PM +0300, Alexander Monakov wrote:
> 
> On Fri, 29 Aug 2025, Christian Brauner wrote:
> 
> > On Fri, Aug 29, 2025 at 10:21:35AM +0300, Alexander Monakov wrote:
> > > 
> > > On Wed, 27 Aug 2025, Alexander Monakov wrote:
> > > 
> > > > Dear fs hackers,
> > > > 
> > > > I suspect there's an unfortunate race window in __fput where file locks are
> > > > dropped (locks_remove_file) prior to decreasing writer refcount
> > > > (put_file_access). If I'm not mistaken, this window is observable and it
> > > > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained
> > > > in more detail below.
> > > 
> > > The race in __fput is a problem irrespective of how the testcase triggers it,
> > > right? It's just showing a real-world scenario. But the issue can be
> > > demonstrated without a multithreaded fork: imagine one process placing an
> > > exclusive lock on a file and writing to it, another process waiting on that
> > > lock and immediately execve'ing when the lock is released.
> > > 
> > > Can put_file_access be moved prior to locks_remove_file in __fput?
> > 
> > Even if we fix this there's no guarantee that the kernel will give that
> > letting the close() of a writably opened file race against a concurrent
> > exec of the same file will not result in EBUSY in some arcane way
> > currently or in the future.
> 
> Forget Go and execve. Take the two-process scenario from my last email.
> The program waiting on flock shouldn't be able to observe elevated
> refcounts on the file after the lock is released. It matters not only
> for execve, but also for unmounting the underlying filesystem, right?

What? No. How?: with details, please.

> And maybe other things too. So why not fix the ordering issue in __fput
> and if there are other bugs breaking valid uses of flock, fix them too?

For locks_remove_file() to be movable after put_file_access() we'd have
to prove that no filesystem implementing f_op->lock() doesn't rely on
f_op->release() to not have run. It is fundamentally backwards to have
run f_ops after f_op->release() ran.

Random quick look into 9pfs:

static int v9fs_file_do_lock(struct file *filp, int cmd, struct file_lock *fl)
{
	struct p9_flock flock;
	struct p9_fid *fid;
	uint8_t status = P9_LOCK_ERROR;
	int res = 0;
	struct v9fs_session_info *v9ses;

	fid = filp->private_data;
	BUG_ON(fid == NULL);

This relies on filp->private_data to be valid which it wouldn't be
anymore after f_op->release().

Moving put_file_access() before f_op->release() is also wrong and would
require to prove that no filesystem depends on file access not having
changed before f_op->release() has run. So no, not a trivial thing to
move around.

And you are explicitly advertising this as a fix to the go execve
problem; both in the bugtracker and here. And it's simply not a good
solution. The problem remains exec's deny-write mechanism. But hooking
into file locks to serialize exec against writable openers isn't a good
solution. It surely is creative though.

We can give userspace a simple way to sidestep this problem though as I
tried to show.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-08-29 11:07       ` Christian Brauner
@ 2025-08-29 11:45         ` Alexander Monakov
  2025-08-29 14:02           ` Jan Kara
  0 siblings, 1 reply; 23+ messages in thread
From: Alexander Monakov @ 2025-08-29 11:45 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Alexander Viro, Jan Kara, linux-kernel


On Fri, 29 Aug 2025, Christian Brauner wrote:

> > > Even if we fix this there's no guarantee that the kernel will give that
> > > letting the close() of a writably opened file race against a concurrent
> > > exec of the same file will not result in EBUSY in some arcane way
> > > currently or in the future.
> > 
> > Forget Go and execve. Take the two-process scenario from my last email.
> > The program waiting on flock shouldn't be able to observe elevated
> > refcounts on the file after the lock is released. It matters not only
> > for execve, but also for unmounting the underlying filesystem, right?
> 
> What? No. How?: with details, please.

Apologies if there's a misunderstanding on my side, but put_file_access
does file_put_write_access, which in turn does

  mnt_put_write_access(file->f_path.mnt);

and I think elevated refcount on mnt will cause -EBUSY from mnt_hold_writers.
Which is then checked in mnt_make_readonly. I expect it affects unmount too,
just don't see exactly where.

(basically as remember non-lazy unmounting a filesystem with open files errors
out, right? as well as ro-remounting when some files are open for writing)

Alexander

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-08-29 11:45         ` Alexander Monakov
@ 2025-08-29 14:02           ` Jan Kara
  2025-09-01 17:53             ` Alexander Monakov
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2025-08-29 14:02 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Christian Brauner, linux-fsdevel, Alexander Viro, Jan Kara,
	linux-kernel

Hello!

On Fri 29-08-25 14:45:57, Alexander Monakov wrote:
> On Fri, 29 Aug 2025, Christian Brauner wrote:
> 
> > > > Even if we fix this there's no guarantee that the kernel will give that
> > > > letting the close() of a writably opened file race against a concurrent
> > > > exec of the same file will not result in EBUSY in some arcane way
> > > > currently or in the future.
> > > 
> > > Forget Go and execve. Take the two-process scenario from my last email.
> > > The program waiting on flock shouldn't be able to observe elevated
> > > refcounts on the file after the lock is released. It matters not only
> > > for execve, but also for unmounting the underlying filesystem, right?
> > 
> > What? No. How?: with details, please.
> 
> Apologies if there's a misunderstanding on my side, but put_file_access
> does file_put_write_access, which in turn does
> 
>   mnt_put_write_access(file->f_path.mnt);
> 
> and I think elevated refcount on mnt will cause -EBUSY from mnt_hold_writers.
> Which is then checked in mnt_make_readonly. I expect it affects unmount too,
> just don't see exactly where.

Umount (may_umount_tree()) looks at mnt->mnt_count which is decremented by
mntput() completely at the end of __fput(). I tend to agree with Christian
here: We've never promised that all effects of open fd are cleaned up
before the flock is released and as Christian explained it will be actually
pretty hard to implement such behavior. So attempts to wait for fd to close
by waiting for its flock are racy...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-08-29 14:02           ` Jan Kara
@ 2025-09-01 17:53             ` Alexander Monakov
  2025-09-02 10:36               ` Jan Kara
  0 siblings, 1 reply; 23+ messages in thread
From: Alexander Monakov @ 2025-09-01 17:53 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel, Alexander Viro, linux-kernel

On Fri, 29 Aug 2025, Jan Kara wrote:

> Umount (may_umount_tree()) looks at mnt->mnt_count which is decremented by
> mntput() completely at the end of __fput(). I tend to agree with Christian
> here: We've never promised that all effects of open fd are cleaned up
> before the flock is released and as Christian explained it will be actually
> pretty hard to implement such behavior. So attempts to wait for fd to close
> by waiting for its flock are racy...

(flock is not a Linux invention, if BSD implementations offered that guarantee,
I'd expect Linux to follow, but I'm not sure if they did)

That's unfortunate. If the remount/unmount issues are not convincing, shall we
try to get this issue called out in the Linux man pages? Would you help me with
wordsmithing?

How about adding the following to the NOTES section in flock.2?

Releasing the lock when a file descriptor is closed is not sequenced after all
observable effects of close(). For example, when one process places an exclusive
lock on a file, writes to it, then closes it, and another process waits on a
shared lock for the file to be closed, it may observe that subsequent execve()
fails with ETXTBSY, and umount() of the underlying filesystem fails with EBUSY,
as if the file is still open in the first process.

Alexander

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-09-01 17:53             ` Alexander Monakov
@ 2025-09-02 10:36               ` Jan Kara
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Kara @ 2025-09-02 10:36 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Jan Kara, Christian Brauner, linux-fsdevel, Alexander Viro,
	linux-kernel

On Mon 01-09-25 20:53:38, Alexander Monakov wrote:
> 
> On Fri, 29 Aug 2025, Jan Kara wrote:
> 
> > Umount (may_umount_tree()) looks at mnt->mnt_count which is decremented by
> > mntput() completely at the end of __fput(). I tend to agree with Christian
> > here: We've never promised that all effects of open fd are cleaned up
> > before the flock is released and as Christian explained it will be actually
> > pretty hard to implement such behavior. So attempts to wait for fd to close
> > by waiting for its flock are racy...
> 
> (flock is not a Linux invention, if BSD implementations offered that guarantee,
> I'd expect Linux to follow, but I'm not sure if they did)
> 
> That's unfortunate. If the remount/unmount issues are not convincing, shall we
> try to get this issue called out in the Linux man pages? Would you help me with
> wordsmithing?
> 
> How about adding the following to the NOTES section in flock.2?
> 
> Releasing the lock when a file descriptor is closed is not sequenced
> after all observable effects of close(). For example, when one process
> places an exclusive lock on a file, writes to it, then closes it, and
> another process waits on a shared lock for the file to be closed, it may
> observe that subsequent execve() fails with ETXTBSY, and umount() of the
> underlying filesystem fails with EBUSY, as if the file is still open in
> the first process.

The paragraph sounds good to me. Thanks for writing it!

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-08-26 21:05 ETXTBSY window in __fput Alexander Monakov
  2025-08-26 22:00 ` Al Viro
  2025-08-29  7:21 ` Alexander Monakov
@ 2025-08-29 18:32 ` Colin Walters
  2025-09-01 18:39 ` Mateusz Guzik
  3 siblings, 0 replies; 23+ messages in thread
From: Colin Walters @ 2025-08-29 18:32 UTC (permalink / raw)
  To: Alexander Monakov, linux-fsdevel
  Cc: Al Viro, Christian Brauner, Jan Kara, linux-kernel



On Tue, Aug 26, 2025, at 5:05 PM, Alexander Monakov wrote:
> Dear fs hackers,
>
> I suspect there's an unfortunate race window in __fput where file locks are
> dropped (locks_remove_file) prior to decreasing writer refcount
> (put_file_access). If I'm not mistaken, this window is observable and it
> breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained
> in more detail below.
>
> The program demonstrating the problem is attached (a slightly modified version
> of the demo given by Russ Cox on the Go issue tracker, see URL in first line).
> It makes 20 threads, each executing an infinite loop doing the following:
>
> 1) open an fd for writing with O_CLOEXEC
> 2) write executable code into it
> 3) close it
> 4) fork
> 5) in the child, attempt to execve the just-written file
>
> If you compile it with -DNOWAIT, you'll see that execve often fails with
> ETXTBSY. This happens if another thread forked while we were holding an open fd
> between steps 1 and 3, our fd "leaked" in that child, and then we reached our
> step 5 before that child did execve (at which point the leaked fd would be
> closed thanks to O_CLOEXEC).

This one looks to be the same as what we hit in https://github.com/containers/composefs-rs/issues/106 but with fsverity.

I came to the conclusion that we want O_CLOFORK (ref https://www.austingroupbugs.net/view.php?id=1318 ).




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-08-26 21:05 ETXTBSY window in __fput Alexander Monakov
                   ` (2 preceding siblings ...)
  2025-08-29 18:32 ` Colin Walters
@ 2025-09-01 18:39 ` Mateusz Guzik
  2025-09-01 19:57   ` Colin Walters
  2025-09-02  8:33   ` Christian Brauner
  3 siblings, 2 replies; 23+ messages in thread
From: Mateusz Guzik @ 2025-09-01 18:39 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: linux-fsdevel, Alexander Viro, Christian Brauner, Jan Kara,
	linux-kernel

On Wed, Aug 27, 2025 at 12:05:38AM +0300, Alexander Monakov wrote:
> Dear fs hackers,
> 
> I suspect there's an unfortunate race window in __fput where file locks are
> dropped (locks_remove_file) prior to decreasing writer refcount
> (put_file_access). If I'm not mistaken, this window is observable and it
> breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained
> in more detail below.
> 
> The program demonstrating the problem is attached (a slightly modified version
> of the demo given by Russ Cox on the Go issue tracker, see URL in first line).
> It makes 20 threads, each executing an infinite loop doing the following:
> 
> 1) open an fd for writing with O_CLOEXEC
> 2) write executable code into it
> 3) close it
> 4) fork
> 5) in the child, attempt to execve the just-written file
> 
> If you compile it with -DNOWAIT, you'll see that execve often fails with
> ETXTBSY.

This problem was reported a few times and is quite ancient by now.

While acknowleding the resulting behavior needs to be fixed, I find the
proposed solutions are merely trying to put more lipstick or a wig on a
pig.

The age of the problem suggests it is not *urgent* to fix it.

The O_CLOFORM idea was accepted into POSIX and recent-ish implemented in
all the BSDs (no, really) and illumos, but got NAKed in Linux. It's also
a part of pig's attire so I think that's the right call.

Not denying execs of files open for writing had to get reverted as
apparently some software depends on it, so that's a no-go either.

The flag proposed by Christian elsewhere in the thread would sort this
out, but it's just another hack which would serve no purpose if the
issue stopped showing up.

The real problem is fork()+execve() combo being crap syscalls with crap
semantics, perpetuating the unix tradition of screwing you over unless
you explicitly ask it not to (e.g., with O_CLOEXEC so that the new proc
does not hang out with surprise fds).

While I don't have anything fleshed out nor have any interest in putting
any work in the area, I would suggest anyone looking to solve the ETXTBSY
went after the real culprit instead of damage-controlling the current
API.

To that end, my sketch of a suggestion boils down to a new API which
allows you to construct a new process one step at a time explicitly
spelling out resources which are going to get passed on, finally doing
an actual exec. You would start with getting a file descriptor to a new
task_struct which you gradually populate and eventually exec something
on. There would be no forking.

It could look like this (ignore specific naming):

/* get a file descriptor for the new process. there is no *fork* here,
 * but task_struct & related get allocated
 * clean slate, no sigmask bullshit and similar
 */
pfd = proc_new();

nullfd = open("/dev/null", O_RDONLY);

/* map /dev/null as 0/1/2 in the new proc */
proc_install_fd(pfd, nullfd, 0); 
proc_install_fd(pfd, nullfd, 2); 
proc_install_fd(pfd, nullfd, 2); 

/* if we can run the proc as someone else, set it up here */
proc_install_cred(pfd, uid, gid, groups, ...);

proc_set_umask(pfd, ...);

/* finally exec */
proc_exec_by_path("/bin/sh", argp, envp);

Notice how not once at any point random-ass file descriptors popped into
the new task, which has a side effect of completely avoiding the
problem.

you may also notice this should be faster to execute as it does not have
to pay the mm overhead.

While proc_install_fd is spelled out as singular syscalls, this can be
batched to accept an array of <from, to> pairs etc.

Also notice the thread executing it is not shackled by any of vfork
limitations.

So... if someone is serious about the transient ETXTBSY, I would really
hope you will consider solving the source of the problem, even if you
come up with someting other than I did (hopefully better). It would be a
damn shame to add even more hacks to pacify this problem (like the O_
stuff).

What to do in the meantime? There is a lol hack you can do in userspace
which so ugly I'm not even going to spell it out, but given the
temporary nature of ETXTBSY I'm sure you can guess what it is.

Something to ponder, cheers.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-09-01 18:39 ` Mateusz Guzik
@ 2025-09-01 19:57   ` Colin Walters
  2025-09-01 20:22     ` Mateusz Guzik
  2025-09-02  8:33   ` Christian Brauner
  1 sibling, 1 reply; 23+ messages in thread
From: Colin Walters @ 2025-09-01 19:57 UTC (permalink / raw)
  To: Mateusz Guzik, Alexander Monakov
  Cc: linux-fsdevel, Al Viro, Christian Brauner, Jan Kara, linux-kernel



On Mon, Sep 1, 2025, at 2:39 PM, Mateusz Guzik wrote:
> 
> The O_CLOFORM idea was accepted into POSIX and recent-ish implemented in
> all the BSDs (no, really) and illumos, but got NAKed in Linux. It's also
> a part of pig's attire so I think that's the right call.

Do you have a reference handy for that NAK?

> To that end, my sketch of a suggestion boils down to a new API which
> allows you to construct a new process one step at a time 

In this vein I think io_uring_spawn work sounds like the best: https://lwn.net/Articles/908268/

However...if we predicate any solution to this problem on changing every single codebase which is spawning processes, it's going to take a long time. I think changing the few special cases around "sealing" (fsverity and write + fexecve()) is more tractable.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-09-01 19:57   ` Colin Walters
@ 2025-09-01 20:22     ` Mateusz Guzik
  0 siblings, 0 replies; 23+ messages in thread
From: Mateusz Guzik @ 2025-09-01 20:22 UTC (permalink / raw)
  To: Colin Walters
  Cc: Alexander Monakov, linux-fsdevel, Al Viro, Christian Brauner,
	Jan Kara, linux-kernel

On Mon, Sep 1, 2025 at 9:57 PM Colin Walters <walters@verbum.org> wrote:
> On Mon, Sep 1, 2025, at 2:39 PM, Mateusz Guzik wrote:
> >
> > The O_CLOFORM idea was accepted into POSIX and recent-ish implemented in
> > all the BSDs (no, really) and illumos, but got NAKed in Linux. It's also
> > a part of pig's attire so I think that's the right call.
>
> Do you have a reference handy for that NAK?
>

https://lore.kernel.org/linux-fsdevel/20200515160342.GE23230@ZenIV.linux.org.uk/

> > To that end, my sketch of a suggestion boils down to a new API which
> > allows you to construct a new process one step at a time
>
> In this vein I think io_uring_spawn work sounds like the best: https://lwn.net/Articles/908268/
>

Indeed sounds like the same core idea, I don't understand why tie it
to io_uring though.

> However...if we predicate any solution to this problem on changing every single codebase which is spawning processes, it's going to take a long time. I think changing the few special cases around "sealing" (fsverity and write + fexecve()) is more tractable.

The problem does not concern every single codebase which spawns a
process. It concerns the few which are heavily multithreaded while
creating binaries they are about to exec.

If they have to be patched in your proposal anyway, they may as well
use the new API.

For non-affected consumers, I was mostly thinking make and shells just
from perf standpoint.
-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-09-01 18:39 ` Mateusz Guzik
  2025-09-01 19:57   ` Colin Walters
@ 2025-09-02  8:33   ` Christian Brauner
  2025-09-02  8:44     ` Mateusz Guzik
  1 sibling, 1 reply; 23+ messages in thread
From: Christian Brauner @ 2025-09-02  8:33 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Alexander Monakov, linux-fsdevel, Alexander Viro, Jan Kara,
	linux-kernel

On Mon, Sep 01, 2025 at 08:39:27PM +0200, Mateusz Guzik wrote:
> On Wed, Aug 27, 2025 at 12:05:38AM +0300, Alexander Monakov wrote:
> > Dear fs hackers,
> > 
> > I suspect there's an unfortunate race window in __fput where file locks are
> > dropped (locks_remove_file) prior to decreasing writer refcount
> > (put_file_access). If I'm not mistaken, this window is observable and it
> > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained
> > in more detail below.
> > 
> > The program demonstrating the problem is attached (a slightly modified version
> > of the demo given by Russ Cox on the Go issue tracker, see URL in first line).
> > It makes 20 threads, each executing an infinite loop doing the following:
> > 
> > 1) open an fd for writing with O_CLOEXEC
> > 2) write executable code into it
> > 3) close it
> > 4) fork
> > 5) in the child, attempt to execve the just-written file
> > 
> > If you compile it with -DNOWAIT, you'll see that execve often fails with
> > ETXTBSY.
> 
> This problem was reported a few times and is quite ancient by now.
> 
> While acknowleding the resulting behavior needs to be fixed, I find the
> proposed solutions are merely trying to put more lipstick or a wig on a
> pig.
> 
> The age of the problem suggests it is not *urgent* to fix it.
> 
> The O_CLOFORM idea was accepted into POSIX and recent-ish implemented in
> all the BSDs (no, really) and illumos, but got NAKed in Linux. It's also
> a part of pig's attire so I think that's the right call.
> 
> Not denying execs of files open for writing had to get reverted as
> apparently some software depends on it, so that's a no-go either.
> 
> The flag proposed by Christian elsewhere in the thread would sort this
> out, but it's just another hack which would serve no purpose if the
> issue stopped showing up.
> 
> The real problem is fork()+execve() combo being crap syscalls with crap
> semantics, perpetuating the unix tradition of screwing you over unless
> you explicitly ask it not to (e.g., with O_CLOEXEC so that the new proc
> does not hang out with surprise fds).
> 
> While I don't have anything fleshed out nor have any interest in putting
> any work in the area, I would suggest anyone looking to solve the ETXTBSY
> went after the real culprit instead of damage-controlling the current
> API.
> 
> To that end, my sketch of a suggestion boils down to a new API which
> allows you to construct a new process one step at a time explicitly
> spelling out resources which are going to get passed on, finally doing
> an actual exec. You would start with getting a file descriptor to a new
> task_struct which you gradually populate and eventually exec something
> on. There would be no forking.
> 
> It could look like this (ignore specific naming):
> 
> /* get a file descriptor for the new process. there is no *fork* here,
>  * but task_struct & related get allocated
>  * clean slate, no sigmask bullshit and similar
>  */
> pfd = proc_new();
> 
> nullfd = open("/dev/null", O_RDONLY);
> 
> /* map /dev/null as 0/1/2 in the new proc */
> proc_install_fd(pfd, nullfd, 0); 
> proc_install_fd(pfd, nullfd, 2); 
> proc_install_fd(pfd, nullfd, 2); 
> 
> /* if we can run the proc as someone else, set it up here */
> proc_install_cred(pfd, uid, gid, groups, ...);
> 
> proc_set_umask(pfd, ...);
> 
> /* finally exec */
> proc_exec_by_path("/bin/sh", argp, envp);

You can trivially build this API on top of pidfs. Like:

pidfd_empty = pidfd_open(FD_PIDFS_ROOT/FD_INVALID, PIDFD_EMPTY)

where FD_PIDFS_ROOT and FD_INVALID are things we already have in the
uapi headers.

Then either just add a new system call like pidfd_config() or just use
ioctls() on that empty pidfd.

With pidfs you have the complete freedom to implement that api however
you want.

I had a prototype for that as well but I can't find it anymore. Other
VFS work took priority and so I never finished it but I remember it
wasn't very difficult.

I would definitely merge a patch series like that.

> 
> Notice how not once at any point random-ass file descriptors popped into
> the new task, which has a side effect of completely avoiding the
> problem.
> 
> you may also notice this should be faster to execute as it does not have
> to pay the mm overhead.
> 
> While proc_install_fd is spelled out as singular syscalls, this can be
> batched to accept an array of <from, to> pairs etc.
> 
> Also notice the thread executing it is not shackled by any of vfork
> limitations.
> 
> So... if someone is serious about the transient ETXTBSY, I would really
> hope you will consider solving the source of the problem, even if you
> come up with someting other than I did (hopefully better). It would be a
> damn shame to add even more hacks to pacify this problem (like the O_
> stuff).
> 
> What to do in the meantime? There is a lol hack you can do in userspace
> which so ugly I'm not even going to spell it out, but given the
> temporary nature of ETXTBSY I'm sure you can guess what it is.
> 
> Something to ponder, cheers.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ETXTBSY window in __fput
  2025-09-02  8:33   ` Christian Brauner
@ 2025-09-02  8:44     ` Mateusz Guzik
  0 siblings, 0 replies; 23+ messages in thread
From: Mateusz Guzik @ 2025-09-02  8:44 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Monakov, linux-fsdevel, Alexander Viro, Jan Kara,
	linux-kernel

On Tue, Sep 2, 2025 at 10:33 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Mon, Sep 01, 2025 at 08:39:27PM +0200, Mateusz Guzik wrote:
> > On Wed, Aug 27, 2025 at 12:05:38AM +0300, Alexander Monakov wrote:
> > > Dear fs hackers,
> > >
> > > I suspect there's an unfortunate race window in __fput where file locks are
> > > dropped (locks_remove_file) prior to decreasing writer refcount
> > > (put_file_access). If I'm not mistaken, this window is observable and it
> > > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained
> > > in more detail below.
> > >
> > > The program demonstrating the problem is attached (a slightly modified version
> > > of the demo given by Russ Cox on the Go issue tracker, see URL in first line).
> > > It makes 20 threads, each executing an infinite loop doing the following:
> > >
> > > 1) open an fd for writing with O_CLOEXEC
> > > 2) write executable code into it
> > > 3) close it
> > > 4) fork
> > > 5) in the child, attempt to execve the just-written file
> > >
> > > If you compile it with -DNOWAIT, you'll see that execve often fails with
> > > ETXTBSY.
> >
> > This problem was reported a few times and is quite ancient by now.
> >
> > While acknowleding the resulting behavior needs to be fixed, I find the
> > proposed solutions are merely trying to put more lipstick or a wig on a
> > pig.
> >
> > The age of the problem suggests it is not *urgent* to fix it.
> >
> > The O_CLOFORM idea was accepted into POSIX and recent-ish implemented in
> > all the BSDs (no, really) and illumos, but got NAKed in Linux. It's also
> > a part of pig's attire so I think that's the right call.
> >
> > Not denying execs of files open for writing had to get reverted as
> > apparently some software depends on it, so that's a no-go either.
> >
> > The flag proposed by Christian elsewhere in the thread would sort this
> > out, but it's just another hack which would serve no purpose if the
> > issue stopped showing up.
> >
> > The real problem is fork()+execve() combo being crap syscalls with crap
> > semantics, perpetuating the unix tradition of screwing you over unless
> > you explicitly ask it not to (e.g., with O_CLOEXEC so that the new proc
> > does not hang out with surprise fds).
> >
> > While I don't have anything fleshed out nor have any interest in putting
> > any work in the area, I would suggest anyone looking to solve the ETXTBSY
> > went after the real culprit instead of damage-controlling the current
> > API.
> >
> > To that end, my sketch of a suggestion boils down to a new API which
> > allows you to construct a new process one step at a time explicitly
> > spelling out resources which are going to get passed on, finally doing
> > an actual exec. You would start with getting a file descriptor to a new
> > task_struct which you gradually populate and eventually exec something
> > on. There would be no forking.
> >
> > It could look like this (ignore specific naming):
> >
> > /* get a file descriptor for the new process. there is no *fork* here,
> >  * but task_struct & related get allocated
> >  * clean slate, no sigmask bullshit and similar
> >  */
> > pfd = proc_new();
> >
> > nullfd = open("/dev/null", O_RDONLY);
> >
> > /* map /dev/null as 0/1/2 in the new proc */
> > proc_install_fd(pfd, nullfd, 0);
> > proc_install_fd(pfd, nullfd, 2);
> > proc_install_fd(pfd, nullfd, 2);
> >
> > /* if we can run the proc as someone else, set it up here */
> > proc_install_cred(pfd, uid, gid, groups, ...);
> >
> > proc_set_umask(pfd, ...);
> >
> > /* finally exec */
> > proc_exec_by_path("/bin/sh", argp, envp);
>
> You can trivially build this API on top of pidfs. Like:
>
> pidfd_empty = pidfd_open(FD_PIDFS_ROOT/FD_INVALID, PIDFD_EMPTY)
>
> where FD_PIDFS_ROOT and FD_INVALID are things we already have in the
> uapi headers.
>
> Then either just add a new system call like pidfd_config() or just use
> ioctls() on that empty pidfd.
>
> With pidfs you have the complete freedom to implement that api however
> you want.
>
> I had a prototype for that as well but I can't find it anymore. Other
> VFS work took priority and so I never finished it but I remember it
> wasn't very difficult.
>
> I would definitely merge a patch series like that.
>

I'm happy we are on the same page here, I'm unhappy it is down to the
point of neither of us doing the work though. :-P

I don't think the work is difficult from tech standpoint, but it does
warrant some caution.

With the current model fork + exec model, whatever process-specific
bullshit gets added, you automatically get it on fork.

Something which gradually spells out the state will need a way to
figure out what was not sorted out and fill it up in kernel-side.
There maybe some other caveats, basically it would be good if
someone(tm) really thought this through. It would be really
embarrassing to end up with a deficient & unfixable API which has to
remain supported. :-P

I offer half of the kingdom for a finished product, the interested
party will have to arrange a bride from elsewhere.

> >
> > Notice how not once at any point random-ass file descriptors popped into
> > the new task, which has a side effect of completely avoiding the
> > problem.
> >
> > you may also notice this should be faster to execute as it does not have
> > to pay the mm overhead.
> >
> > While proc_install_fd is spelled out as singular syscalls, this can be
> > batched to accept an array of <from, to> pairs etc.
> >
> > Also notice the thread executing it is not shackled by any of vfork
> > limitations.
> >
> > So... if someone is serious about the transient ETXTBSY, I would really
> > hope you will consider solving the source of the problem, even if you
> > come up with someting other than I did (hopefully better). It would be a
> > damn shame to add even more hacks to pacify this problem (like the O_
> > stuff).
> >
> > What to do in the meantime? There is a lol hack you can do in userspace
> > which so ugly I'm not even going to spell it out, but given the
> > temporary nature of ETXTBSY I'm sure you can guess what it is.
> >
> > Something to ponder, cheers.



-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-09-02 10:36 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-26 21:05 ETXTBSY window in __fput Alexander Monakov
2025-08-26 22:00 ` Al Viro
2025-08-27  7:22   ` Alexander Monakov
2025-08-27 11:52     ` Theodore Ts'o
2025-08-27 13:05       ` Alexander Monakov
2025-08-31 19:22         ` David Laight
2025-09-01  8:44           ` Jan Kara
2025-08-27 13:16     ` Aleksa Sarai
2025-08-27 14:29       ` Alexander Monakov
2025-08-29  7:21 ` Alexander Monakov
2025-08-29  9:47   ` Christian Brauner
2025-08-29 10:17     ` Alexander Monakov
2025-08-29 11:07       ` Christian Brauner
2025-08-29 11:45         ` Alexander Monakov
2025-08-29 14:02           ` Jan Kara
2025-09-01 17:53             ` Alexander Monakov
2025-09-02 10:36               ` Jan Kara
2025-08-29 18:32 ` Colin Walters
2025-09-01 18:39 ` Mateusz Guzik
2025-09-01 19:57   ` Colin Walters
2025-09-01 20:22     ` Mateusz Guzik
2025-09-02  8:33   ` Christian Brauner
2025-09-02  8:44     ` Mateusz Guzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).