* ETXTBSY window in __fput @ 2025-08-26 21:05 Alexander Monakov 2025-08-26 22:00 ` Al Viro ` (3 more replies) 0 siblings, 4 replies; 23+ messages in thread From: Alexander Monakov @ 2025-08-26 21:05 UTC (permalink / raw) To: linux-fsdevel; +Cc: Alexander Viro, Christian Brauner, Jan Kara, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1987 bytes --] Dear fs hackers, I suspect there's an unfortunate race window in __fput where file locks are dropped (locks_remove_file) prior to decreasing writer refcount (put_file_access). If I'm not mistaken, this window is observable and it breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained in more detail below. The program demonstrating the problem is attached (a slightly modified version of the demo given by Russ Cox on the Go issue tracker, see URL in first line). It makes 20 threads, each executing an infinite loop doing the following: 1) open an fd for writing with O_CLOEXEC 2) write executable code into it 3) close it 4) fork 5) in the child, attempt to execve the just-written file If you compile it with -DNOWAIT, you'll see that execve often fails with ETXTBSY. This happens if another thread forked while we were holding an open fd between steps 1 and 3, our fd "leaked" in that child, and then we reached our step 5 before that child did execve (at which point the leaked fd would be closed thanks to O_CLOEXEC). I suggested on the Go bugreport that the problem can be solved without any inter-thread cooperation by utilizing BSD locks. Replace step 3 by 3a) place an exlusive lock on the file identified by fd (flock(fd, LOCK_EX)) 3b) close the fd 3c) open an fd on the same path again 3d) place a lock on it again 3e) close it again Since BSD locks are placed via the open file description, the lock placed at step 3a is not released until all descriptors duplicated via forks are closed. Hence, at step 3d we wait until all forked children proceeded to execve. Recently another person tried this solution and observed that they still see the errors, albeit at a much lower rate, about three per 30 minutes (I've not been able to replicate that). I suspect the race window from the first paragraph makes that possible. If so, would it be possible to close that window? Would be nice to have this algorithm work reliably. Thanks. Alexander [-- Attachment #2: Type: text/plain, Size: 1450 bytes --] /* ETXTBSY race example from https://github.com/golang/go/issues/22315 */ #include <stdint.h> #include <stdio.h> #include <string.h> #include <stdlib.h> #include <errno.h> #include <pthread.h> #include <unistd.h> #include <fcntl.h> #include <sys/wait.h> #include <sys/file.h> static void *runner(void *); int main(void) { pthread_t thr[20]; for (int i=1; i<20; i++) pthread_create(&thr[i], 0, runner, (void*)(uintptr_t)i); runner(0); } static const char *script = "#!/bin/sh\nexit 0\n"; static void * runner(void *v) { int i, fd, pid, status; char buf[100], *argv[2]; i = (int)(uintptr_t)v; snprintf(buf, sizeof buf, "txtbusy-%d", i); argv[0] = buf; argv[1] = 0; for(;;) { fd = open(buf, O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0777); if(fd < 0) { perror("open"); exit(2); } write(fd, script, strlen(script)); #ifndef NOWAIT flock(fd, LOCK_EX); close(fd); fd = open(buf, O_RDONLY|O_CLOEXEC); flock(fd, LOCK_SH); #endif close(fd); pid = fork(); if(pid < 0) { perror("fork"); exit(2); } if(pid == 0) { execve(buf, argv, 0); exit(errno); } if(waitpid(pid, &status, 0) < 0) { perror("waitpid"); exit(2); } if(!WIFEXITED(status)) { perror("waitpid not exited"); exit(2); } status = WEXITSTATUS(status); if(status != 0) fprintf(stderr, "exec: %d %s\n", status, strerror(status)); } return 0; } ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-08-26 21:05 ETXTBSY window in __fput Alexander Monakov @ 2025-08-26 22:00 ` Al Viro 2025-08-27 7:22 ` Alexander Monakov 2025-08-29 7:21 ` Alexander Monakov ` (2 subsequent siblings) 3 siblings, 1 reply; 23+ messages in thread From: Al Viro @ 2025-08-26 22:00 UTC (permalink / raw) To: Alexander Monakov Cc: linux-fsdevel, Christian Brauner, Jan Kara, linux-kernel On Wed, Aug 27, 2025 at 12:05:38AM +0300, Alexander Monakov wrote: > Dear fs hackers, > > I suspect there's an unfortunate race window in __fput where file locks are > dropped (locks_remove_file) prior to decreasing writer refcount > (put_file_access). If I'm not mistaken, this window is observable and it > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained > in more detail below. > > The program demonstrating the problem is attached (a slightly modified version > of the demo given by Russ Cox on the Go issue tracker, see URL in first line). > It makes 20 threads, each executing an infinite loop doing the following: > > 1) open an fd for writing with O_CLOEXEC > 2) write executable code into it > 3) close it > 4) fork > 5) in the child, attempt to execve the just-written file > > If you compile it with -DNOWAIT, you'll see that execve often fails with > ETXTBSY. This happens if another thread forked while we were holding an open fd > between steps 1 and 3, our fd "leaked" in that child, and then we reached our > step 5 before that child did execve (at which point the leaked fd would be > closed thanks to O_CLOEXEC). Egads... Let me get it straight - you have a bunch of threads sharing descriptor tables and some of them are forking (or cloning without shared descriptor tables) while that is going on? Frankly, in such situation I would spawn a thread for that, did unshare(CLONE_FILES) in it, replaced the binary and buggered off, with parent waiting for it to complete. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-08-26 22:00 ` Al Viro @ 2025-08-27 7:22 ` Alexander Monakov 2025-08-27 11:52 ` Theodore Ts'o 2025-08-27 13:16 ` Aleksa Sarai 0 siblings, 2 replies; 23+ messages in thread From: Alexander Monakov @ 2025-08-27 7:22 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, linux-kernel On Tue, 26 Aug 2025, Al Viro wrote: > Egads... Let me get it straight - you have a bunch of threads sharing descriptor > tables and some of them are forking (or cloning without shared descriptor tables) > while that is going on? I suppose if they could start a new process in a more straightforward manner, they would. But you cannot start a new process without fork. Anyway, I'm but a messenger here: the problem has been hit by various people in the Go community (and by Go team itself, at least twice). Here I'm asking about a potential shortcoming in __fput that exacerbates the problem. > Frankly, in such situation I would spawn a thread for that, did unshare(CLONE_FILES) > in it, replaced the binary and buggered off, with parent waiting for it to complete. Good to know, but it doesn't sound very efficient (and like something that could be integrated in Go runtime). Anyhow, if the alleged race window in __fput is real, why not close it for good? Alexander ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-08-27 7:22 ` Alexander Monakov @ 2025-08-27 11:52 ` Theodore Ts'o 2025-08-27 13:05 ` Alexander Monakov 2025-08-27 13:16 ` Aleksa Sarai 1 sibling, 1 reply; 23+ messages in thread From: Theodore Ts'o @ 2025-08-27 11:52 UTC (permalink / raw) To: Alexander Monakov Cc: Al Viro, linux-fsdevel, Christian Brauner, Jan Kara, linux-kernel On Wed, Aug 27, 2025 at 10:22:14AM +0300, Alexander Monakov wrote: > > On Tue, 26 Aug 2025, Al Viro wrote: > > > Egads... Let me get it straight - you have a bunch of threads sharing descriptor > > tables and some of them are forking (or cloning without shared descriptor tables) > > while that is going on? > > I suppose if they could start a new process in a more straightforward manner, > they would. But you cannot start a new process without fork. Anyway, I'm but > a messenger here: the problem has been hit by various people in the Go community > (and by Go team itself, at least twice). Here I'm asking about a potential > shortcoming in __fput that exacerbates the problem. I'm assuming that the problem is showing up in real life when users run a go problem using "go run" where the golang compiler freshly writes the executable, and then fork/exec's the binary. And using multiple threads sharing descriptor tables was just to make a reliable reproducer? - Ted ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-08-27 11:52 ` Theodore Ts'o @ 2025-08-27 13:05 ` Alexander Monakov 2025-08-31 19:22 ` David Laight 0 siblings, 1 reply; 23+ messages in thread From: Alexander Monakov @ 2025-08-27 13:05 UTC (permalink / raw) To: Theodore Ts'o Cc: Al Viro, linux-fsdevel, Christian Brauner, Jan Kara, linux-kernel On Wed, 27 Aug 2025, Theodore Ts'o wrote: > On Wed, Aug 27, 2025 at 10:22:14AM +0300, Alexander Monakov wrote: > > > > On Tue, 26 Aug 2025, Al Viro wrote: > > > > > Egads... Let me get it straight - you have a bunch of threads sharing descriptor > > > tables and some of them are forking (or cloning without shared descriptor tables) > > > while that is going on? > > > > I suppose if they could start a new process in a more straightforward manner, > > they would. But you cannot start a new process without fork. Anyway, I'm but > > a messenger here: the problem has been hit by various people in the Go community > > (and by Go team itself, at least twice). Here I'm asking about a potential > > shortcoming in __fput that exacerbates the problem. > > I'm assuming that the problem is showing up in real life when users > run a go problem using "go run" where the golang compiler freshly > writes the executable, and then fork/exec's the binary. And using > multiple threads sharing descriptor tables was just to make a reliable > reproducer? You need at least two threads: while one thread does open-write-close-fork, there needs to be another thread that forks concurrently with the write. Alexander ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-08-27 13:05 ` Alexander Monakov @ 2025-08-31 19:22 ` David Laight 2025-09-01 8:44 ` Jan Kara 0 siblings, 1 reply; 23+ messages in thread From: David Laight @ 2025-08-31 19:22 UTC (permalink / raw) To: Alexander Monakov Cc: Theodore Ts'o, Al Viro, linux-fsdevel, Christian Brauner, Jan Kara, linux-kernel On Wed, 27 Aug 2025 16:05:51 +0300 (MSK) Alexander Monakov <amonakov@ispras.ru> wrote: > On Wed, 27 Aug 2025, Theodore Ts'o wrote: > > > On Wed, Aug 27, 2025 at 10:22:14AM +0300, Alexander Monakov wrote: > > > > > > On Tue, 26 Aug 2025, Al Viro wrote: > > > > > > > Egads... Let me get it straight - you have a bunch of threads sharing descriptor > > > > tables and some of them are forking (or cloning without shared descriptor tables) > > > > while that is going on? > > > > > > I suppose if they could start a new process in a more straightforward manner, > > > they would. But you cannot start a new process without fork. Anyway, I'm but > > > a messenger here: the problem has been hit by various people in the Go community > > > (and by Go team itself, at least twice). Here I'm asking about a potential > > > shortcoming in __fput that exacerbates the problem. > > > > I'm assuming that the problem is showing up in real life when users > > run a go problem using "go run" where the golang compiler freshly > > writes the executable, and then fork/exec's the binary. And using > > multiple threads sharing descriptor tables was just to make a reliable > > reproducer? > > You need at least two threads: while one thread does open-write-close-fork, > there needs to be another thread that forks concurrently with the write. Is this made worse by the code that defers fput to a worker thread? (or am I misremembering things again?) David > > Alexander > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-08-31 19:22 ` David Laight @ 2025-09-01 8:44 ` Jan Kara 0 siblings, 0 replies; 23+ messages in thread From: Jan Kara @ 2025-09-01 8:44 UTC (permalink / raw) To: David Laight Cc: Alexander Monakov, Theodore Ts'o, Al Viro, linux-fsdevel, Christian Brauner, Jan Kara, linux-kernel On Sun 31-08-25 20:22:44, David Laight wrote: > On Wed, 27 Aug 2025 16:05:51 +0300 (MSK) > Alexander Monakov <amonakov@ispras.ru> wrote: > > > On Wed, 27 Aug 2025, Theodore Ts'o wrote: > > > > > On Wed, Aug 27, 2025 at 10:22:14AM +0300, Alexander Monakov wrote: > > > > > > > > On Tue, 26 Aug 2025, Al Viro wrote: > > > > > > > > > Egads... Let me get it straight - you have a bunch of threads sharing descriptor > > > > > tables and some of them are forking (or cloning without shared descriptor tables) > > > > > while that is going on? > > > > > > > > I suppose if they could start a new process in a more straightforward manner, > > > > they would. But you cannot start a new process without fork. Anyway, I'm but > > > > a messenger here: the problem has been hit by various people in the Go community > > > > (and by Go team itself, at least twice). Here I'm asking about a potential > > > > shortcoming in __fput that exacerbates the problem. > > > > > > I'm assuming that the problem is showing up in real life when users > > > run a go problem using "go run" where the golang compiler freshly > > > writes the executable, and then fork/exec's the binary. And using > > > multiple threads sharing descriptor tables was just to make a reliable > > > reproducer? > > > > You need at least two threads: while one thread does open-write-close-fork, > > there needs to be another thread that forks concurrently with the write. > > Is this made worse by the code that defers fput to a worker thread? > (or am I misremembering things again?) fput() is offloaded to task work (i.e., it happens on exit of a task to userspace). But I don't think it impacts this particular problem in a significant way. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-08-27 7:22 ` Alexander Monakov 2025-08-27 11:52 ` Theodore Ts'o @ 2025-08-27 13:16 ` Aleksa Sarai 2025-08-27 14:29 ` Alexander Monakov 1 sibling, 1 reply; 23+ messages in thread From: Aleksa Sarai @ 2025-08-27 13:16 UTC (permalink / raw) To: Alexander Monakov Cc: Al Viro, linux-fsdevel, Christian Brauner, Jan Kara, linux-kernel [-- Attachment #1: Type: text/plain, Size: 877 bytes --] On 2025-08-27, Alexander Monakov <amonakov@ispras.ru> wrote: > > Frankly, in such situation I would spawn a thread for that, did unshare(CLONE_FILES) > > in it, replaced the binary and buggered off, with parent waiting for it to complete. > > Good to know, but it doesn't sound very efficient (and like something that could be > integrated in Go runtime). Can't you create a goroutine, runtime.LockOSThread, unshare(CLONE_FILES), do the work, and then return -- without runtime.UnlockOSThread (to kill the thread and stop it from being used by other Go code)? Or does that not work in stdlib? We have to do this a lot in runc and other Go programs that mess around with unshare() or other per-thread attributes that don't play well with Go's process model. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 265 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-08-27 13:16 ` Aleksa Sarai @ 2025-08-27 14:29 ` Alexander Monakov 0 siblings, 0 replies; 23+ messages in thread From: Alexander Monakov @ 2025-08-27 14:29 UTC (permalink / raw) To: Aleksa Sarai Cc: Al Viro, linux-fsdevel, Christian Brauner, Jan Kara, linux-kernel On Wed, 27 Aug 2025, Aleksa Sarai wrote: > On 2025-08-27, Alexander Monakov <amonakov@ispras.ru> wrote: > > > Frankly, in such situation I would spawn a thread for that, did unshare(CLONE_FILES) > > > in it, replaced the binary and buggered off, with parent waiting for it to complete. > > > > Good to know, but it doesn't sound very efficient (and like something that could be > > integrated in Go runtime). > > Can't you create a goroutine, runtime.LockOSThread, > unshare(CLONE_FILES), do the work, and then return -- without > runtime.UnlockOSThread (to kill the thread and stop it from being used > by other Go code)? Or does that not work in stdlib? Again, with regards to Go backstory, I'm just the messenger. But were I a Go runtime implementor, I'm afraid no, I wouldn't be able to do that, because while the file is being written, it's done with normal I/O APIs. The runtime has no way to look in the future and anticipate that the file currently being written will be execve'd. My first priority here is not to dig up alternative solutions to the ETXTBSY problem. I'm asking about a race window in __fput. Alexander ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-08-26 21:05 ETXTBSY window in __fput Alexander Monakov 2025-08-26 22:00 ` Al Viro @ 2025-08-29 7:21 ` Alexander Monakov 2025-08-29 9:47 ` Christian Brauner 2025-08-29 18:32 ` Colin Walters 2025-09-01 18:39 ` Mateusz Guzik 3 siblings, 1 reply; 23+ messages in thread From: Alexander Monakov @ 2025-08-29 7:21 UTC (permalink / raw) To: linux-fsdevel; +Cc: Alexander Viro, Christian Brauner, Jan Kara, linux-kernel On Wed, 27 Aug 2025, Alexander Monakov wrote: > Dear fs hackers, > > I suspect there's an unfortunate race window in __fput where file locks are > dropped (locks_remove_file) prior to decreasing writer refcount > (put_file_access). If I'm not mistaken, this window is observable and it > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained > in more detail below. The race in __fput is a problem irrespective of how the testcase triggers it, right? It's just showing a real-world scenario. But the issue can be demonstrated without a multithreaded fork: imagine one process placing an exclusive lock on a file and writing to it, another process waiting on that lock and immediately execve'ing when the lock is released. Can put_file_access be moved prior to locks_remove_file in __fput? Alexander ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-08-29 7:21 ` Alexander Monakov @ 2025-08-29 9:47 ` Christian Brauner 2025-08-29 10:17 ` Alexander Monakov 0 siblings, 1 reply; 23+ messages in thread From: Christian Brauner @ 2025-08-29 9:47 UTC (permalink / raw) To: Alexander Monakov; +Cc: linux-fsdevel, Alexander Viro, Jan Kara, linux-kernel On Fri, Aug 29, 2025 at 10:21:35AM +0300, Alexander Monakov wrote: > > On Wed, 27 Aug 2025, Alexander Monakov wrote: > > > Dear fs hackers, > > > > I suspect there's an unfortunate race window in __fput where file locks are > > dropped (locks_remove_file) prior to decreasing writer refcount > > (put_file_access). If I'm not mistaken, this window is observable and it > > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained > > in more detail below. > > The race in __fput is a problem irrespective of how the testcase triggers it, > right? It's just showing a real-world scenario. But the issue can be > demonstrated without a multithreaded fork: imagine one process placing an > exclusive lock on a file and writing to it, another process waiting on that > lock and immediately execve'ing when the lock is released. > > Can put_file_access be moved prior to locks_remove_file in __fput? Even if we fix this there's no guarantee that the kernel will give that letting the close() of a writably opened file race against a concurrent exec of the same file will not result in EBUSY in some arcane way currently or in the future. The fundamental problem is the idiotic exe_file_deny_write_access() mechanism for execve. This is the crux of the whole go issue. I've tried to removethis nonsense (twice?). Everytime because of some odd userspace regression we had to revert (And now we're apparently at the stage where in another patchset people think that this stuff needs to become a uapi O_* flag which will absolutely not happen.). My point is: currently you need synchronization for this to work cleanly in some form. But what I would do is the following. So far we've always failed to remove the deny on write mechanism because doing it unconditionally broke very weird use-cases with very strange justificatons. I think we should turn this on its head and give execveat() a flag AT_EXECVE_NODENYWRITE that allows applications to ignore the write restrictions. Applications like go can just start using this flag when exec'ing. Applications that need the annoying deny-write mechanism can just continue exec'ing as before without AT_EXECVE_NODENYWRITE. __Completely untested and uncompiled__ draft below: diff --git a/fs/exec.c b/fs/exec.c index 2a1e5e4042a1..59e8fcd3fc19 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -766,7 +766,7 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags) int err; struct file *file __free(fput) = NULL; struct open_flags open_exec_flags = { - .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC, + .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC | __F_EXEC_NODENYWRITE, .acc_mode = MAY_EXEC, .intent = LOOKUP_OPEN, .lookup_flags = LOOKUP_FOLLOW, diff --git a/fs/fcntl.c b/fs/fcntl.c index 5598e4d57422..b0c01cba1560 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -1158,10 +1158,10 @@ static int __init fcntl_init(void) * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY * is defined as O_NONBLOCK on some platforms and not on others. */ - BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ != + BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) | - __FMODE_EXEC)); + __FMODE_EXEC | __F_EXEC_NODENYWRITE)); fasync_cache = kmem_cache_create("fasync_cache", sizeof(struct fasync_struct), 0, diff --git a/include/linux/fs.h b/include/linux/fs.h index d7ab4f96d705..123f74cbe7a4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3215,12 +3215,16 @@ static inline int exe_file_deny_write_access(struct file *exe_file) { if (unlikely(FMODE_FSNOTIFY_HSM(exe_file->f_mode))) return 0; + if (unlikely(exe_file->f_flags & __F_EXEC_NODENYWRITE)) + return 0; return deny_write_access(exe_file); } static inline void exe_file_allow_write_access(struct file *exe_file) { if (unlikely(!exe_file || FMODE_FSNOTIFY_HSM(exe_file->f_mode))) return; + if (unlikely(exe_file->f_flags & __F_EXEC_NODENYWRITE)) + return; allow_write_access(exe_file); } diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index 613475285643..f3c8a457bc7d 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -61,6 +61,9 @@ #ifndef O_CLOEXEC #define O_CLOEXEC 02000000 /* set close_on_exec */ #endif +#ifdef __KERNEL__ +#define __F_EXEC_NODENYWRITE 020000000000 /* bit 31 */ +#endif /* * Before Linux 2.6.33 only O_DSYNC semantics were implemented, but using diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index f291ab4f94eb..123fb158dc5b 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -174,6 +174,7 @@ #define AT_HANDLE_CONNECTABLE 0x002 /* Request a connectable file handle */ /* Flags for execveat2(2). */ +#define AT_EXECVE_NODENYWRITE 0x001 /* Allow writable file descriptors to the executable file. */ #define AT_EXECVE_CHECK 0x10000 /* Only perform a check if execution would be allowed. */ ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-08-29 9:47 ` Christian Brauner @ 2025-08-29 10:17 ` Alexander Monakov 2025-08-29 11:07 ` Christian Brauner 0 siblings, 1 reply; 23+ messages in thread From: Alexander Monakov @ 2025-08-29 10:17 UTC (permalink / raw) To: Christian Brauner; +Cc: linux-fsdevel, Alexander Viro, Jan Kara, linux-kernel On Fri, 29 Aug 2025, Christian Brauner wrote: > On Fri, Aug 29, 2025 at 10:21:35AM +0300, Alexander Monakov wrote: > > > > On Wed, 27 Aug 2025, Alexander Monakov wrote: > > > > > Dear fs hackers, > > > > > > I suspect there's an unfortunate race window in __fput where file locks are > > > dropped (locks_remove_file) prior to decreasing writer refcount > > > (put_file_access). If I'm not mistaken, this window is observable and it > > > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained > > > in more detail below. > > > > The race in __fput is a problem irrespective of how the testcase triggers it, > > right? It's just showing a real-world scenario. But the issue can be > > demonstrated without a multithreaded fork: imagine one process placing an > > exclusive lock on a file and writing to it, another process waiting on that > > lock and immediately execve'ing when the lock is released. > > > > Can put_file_access be moved prior to locks_remove_file in __fput? > > Even if we fix this there's no guarantee that the kernel will give that > letting the close() of a writably opened file race against a concurrent > exec of the same file will not result in EBUSY in some arcane way > currently or in the future. Forget Go and execve. Take the two-process scenario from my last email. The program waiting on flock shouldn't be able to observe elevated refcounts on the file after the lock is released. It matters not only for execve, but also for unmounting the underlying filesystem, right? And maybe other things too. So why not fix the ordering issue in __fput and if there are other bugs breaking valid uses of flock, fix them too? Alexander ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-08-29 10:17 ` Alexander Monakov @ 2025-08-29 11:07 ` Christian Brauner 2025-08-29 11:45 ` Alexander Monakov 0 siblings, 1 reply; 23+ messages in thread From: Christian Brauner @ 2025-08-29 11:07 UTC (permalink / raw) To: Alexander Monakov; +Cc: linux-fsdevel, Alexander Viro, Jan Kara, linux-kernel On Fri, Aug 29, 2025 at 01:17:16PM +0300, Alexander Monakov wrote: > > On Fri, 29 Aug 2025, Christian Brauner wrote: > > > On Fri, Aug 29, 2025 at 10:21:35AM +0300, Alexander Monakov wrote: > > > > > > On Wed, 27 Aug 2025, Alexander Monakov wrote: > > > > > > > Dear fs hackers, > > > > > > > > I suspect there's an unfortunate race window in __fput where file locks are > > > > dropped (locks_remove_file) prior to decreasing writer refcount > > > > (put_file_access). If I'm not mistaken, this window is observable and it > > > > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained > > > > in more detail below. > > > > > > The race in __fput is a problem irrespective of how the testcase triggers it, > > > right? It's just showing a real-world scenario. But the issue can be > > > demonstrated without a multithreaded fork: imagine one process placing an > > > exclusive lock on a file and writing to it, another process waiting on that > > > lock and immediately execve'ing when the lock is released. > > > > > > Can put_file_access be moved prior to locks_remove_file in __fput? > > > > Even if we fix this there's no guarantee that the kernel will give that > > letting the close() of a writably opened file race against a concurrent > > exec of the same file will not result in EBUSY in some arcane way > > currently or in the future. > > Forget Go and execve. Take the two-process scenario from my last email. > The program waiting on flock shouldn't be able to observe elevated > refcounts on the file after the lock is released. It matters not only > for execve, but also for unmounting the underlying filesystem, right? What? No. How?: with details, please. > And maybe other things too. So why not fix the ordering issue in __fput > and if there are other bugs breaking valid uses of flock, fix them too? For locks_remove_file() to be movable after put_file_access() we'd have to prove that no filesystem implementing f_op->lock() doesn't rely on f_op->release() to not have run. It is fundamentally backwards to have run f_ops after f_op->release() ran. Random quick look into 9pfs: static int v9fs_file_do_lock(struct file *filp, int cmd, struct file_lock *fl) { struct p9_flock flock; struct p9_fid *fid; uint8_t status = P9_LOCK_ERROR; int res = 0; struct v9fs_session_info *v9ses; fid = filp->private_data; BUG_ON(fid == NULL); This relies on filp->private_data to be valid which it wouldn't be anymore after f_op->release(). Moving put_file_access() before f_op->release() is also wrong and would require to prove that no filesystem depends on file access not having changed before f_op->release() has run. So no, not a trivial thing to move around. And you are explicitly advertising this as a fix to the go execve problem; both in the bugtracker and here. And it's simply not a good solution. The problem remains exec's deny-write mechanism. But hooking into file locks to serialize exec against writable openers isn't a good solution. It surely is creative though. We can give userspace a simple way to sidestep this problem though as I tried to show. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-08-29 11:07 ` Christian Brauner @ 2025-08-29 11:45 ` Alexander Monakov 2025-08-29 14:02 ` Jan Kara 0 siblings, 1 reply; 23+ messages in thread From: Alexander Monakov @ 2025-08-29 11:45 UTC (permalink / raw) To: Christian Brauner; +Cc: linux-fsdevel, Alexander Viro, Jan Kara, linux-kernel On Fri, 29 Aug 2025, Christian Brauner wrote: > > > Even if we fix this there's no guarantee that the kernel will give that > > > letting the close() of a writably opened file race against a concurrent > > > exec of the same file will not result in EBUSY in some arcane way > > > currently or in the future. > > > > Forget Go and execve. Take the two-process scenario from my last email. > > The program waiting on flock shouldn't be able to observe elevated > > refcounts on the file after the lock is released. It matters not only > > for execve, but also for unmounting the underlying filesystem, right? > > What? No. How?: with details, please. Apologies if there's a misunderstanding on my side, but put_file_access does file_put_write_access, which in turn does mnt_put_write_access(file->f_path.mnt); and I think elevated refcount on mnt will cause -EBUSY from mnt_hold_writers. Which is then checked in mnt_make_readonly. I expect it affects unmount too, just don't see exactly where. (basically as remember non-lazy unmounting a filesystem with open files errors out, right? as well as ro-remounting when some files are open for writing) Alexander ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-08-29 11:45 ` Alexander Monakov @ 2025-08-29 14:02 ` Jan Kara 2025-09-01 17:53 ` Alexander Monakov 0 siblings, 1 reply; 23+ messages in thread From: Jan Kara @ 2025-08-29 14:02 UTC (permalink / raw) To: Alexander Monakov Cc: Christian Brauner, linux-fsdevel, Alexander Viro, Jan Kara, linux-kernel Hello! On Fri 29-08-25 14:45:57, Alexander Monakov wrote: > On Fri, 29 Aug 2025, Christian Brauner wrote: > > > > > Even if we fix this there's no guarantee that the kernel will give that > > > > letting the close() of a writably opened file race against a concurrent > > > > exec of the same file will not result in EBUSY in some arcane way > > > > currently or in the future. > > > > > > Forget Go and execve. Take the two-process scenario from my last email. > > > The program waiting on flock shouldn't be able to observe elevated > > > refcounts on the file after the lock is released. It matters not only > > > for execve, but also for unmounting the underlying filesystem, right? > > > > What? No. How?: with details, please. > > Apologies if there's a misunderstanding on my side, but put_file_access > does file_put_write_access, which in turn does > > mnt_put_write_access(file->f_path.mnt); > > and I think elevated refcount on mnt will cause -EBUSY from mnt_hold_writers. > Which is then checked in mnt_make_readonly. I expect it affects unmount too, > just don't see exactly where. Umount (may_umount_tree()) looks at mnt->mnt_count which is decremented by mntput() completely at the end of __fput(). I tend to agree with Christian here: We've never promised that all effects of open fd are cleaned up before the flock is released and as Christian explained it will be actually pretty hard to implement such behavior. So attempts to wait for fd to close by waiting for its flock are racy... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-08-29 14:02 ` Jan Kara @ 2025-09-01 17:53 ` Alexander Monakov 2025-09-02 10:36 ` Jan Kara 0 siblings, 1 reply; 23+ messages in thread From: Alexander Monakov @ 2025-09-01 17:53 UTC (permalink / raw) To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel, Alexander Viro, linux-kernel On Fri, 29 Aug 2025, Jan Kara wrote: > Umount (may_umount_tree()) looks at mnt->mnt_count which is decremented by > mntput() completely at the end of __fput(). I tend to agree with Christian > here: We've never promised that all effects of open fd are cleaned up > before the flock is released and as Christian explained it will be actually > pretty hard to implement such behavior. So attempts to wait for fd to close > by waiting for its flock are racy... (flock is not a Linux invention, if BSD implementations offered that guarantee, I'd expect Linux to follow, but I'm not sure if they did) That's unfortunate. If the remount/unmount issues are not convincing, shall we try to get this issue called out in the Linux man pages? Would you help me with wordsmithing? How about adding the following to the NOTES section in flock.2? Releasing the lock when a file descriptor is closed is not sequenced after all observable effects of close(). For example, when one process places an exclusive lock on a file, writes to it, then closes it, and another process waits on a shared lock for the file to be closed, it may observe that subsequent execve() fails with ETXTBSY, and umount() of the underlying filesystem fails with EBUSY, as if the file is still open in the first process. Alexander ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-09-01 17:53 ` Alexander Monakov @ 2025-09-02 10:36 ` Jan Kara 0 siblings, 0 replies; 23+ messages in thread From: Jan Kara @ 2025-09-02 10:36 UTC (permalink / raw) To: Alexander Monakov Cc: Jan Kara, Christian Brauner, linux-fsdevel, Alexander Viro, linux-kernel On Mon 01-09-25 20:53:38, Alexander Monakov wrote: > > On Fri, 29 Aug 2025, Jan Kara wrote: > > > Umount (may_umount_tree()) looks at mnt->mnt_count which is decremented by > > mntput() completely at the end of __fput(). I tend to agree with Christian > > here: We've never promised that all effects of open fd are cleaned up > > before the flock is released and as Christian explained it will be actually > > pretty hard to implement such behavior. So attempts to wait for fd to close > > by waiting for its flock are racy... > > (flock is not a Linux invention, if BSD implementations offered that guarantee, > I'd expect Linux to follow, but I'm not sure if they did) > > That's unfortunate. If the remount/unmount issues are not convincing, shall we > try to get this issue called out in the Linux man pages? Would you help me with > wordsmithing? > > How about adding the following to the NOTES section in flock.2? > > Releasing the lock when a file descriptor is closed is not sequenced > after all observable effects of close(). For example, when one process > places an exclusive lock on a file, writes to it, then closes it, and > another process waits on a shared lock for the file to be closed, it may > observe that subsequent execve() fails with ETXTBSY, and umount() of the > underlying filesystem fails with EBUSY, as if the file is still open in > the first process. The paragraph sounds good to me. Thanks for writing it! Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-08-26 21:05 ETXTBSY window in __fput Alexander Monakov 2025-08-26 22:00 ` Al Viro 2025-08-29 7:21 ` Alexander Monakov @ 2025-08-29 18:32 ` Colin Walters 2025-09-01 18:39 ` Mateusz Guzik 3 siblings, 0 replies; 23+ messages in thread From: Colin Walters @ 2025-08-29 18:32 UTC (permalink / raw) To: Alexander Monakov, linux-fsdevel Cc: Al Viro, Christian Brauner, Jan Kara, linux-kernel On Tue, Aug 26, 2025, at 5:05 PM, Alexander Monakov wrote: > Dear fs hackers, > > I suspect there's an unfortunate race window in __fput where file locks are > dropped (locks_remove_file) prior to decreasing writer refcount > (put_file_access). If I'm not mistaken, this window is observable and it > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained > in more detail below. > > The program demonstrating the problem is attached (a slightly modified version > of the demo given by Russ Cox on the Go issue tracker, see URL in first line). > It makes 20 threads, each executing an infinite loop doing the following: > > 1) open an fd for writing with O_CLOEXEC > 2) write executable code into it > 3) close it > 4) fork > 5) in the child, attempt to execve the just-written file > > If you compile it with -DNOWAIT, you'll see that execve often fails with > ETXTBSY. This happens if another thread forked while we were holding an open fd > between steps 1 and 3, our fd "leaked" in that child, and then we reached our > step 5 before that child did execve (at which point the leaked fd would be > closed thanks to O_CLOEXEC). This one looks to be the same as what we hit in https://github.com/containers/composefs-rs/issues/106 but with fsverity. I came to the conclusion that we want O_CLOFORK (ref https://www.austingroupbugs.net/view.php?id=1318 ). ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-08-26 21:05 ETXTBSY window in __fput Alexander Monakov ` (2 preceding siblings ...) 2025-08-29 18:32 ` Colin Walters @ 2025-09-01 18:39 ` Mateusz Guzik 2025-09-01 19:57 ` Colin Walters 2025-09-02 8:33 ` Christian Brauner 3 siblings, 2 replies; 23+ messages in thread From: Mateusz Guzik @ 2025-09-01 18:39 UTC (permalink / raw) To: Alexander Monakov Cc: linux-fsdevel, Alexander Viro, Christian Brauner, Jan Kara, linux-kernel On Wed, Aug 27, 2025 at 12:05:38AM +0300, Alexander Monakov wrote: > Dear fs hackers, > > I suspect there's an unfortunate race window in __fput where file locks are > dropped (locks_remove_file) prior to decreasing writer refcount > (put_file_access). If I'm not mistaken, this window is observable and it > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained > in more detail below. > > The program demonstrating the problem is attached (a slightly modified version > of the demo given by Russ Cox on the Go issue tracker, see URL in first line). > It makes 20 threads, each executing an infinite loop doing the following: > > 1) open an fd for writing with O_CLOEXEC > 2) write executable code into it > 3) close it > 4) fork > 5) in the child, attempt to execve the just-written file > > If you compile it with -DNOWAIT, you'll see that execve often fails with > ETXTBSY. This problem was reported a few times and is quite ancient by now. While acknowleding the resulting behavior needs to be fixed, I find the proposed solutions are merely trying to put more lipstick or a wig on a pig. The age of the problem suggests it is not *urgent* to fix it. The O_CLOFORM idea was accepted into POSIX and recent-ish implemented in all the BSDs (no, really) and illumos, but got NAKed in Linux. It's also a part of pig's attire so I think that's the right call. Not denying execs of files open for writing had to get reverted as apparently some software depends on it, so that's a no-go either. The flag proposed by Christian elsewhere in the thread would sort this out, but it's just another hack which would serve no purpose if the issue stopped showing up. The real problem is fork()+execve() combo being crap syscalls with crap semantics, perpetuating the unix tradition of screwing you over unless you explicitly ask it not to (e.g., with O_CLOEXEC so that the new proc does not hang out with surprise fds). While I don't have anything fleshed out nor have any interest in putting any work in the area, I would suggest anyone looking to solve the ETXTBSY went after the real culprit instead of damage-controlling the current API. To that end, my sketch of a suggestion boils down to a new API which allows you to construct a new process one step at a time explicitly spelling out resources which are going to get passed on, finally doing an actual exec. You would start with getting a file descriptor to a new task_struct which you gradually populate and eventually exec something on. There would be no forking. It could look like this (ignore specific naming): /* get a file descriptor for the new process. there is no *fork* here, * but task_struct & related get allocated * clean slate, no sigmask bullshit and similar */ pfd = proc_new(); nullfd = open("/dev/null", O_RDONLY); /* map /dev/null as 0/1/2 in the new proc */ proc_install_fd(pfd, nullfd, 0); proc_install_fd(pfd, nullfd, 2); proc_install_fd(pfd, nullfd, 2); /* if we can run the proc as someone else, set it up here */ proc_install_cred(pfd, uid, gid, groups, ...); proc_set_umask(pfd, ...); /* finally exec */ proc_exec_by_path("/bin/sh", argp, envp); Notice how not once at any point random-ass file descriptors popped into the new task, which has a side effect of completely avoiding the problem. you may also notice this should be faster to execute as it does not have to pay the mm overhead. While proc_install_fd is spelled out as singular syscalls, this can be batched to accept an array of <from, to> pairs etc. Also notice the thread executing it is not shackled by any of vfork limitations. So... if someone is serious about the transient ETXTBSY, I would really hope you will consider solving the source of the problem, even if you come up with someting other than I did (hopefully better). It would be a damn shame to add even more hacks to pacify this problem (like the O_ stuff). What to do in the meantime? There is a lol hack you can do in userspace which so ugly I'm not even going to spell it out, but given the temporary nature of ETXTBSY I'm sure you can guess what it is. Something to ponder, cheers. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-09-01 18:39 ` Mateusz Guzik @ 2025-09-01 19:57 ` Colin Walters 2025-09-01 20:22 ` Mateusz Guzik 2025-09-02 8:33 ` Christian Brauner 1 sibling, 1 reply; 23+ messages in thread From: Colin Walters @ 2025-09-01 19:57 UTC (permalink / raw) To: Mateusz Guzik, Alexander Monakov Cc: linux-fsdevel, Al Viro, Christian Brauner, Jan Kara, linux-kernel On Mon, Sep 1, 2025, at 2:39 PM, Mateusz Guzik wrote: > > The O_CLOFORM idea was accepted into POSIX and recent-ish implemented in > all the BSDs (no, really) and illumos, but got NAKed in Linux. It's also > a part of pig's attire so I think that's the right call. Do you have a reference handy for that NAK? > To that end, my sketch of a suggestion boils down to a new API which > allows you to construct a new process one step at a time In this vein I think io_uring_spawn work sounds like the best: https://lwn.net/Articles/908268/ However...if we predicate any solution to this problem on changing every single codebase which is spawning processes, it's going to take a long time. I think changing the few special cases around "sealing" (fsverity and write + fexecve()) is more tractable. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-09-01 19:57 ` Colin Walters @ 2025-09-01 20:22 ` Mateusz Guzik 0 siblings, 0 replies; 23+ messages in thread From: Mateusz Guzik @ 2025-09-01 20:22 UTC (permalink / raw) To: Colin Walters Cc: Alexander Monakov, linux-fsdevel, Al Viro, Christian Brauner, Jan Kara, linux-kernel On Mon, Sep 1, 2025 at 9:57 PM Colin Walters <walters@verbum.org> wrote: > On Mon, Sep 1, 2025, at 2:39 PM, Mateusz Guzik wrote: > > > > The O_CLOFORM idea was accepted into POSIX and recent-ish implemented in > > all the BSDs (no, really) and illumos, but got NAKed in Linux. It's also > > a part of pig's attire so I think that's the right call. > > Do you have a reference handy for that NAK? > https://lore.kernel.org/linux-fsdevel/20200515160342.GE23230@ZenIV.linux.org.uk/ > > To that end, my sketch of a suggestion boils down to a new API which > > allows you to construct a new process one step at a time > > In this vein I think io_uring_spawn work sounds like the best: https://lwn.net/Articles/908268/ > Indeed sounds like the same core idea, I don't understand why tie it to io_uring though. > However...if we predicate any solution to this problem on changing every single codebase which is spawning processes, it's going to take a long time. I think changing the few special cases around "sealing" (fsverity and write + fexecve()) is more tractable. The problem does not concern every single codebase which spawns a process. It concerns the few which are heavily multithreaded while creating binaries they are about to exec. If they have to be patched in your proposal anyway, they may as well use the new API. For non-affected consumers, I was mostly thinking make and shells just from perf standpoint. -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-09-01 18:39 ` Mateusz Guzik 2025-09-01 19:57 ` Colin Walters @ 2025-09-02 8:33 ` Christian Brauner 2025-09-02 8:44 ` Mateusz Guzik 1 sibling, 1 reply; 23+ messages in thread From: Christian Brauner @ 2025-09-02 8:33 UTC (permalink / raw) To: Mateusz Guzik Cc: Alexander Monakov, linux-fsdevel, Alexander Viro, Jan Kara, linux-kernel On Mon, Sep 01, 2025 at 08:39:27PM +0200, Mateusz Guzik wrote: > On Wed, Aug 27, 2025 at 12:05:38AM +0300, Alexander Monakov wrote: > > Dear fs hackers, > > > > I suspect there's an unfortunate race window in __fput where file locks are > > dropped (locks_remove_file) prior to decreasing writer refcount > > (put_file_access). If I'm not mistaken, this window is observable and it > > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained > > in more detail below. > > > > The program demonstrating the problem is attached (a slightly modified version > > of the demo given by Russ Cox on the Go issue tracker, see URL in first line). > > It makes 20 threads, each executing an infinite loop doing the following: > > > > 1) open an fd for writing with O_CLOEXEC > > 2) write executable code into it > > 3) close it > > 4) fork > > 5) in the child, attempt to execve the just-written file > > > > If you compile it with -DNOWAIT, you'll see that execve often fails with > > ETXTBSY. > > This problem was reported a few times and is quite ancient by now. > > While acknowleding the resulting behavior needs to be fixed, I find the > proposed solutions are merely trying to put more lipstick or a wig on a > pig. > > The age of the problem suggests it is not *urgent* to fix it. > > The O_CLOFORM idea was accepted into POSIX and recent-ish implemented in > all the BSDs (no, really) and illumos, but got NAKed in Linux. It's also > a part of pig's attire so I think that's the right call. > > Not denying execs of files open for writing had to get reverted as > apparently some software depends on it, so that's a no-go either. > > The flag proposed by Christian elsewhere in the thread would sort this > out, but it's just another hack which would serve no purpose if the > issue stopped showing up. > > The real problem is fork()+execve() combo being crap syscalls with crap > semantics, perpetuating the unix tradition of screwing you over unless > you explicitly ask it not to (e.g., with O_CLOEXEC so that the new proc > does not hang out with surprise fds). > > While I don't have anything fleshed out nor have any interest in putting > any work in the area, I would suggest anyone looking to solve the ETXTBSY > went after the real culprit instead of damage-controlling the current > API. > > To that end, my sketch of a suggestion boils down to a new API which > allows you to construct a new process one step at a time explicitly > spelling out resources which are going to get passed on, finally doing > an actual exec. You would start with getting a file descriptor to a new > task_struct which you gradually populate and eventually exec something > on. There would be no forking. > > It could look like this (ignore specific naming): > > /* get a file descriptor for the new process. there is no *fork* here, > * but task_struct & related get allocated > * clean slate, no sigmask bullshit and similar > */ > pfd = proc_new(); > > nullfd = open("/dev/null", O_RDONLY); > > /* map /dev/null as 0/1/2 in the new proc */ > proc_install_fd(pfd, nullfd, 0); > proc_install_fd(pfd, nullfd, 2); > proc_install_fd(pfd, nullfd, 2); > > /* if we can run the proc as someone else, set it up here */ > proc_install_cred(pfd, uid, gid, groups, ...); > > proc_set_umask(pfd, ...); > > /* finally exec */ > proc_exec_by_path("/bin/sh", argp, envp); You can trivially build this API on top of pidfs. Like: pidfd_empty = pidfd_open(FD_PIDFS_ROOT/FD_INVALID, PIDFD_EMPTY) where FD_PIDFS_ROOT and FD_INVALID are things we already have in the uapi headers. Then either just add a new system call like pidfd_config() or just use ioctls() on that empty pidfd. With pidfs you have the complete freedom to implement that api however you want. I had a prototype for that as well but I can't find it anymore. Other VFS work took priority and so I never finished it but I remember it wasn't very difficult. I would definitely merge a patch series like that. > > Notice how not once at any point random-ass file descriptors popped into > the new task, which has a side effect of completely avoiding the > problem. > > you may also notice this should be faster to execute as it does not have > to pay the mm overhead. > > While proc_install_fd is spelled out as singular syscalls, this can be > batched to accept an array of <from, to> pairs etc. > > Also notice the thread executing it is not shackled by any of vfork > limitations. > > So... if someone is serious about the transient ETXTBSY, I would really > hope you will consider solving the source of the problem, even if you > come up with someting other than I did (hopefully better). It would be a > damn shame to add even more hacks to pacify this problem (like the O_ > stuff). > > What to do in the meantime? There is a lol hack you can do in userspace > which so ugly I'm not even going to spell it out, but given the > temporary nature of ETXTBSY I'm sure you can guess what it is. > > Something to ponder, cheers. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: ETXTBSY window in __fput 2025-09-02 8:33 ` Christian Brauner @ 2025-09-02 8:44 ` Mateusz Guzik 0 siblings, 0 replies; 23+ messages in thread From: Mateusz Guzik @ 2025-09-02 8:44 UTC (permalink / raw) To: Christian Brauner Cc: Alexander Monakov, linux-fsdevel, Alexander Viro, Jan Kara, linux-kernel On Tue, Sep 2, 2025 at 10:33 AM Christian Brauner <brauner@kernel.org> wrote: > > On Mon, Sep 01, 2025 at 08:39:27PM +0200, Mateusz Guzik wrote: > > On Wed, Aug 27, 2025 at 12:05:38AM +0300, Alexander Monakov wrote: > > > Dear fs hackers, > > > > > > I suspect there's an unfortunate race window in __fput where file locks are > > > dropped (locks_remove_file) prior to decreasing writer refcount > > > (put_file_access). If I'm not mistaken, this window is observable and it > > > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained > > > in more detail below. > > > > > > The program demonstrating the problem is attached (a slightly modified version > > > of the demo given by Russ Cox on the Go issue tracker, see URL in first line). > > > It makes 20 threads, each executing an infinite loop doing the following: > > > > > > 1) open an fd for writing with O_CLOEXEC > > > 2) write executable code into it > > > 3) close it > > > 4) fork > > > 5) in the child, attempt to execve the just-written file > > > > > > If you compile it with -DNOWAIT, you'll see that execve often fails with > > > ETXTBSY. > > > > This problem was reported a few times and is quite ancient by now. > > > > While acknowleding the resulting behavior needs to be fixed, I find the > > proposed solutions are merely trying to put more lipstick or a wig on a > > pig. > > > > The age of the problem suggests it is not *urgent* to fix it. > > > > The O_CLOFORM idea was accepted into POSIX and recent-ish implemented in > > all the BSDs (no, really) and illumos, but got NAKed in Linux. It's also > > a part of pig's attire so I think that's the right call. > > > > Not denying execs of files open for writing had to get reverted as > > apparently some software depends on it, so that's a no-go either. > > > > The flag proposed by Christian elsewhere in the thread would sort this > > out, but it's just another hack which would serve no purpose if the > > issue stopped showing up. > > > > The real problem is fork()+execve() combo being crap syscalls with crap > > semantics, perpetuating the unix tradition of screwing you over unless > > you explicitly ask it not to (e.g., with O_CLOEXEC so that the new proc > > does not hang out with surprise fds). > > > > While I don't have anything fleshed out nor have any interest in putting > > any work in the area, I would suggest anyone looking to solve the ETXTBSY > > went after the real culprit instead of damage-controlling the current > > API. > > > > To that end, my sketch of a suggestion boils down to a new API which > > allows you to construct a new process one step at a time explicitly > > spelling out resources which are going to get passed on, finally doing > > an actual exec. You would start with getting a file descriptor to a new > > task_struct which you gradually populate and eventually exec something > > on. There would be no forking. > > > > It could look like this (ignore specific naming): > > > > /* get a file descriptor for the new process. there is no *fork* here, > > * but task_struct & related get allocated > > * clean slate, no sigmask bullshit and similar > > */ > > pfd = proc_new(); > > > > nullfd = open("/dev/null", O_RDONLY); > > > > /* map /dev/null as 0/1/2 in the new proc */ > > proc_install_fd(pfd, nullfd, 0); > > proc_install_fd(pfd, nullfd, 2); > > proc_install_fd(pfd, nullfd, 2); > > > > /* if we can run the proc as someone else, set it up here */ > > proc_install_cred(pfd, uid, gid, groups, ...); > > > > proc_set_umask(pfd, ...); > > > > /* finally exec */ > > proc_exec_by_path("/bin/sh", argp, envp); > > You can trivially build this API on top of pidfs. Like: > > pidfd_empty = pidfd_open(FD_PIDFS_ROOT/FD_INVALID, PIDFD_EMPTY) > > where FD_PIDFS_ROOT and FD_INVALID are things we already have in the > uapi headers. > > Then either just add a new system call like pidfd_config() or just use > ioctls() on that empty pidfd. > > With pidfs you have the complete freedom to implement that api however > you want. > > I had a prototype for that as well but I can't find it anymore. Other > VFS work took priority and so I never finished it but I remember it > wasn't very difficult. > > I would definitely merge a patch series like that. > I'm happy we are on the same page here, I'm unhappy it is down to the point of neither of us doing the work though. :-P I don't think the work is difficult from tech standpoint, but it does warrant some caution. With the current model fork + exec model, whatever process-specific bullshit gets added, you automatically get it on fork. Something which gradually spells out the state will need a way to figure out what was not sorted out and fill it up in kernel-side. There maybe some other caveats, basically it would be good if someone(tm) really thought this through. It would be really embarrassing to end up with a deficient & unfixable API which has to remain supported. :-P I offer half of the kingdom for a finished product, the interested party will have to arrange a bride from elsewhere. > > > > Notice how not once at any point random-ass file descriptors popped into > > the new task, which has a side effect of completely avoiding the > > problem. > > > > you may also notice this should be faster to execute as it does not have > > to pay the mm overhead. > > > > While proc_install_fd is spelled out as singular syscalls, this can be > > batched to accept an array of <from, to> pairs etc. > > > > Also notice the thread executing it is not shackled by any of vfork > > limitations. > > > > So... if someone is serious about the transient ETXTBSY, I would really > > hope you will consider solving the source of the problem, even if you > > come up with someting other than I did (hopefully better). It would be a > > damn shame to add even more hacks to pacify this problem (like the O_ > > stuff). > > > > What to do in the meantime? There is a lol hack you can do in userspace > > which so ugly I'm not even going to spell it out, but given the > > temporary nature of ETXTBSY I'm sure you can guess what it is. > > > > Something to ponder, cheers. -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2025-09-02 10:36 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-26 21:05 ETXTBSY window in __fput Alexander Monakov 2025-08-26 22:00 ` Al Viro 2025-08-27 7:22 ` Alexander Monakov 2025-08-27 11:52 ` Theodore Ts'o 2025-08-27 13:05 ` Alexander Monakov 2025-08-31 19:22 ` David Laight 2025-09-01 8:44 ` Jan Kara 2025-08-27 13:16 ` Aleksa Sarai 2025-08-27 14:29 ` Alexander Monakov 2025-08-29 7:21 ` Alexander Monakov 2025-08-29 9:47 ` Christian Brauner 2025-08-29 10:17 ` Alexander Monakov 2025-08-29 11:07 ` Christian Brauner 2025-08-29 11:45 ` Alexander Monakov 2025-08-29 14:02 ` Jan Kara 2025-09-01 17:53 ` Alexander Monakov 2025-09-02 10:36 ` Jan Kara 2025-08-29 18:32 ` Colin Walters 2025-09-01 18:39 ` Mateusz Guzik 2025-09-01 19:57 ` Colin Walters 2025-09-01 20:22 ` Mateusz Guzik 2025-09-02 8:33 ` Christian Brauner 2025-09-02 8:44 ` Mateusz Guzik
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).