linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Christian Brauner <brauner@kernel.org>
To: Alexander Monakov <amonakov@ispras.ru>
Cc: linux-fsdevel@vger.kernel.org,
	 Alexander Viro <viro@zeniv.linux.org.uk>,
	Jan Kara <jack@suse.cz>,
	linux-kernel@vger.kernel.org
Subject: Re: ETXTBSY window in __fput
Date: Fri, 29 Aug 2025 11:47:58 +0200	[thread overview]
Message-ID: <20250829-diskette-landbrot-aa01bc844435@brauner> (raw)
In-Reply-To: <5a4513fe-6eae-9269-c235-c8b0bc1ae05b@ispras.ru>

On Fri, Aug 29, 2025 at 10:21:35AM +0300, Alexander Monakov wrote:
> 
> On Wed, 27 Aug 2025, Alexander Monakov wrote:
> 
> > Dear fs hackers,
> > 
> > I suspect there's an unfortunate race window in __fput where file locks are
> > dropped (locks_remove_file) prior to decreasing writer refcount
> > (put_file_access). If I'm not mistaken, this window is observable and it
> > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained
> > in more detail below.
> 
> The race in __fput is a problem irrespective of how the testcase triggers it,
> right? It's just showing a real-world scenario. But the issue can be
> demonstrated without a multithreaded fork: imagine one process placing an
> exclusive lock on a file and writing to it, another process waiting on that
> lock and immediately execve'ing when the lock is released.
> 
> Can put_file_access be moved prior to locks_remove_file in __fput?

Even if we fix this there's no guarantee that the kernel will give that
letting the close() of a writably opened file race against a concurrent
exec of the same file will not result in EBUSY in some arcane way
currently or in the future.

The fundamental problem is the idiotic exe_file_deny_write_access()
mechanism for execve. This is the crux of the whole go issue. I've tried
to removethis nonsense (twice?). Everytime because of some odd userspace
regression we had to revert (And now we're apparently at the stage where
in another patchset people think that this stuff needs to become a uapi
O_* flag which will absolutely not happen.).

My point is: currently you need synchronization for this to work cleanly
in some form.

But what I would do is the following. So far we've always failed to
remove the deny on write mechanism because doing it unconditionally
broke very weird use-cases with very strange justificatons.

I think we should turn this on its head and give execveat() a flag
AT_EXECVE_NODENYWRITE that allows applications to ignore the write
restrictions.

Applications like go can just start using this flag when exec'ing.
Applications that need the annoying deny-write mechanism can just
continue exec'ing as before without AT_EXECVE_NODENYWRITE.

__Completely untested and uncompiled__ draft below:

diff --git a/fs/exec.c b/fs/exec.c
index 2a1e5e4042a1..59e8fcd3fc19 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -766,7 +766,7 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
        int err;
        struct file *file __free(fput) = NULL;
        struct open_flags open_exec_flags = {
-               .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
+               .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC | __F_EXEC_NODENYWRITE,
                .acc_mode = MAY_EXEC,
                .intent = LOOKUP_OPEN,
                .lookup_flags = LOOKUP_FOLLOW,
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 5598e4d57422..b0c01cba1560 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1158,10 +1158,10 @@ static int __init fcntl_init(void)
         * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
         * is defined as O_NONBLOCK on some platforms and not on others.
         */
-       BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
+       BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
                HWEIGHT32(
                        (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
-                       __FMODE_EXEC));
+                       __FMODE_EXEC | __F_EXEC_NODENYWRITE));

        fasync_cache = kmem_cache_create("fasync_cache",
                                         sizeof(struct fasync_struct), 0,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d7ab4f96d705..123f74cbe7a4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3215,12 +3215,16 @@ static inline int exe_file_deny_write_access(struct file *exe_file)
 {
        if (unlikely(FMODE_FSNOTIFY_HSM(exe_file->f_mode)))
                return 0;
+       if (unlikely(exe_file->f_flags & __F_EXEC_NODENYWRITE))
+               return 0;
        return deny_write_access(exe_file);
 }
 static inline void exe_file_allow_write_access(struct file *exe_file)
 {
        if (unlikely(!exe_file || FMODE_FSNOTIFY_HSM(exe_file->f_mode)))
                return;
+       if (unlikely(exe_file->f_flags & __F_EXEC_NODENYWRITE))
+               return;
        allow_write_access(exe_file);
 }

diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 613475285643..f3c8a457bc7d 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -61,6 +61,9 @@
 #ifndef O_CLOEXEC
 #define O_CLOEXEC      02000000        /* set close_on_exec */
 #endif
+#ifdef __KERNEL__
+#define __F_EXEC_NODENYWRITE 020000000000 /* bit 31 */
+#endif

 /*
  * Before Linux 2.6.33 only O_DSYNC semantics were implemented, but using
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index f291ab4f94eb..123fb158dc5b 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -174,6 +174,7 @@
 #define AT_HANDLE_CONNECTABLE  0x002   /* Request a connectable file handle */

 /* Flags for execveat2(2). */
+#define AT_EXECVE_NODENYWRITE  0x001   /* Allow writable file descriptors to the executable file. */
 #define AT_EXECVE_CHECK                0x10000 /* Only perform a check if execution
                                           would be allowed. */

  reply	other threads:[~2025-08-29  9:48 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-26 21:05 ETXTBSY window in __fput Alexander Monakov
2025-08-26 22:00 ` Al Viro
2025-08-27  7:22   ` Alexander Monakov
2025-08-27 11:52     ` Theodore Ts'o
2025-08-27 13:05       ` Alexander Monakov
2025-08-31 19:22         ` David Laight
2025-09-01  8:44           ` Jan Kara
2025-08-27 13:16     ` Aleksa Sarai
2025-08-27 14:29       ` Alexander Monakov
2025-08-29  7:21 ` Alexander Monakov
2025-08-29  9:47   ` Christian Brauner [this message]
2025-08-29 10:17     ` Alexander Monakov
2025-08-29 11:07       ` Christian Brauner
2025-08-29 11:45         ` Alexander Monakov
2025-08-29 14:02           ` Jan Kara
2025-09-01 17:53             ` Alexander Monakov
2025-09-02 10:36               ` Jan Kara
2025-08-29 18:32 ` Colin Walters
2025-09-01 18:39 ` Mateusz Guzik
2025-09-01 19:57   ` Colin Walters
2025-09-01 20:22     ` Mateusz Guzik
2025-09-02  8:33   ` Christian Brauner
2025-09-02  8:44     ` Mateusz Guzik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250829-diskette-landbrot-aa01bc844435@brauner \
    --to=brauner@kernel.org \
    --cc=amonakov@ispras.ru \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).