linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mateusz Guzik <mjguzik@gmail.com>
To: Christian Brauner <brauner@kernel.org>
Cc: Alexander Monakov <amonakov@ispras.ru>,
	linux-fsdevel@vger.kernel.org,
	 Alexander Viro <viro@zeniv.linux.org.uk>,
	Jan Kara <jack@suse.cz>,
	linux-kernel@vger.kernel.org
Subject: Re: ETXTBSY window in __fput
Date: Tue, 2 Sep 2025 10:44:56 +0200	[thread overview]
Message-ID: <CAGudoHHV6U5TobvSobzSCor1nN8tb9Ez0fYKxc9+OZtP9EgE-w@mail.gmail.com> (raw)
In-Reply-To: <20250902-faust-kolibri-0898c1980de8@brauner>

On Tue, Sep 2, 2025 at 10:33 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Mon, Sep 01, 2025 at 08:39:27PM +0200, Mateusz Guzik wrote:
> > On Wed, Aug 27, 2025 at 12:05:38AM +0300, Alexander Monakov wrote:
> > > Dear fs hackers,
> > >
> > > I suspect there's an unfortunate race window in __fput where file locks are
> > > dropped (locks_remove_file) prior to decreasing writer refcount
> > > (put_file_access). If I'm not mistaken, this window is observable and it
> > > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained
> > > in more detail below.
> > >
> > > The program demonstrating the problem is attached (a slightly modified version
> > > of the demo given by Russ Cox on the Go issue tracker, see URL in first line).
> > > It makes 20 threads, each executing an infinite loop doing the following:
> > >
> > > 1) open an fd for writing with O_CLOEXEC
> > > 2) write executable code into it
> > > 3) close it
> > > 4) fork
> > > 5) in the child, attempt to execve the just-written file
> > >
> > > If you compile it with -DNOWAIT, you'll see that execve often fails with
> > > ETXTBSY.
> >
> > This problem was reported a few times and is quite ancient by now.
> >
> > While acknowleding the resulting behavior needs to be fixed, I find the
> > proposed solutions are merely trying to put more lipstick or a wig on a
> > pig.
> >
> > The age of the problem suggests it is not *urgent* to fix it.
> >
> > The O_CLOFORM idea was accepted into POSIX and recent-ish implemented in
> > all the BSDs (no, really) and illumos, but got NAKed in Linux. It's also
> > a part of pig's attire so I think that's the right call.
> >
> > Not denying execs of files open for writing had to get reverted as
> > apparently some software depends on it, so that's a no-go either.
> >
> > The flag proposed by Christian elsewhere in the thread would sort this
> > out, but it's just another hack which would serve no purpose if the
> > issue stopped showing up.
> >
> > The real problem is fork()+execve() combo being crap syscalls with crap
> > semantics, perpetuating the unix tradition of screwing you over unless
> > you explicitly ask it not to (e.g., with O_CLOEXEC so that the new proc
> > does not hang out with surprise fds).
> >
> > While I don't have anything fleshed out nor have any interest in putting
> > any work in the area, I would suggest anyone looking to solve the ETXTBSY
> > went after the real culprit instead of damage-controlling the current
> > API.
> >
> > To that end, my sketch of a suggestion boils down to a new API which
> > allows you to construct a new process one step at a time explicitly
> > spelling out resources which are going to get passed on, finally doing
> > an actual exec. You would start with getting a file descriptor to a new
> > task_struct which you gradually populate and eventually exec something
> > on. There would be no forking.
> >
> > It could look like this (ignore specific naming):
> >
> > /* get a file descriptor for the new process. there is no *fork* here,
> >  * but task_struct & related get allocated
> >  * clean slate, no sigmask bullshit and similar
> >  */
> > pfd = proc_new();
> >
> > nullfd = open("/dev/null", O_RDONLY);
> >
> > /* map /dev/null as 0/1/2 in the new proc */
> > proc_install_fd(pfd, nullfd, 0);
> > proc_install_fd(pfd, nullfd, 2);
> > proc_install_fd(pfd, nullfd, 2);
> >
> > /* if we can run the proc as someone else, set it up here */
> > proc_install_cred(pfd, uid, gid, groups, ...);
> >
> > proc_set_umask(pfd, ...);
> >
> > /* finally exec */
> > proc_exec_by_path("/bin/sh", argp, envp);
>
> You can trivially build this API on top of pidfs. Like:
>
> pidfd_empty = pidfd_open(FD_PIDFS_ROOT/FD_INVALID, PIDFD_EMPTY)
>
> where FD_PIDFS_ROOT and FD_INVALID are things we already have in the
> uapi headers.
>
> Then either just add a new system call like pidfd_config() or just use
> ioctls() on that empty pidfd.
>
> With pidfs you have the complete freedom to implement that api however
> you want.
>
> I had a prototype for that as well but I can't find it anymore. Other
> VFS work took priority and so I never finished it but I remember it
> wasn't very difficult.
>
> I would definitely merge a patch series like that.
>

I'm happy we are on the same page here, I'm unhappy it is down to the
point of neither of us doing the work though. :-P

I don't think the work is difficult from tech standpoint, but it does
warrant some caution.

With the current model fork + exec model, whatever process-specific
bullshit gets added, you automatically get it on fork.

Something which gradually spells out the state will need a way to
figure out what was not sorted out and fill it up in kernel-side.
There maybe some other caveats, basically it would be good if
someone(tm) really thought this through. It would be really
embarrassing to end up with a deficient & unfixable API which has to
remain supported. :-P

I offer half of the kingdom for a finished product, the interested
party will have to arrange a bride from elsewhere.

> >
> > Notice how not once at any point random-ass file descriptors popped into
> > the new task, which has a side effect of completely avoiding the
> > problem.
> >
> > you may also notice this should be faster to execute as it does not have
> > to pay the mm overhead.
> >
> > While proc_install_fd is spelled out as singular syscalls, this can be
> > batched to accept an array of <from, to> pairs etc.
> >
> > Also notice the thread executing it is not shackled by any of vfork
> > limitations.
> >
> > So... if someone is serious about the transient ETXTBSY, I would really
> > hope you will consider solving the source of the problem, even if you
> > come up with someting other than I did (hopefully better). It would be a
> > damn shame to add even more hacks to pacify this problem (like the O_
> > stuff).
> >
> > What to do in the meantime? There is a lol hack you can do in userspace
> > which so ugly I'm not even going to spell it out, but given the
> > temporary nature of ETXTBSY I'm sure you can guess what it is.
> >
> > Something to ponder, cheers.



-- 
Mateusz Guzik <mjguzik gmail.com>

      reply	other threads:[~2025-09-02  8:45 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-26 21:05 ETXTBSY window in __fput Alexander Monakov
2025-08-26 22:00 ` Al Viro
2025-08-27  7:22   ` Alexander Monakov
2025-08-27 11:52     ` Theodore Ts'o
2025-08-27 13:05       ` Alexander Monakov
2025-08-31 19:22         ` David Laight
2025-09-01  8:44           ` Jan Kara
2025-08-27 13:16     ` Aleksa Sarai
2025-08-27 14:29       ` Alexander Monakov
2025-08-29  7:21 ` Alexander Monakov
2025-08-29  9:47   ` Christian Brauner
2025-08-29 10:17     ` Alexander Monakov
2025-08-29 11:07       ` Christian Brauner
2025-08-29 11:45         ` Alexander Monakov
2025-08-29 14:02           ` Jan Kara
2025-09-01 17:53             ` Alexander Monakov
2025-09-02 10:36               ` Jan Kara
2025-08-29 18:32 ` Colin Walters
2025-09-01 18:39 ` Mateusz Guzik
2025-09-01 19:57   ` Colin Walters
2025-09-01 20:22     ` Mateusz Guzik
2025-09-02  8:33   ` Christian Brauner
2025-09-02  8:44     ` Mateusz Guzik [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGudoHHV6U5TobvSobzSCor1nN8tb9Ez0fYKxc9+OZtP9EgE-w@mail.gmail.com \
    --to=mjguzik@gmail.com \
    --cc=amonakov@ispras.ru \
    --cc=brauner@kernel.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).