[syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4)
@ 2025-03-20 19:09 syzbot
  2025-03-20 20:09 ` Kees Cook
  2025-03-24 16:00 ` [PATCH] exec: fix the racy usage of fs_struct->in_exec Oleg Nesterov
  0 siblings, 2 replies; 30+ messages in thread
From: syzbot @ 2025-03-20 19:09 UTC (permalink / raw)
  To: brauner, jack, kees, linux-fsdevel, linux-kernel, linux-mm,
	syzkaller-bugs, viro

Hello,

syzbot found the following issue on:

HEAD commit:    a7f2e10ecd8f Merge tag 'hwmon-fixes-for-v6.14-rc8/6.14' of..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=114fee98580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=f33d372c4021745
dashboard link: https://syzkaller.appspot.com/bug?extid=1c486d0b62032c82a968
compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/614aabc71b48/disk-a7f2e10e.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/d47dd90a010a/vmlinux-a7f2e10e.xz
kernel image: https://storage.googleapis.com/syzbot-assets/418d8cf8782b/bzImage-a7f2e10e.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+1c486d0b62032c82a968@syzkaller.appspotmail.com

==================================================================
BUG: KCSAN: data-race in bprm_execve / copy_fs

write to 0xffff8881044f8250 of 4 bytes by task 13692 on cpu 0:
 bprm_execve+0x748/0x9c0 fs/exec.c:1884
 do_execveat_common+0x769/0x7e0 fs/exec.c:1966
 do_execveat fs/exec.c:2051 [inline]
 __do_sys_execveat fs/exec.c:2125 [inline]
 __se_sys_execveat fs/exec.c:2119 [inline]
 __x64_sys_execveat+0x75/0x90 fs/exec.c:2119
 x64_sys_call+0x291e/0x2dc0 arch/x86/include/generated/asm/syscalls_64.h:323
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffff8881044f8250 of 4 bytes by task 13686 on cpu 1:
 copy_fs+0x95/0xf0 kernel/fork.c:1770
 copy_process+0xc91/0x1f50 kernel/fork.c:2394
 kernel_clone+0x167/0x5e0 kernel/fork.c:2815
 __do_sys_clone3 kernel/fork.c:3119 [inline]
 __se_sys_clone3+0x1c1/0x200 kernel/fork.c:3098
 __x64_sys_clone3+0x31/0x40 kernel/fork.c:3098
 x64_sys_call+0x2d56/0x2dc0 arch/x86/include/generated/asm/syscalls_64.h:436
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

value changed: 0x00000001 -> 0x00000000

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 UID: 0 PID: 13686 Comm: syz.1.3826 Not tainted 6.14.0-rc7-syzkaller-00074-ga7f2e10ecd8f #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
==================================================================


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4)
  2025-03-20 19:09 [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) syzbot
@ 2025-03-20 20:09 ` Kees Cook
  2025-03-21  1:44   ` Al Viro
  2025-03-21  8:45   ` Christian Brauner
  2025-03-24 16:00 ` [PATCH] exec: fix the racy usage of fs_struct->in_exec Oleg Nesterov
  1 sibling, 2 replies; 30+ messages in thread
From: Kees Cook @ 2025-03-20 20:09 UTC (permalink / raw)
  To: Oleg Nesterov, brauner
  Cc: jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, viro,
	syzbot

Hey look another threaded exec bug. :|

On Thu, Mar 20, 2025 at 12:09:36PM -0700, syzbot wrote:
> ==================================================================
> BUG: KCSAN: data-race in bprm_execve / copy_fs
> 
> write to 0xffff8881044f8250 of 4 bytes by task 13692 on cpu 0:
>  bprm_execve+0x748/0x9c0 fs/exec.c:1884

This is:

        current->fs->in_exec = 0;

And is part of the execve failure path:

out:
	...
        if (bprm->point_of_no_return && !fatal_signal_pending(current))
                force_fatal_sig(SIGSEGV);

        sched_mm_cid_after_execve(current);
        current->fs->in_exec = 0;
        current->in_execve = 0;

        return retval;
}

>  do_execveat_common+0x769/0x7e0 fs/exec.c:1966
>  do_execveat fs/exec.c:2051 [inline]
>  __do_sys_execveat fs/exec.c:2125 [inline]
>  __se_sys_execveat fs/exec.c:2119 [inline]
>  __x64_sys_execveat+0x75/0x90 fs/exec.c:2119
>  x64_sys_call+0x291e/0x2dc0 arch/x86/include/generated/asm/syscalls_64.h:323
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> read to 0xffff8881044f8250 of 4 bytes by task 13686 on cpu 1:
>  copy_fs+0x95/0xf0 kernel/fork.c:1770

This is:

                if (fs->in_exec) {

Which is under lock:

        struct fs_struct *fs = current->fs;
        if (clone_flags & CLONE_FS) {
                /* tsk->fs is already what we want */
                spin_lock(&fs->lock);
                /* "users" and "in_exec" locked for check_unsafe_exec() * */
                if (fs->in_exec) {
                        spin_unlock(&fs->lock);
                        return -EAGAIN;
                }
                fs->users++;
                spin_unlock(&fs->lock);


Does execve need to be taking this lock? The other thing touching it is
check_unsafe_exec(), which takes the lock. It looks like the bprm_execve()
lock was removed in commit 8c652f96d385 ("do_execve() must not clear
fs->in_exec if it was set by another thread") which used the return
value from check_unsafe_exec():

    When do_execve() succeeds, it is safe to clear ->in_exec unconditionally.
    It can be set only if we don't share ->fs with another process, and since
    we already killed all sub-threads either ->in_exec == 0 or we are the
    only user of this ->fs.

    Also, we do not need fs->lock to clear fs->in_exec.

This logic was updated in commit 9e00cdb091b0 ("exec:check_unsafe_exec:
kill the dead -EAGAIN and clear_in_exec logic"), which includes this
rationale:

            2. "out_unmark:" in do_execve_common() is either called
               under ->cred_guard_mutex, or after de_thread() which
               kills other threads, so we can't race with sub-thread
               which could set ->in_exec. And if ->fs is shared with
               another process ->in_exec should be false anyway.

The de_thread() is part of the "point of no return" in exec_binprm(),
called via exec_binprm(). But the bprm_execve() error path is reachable
from many paths prior to the point of no return.

What I can imagine here is two failing execs racing a fork:

	A start execve
	B fork with CLONE_FS
	C start execve, reach check_unsafe_exec(), set fs->in_exec
	A bprm_execve() failure, clear fs->in_exec
	B copy_fs() increment fs->users.
	C bprm_execve() failure, clear fs->in_exec

But I don't think this is a "real" flaw, though, since the locking is to
protect a _successful_ execve from a fork (i.e. getting the user count
right). A successful execve will de_thread, and I don't see any wrong
counting of fs->users with regard to thread lifetime.

Did I miss something in the analysis? Should we perform locking anyway,
or add data race annotations, or something else?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4)
  2025-03-20 20:09 ` Kees Cook
@ 2025-03-21  1:44   ` Al Viro
  2025-03-21  8:10     ` Kees Cook
  2025-03-21  8:45   ` Christian Brauner
  1 sibling, 1 reply; 30+ messages in thread
From: Al Viro @ 2025-03-21  1:44 UTC (permalink / raw)
  To: Kees Cook
  Cc: Oleg Nesterov, brauner, jack, linux-fsdevel, linux-kernel,
	linux-mm, syzkaller-bugs, syzbot

On Thu, Mar 20, 2025 at 01:09:38PM -0700, Kees Cook wrote:

> What I can imagine here is two failing execs racing a fork:
> 
> 	A start execve
> 	B fork with CLONE_FS
> 	C start execve, reach check_unsafe_exec(), set fs->in_exec
> 	A bprm_execve() failure, clear fs->in_exec
> 	B copy_fs() increment fs->users.
> 	C bprm_execve() failure, clear fs->in_exec
> 
> But I don't think this is a "real" flaw, though, since the locking is to
> protect a _successful_ execve from a fork (i.e. getting the user count
> right). A successful execve will de_thread, and I don't see any wrong
> counting of fs->users with regard to thread lifetime.
> 
> Did I miss something in the analysis? Should we perform locking anyway,
> or add data race annotations, or something else?

Umm...  What if C succeeds, ending up with suid sharing ->fs?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4)
  2025-03-21  1:44   ` Al Viro
@ 2025-03-21  8:10     ` Kees Cook
  2025-03-21  8:49       ` Christian Brauner
  0 siblings, 1 reply; 30+ messages in thread
From: Kees Cook @ 2025-03-21  8:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Oleg Nesterov, brauner, jack, linux-fsdevel, linux-kernel,
	linux-mm, syzkaller-bugs, syzbot

On Fri, Mar 21, 2025 at 01:44:23AM +0000, Al Viro wrote:
> On Thu, Mar 20, 2025 at 01:09:38PM -0700, Kees Cook wrote:
> 
> > What I can imagine here is two failing execs racing a fork:
> > 
> > 	A start execve
> > 	B fork with CLONE_FS
> > 	C start execve, reach check_unsafe_exec(), set fs->in_exec
> > 	A bprm_execve() failure, clear fs->in_exec
> > 	B copy_fs() increment fs->users.
> > 	C bprm_execve() failure, clear fs->in_exec
> > 
> > But I don't think this is a "real" flaw, though, since the locking is to
> > protect a _successful_ execve from a fork (i.e. getting the user count
> > right). A successful execve will de_thread, and I don't see any wrong
> > counting of fs->users with regard to thread lifetime.
> > 
> > Did I miss something in the analysis? Should we perform locking anyway,
> > or add data race annotations, or something else?
> 
> Umm...  What if C succeeds, ending up with suid sharing ->fs?

I still can't quite construct it -- fs->users is always correct, I
think?

Below would be the bad set of events, but it's wrong that "fs->users==1".
If A and C are both running with CLONE_FS then fs->users==2. A would need to
exit first, but it can't do that and also set fs->in_exec=0

A execve, reaches bprm_execve() failure path
B fork with CLONE_FS, reaches copy_fs()
C execve, reaches check_unsafe_exec()
C takes fs->lock, counts, finds safe fs->users==1, sets in_exec=1, unlocks
A sets fs->in_exec=0
B takes fs->lock, sees in_exec==0, does fs->users++, unlocks
C goes setuid, sharing fs with unpriv B

Something still feels very weird, though. Does fs->in_exec not matter at
all? Hmm, no, it stops fs->users++ happening after it was validated to be 1.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4)
  2025-03-21  8:10     ` Kees Cook
@ 2025-03-21  8:49       ` Christian Brauner
  0 siblings, 0 replies; 30+ messages in thread
From: Christian Brauner @ 2025-03-21  8:49 UTC (permalink / raw)
  To: Kees Cook
  Cc: Al Viro, Oleg Nesterov, jack, linux-fsdevel, linux-kernel,
	linux-mm, syzkaller-bugs, syzbot

On Fri, Mar 21, 2025 at 01:10:26AM -0700, Kees Cook wrote:
> On Fri, Mar 21, 2025 at 01:44:23AM +0000, Al Viro wrote:
> > On Thu, Mar 20, 2025 at 01:09:38PM -0700, Kees Cook wrote:
> > 
> > > What I can imagine here is two failing execs racing a fork:
> > > 
> > > 	A start execve
> > > 	B fork with CLONE_FS
> > > 	C start execve, reach check_unsafe_exec(), set fs->in_exec
> > > 	A bprm_execve() failure, clear fs->in_exec
> > > 	B copy_fs() increment fs->users.
> > > 	C bprm_execve() failure, clear fs->in_exec
> > > 
> > > But I don't think this is a "real" flaw, though, since the locking is to
> > > protect a _successful_ execve from a fork (i.e. getting the user count
> > > right). A successful execve will de_thread, and I don't see any wrong
> > > counting of fs->users with regard to thread lifetime.
> > > 
> > > Did I miss something in the analysis? Should we perform locking anyway,
> > > or add data race annotations, or something else?
> > 
> > Umm...  What if C succeeds, ending up with suid sharing ->fs?
> 
> I still can't quite construct it -- fs->users is always correct, I
> think?
> 
> Below would be the bad set of events, but it's wrong that "fs->users==1".
> If A and C are both running with CLONE_FS then fs->users==2. A would need to
> exit first, but it can't do that and also set fs->in_exec=0
> 
> A execve, reaches bprm_execve() failure path
> B fork with CLONE_FS, reaches copy_fs()
> C execve, reaches check_unsafe_exec()
> C takes fs->lock, counts, finds safe fs->users==1, sets in_exec=1, unlocks
> A sets fs->in_exec=0
> B takes fs->lock, sees in_exec==0, does fs->users++, unlocks
> C goes setuid, sharing fs with unpriv B
> 
> Something still feels very weird, though. Does fs->in_exec not matter at
> all? Hmm, no, it stops fs->users++ happening after it was validated to be 1.

This is a harmless data race afaict. See my other mail.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4)
  2025-03-20 20:09 ` Kees Cook
  2025-03-21  1:44   ` Al Viro
@ 2025-03-21  8:45   ` Christian Brauner
  2025-03-22  1:00     ` Al Viro
  1 sibling, 1 reply; 30+ messages in thread
From: Christian Brauner @ 2025-03-21  8:45 UTC (permalink / raw)
  To: Kees Cook
  Cc: Oleg Nesterov, jack, linux-fsdevel, linux-kernel, linux-mm,
	syzkaller-bugs, viro, syzbot

On Thu, Mar 20, 2025 at 01:09:38PM -0700, Kees Cook wrote:
> Hey look another threaded exec bug. :|
> 
> On Thu, Mar 20, 2025 at 12:09:36PM -0700, syzbot wrote:
> > ==================================================================
> > BUG: KCSAN: data-race in bprm_execve / copy_fs
> > 
> > write to 0xffff8881044f8250 of 4 bytes by task 13692 on cpu 0:
> >  bprm_execve+0x748/0x9c0 fs/exec.c:1884
> 
> This is:
> 
>         current->fs->in_exec = 0;
> 
> And is part of the execve failure path:
> 
> out:
> 	...
>         if (bprm->point_of_no_return && !fatal_signal_pending(current))
>                 force_fatal_sig(SIGSEGV);
> 
>         sched_mm_cid_after_execve(current);
>         current->fs->in_exec = 0;
>         current->in_execve = 0;
> 
>         return retval;
> }
> 
> >  do_execveat_common+0x769/0x7e0 fs/exec.c:1966
> >  do_execveat fs/exec.c:2051 [inline]
> >  __do_sys_execveat fs/exec.c:2125 [inline]
> >  __se_sys_execveat fs/exec.c:2119 [inline]
> >  __x64_sys_execveat+0x75/0x90 fs/exec.c:2119
> >  x64_sys_call+0x291e/0x2dc0 arch/x86/include/generated/asm/syscalls_64.h:323
> >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> >  do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83
> >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > 
> > read to 0xffff8881044f8250 of 4 bytes by task 13686 on cpu 1:
> >  copy_fs+0x95/0xf0 kernel/fork.c:1770
> 
> This is:
> 
>                 if (fs->in_exec) {
> 
> Which is under lock:
> 
>         struct fs_struct *fs = current->fs;
>         if (clone_flags & CLONE_FS) {
>                 /* tsk->fs is already what we want */
>                 spin_lock(&fs->lock);
>                 /* "users" and "in_exec" locked for check_unsafe_exec() * */
>                 if (fs->in_exec) {
>                         spin_unlock(&fs->lock);
>                         return -EAGAIN;
>                 }
>                 fs->users++;
>                 spin_unlock(&fs->lock);
> 
> 
> Does execve need to be taking this lock? The other thing touching it is
> check_unsafe_exec(), which takes the lock. It looks like the bprm_execve()
> lock was removed in commit 8c652f96d385 ("do_execve() must not clear
> fs->in_exec if it was set by another thread") which used the return
> value from check_unsafe_exec():
> 
>     When do_execve() succeeds, it is safe to clear ->in_exec unconditionally.
>     It can be set only if we don't share ->fs with another process, and since
>     we already killed all sub-threads either ->in_exec == 0 or we are the
>     only user of this ->fs.
> 
>     Also, we do not need fs->lock to clear fs->in_exec.
> 
> This logic was updated in commit 9e00cdb091b0 ("exec:check_unsafe_exec:
> kill the dead -EAGAIN and clear_in_exec logic"), which includes this
> rationale:
> 
>             2. "out_unmark:" in do_execve_common() is either called
>                under ->cred_guard_mutex, or after de_thread() which
>                kills other threads, so we can't race with sub-thread
>                which could set ->in_exec. And if ->fs is shared with
>                another process ->in_exec should be false anyway.
> 
> The de_thread() is part of the "point of no return" in exec_binprm(),
> called via exec_binprm(). But the bprm_execve() error path is reachable
> from many paths prior to the point of no return.
> 
> What I can imagine here is two failing execs racing a fork:
> 
> 	A start execve
> 	B fork with CLONE_FS
> 	C start execve, reach check_unsafe_exec(), set fs->in_exec
> 	A bprm_execve() failure, clear fs->in_exec
> 	B copy_fs() increment fs->users.
> 	C bprm_execve() failure, clear fs->in_exec
> 
> But I don't think this is a "real" flaw, though, since the locking is to
> protect a _successful_ execve from a fork (i.e. getting the user count
> right). A successful execve will de_thread, and I don't see any wrong
> counting of fs->users with regard to thread lifetime.
> 
> Did I miss something in the analysis? Should we perform locking anyway,
> or add data race annotations, or something else?

Afaict, the only way this data race can happen is if we jump to the
cleanup label and then reset current->fs->in_exec. If the execve was
successful there's no one to race us with CLONE_FS obviously because we
took down all other threads.

I think the logic in commit 9e00cdb091b0 ("exec:check_unsafe_exec: kill
the dead -EAGAIN and clear_in_exec logic") is sound.

This is a harmless data race that can only happen if the execve fails.
The worst that can happen is that a subthread does clone(CLONE_FS) and
gets a spurious error because it raced with the exec'ing subthread
resetting fs->in_exec. So I think all we need is:

diff --git a/fs/exec.c b/fs/exec.c
index 506cd411f4ac..177acaf196a9 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1881,7 +1881,13 @@ static int bprm_execve(struct linux_binprm *bprm)
                force_fatal_sig(SIGSEGV);

        sched_mm_cid_after_execve(current);
-       current->fs->in_exec = 0;
+       /*
+        * If this execve failed before de_thread() and another
+        * subthread is concurrently forking with CLONE_FS they race
+        * with us resetting current->fs->in_exec. This is fine,
+        * annotate it.
+        */
+       data_race(current->fs->in_exec = 1);
        current->in_execve = 0;

        return retval;



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4)
  2025-03-21  8:45   ` Christian Brauner
@ 2025-03-22  1:00     ` Al Viro
  2025-03-22  6:26       ` Kees Cook
                         ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Al Viro @ 2025-03-22  1:00 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kees Cook, Oleg Nesterov, jack, linux-fsdevel, linux-kernel,
	linux-mm, syzkaller-bugs, syzbot

On Fri, Mar 21, 2025 at 09:45:39AM +0100, Christian Brauner wrote:

> Afaict, the only way this data race can happen is if we jump to the
> cleanup label and then reset current->fs->in_exec. If the execve was
> successful there's no one to race us with CLONE_FS obviously because we
> took down all other threads.

Not really.

1) A enters check_unsafe_execve(), sets ->in_exec to 1
2) B enters check_unsafe_execve(), sets ->in_exec to 1
3) A calls exec_binprm(), fails (bad binary)
4) A clears ->in_exec
5) C calls clone(2) with CLONE_FS and spawns D - ->in_exec is 0
6) B gets through exec_binprm(), kills A and C, but not D.
7) B clears ->in_exec, returns

Result: B and D share ->fs, B runs suid binary.

Had (5) happened prior to (2), (2) wouldn't have set ->in_exec;
had (5) happened prior to (4), clone() would've failed; had
(5) been delayed past (6), there wouldn't have been a thread
to call clone().

But in the window between (4) and (6), clone() doesn't see
execve() in progress and check_unsafe_execve() has already
been done, so it hadn't seen the extra thread.

IOW, it really is racy.  It's a counter, not a flag.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4)
  2025-03-22  1:00     ` Al Viro
@ 2025-03-22  6:26       ` Kees Cook
  2025-03-22 10:15         ` Mateusz Guzik
  2025-03-22 10:23       ` Christian Brauner
  2025-03-22 15:55       ` Oleg Nesterov
  2 siblings, 1 reply; 30+ messages in thread
From: Kees Cook @ 2025-03-22  6:26 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, Oleg Nesterov, jack, linux-fsdevel,
	linux-kernel, linux-mm, syzkaller-bugs, syzbot

On Sat, Mar 22, 2025 at 01:00:08AM +0000, Al Viro wrote:
> On Fri, Mar 21, 2025 at 09:45:39AM +0100, Christian Brauner wrote:
> 
> > Afaict, the only way this data race can happen is if we jump to the
> > cleanup label and then reset current->fs->in_exec. If the execve was
> > successful there's no one to race us with CLONE_FS obviously because we
> > took down all other threads.
> 
> Not really.

Yeah, you found it. Thank you!

> 1) A enters check_unsafe_execve(), sets ->in_exec to 1
> 2) B enters check_unsafe_execve(), sets ->in_exec to 1

With 3 threads A, B, and C already running, fs->users == 3, so steps (1)
and (2) happily pass.

> 3) A calls exec_binprm(), fails (bad binary)
> 4) A clears ->in_exec
> 5) C calls clone(2) with CLONE_FS and spawns D - ->in_exec is 0

D's creation bumps fs->users == 4.

> 6) B gets through exec_binprm(), kills A and C, but not D.
> 7) B clears ->in_exec, returns
> 
> Result: B and D share ->fs, B runs suid binary.
> 
> Had (5) happened prior to (2), (2) wouldn't have set ->in_exec;
> had (5) happened prior to (4), clone() would've failed; had
> (5) been delayed past (6), there wouldn't have been a thread
> to call clone().
> 
> But in the window between (4) and (6), clone() doesn't see
> execve() in progress and check_unsafe_execve() has already
> been done, so it hadn't seen the extra thread.
> 
> IOW, it really is racy.  It's a counter, not a flag.

Yeah, I would agree. Totally untested patch:

diff --git a/fs/exec.c b/fs/exec.c
index 506cd411f4ac..988b8621c079 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1632,7 +1632,7 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
 	if (p->fs->users > n_fs)
 		bprm->unsafe |= LSM_UNSAFE_SHARE;
 	else
-		p->fs->in_exec = 1;
+		refcount_inc(&p->fs->in_exec);
 	spin_unlock(&p->fs->lock);
 }
 
@@ -1862,7 +1862,7 @@ static int bprm_execve(struct linux_binprm *bprm)
 
 	sched_mm_cid_after_execve(current);
 	/* execve succeeded */
-	current->fs->in_exec = 0;
+	refcount_dec(&current->fs->in_exec);
 	current->in_execve = 0;
 	rseq_execve(current);
 	user_events_execve(current);
@@ -1881,7 +1881,7 @@ static int bprm_execve(struct linux_binprm *bprm)
 		force_fatal_sig(SIGSEGV);
 
 	sched_mm_cid_after_execve(current);
-	current->fs->in_exec = 0;
+	refcount_dec(&current->fs->in_exec);
 	current->in_execve = 0;
 
 	return retval;
diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index 64c2d0814ed6..df46b873c425 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -115,7 +115,7 @@ struct fs_struct *copy_fs_struct(struct fs_struct *old)
 	/* We don't need to lock fs - think why ;-) */
 	if (fs) {
 		fs->users = 1;
-		fs->in_exec = 0;
+		fs->in_exec = REFCOUNT_INIT(0);
 		spin_lock_init(&fs->lock);
 		seqcount_spinlock_init(&fs->seq, &fs->lock);
 		fs->umask = old->umask;
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index 783b48dedb72..aebc0b7aedb9 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -11,7 +11,7 @@ struct fs_struct {
 	spinlock_t lock;
 	seqcount_spinlock_t seq;
 	int umask;
-	int in_exec;
+	refcount_t in_exec;
 	struct path root, pwd;
 } __randomize_layout;
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 735405a9c5f3..8b427045fd86 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1767,7 +1767,7 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
 		/* tsk->fs is already what we want */
 		spin_lock(&fs->lock);
 		/* "users" and "in_exec" locked for check_unsafe_exec() */
-		if (fs->in_exec) {
+		if (refcount_read(&fs->in_exec)) {
 			spin_unlock(&fs->lock);
 			return -EAGAIN;
 		}

-- 
Kees Cook

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4)
  2025-03-22  6:26       ` Kees Cook
@ 2025-03-22 10:15         ` Mateusz Guzik
  2025-03-22 10:28           ` Christian Brauner
  0 siblings, 1 reply; 30+ messages in thread
From: Mateusz Guzik @ 2025-03-22 10:15 UTC (permalink / raw)
  To: Kees Cook
  Cc: Al Viro, Christian Brauner, Oleg Nesterov, jack, linux-fsdevel,
	linux-kernel, linux-mm, syzkaller-bugs, syzbot

On Fri, Mar 21, 2025 at 11:26:03PM -0700, Kees Cook wrote:
> On Sat, Mar 22, 2025 at 01:00:08AM +0000, Al Viro wrote:
> > On Fri, Mar 21, 2025 at 09:45:39AM +0100, Christian Brauner wrote:
> > 
> > > Afaict, the only way this data race can happen is if we jump to the
> > > cleanup label and then reset current->fs->in_exec. If the execve was
> > > successful there's no one to race us with CLONE_FS obviously because we
> > > took down all other threads.
> > 
> > Not really.
> 
> Yeah, you found it. Thank you!
> 
> > 1) A enters check_unsafe_execve(), sets ->in_exec to 1
> > 2) B enters check_unsafe_execve(), sets ->in_exec to 1
> 
> With 3 threads A, B, and C already running, fs->users == 3, so steps (1)
> and (2) happily pass.
> 
> > 3) A calls exec_binprm(), fails (bad binary)
> > 4) A clears ->in_exec
> > 5) C calls clone(2) with CLONE_FS and spawns D - ->in_exec is 0
> 
> D's creation bumps fs->users == 4.
> 
> > 6) B gets through exec_binprm(), kills A and C, but not D.
> > 7) B clears ->in_exec, returns
> > 
> > Result: B and D share ->fs, B runs suid binary.
> > 
> > Had (5) happened prior to (2), (2) wouldn't have set ->in_exec;
> > had (5) happened prior to (4), clone() would've failed; had
> > (5) been delayed past (6), there wouldn't have been a thread
> > to call clone().
> > 
> > But in the window between (4) and (6), clone() doesn't see
> > execve() in progress and check_unsafe_execve() has already
> > been done, so it hadn't seen the extra thread.
> > 
> > IOW, it really is racy.  It's a counter, not a flag.
> 
> Yeah, I would agree. Totally untested patch:
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 506cd411f4ac..988b8621c079 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1632,7 +1632,7 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
>  	if (p->fs->users > n_fs)
>  		bprm->unsafe |= LSM_UNSAFE_SHARE;
>  	else
> -		p->fs->in_exec = 1;
> +		refcount_inc(&p->fs->in_exec);
>  	spin_unlock(&p->fs->lock);
>  }
>  
> @@ -1862,7 +1862,7 @@ static int bprm_execve(struct linux_binprm *bprm)
>  
>  	sched_mm_cid_after_execve(current);
>  	/* execve succeeded */
> -	current->fs->in_exec = 0;
> +	refcount_dec(&current->fs->in_exec);
>  	current->in_execve = 0;
>  	rseq_execve(current);
>  	user_events_execve(current);
> @@ -1881,7 +1881,7 @@ static int bprm_execve(struct linux_binprm *bprm)
>  		force_fatal_sig(SIGSEGV);
>  
>  	sched_mm_cid_after_execve(current);
> -	current->fs->in_exec = 0;
> +	refcount_dec(&current->fs->in_exec);
>  	current->in_execve = 0;
>  
>  	return retval;

The bump is conditional and with this patch you may be issuing
refcount_dec when you declined to refcount_inc.

A special case where there are others to worry about and which proceeds
with an exec without leaving in any indicators is imo sketchy.

I would argue it would make the most sense to serialize these execs.

Vast majority of programs are single-threaded when they exec with an
unshared ->fs, so they don't need to bear any overhead nor complexity
modulo a branch.

For any fucky case you can park yourself waiting for any pending exec to
finish.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4)
  2025-03-22 10:15         ` Mateusz Guzik
@ 2025-03-22 10:28           ` Christian Brauner
  0 siblings, 0 replies; 30+ messages in thread
From: Christian Brauner @ 2025-03-22 10:28 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Kees Cook, Al Viro, Oleg Nesterov, jack, linux-fsdevel,
	linux-kernel, linux-mm, syzkaller-bugs, syzbot

On Sat, Mar 22, 2025 at 11:15:44AM +0100, Mateusz Guzik wrote:
> On Fri, Mar 21, 2025 at 11:26:03PM -0700, Kees Cook wrote:
> > On Sat, Mar 22, 2025 at 01:00:08AM +0000, Al Viro wrote:
> > > On Fri, Mar 21, 2025 at 09:45:39AM +0100, Christian Brauner wrote:
> > > 
> > > > Afaict, the only way this data race can happen is if we jump to the
> > > > cleanup label and then reset current->fs->in_exec. If the execve was
> > > > successful there's no one to race us with CLONE_FS obviously because we
> > > > took down all other threads.
> > > 
> > > Not really.
> > 
> > Yeah, you found it. Thank you!
> > 
> > > 1) A enters check_unsafe_execve(), sets ->in_exec to 1
> > > 2) B enters check_unsafe_execve(), sets ->in_exec to 1
> > 
> > With 3 threads A, B, and C already running, fs->users == 3, so steps (1)
> > and (2) happily pass.
> > 
> > > 3) A calls exec_binprm(), fails (bad binary)
> > > 4) A clears ->in_exec
> > > 5) C calls clone(2) with CLONE_FS and spawns D - ->in_exec is 0
> > 
> > D's creation bumps fs->users == 4.
> > 
> > > 6) B gets through exec_binprm(), kills A and C, but not D.
> > > 7) B clears ->in_exec, returns
> > > 
> > > Result: B and D share ->fs, B runs suid binary.
> > > 
> > > Had (5) happened prior to (2), (2) wouldn't have set ->in_exec;
> > > had (5) happened prior to (4), clone() would've failed; had
> > > (5) been delayed past (6), there wouldn't have been a thread
> > > to call clone().
> > > 
> > > But in the window between (4) and (6), clone() doesn't see
> > > execve() in progress and check_unsafe_execve() has already
> > > been done, so it hadn't seen the extra thread.
> > > 
> > > IOW, it really is racy.  It's a counter, not a flag.
> > 
> > Yeah, I would agree. Totally untested patch:
> > 
> > diff --git a/fs/exec.c b/fs/exec.c
> > index 506cd411f4ac..988b8621c079 100644
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -1632,7 +1632,7 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
> >  	if (p->fs->users > n_fs)
> >  		bprm->unsafe |= LSM_UNSAFE_SHARE;
> >  	else
> > -		p->fs->in_exec = 1;
> > +		refcount_inc(&p->fs->in_exec);
> >  	spin_unlock(&p->fs->lock);
> >  }
> >  
> > @@ -1862,7 +1862,7 @@ static int bprm_execve(struct linux_binprm *bprm)
> >  
> >  	sched_mm_cid_after_execve(current);
> >  	/* execve succeeded */
> > -	current->fs->in_exec = 0;
> > +	refcount_dec(&current->fs->in_exec);
> >  	current->in_execve = 0;
> >  	rseq_execve(current);
> >  	user_events_execve(current);
> > @@ -1881,7 +1881,7 @@ static int bprm_execve(struct linux_binprm *bprm)
> >  		force_fatal_sig(SIGSEGV);
> >  
> >  	sched_mm_cid_after_execve(current);
> > -	current->fs->in_exec = 0;
> > +	refcount_dec(&current->fs->in_exec);
> >  	current->in_execve = 0;
> >  
> >  	return retval;
> 
> The bump is conditional and with this patch you may be issuing
> refcount_dec when you declined to refcount_inc.
> 
> A special case where there are others to worry about and which proceeds
> with an exec without leaving in any indicators is imo sketchy.
> 
> I would argue it would make the most sense to serialize these execs.

Yeah, I tend to agree.

And fwiw, I had proposed somewhere else already that we should start
restricting thread-group exec to the thread-group leader because
subthread exec is about as ugly as it can get.

I'd like to add a sysctl for this with the goal of removing this
completely in the future. I think there are very few if any legitimate
cases for subthread aka non-thread-group leader exec. It not just
complicates things it also means two task switch TIDs which is super
confusing from userspace (It also complicates pidfd polling but that's
an aside.). I'll wip up a patch for this once I get back from travel.

> 
> Vast majority of programs are single-threaded when they exec with an
> unshared ->fs, so they don't need to bear any overhead nor complexity
> modulo a branch.
> 
> For any fucky case you can park yourself waiting for any pending exec to
> finish.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4)
  2025-03-22  1:00     ` Al Viro
  2025-03-22  6:26       ` Kees Cook
@ 2025-03-22 10:23       ` Christian Brauner
  2025-03-22 15:55       ` Oleg Nesterov
  2 siblings, 0 replies; 30+ messages in thread
From: Christian Brauner @ 2025-03-22 10:23 UTC (permalink / raw)
  To: Al Viro
  Cc: Kees Cook, Oleg Nesterov, jack, linux-fsdevel, linux-kernel,
	linux-mm, syzkaller-bugs, syzbot

On Sat, Mar 22, 2025 at 01:00:08AM +0000, Al Viro wrote:
> On Fri, Mar 21, 2025 at 09:45:39AM +0100, Christian Brauner wrote:
> 
> > Afaict, the only way this data race can happen is if we jump to the
> > cleanup label and then reset current->fs->in_exec. If the execve was
> > successful there's no one to race us with CLONE_FS obviously because we
> > took down all other threads.
> 
> Not really.
> 
> 1) A enters check_unsafe_execve(), sets ->in_exec to 1
> 2) B enters check_unsafe_execve(), sets ->in_exec to 1
> 3) A calls exec_binprm(), fails (bad binary)
> 4) A clears ->in_exec
> 5) C calls clone(2) with CLONE_FS and spawns D - ->in_exec is 0
> 6) B gets through exec_binprm(), kills A and C, but not D.
> 7) B clears ->in_exec, returns
> 
> Result: B and D share ->fs, B runs suid binary.
> 
> Had (5) happened prior to (2), (2) wouldn't have set ->in_exec;
> had (5) happened prior to (4), clone() would've failed; had
> (5) been delayed past (6), there wouldn't have been a thread
> to call clone().
> 
> But in the window between (4) and (6), clone() doesn't see
> execve() in progress and check_unsafe_execve() has already
> been done, so it hadn't seen the extra thread.

Eewww, you're right. That's ugly as hell.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4)
  2025-03-22  1:00     ` Al Viro
  2025-03-22  6:26       ` Kees Cook
  2025-03-22 10:23       ` Christian Brauner
@ 2025-03-22 15:55       ` Oleg Nesterov
  2025-03-22 18:50         ` Al Viro
  2 siblings, 1 reply; 30+ messages in thread
From: Oleg Nesterov @ 2025-03-22 15:55 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, Kees Cook, jack, linux-fsdevel, linux-kernel,
	linux-mm, syzkaller-bugs, syzbot

Quite possibly I am wrong, I need to recall this code, but at first
glance...

On 03/22, Al Viro wrote:
>
> Not really.

I agree, it is really racy. But,

> 1) A enters check_unsafe_execve(), sets ->in_exec to 1
> 2) B enters check_unsafe_execve(), sets ->in_exec to 1

No, check_unsafe_execve() is called with cred_guard_mutex held,
see prepare_bprm_creds()

> 3) A calls exec_binprm(), fails (bad binary)
> 4) A clears ->in_exec

So (2) can only happen after A fails and drops cred_guard_mutex.

And this means that we just need to ensure that ->in_exec is cleared
before this mutex is dropped, no? Something like below?

Oleg.
---

diff --git a/fs/exec.c b/fs/exec.c
index 506cd411f4ac..f8bf3c96e181 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1233,6 +1233,7 @@ int begin_new_exec(struct linux_binprm * bprm)
 	 * Make this the only thread in the thread group.
 	 */
 	retval = de_thread(me);
+	current->fs->in_exec = 0;
 	if (retval)
 		goto out;
 
@@ -1497,6 +1498,8 @@ static void free_bprm(struct linux_binprm *bprm)
 	}
 	free_arg_pages(bprm);
 	if (bprm->cred) {
+		// for the case exec fails before de_thread()
+		current->fs->in_exec = 0;
 		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
@@ -1862,7 +1865,6 @@ static int bprm_execve(struct linux_binprm *bprm)
 
 	sched_mm_cid_after_execve(current);
 	/* execve succeeded */
-	current->fs->in_exec = 0;
 	current->in_execve = 0;
 	rseq_execve(current);
 	user_events_execve(current);
@@ -1881,7 +1883,6 @@ static int bprm_execve(struct linux_binprm *bprm)
 		force_fatal_sig(SIGSEGV);
 
 	sched_mm_cid_after_execve(current);
-	current->fs->in_exec = 0;
 	current->in_execve = 0;
 
 	return retval;


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4)
  2025-03-22 15:55       ` Oleg Nesterov
@ 2025-03-22 18:50         ` Al Viro
  2025-03-23 18:14           ` Oleg Nesterov
  0 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2025-03-22 18:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Brauner, Kees Cook, jack, linux-fsdevel, linux-kernel,
	linux-mm, syzkaller-bugs, syzbot

On Sat, Mar 22, 2025 at 04:55:39PM +0100, Oleg Nesterov wrote:

> No, check_unsafe_execve() is called with cred_guard_mutex held,
> see prepare_bprm_creds()

Point...

> > 3) A calls exec_binprm(), fails (bad binary)
> > 4) A clears ->in_exec
> 
> So (2) can only happen after A fails and drops cred_guard_mutex.
> 
> And this means that we just need to ensure that ->in_exec is cleared
> before this mutex is dropped, no? Something like below?

Probably should work, but I wonder if it would be cleaner to have
->in_exec replaced with pointer to task_struct responsible.  Not
"somebody with that fs_struct for ->fs is trying to do execve(),
has verified that nothing outside of their threads is using this
and had been holding ->signal->cred_guard_mutex ever since then",
but "this is the thread that..."

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4)
  2025-03-22 18:50         ` Al Viro
@ 2025-03-23 18:14           ` Oleg Nesterov
  2025-03-23 20:57             ` Christian Brauner
  0 siblings, 1 reply; 30+ messages in thread
From: Oleg Nesterov @ 2025-03-23 18:14 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, Kees Cook, jack, linux-fsdevel, linux-kernel,
	linux-mm, syzkaller-bugs, syzbot

On 03/22, Al Viro wrote:
>
> On Sat, Mar 22, 2025 at 04:55:39PM +0100, Oleg Nesterov wrote:
>
> > And this means that we just need to ensure that ->in_exec is cleared
> > before this mutex is dropped, no? Something like below?
>
> Probably should work, but I wonder if it would be cleaner to have
> ->in_exec replaced with pointer to task_struct responsible.  Not
> "somebody with that fs_struct for ->fs is trying to do execve(),
> has verified that nothing outside of their threads is using this
> and had been holding ->signal->cred_guard_mutex ever since then",
> but "this is the thread that..."

perhaps... or something else to make this "not immediately obvious"
fs->in_exec more clear.

But I guess we need something simple for -stable, so will you agree
with this fix for now? Apart from changelog/comments.

	retval = de_thread(me);
+	current->fs->in_exec = 0;
	if (retval)
		current->fs->in_exec = 0;

is correct but looks confusing. See "V2" below, it clears fs->in_exec
after the "if (retval)" check.

syzbot says:

	Unfortunately, I don't have any reproducer for this issue yet.

so I guess "#syz test: " is pointless right now...

Oleg.
---

diff --git a/fs/exec.c b/fs/exec.c
index 506cd411f4ac..02e8824fc9cd 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1236,6 +1236,7 @@ int begin_new_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+	current->fs->in_exec = 0;
 	/*
 	 * Cancel any io_uring activity across execve
 	 */
@@ -1497,6 +1498,8 @@ static void free_bprm(struct linux_binprm *bprm)
 	}
 	free_arg_pages(bprm);
 	if (bprm->cred) {
+		// for the case exec fails before de_thread()
+		current->fs->in_exec = 0;
 		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
@@ -1862,7 +1865,6 @@ static int bprm_execve(struct linux_binprm *bprm)
 
 	sched_mm_cid_after_execve(current);
 	/* execve succeeded */
-	current->fs->in_exec = 0;
 	current->in_execve = 0;
 	rseq_execve(current);
 	user_events_execve(current);
@@ -1881,7 +1883,6 @@ static int bprm_execve(struct linux_binprm *bprm)
 		force_fatal_sig(SIGSEGV);
 
 	sched_mm_cid_after_execve(current);
-	current->fs->in_exec = 0;
 	current->in_execve = 0;
 
 	return retval;


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4)
  2025-03-23 18:14           ` Oleg Nesterov
@ 2025-03-23 20:57             ` Christian Brauner
  0 siblings, 0 replies; 30+ messages in thread
From: Christian Brauner @ 2025-03-23 20:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Kees Cook, jack, linux-fsdevel, linux-kernel, linux-mm,
	syzkaller-bugs, syzbot

On Sun, Mar 23, 2025 at 07:14:21PM +0100, Oleg Nesterov wrote:
> On 03/22, Al Viro wrote:
> >
> > On Sat, Mar 22, 2025 at 04:55:39PM +0100, Oleg Nesterov wrote:
> >
> > > And this means that we just need to ensure that ->in_exec is cleared
> > > before this mutex is dropped, no? Something like below?
> >
> > Probably should work, but I wonder if it would be cleaner to have
> > ->in_exec replaced with pointer to task_struct responsible.  Not
> > "somebody with that fs_struct for ->fs is trying to do execve(),
> > has verified that nothing outside of their threads is using this
> > and had been holding ->signal->cred_guard_mutex ever since then",
> > but "this is the thread that..."
> 
> perhaps... or something else to make this "not immediately obvious"
> fs->in_exec more clear.

Well, it would certainly help to document that cred_guard_mutex
serializes concurrent exec.

This is kind of important information given that begin_new_exec() and
finalize_exec() are only called from ->load_binary() and are thus always
located in the individual binfmt_*.c files. That makes this pretty
implicit information.

Let alone that the unlocking is all based on bprm->cred being set or
unset.

Otherwise the patch looks good to me.

> 
> But I guess we need something simple for -stable, so will you agree
> with this fix for now? Apart from changelog/comments.
> 
> 	retval = de_thread(me);
> +	current->fs->in_exec = 0;
> 	if (retval)
> 		current->fs->in_exec = 0;
> 
> is correct but looks confusing. See "V2" below, it clears fs->in_exec
> after the "if (retval)" check.
> 
> syzbot says:
> 
> 	Unfortunately, I don't have any reproducer for this issue yet.
> 
> so I guess "#syz test: " is pointless right now...
> 
> Oleg.
> ---
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 506cd411f4ac..02e8824fc9cd 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1236,6 +1236,7 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	if (retval)
>  		goto out;
>  
> +	current->fs->in_exec = 0;
>  	/*
>  	 * Cancel any io_uring activity across execve
>  	 */
> @@ -1497,6 +1498,8 @@ static void free_bprm(struct linux_binprm *bprm)
>  	}
>  	free_arg_pages(bprm);
>  	if (bprm->cred) {
> +		// for the case exec fails before de_thread()
> +		current->fs->in_exec = 0;
>  		mutex_unlock(&current->signal->cred_guard_mutex);
>  		abort_creds(bprm->cred);
>  	}
> @@ -1862,7 +1865,6 @@ static int bprm_execve(struct linux_binprm *bprm)
>  
>  	sched_mm_cid_after_execve(current);
>  	/* execve succeeded */
> -	current->fs->in_exec = 0;
>  	current->in_execve = 0;
>  	rseq_execve(current);
>  	user_events_execve(current);
> @@ -1881,7 +1883,6 @@ static int bprm_execve(struct linux_binprm *bprm)
>  		force_fatal_sig(SIGSEGV);
>  
>  	sched_mm_cid_after_execve(current);
> -	current->fs->in_exec = 0;
>  	current->in_execve = 0;
>  
>  	return retval;
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] exec: fix the racy usage of fs_struct->in_exec
  2025-03-20 19:09 [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) syzbot
  2025-03-20 20:09 ` Kees Cook
@ 2025-03-24 16:00 ` Oleg Nesterov
  2025-03-24 17:01   ` Mateusz Guzik
  2025-04-29 15:49   ` Oleg Nesterov
  1 sibling, 2 replies; 30+ messages in thread
From: Oleg Nesterov @ 2025-03-24 16:00 UTC (permalink / raw)
  To: syzbot, brauner, kees, viro
  Cc: jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs,
	mjguzik

check_unsafe_exec() sets fs->in_exec under cred_guard_mutex, then execve()
paths clear fs->in_exec lockless. This is fine if exec succeeds, but if it
fails we have the following race:

	T1 sets fs->in_exec = 1, fails, drops cred_guard_mutex

	T2 sets fs->in_exec = 1

	T1 clears fs->in_exec

	T2 continues with fs->in_exec == 0

Change fs/exec.c to clear fs->in_exec with cred_guard_mutex held.

Reported-by: syzbot+1c486d0b62032c82a968@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001f.GAE@google.com/
Cc: stable@vger.kernel.org
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/exec.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 506cd411f4ac..17047210be46 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1229,13 +1229,12 @@ int begin_new_exec(struct linux_binprm * bprm)
 	 */
 	bprm->point_of_no_return = true;
 
-	/*
-	 * Make this the only thread in the thread group.
-	 */
+	/* Make this the only thread in the thread group */
 	retval = de_thread(me);
 	if (retval)
 		goto out;
-
+	/* see the comment in check_unsafe_exec() */
+	current->fs->in_exec = 0;
 	/*
 	 * Cancel any io_uring activity across execve
 	 */
@@ -1497,6 +1496,8 @@ static void free_bprm(struct linux_binprm *bprm)
 	}
 	free_arg_pages(bprm);
 	if (bprm->cred) {
+		/* in case exec fails before de_thread() succeeds */
+		current->fs->in_exec = 0;
 		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
@@ -1618,6 +1619,10 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
 	 * suid exec because the differently privileged task
 	 * will be able to manipulate the current directory, etc.
 	 * It would be nice to force an unshare instead...
+	 *
+	 * Otherwise we set fs->in_exec = 1 to deny clone(CLONE_FS)
+	 * from another sub-thread until de_thread() succeeds, this
+	 * state is protected by cred_guard_mutex we hold.
 	 */
 	n_fs = 1;
 	spin_lock(&p->fs->lock);
@@ -1862,7 +1867,6 @@ static int bprm_execve(struct linux_binprm *bprm)
 
 	sched_mm_cid_after_execve(current);
 	/* execve succeeded */
-	current->fs->in_exec = 0;
 	current->in_execve = 0;
 	rseq_execve(current);
 	user_events_execve(current);
@@ -1881,7 +1885,6 @@ static int bprm_execve(struct linux_binprm *bprm)
 		force_fatal_sig(SIGSEGV);
 
 	sched_mm_cid_after_execve(current);
-	current->fs->in_exec = 0;
 	current->in_execve = 0;
 
 	return retval;
-- 
2.25.1.362.g51ebf55



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec
  2025-03-24 16:00 ` [PATCH] exec: fix the racy usage of fs_struct->in_exec Oleg Nesterov
@ 2025-03-24 17:01   ` Mateusz Guzik
  2025-03-24 18:27     ` Oleg Nesterov
  2025-04-29 15:49   ` Oleg Nesterov
  1 sibling, 1 reply; 30+ messages in thread
From: Mateusz Guzik @ 2025-03-24 17:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: syzbot, brauner, kees, viro, jack, linux-fsdevel, linux-kernel,
	linux-mm, syzkaller-bugs

On Mon, Mar 24, 2025 at 5:00 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> check_unsafe_exec() sets fs->in_exec under cred_guard_mutex, then execve()
> paths clear fs->in_exec lockless. This is fine if exec succeeds, but if it
> fails we have the following race:
>
>         T1 sets fs->in_exec = 1, fails, drops cred_guard_mutex
>
>         T2 sets fs->in_exec = 1
>
>         T1 clears fs->in_exec
>
>         T2 continues with fs->in_exec == 0
>
> Change fs/exec.c to clear fs->in_exec with cred_guard_mutex held.
>

I had cursory glances at this code earlier and the more I try to
understand it the more confused I am.

The mutex at hand hides in ->signal and fs->in_exec remains treated as a flag.

The loop in check_unsafe_exec() tries to guard against a task which
shares ->fs, but does not share ->mm? To my reading this implies
unshared ->signal, so the mutex protection does not apply.

I think this ends up being harmless as in this case nobody is going to
set ->in_exec (instead they are going to share LSM_UNSAFE_SHARE), so
clearing it in these spots becomes a nop.

At the same time the check in copy_fs() no longer blocks clones as
check_unsafe_exec() already opts to LSM_UNSAFE_SHARE?

Even if this all works with the patch, this is an incredibly odd set
of dependencies and I don't see a good reason for it to still be here.

Per my other e-mail the obvious scheme would serialize all execs
sharing ->fs and make copy_fs do a killable wait for execs to finish.
Arguably this would also improve userspace-visible behavior as a
transient -EBUSY would be eliminated.

No matter what the specific solution, imo treating ->in_exec as a flag
needs to die.

is there a problem getting this done even for stable kernels? I
understand it would be harder to backport churn-wise, but should be
much easier to reason about?

Just my $0,03
-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec
  2025-03-24 17:01   ` Mateusz Guzik
@ 2025-03-24 18:27     ` Oleg Nesterov
  2025-03-24 18:37       ` Oleg Nesterov
  2025-03-24 22:24       ` Mateusz Guzik
  0 siblings, 2 replies; 30+ messages in thread
From: Oleg Nesterov @ 2025-03-24 18:27 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: syzbot, brauner, kees, viro, jack, linux-fsdevel, linux-kernel,
	linux-mm, syzkaller-bugs

On 03/24, Mateusz Guzik wrote:
>
> I had cursory glances at this code earlier and the more I try to
> understand it the more confused I am.

You are not alone ;)

> Per my other e-mail the obvious scheme would serialize all execs
> sharing ->fs and make copy_fs do a killable wait for execs to finish.
> Arguably this would also improve userspace-visible behavior as a
> transient -EBUSY would be eliminated.

I had the same feeling years ago. Why didn't I do it? I can't recall.
Perhaps because I found some problem in this idea, but most probably
because I failed to make the correct and simple patch.

> is there a problem getting this done even for stable kernels? I
> understand it would be harder to backport churn-wise, but should be
> much easier to reason about?

I won't argue with another solution. But this problem is quite old,
unless I am totally confused this logic was wrong from the very
beginning when fs->in_exec was introduced by 498052bba55ec.

So to me it would be better to have the trivial fix for stable,
exactly because it is trivially backportable. Then cleanup/simplify
this logic on top of it.

Oleg.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec
  2025-03-24 18:27     ` Oleg Nesterov
@ 2025-03-24 18:37       ` Oleg Nesterov
  2025-03-24 22:24       ` Mateusz Guzik
  1 sibling, 0 replies; 30+ messages in thread
From: Oleg Nesterov @ 2025-03-24 18:37 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: syzbot, brauner, kees, viro, jack, linux-fsdevel, linux-kernel,
	linux-mm, syzkaller-bugs

On 03/24, Oleg Nesterov wrote:
>
> I won't argue with another solution. But this problem is quite old,

Yes, but...

> unless I am totally confused this logic was wrong from the very
> beginning when fs->in_exec was introduced by 498052bba55ec.

OK, I was wrong. According to git show 498052bba55ec:fs/exec.c
this patch was correct in this respect. Sorry for the noise.

> So to me it would be better to have the trivial fix for stable,
> exactly because it is trivially backportable. Then cleanup/simplify
> this logic on top of it.

Yes...

Oleg.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec
  2025-03-24 18:27     ` Oleg Nesterov
  2025-03-24 18:37       ` Oleg Nesterov
@ 2025-03-24 22:24       ` Mateusz Guzik
  2025-03-25 10:09         ` Oleg Nesterov
  1 sibling, 1 reply; 30+ messages in thread
From: Mateusz Guzik @ 2025-03-24 22:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: syzbot, brauner, kees, viro, jack, linux-fsdevel, linux-kernel,
	linux-mm, syzkaller-bugs

On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote:
> I won't argue with another solution. But this problem is quite old,
> unless I am totally confused this logic was wrong from the very
> beginning when fs->in_exec was introduced by 498052bba55ec.
>
> So to me it would be better to have the trivial fix for stable,
> exactly because it is trivially backportable. Then cleanup/simplify
> this logic on top of it.
>

So I got myself a crap testcase with a CLONE_FS'ed task which can
execve and sanity-checked that suid is indeed not honored as expected.

The most convenient way out of planting a mutex in there does not work
because of the size -- fs_struct is already at 56 bytes and I'm not
going to push it past 64.

However, looks like "wait on bit" machinery can be employed here,
based on what I had seen with inodes (see __wait_on_freeing_inode) and
that should avoid growing the struct, unless I'm grossly misreading
something.

Anyhow, the plan would be to serialize on the bit, synchronized with
the current spin lock. copy_fs would call a helper to wait for it to
clear, would still bump ->users under the spin lock.

This would decouple the handling from cred_mutex and avoid weirdness
like clearing the ->in_exec flag when we never set it.

Should be easy enough and viable for backports, I'll try to hack it up
tomorrow unless the idea is NAKed. The crapper mentioned above will be
used to validate exec vs clone work as expected.

--
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec
  2025-03-24 22:24       ` Mateusz Guzik
@ 2025-03-25 10:09         ` Oleg Nesterov
  2025-03-25 11:01           ` Mateusz Guzik
  0 siblings, 1 reply; 30+ messages in thread
From: Oleg Nesterov @ 2025-03-25 10:09 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: syzbot, brauner, kees, viro, jack, linux-fsdevel, linux-kernel,
	linux-mm, syzkaller-bugs

On 03/24, Mateusz Guzik wrote:
>
> On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > So to me it would be better to have the trivial fix for stable,
> > exactly because it is trivially backportable. Then cleanup/simplify
> > this logic on top of it.
>
> So I got myself a crap testcase with a CLONE_FS'ed task which can
> execve and sanity-checked that suid is indeed not honored as expected.

So you mean my patch can't fix the problem?

> Anyhow, the plan would be to serialize on the bit, synchronized with
> the current spin lock. copy_fs would call a helper to wait for it to
> clear, would still bump ->users under the spin lock.
>
> This would decouple the handling from cred_mutex and avoid weirdness
> like clearing the ->in_exec flag when we never set it.

I don't really understand the idea, but as I said I won't argue with
another solution.

Oleg.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec
  2025-03-25 10:09         ` Oleg Nesterov
@ 2025-03-25 11:01           ` Mateusz Guzik
  2025-03-25 13:21             ` Oleg Nesterov
  0 siblings, 1 reply; 30+ messages in thread
From: Mateusz Guzik @ 2025-03-25 11:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: syzbot, brauner, kees, viro, jack, linux-fsdevel, linux-kernel,
	linux-mm, syzkaller-bugs

On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 03/24, Mateusz Guzik wrote:
> >
> > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > So to me it would be better to have the trivial fix for stable,
> > > exactly because it is trivially backportable. Then cleanup/simplify
> > > this logic on top of it.
> >
> > So I got myself a crap testcase with a CLONE_FS'ed task which can
> > execve and sanity-checked that suid is indeed not honored as expected.
>
> So you mean my patch can't fix the problem?

No, I think the patch works.

I am saying the current scheme is avoidably hard to reason about.

>
> > Anyhow, the plan would be to serialize on the bit, synchronized with
> > the current spin lock. copy_fs would call a helper to wait for it to
> > clear, would still bump ->users under the spin lock.
> >
> > This would decouple the handling from cred_mutex and avoid weirdness
> > like clearing the ->in_exec flag when we never set it.
>
> I don't really understand the idea, but as I said I won't argue with
> another solution.
>

I'll try to ship later today so that there will be something concrete
to comment on.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec
  2025-03-25 11:01           ` Mateusz Guzik
@ 2025-03-25 13:21             ` Oleg Nesterov
  2025-03-25 13:30               ` Christian Brauner
  0 siblings, 1 reply; 30+ messages in thread
From: Oleg Nesterov @ 2025-03-25 13:21 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: syzbot, brauner, kees, viro, jack, linux-fsdevel, linux-kernel,
	linux-mm, syzkaller-bugs

On 03/25, Mateusz Guzik wrote:
>
> On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 03/24, Mateusz Guzik wrote:
> > >
> > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > > >
> > > > So to me it would be better to have the trivial fix for stable,
> > > > exactly because it is trivially backportable. Then cleanup/simplify
> > > > this logic on top of it.
> > >
> > > So I got myself a crap testcase with a CLONE_FS'ed task which can
> > > execve and sanity-checked that suid is indeed not honored as expected.
> >
> > So you mean my patch can't fix the problem?
>
> No, I think the patch works.
>
> I am saying the current scheme is avoidably hard to reason about.

Ah, OK, thanks. Then I still think it makes more sense to do the
cleanups you propose on top of this fix.

But I leave this to you and other fs/ maintainers.

Oleg.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec
  2025-03-25 13:21             ` Oleg Nesterov
@ 2025-03-25 13:30               ` Christian Brauner
  2025-03-25 14:15                 ` Mateusz Guzik
  0 siblings, 1 reply; 30+ messages in thread
From: Christian Brauner @ 2025-03-25 13:30 UTC (permalink / raw)
  To: Oleg Nesterov, Mateusz Guzik
  Cc: syzbot, kees, viro, jack, linux-fsdevel, linux-kernel, linux-mm,
	syzkaller-bugs

On Tue, Mar 25, 2025 at 02:21:36PM +0100, Oleg Nesterov wrote:
> On 03/25, Mateusz Guzik wrote:
> >
> > On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > On 03/24, Mateusz Guzik wrote:
> > > >
> > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > > > >
> > > > > So to me it would be better to have the trivial fix for stable,
> > > > > exactly because it is trivially backportable. Then cleanup/simplify
> > > > > this logic on top of it.
> > > >
> > > > So I got myself a crap testcase with a CLONE_FS'ed task which can
> > > > execve and sanity-checked that suid is indeed not honored as expected.
> > >
> > > So you mean my patch can't fix the problem?
> >
> > No, I think the patch works.
> >
> > I am saying the current scheme is avoidably hard to reason about.
> 
> Ah, OK, thanks. Then I still think it makes more sense to do the
> cleanups you propose on top of this fix.

I agree. We should go with Oleg's fix that in the old scheme and use
that. And then @Mateusz your cleanup should please go on top!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec
  2025-03-25 13:30               ` Christian Brauner
@ 2025-03-25 14:15                 ` Mateusz Guzik
  2025-03-25 14:46                   ` Christian Brauner
  0 siblings, 1 reply; 30+ messages in thread
From: Mateusz Guzik @ 2025-03-25 14:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Oleg Nesterov, syzbot, kees, viro, jack, linux-fsdevel,
	linux-kernel, linux-mm, syzkaller-bugs

On Tue, Mar 25, 2025 at 2:30 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Tue, Mar 25, 2025 at 02:21:36PM +0100, Oleg Nesterov wrote:
> > On 03/25, Mateusz Guzik wrote:
> > >
> > > On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > > >
> > > > On 03/24, Mateusz Guzik wrote:
> > > > >
> > > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > > > > >
> > > > > > So to me it would be better to have the trivial fix for stable,
> > > > > > exactly because it is trivially backportable. Then cleanup/simplify
> > > > > > this logic on top of it.
> > > > >
> > > > > So I got myself a crap testcase with a CLONE_FS'ed task which can
> > > > > execve and sanity-checked that suid is indeed not honored as expected.
> > > >
> > > > So you mean my patch can't fix the problem?
> > >
> > > No, I think the patch works.
> > >
> > > I am saying the current scheme is avoidably hard to reason about.
> >
> > Ah, OK, thanks. Then I still think it makes more sense to do the
> > cleanups you propose on top of this fix.
>
> I agree. We should go with Oleg's fix that in the old scheme and use
> that. And then @Mateusz your cleanup should please go on top!

Ok, in that case I'm gonna ship when I'm gonna ship(tm), maybe later this week.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec
  2025-03-25 14:15                 ` Mateusz Guzik
@ 2025-03-25 14:46                   ` Christian Brauner
  2025-03-25 18:40                     ` Kees Cook
  0 siblings, 1 reply; 30+ messages in thread
From: Christian Brauner @ 2025-03-25 14:46 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Oleg Nesterov, syzbot, kees, viro, jack, linux-fsdevel,
	linux-kernel, linux-mm, syzkaller-bugs

On Tue, Mar 25, 2025 at 03:15:06PM +0100, Mateusz Guzik wrote:
> On Tue, Mar 25, 2025 at 2:30 PM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Tue, Mar 25, 2025 at 02:21:36PM +0100, Oleg Nesterov wrote:
> > > On 03/25, Mateusz Guzik wrote:
> > > >
> > > > On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > > > >
> > > > > On 03/24, Mateusz Guzik wrote:
> > > > > >
> > > > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > > > > > >
> > > > > > > So to me it would be better to have the trivial fix for stable,
> > > > > > > exactly because it is trivially backportable. Then cleanup/simplify
> > > > > > > this logic on top of it.
> > > > > >
> > > > > > So I got myself a crap testcase with a CLONE_FS'ed task which can
> > > > > > execve and sanity-checked that suid is indeed not honored as expected.
> > > > >
> > > > > So you mean my patch can't fix the problem?
> > > >
> > > > No, I think the patch works.
> > > >
> > > > I am saying the current scheme is avoidably hard to reason about.
> > >
> > > Ah, OK, thanks. Then I still think it makes more sense to do the
> > > cleanups you propose on top of this fix.
> >
> > I agree. We should go with Oleg's fix that in the old scheme and use
> > that. And then @Mateusz your cleanup should please go on top!
> 
> Ok, in that case I'm gonna ship when I'm gonna ship(tm), maybe later this week.

Ok, I've taken the patch as I've got a first round of fixes to send
already.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec
  2025-03-25 14:46                   ` Christian Brauner
@ 2025-03-25 18:40                     ` Kees Cook
  0 siblings, 0 replies; 30+ messages in thread
From: Kees Cook @ 2025-03-25 18:40 UTC (permalink / raw)
  To: Christian Brauner, Mateusz Guzik
  Cc: Oleg Nesterov, syzbot, viro, jack, linux-fsdevel, linux-kernel,
	linux-mm, syzkaller-bugs



On March 25, 2025 7:46:15 AM PDT, Christian Brauner <brauner@kernel.org> wrote:
>On Tue, Mar 25, 2025 at 03:15:06PM +0100, Mateusz Guzik wrote:
>> On Tue, Mar 25, 2025 at 2:30 PM Christian Brauner <brauner@kernel.org> wrote:
>> >
>> > On Tue, Mar 25, 2025 at 02:21:36PM +0100, Oleg Nesterov wrote:
>> > > On 03/25, Mateusz Guzik wrote:
>> > > >
>> > > > On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote:
>> > > > >
>> > > > > On 03/24, Mateusz Guzik wrote:
>> > > > > >
>> > > > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote:
>> > > > > > >
>> > > > > > > So to me it would be better to have the trivial fix for stable,
>> > > > > > > exactly because it is trivially backportable. Then cleanup/simplify
>> > > > > > > this logic on top of it.
>> > > > > >
>> > > > > > So I got myself a crap testcase with a CLONE_FS'ed task which can
>> > > > > > execve and sanity-checked that suid is indeed not honored as expected.
>> > > > >
>> > > > > So you mean my patch can't fix the problem?
>> > > >
>> > > > No, I think the patch works.
>> > > >
>> > > > I am saying the current scheme is avoidably hard to reason about.
>> > >
>> > > Ah, OK, thanks. Then I still think it makes more sense to do the
>> > > cleanups you propose on top of this fix.
>> >
>> > I agree. We should go with Oleg's fix that in the old scheme and use
>> > that. And then @Mateusz your cleanup should please go on top!
>> 
>> Ok, in that case I'm gonna ship when I'm gonna ship(tm), maybe later this week.
>
>Ok, I've taken the patch as I've got a first round of fixes to send
>already.

Thanks! 

Acked-by: Kees Cook <kees@kernel.org>


-- 
Kees Cook

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec
  2025-03-24 16:00 ` [PATCH] exec: fix the racy usage of fs_struct->in_exec Oleg Nesterov
  2025-03-24 17:01   ` Mateusz Guzik
@ 2025-04-29 15:49   ` Oleg Nesterov
  2025-04-29 16:57     ` Kees Cook
  2025-04-29 17:12     ` Mateusz Guzik
  1 sibling, 2 replies; 30+ messages in thread
From: Oleg Nesterov @ 2025-04-29 15:49 UTC (permalink / raw)
  To: brauner, kees, viro
  Cc: jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs,
	mjguzik, Michal Hocko

Damn, I am stupid.

On 03/24, Oleg Nesterov wrote:
>
> check_unsafe_exec() sets fs->in_exec under cred_guard_mutex, then execve()
> paths clear fs->in_exec lockless. This is fine if exec succeeds, but if it
> fails we have the following race:
>
> 	T1 sets fs->in_exec = 1, fails, drops cred_guard_mutex
>
> 	T2 sets fs->in_exec = 1
>
> 	T1 clears fs->in_exec

When I look at this code again, I think this race was not possible and thus
this patch (applied as af7bb0d2ca45) was not needed.

Yes, begin_new_exec() can drop cred_guard_mutex on failure, but only after
de_thread() succeeds, when we can't race with another sub-thread.

I hope this patch didn't make the things worse so we don't need to revert it.
Plus I think it makes this (confusing) logic a bit more clear. Just, unless
I am confused again, it wasn't really needed.

-----------------------------------------------------------------------------
But. I didn't read the original report from syzbot,
https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001f.GAE@google.com/#t
because I wasn't CC'ed. and then - sorry Kees!!! - I didn't bother to read
your first reply carefully.

So yes, with or without this patch the "if (fs->in_exec)" check in copy_fs()
can obviously hit the 1 -> 0 transition.

This is harmless, but should be probably fixed just to avoid another report
from KCSAN.

I do not want to add another spin_lock(fs->lock). We can change copy_fs() to
use data_race(), but I'd prefer the patch below. Yes, it needs the additional
comment(s) to explain READ_ONCE().

What do you think? Did I miss somthing again??? Quite possibly...

Mateusz, I hope you will cleanup this horror sooner or later ;)

Oleg.
---

diff --git a/fs/exec.c b/fs/exec.c
index 5d1c0d2dc403..42a7f9b43911 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1495,7 +1495,7 @@ static void free_bprm(struct linux_binprm *bprm)
 	free_arg_pages(bprm);
 	if (bprm->cred) {
 		/* in case exec fails before de_thread() succeeds */
-		current->fs->in_exec = 0;
+		WRITE_ONCE(current->fs->in_exec, 0);
 		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
diff --git a/kernel/fork.c b/kernel/fork.c
index 4c2df3816728..381af8c8ece8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1802,7 +1802,7 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
 		/* tsk->fs is already what we want */
 		spin_lock(&fs->lock);
 		/* "users" and "in_exec" locked for check_unsafe_exec() */
-		if (fs->in_exec) {
+		if (READ_ONCE(fs->in_exec)) {
 			spin_unlock(&fs->lock);
 			return -EAGAIN;
 		}


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec
  2025-04-29 15:49   ` Oleg Nesterov
@ 2025-04-29 16:57     ` Kees Cook
  2025-04-29 17:12     ` Mateusz Guzik
  1 sibling, 0 replies; 30+ messages in thread
From: Kees Cook @ 2025-04-29 16:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: brauner, viro, jack, linux-fsdevel, linux-kernel, linux-mm,
	syzkaller-bugs, mjguzik, Michal Hocko

On Tue, Apr 29, 2025 at 05:49:44PM +0200, Oleg Nesterov wrote:
> Damn, I am stupid.
> 
> On 03/24, Oleg Nesterov wrote:
> >
> > check_unsafe_exec() sets fs->in_exec under cred_guard_mutex, then execve()
> > paths clear fs->in_exec lockless. This is fine if exec succeeds, but if it
> > fails we have the following race:
> >
> > 	T1 sets fs->in_exec = 1, fails, drops cred_guard_mutex
> >
> > 	T2 sets fs->in_exec = 1
> >
> > 	T1 clears fs->in_exec
> 
> When I look at this code again, I think this race was not possible and thus
> this patch (applied as af7bb0d2ca45) was not needed.
> 
> Yes, begin_new_exec() can drop cred_guard_mutex on failure, but only after
> de_thread() succeeds, when we can't race with another sub-thread.
> 
> I hope this patch didn't make the things worse so we don't need to revert it.
> Plus I think it makes this (confusing) logic a bit more clear. Just, unless
> I am confused again, it wasn't really needed.
> 
> -----------------------------------------------------------------------------
> But. I didn't read the original report from syzbot,
> https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001f.GAE@google.com/#t
> because I wasn't CC'ed. and then - sorry Kees!!! - I didn't bother to read
> your first reply carefully.
> 
> So yes, with or without this patch the "if (fs->in_exec)" check in copy_fs()
> can obviously hit the 1 -> 0 transition.
> 
> This is harmless, but should be probably fixed just to avoid another report
> from KCSAN.
> 
> I do not want to add another spin_lock(fs->lock). We can change copy_fs() to
> use data_race(), but I'd prefer the patch below. Yes, it needs the additional
> comment(s) to explain READ_ONCE().
> 
> What do you think? Did I miss somthing again??? Quite possibly...
> 
> Mateusz, I hope you will cleanup this horror sooner or later ;)
> 
> Oleg.
> ---
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 5d1c0d2dc403..42a7f9b43911 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1495,7 +1495,7 @@ static void free_bprm(struct linux_binprm *bprm)
>  	free_arg_pages(bprm);
>  	if (bprm->cred) {
>  		/* in case exec fails before de_thread() succeeds */
> -		current->fs->in_exec = 0;
> +		WRITE_ONCE(current->fs->in_exec, 0);
>  		mutex_unlock(&current->signal->cred_guard_mutex);
>  		abort_creds(bprm->cred);
>  	}
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 4c2df3816728..381af8c8ece8 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1802,7 +1802,7 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
>  		/* tsk->fs is already what we want */
>  		spin_lock(&fs->lock);
>  		/* "users" and "in_exec" locked for check_unsafe_exec() */
> -		if (fs->in_exec) {
> +		if (READ_ONCE(fs->in_exec)) {
>  			spin_unlock(&fs->lock);
>  			return -EAGAIN;
>  		}
> 

Yeah, this seems reasonable.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec
  2025-04-29 15:49   ` Oleg Nesterov
  2025-04-29 16:57     ` Kees Cook
@ 2025-04-29 17:12     ` Mateusz Guzik
  1 sibling, 0 replies; 30+ messages in thread
From: Mateusz Guzik @ 2025-04-29 17:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: brauner, kees, viro, jack, linux-fsdevel, linux-kernel, linux-mm,
	syzkaller-bugs, Michal Hocko

On Tue, Apr 29, 2025 at 5:50 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> Damn, I am stupid.
>
> On 03/24, Oleg Nesterov wrote:
> >
> > check_unsafe_exec() sets fs->in_exec under cred_guard_mutex, then execve()
> > paths clear fs->in_exec lockless. This is fine if exec succeeds, but if it
> > fails we have the following race:
> >
> >       T1 sets fs->in_exec = 1, fails, drops cred_guard_mutex
> >
> >       T2 sets fs->in_exec = 1
> >
> >       T1 clears fs->in_exec
>
> When I look at this code again, I think this race was not possible and thus
> this patch (applied as af7bb0d2ca45) was not needed.
>
> Yes, begin_new_exec() can drop cred_guard_mutex on failure, but only after
> de_thread() succeeds, when we can't race with another sub-thread.
>
> I hope this patch didn't make the things worse so we don't need to revert it.
> Plus I think it makes this (confusing) logic a bit more clear. Just, unless
> I am confused again, it wasn't really needed.
>
> -----------------------------------------------------------------------------
> But. I didn't read the original report from syzbot,
> https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001f.GAE@google.com/#t
> because I wasn't CC'ed. and then - sorry Kees!!! - I didn't bother to read
> your first reply carefully.
>
> So yes, with or without this patch the "if (fs->in_exec)" check in copy_fs()
> can obviously hit the 1 -> 0 transition.
>
> This is harmless, but should be probably fixed just to avoid another report
> from KCSAN.
>
> I do not want to add another spin_lock(fs->lock). We can change copy_fs() to
> use data_race(), but I'd prefer the patch below. Yes, it needs the additional
> comment(s) to explain READ_ONCE().
>
> What do you think? Did I miss somthing again??? Quite possibly...
>
> Mateusz, I hope you will cleanup this horror sooner or later ;)
>
> Oleg.
> ---
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 5d1c0d2dc403..42a7f9b43911 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1495,7 +1495,7 @@ static void free_bprm(struct linux_binprm *bprm)
>         free_arg_pages(bprm);
>         if (bprm->cred) {
>                 /* in case exec fails before de_thread() succeeds */
> -               current->fs->in_exec = 0;
> +               WRITE_ONCE(current->fs->in_exec, 0);
>                 mutex_unlock(&current->signal->cred_guard_mutex);
>                 abort_creds(bprm->cred);
>         }
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 4c2df3816728..381af8c8ece8 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1802,7 +1802,7 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
>                 /* tsk->fs is already what we want */
>                 spin_lock(&fs->lock);
>                 /* "users" and "in_exec" locked for check_unsafe_exec() */
> -               if (fs->in_exec) {
> +               if (READ_ONCE(fs->in_exec)) {
>                         spin_unlock(&fs->lock);
>                         return -EAGAIN;
>                 }
>

good grief man ;)

I have this on my TODO list, (un)fortunately $life got in the way and
by now I swapped out almost all of the context. I mostly remember the
code is hard to follow. ;)

that said, maybe i'll give it a stab soon(tm). I have a testcase
somewhere to validate that this does provide the expect behavior vs
suid, so it's not going to be me flying blindly.
-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-04-29 17:12 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-20 19:09 [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) syzbot
2025-03-20 20:09 ` Kees Cook
2025-03-21  1:44   ` Al Viro
2025-03-21  8:10     ` Kees Cook
2025-03-21  8:49       ` Christian Brauner
2025-03-21  8:45   ` Christian Brauner
2025-03-22  1:00     ` Al Viro
2025-03-22  6:26       ` Kees Cook
2025-03-22 10:15         ` Mateusz Guzik
2025-03-22 10:28           ` Christian Brauner
2025-03-22 10:23       ` Christian Brauner
2025-03-22 15:55       ` Oleg Nesterov
2025-03-22 18:50         ` Al Viro
2025-03-23 18:14           ` Oleg Nesterov
2025-03-23 20:57             ` Christian Brauner
2025-03-24 16:00 ` [PATCH] exec: fix the racy usage of fs_struct->in_exec Oleg Nesterov
2025-03-24 17:01   ` Mateusz Guzik
2025-03-24 18:27     ` Oleg Nesterov
2025-03-24 18:37       ` Oleg Nesterov
2025-03-24 22:24       ` Mateusz Guzik
2025-03-25 10:09         ` Oleg Nesterov
2025-03-25 11:01           ` Mateusz Guzik
2025-03-25 13:21             ` Oleg Nesterov
2025-03-25 13:30               ` Christian Brauner
2025-03-25 14:15                 ` Mateusz Guzik
2025-03-25 14:46                   ` Christian Brauner
2025-03-25 18:40                     ` Kees Cook
2025-04-29 15:49   ` Oleg Nesterov
2025-04-29 16:57     ` Kees Cook
2025-04-29 17:12     ` Mateusz Guzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).