* [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) @ 2025-03-20 19:09 syzbot 2025-03-20 20:09 ` Kees Cook 2025-03-24 16:00 ` [PATCH] exec: fix the racy usage of fs_struct->in_exec Oleg Nesterov 0 siblings, 2 replies; 30+ messages in thread From: syzbot @ 2025-03-20 19:09 UTC (permalink / raw) To: brauner, jack, kees, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, viro Hello, syzbot found the following issue on: HEAD commit: a7f2e10ecd8f Merge tag 'hwmon-fixes-for-v6.14-rc8/6.14' of.. git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=114fee98580000 kernel config: https://syzkaller.appspot.com/x/.config?x=f33d372c4021745 dashboard link: https://syzkaller.appspot.com/bug?extid=1c486d0b62032c82a968 compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40 Unfortunately, I don't have any reproducer for this issue yet. Downloadable assets: disk image: https://storage.googleapis.com/syzbot-assets/614aabc71b48/disk-a7f2e10e.raw.xz vmlinux: https://storage.googleapis.com/syzbot-assets/d47dd90a010a/vmlinux-a7f2e10e.xz kernel image: https://storage.googleapis.com/syzbot-assets/418d8cf8782b/bzImage-a7f2e10e.xz IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+1c486d0b62032c82a968@syzkaller.appspotmail.com ================================================================== BUG: KCSAN: data-race in bprm_execve / copy_fs write to 0xffff8881044f8250 of 4 bytes by task 13692 on cpu 0: bprm_execve+0x748/0x9c0 fs/exec.c:1884 do_execveat_common+0x769/0x7e0 fs/exec.c:1966 do_execveat fs/exec.c:2051 [inline] __do_sys_execveat fs/exec.c:2125 [inline] __se_sys_execveat fs/exec.c:2119 [inline] __x64_sys_execveat+0x75/0x90 fs/exec.c:2119 x64_sys_call+0x291e/0x2dc0 arch/x86/include/generated/asm/syscalls_64.h:323 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffff8881044f8250 of 4 bytes by task 13686 on cpu 1: copy_fs+0x95/0xf0 kernel/fork.c:1770 copy_process+0xc91/0x1f50 kernel/fork.c:2394 kernel_clone+0x167/0x5e0 kernel/fork.c:2815 __do_sys_clone3 kernel/fork.c:3119 [inline] __se_sys_clone3+0x1c1/0x200 kernel/fork.c:3098 __x64_sys_clone3+0x31/0x40 kernel/fork.c:3098 x64_sys_call+0x2d56/0x2dc0 arch/x86/include/generated/asm/syscalls_64.h:436 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0x00000001 -> 0x00000000 Reported by Kernel Concurrency Sanitizer on: CPU: 1 UID: 0 PID: 13686 Comm: syz.1.3826 Not tainted 6.14.0-rc7-syzkaller-00074-ga7f2e10ecd8f #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025 ================================================================== --- This report is generated by a bot. It may contain errors. See https://goo.gl/tpsmEJ for more information about syzbot. syzbot engineers can be reached at syzkaller@googlegroups.com. syzbot will keep track of this issue. See: https://goo.gl/tpsmEJ#status for how to communicate with syzbot. If the report is already addressed, let syzbot know by replying with: #syz fix: exact-commit-title If you want to overwrite report's subsystems, reply with: #syz set subsystems: new-subsystem (See the list of subsystem names on the web dashboard) If the report is a duplicate of another one, reply with: #syz dup: exact-subject-of-another-report If you want to undo deduplication, reply with: #syz undup ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) 2025-03-20 19:09 [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) syzbot @ 2025-03-20 20:09 ` Kees Cook 2025-03-21 1:44 ` Al Viro 2025-03-21 8:45 ` Christian Brauner 2025-03-24 16:00 ` [PATCH] exec: fix the racy usage of fs_struct->in_exec Oleg Nesterov 1 sibling, 2 replies; 30+ messages in thread From: Kees Cook @ 2025-03-20 20:09 UTC (permalink / raw) To: Oleg Nesterov, brauner Cc: jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, viro, syzbot Hey look another threaded exec bug. :| On Thu, Mar 20, 2025 at 12:09:36PM -0700, syzbot wrote: > ================================================================== > BUG: KCSAN: data-race in bprm_execve / copy_fs > > write to 0xffff8881044f8250 of 4 bytes by task 13692 on cpu 0: > bprm_execve+0x748/0x9c0 fs/exec.c:1884 This is: current->fs->in_exec = 0; And is part of the execve failure path: out: ... if (bprm->point_of_no_return && !fatal_signal_pending(current)) force_fatal_sig(SIGSEGV); sched_mm_cid_after_execve(current); current->fs->in_exec = 0; current->in_execve = 0; return retval; } > do_execveat_common+0x769/0x7e0 fs/exec.c:1966 > do_execveat fs/exec.c:2051 [inline] > __do_sys_execveat fs/exec.c:2125 [inline] > __se_sys_execveat fs/exec.c:2119 [inline] > __x64_sys_execveat+0x75/0x90 fs/exec.c:2119 > x64_sys_call+0x291e/0x2dc0 arch/x86/include/generated/asm/syscalls_64.h:323 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > read to 0xffff8881044f8250 of 4 bytes by task 13686 on cpu 1: > copy_fs+0x95/0xf0 kernel/fork.c:1770 This is: if (fs->in_exec) { Which is under lock: struct fs_struct *fs = current->fs; if (clone_flags & CLONE_FS) { /* tsk->fs is already what we want */ spin_lock(&fs->lock); /* "users" and "in_exec" locked for check_unsafe_exec() * */ if (fs->in_exec) { spin_unlock(&fs->lock); return -EAGAIN; } fs->users++; spin_unlock(&fs->lock); Does execve need to be taking this lock? The other thing touching it is check_unsafe_exec(), which takes the lock. It looks like the bprm_execve() lock was removed in commit 8c652f96d385 ("do_execve() must not clear fs->in_exec if it was set by another thread") which used the return value from check_unsafe_exec(): When do_execve() succeeds, it is safe to clear ->in_exec unconditionally. It can be set only if we don't share ->fs with another process, and since we already killed all sub-threads either ->in_exec == 0 or we are the only user of this ->fs. Also, we do not need fs->lock to clear fs->in_exec. This logic was updated in commit 9e00cdb091b0 ("exec:check_unsafe_exec: kill the dead -EAGAIN and clear_in_exec logic"), which includes this rationale: 2. "out_unmark:" in do_execve_common() is either called under ->cred_guard_mutex, or after de_thread() which kills other threads, so we can't race with sub-thread which could set ->in_exec. And if ->fs is shared with another process ->in_exec should be false anyway. The de_thread() is part of the "point of no return" in exec_binprm(), called via exec_binprm(). But the bprm_execve() error path is reachable from many paths prior to the point of no return. What I can imagine here is two failing execs racing a fork: A start execve B fork with CLONE_FS C start execve, reach check_unsafe_exec(), set fs->in_exec A bprm_execve() failure, clear fs->in_exec B copy_fs() increment fs->users. C bprm_execve() failure, clear fs->in_exec But I don't think this is a "real" flaw, though, since the locking is to protect a _successful_ execve from a fork (i.e. getting the user count right). A successful execve will de_thread, and I don't see any wrong counting of fs->users with regard to thread lifetime. Did I miss something in the analysis? Should we perform locking anyway, or add data race annotations, or something else? -- Kees Cook ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) 2025-03-20 20:09 ` Kees Cook @ 2025-03-21 1:44 ` Al Viro 2025-03-21 8:10 ` Kees Cook 2025-03-21 8:45 ` Christian Brauner 1 sibling, 1 reply; 30+ messages in thread From: Al Viro @ 2025-03-21 1:44 UTC (permalink / raw) To: Kees Cook Cc: Oleg Nesterov, brauner, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, syzbot On Thu, Mar 20, 2025 at 01:09:38PM -0700, Kees Cook wrote: > What I can imagine here is two failing execs racing a fork: > > A start execve > B fork with CLONE_FS > C start execve, reach check_unsafe_exec(), set fs->in_exec > A bprm_execve() failure, clear fs->in_exec > B copy_fs() increment fs->users. > C bprm_execve() failure, clear fs->in_exec > > But I don't think this is a "real" flaw, though, since the locking is to > protect a _successful_ execve from a fork (i.e. getting the user count > right). A successful execve will de_thread, and I don't see any wrong > counting of fs->users with regard to thread lifetime. > > Did I miss something in the analysis? Should we perform locking anyway, > or add data race annotations, or something else? Umm... What if C succeeds, ending up with suid sharing ->fs? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) 2025-03-21 1:44 ` Al Viro @ 2025-03-21 8:10 ` Kees Cook 2025-03-21 8:49 ` Christian Brauner 0 siblings, 1 reply; 30+ messages in thread From: Kees Cook @ 2025-03-21 8:10 UTC (permalink / raw) To: Al Viro Cc: Oleg Nesterov, brauner, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, syzbot On Fri, Mar 21, 2025 at 01:44:23AM +0000, Al Viro wrote: > On Thu, Mar 20, 2025 at 01:09:38PM -0700, Kees Cook wrote: > > > What I can imagine here is two failing execs racing a fork: > > > > A start execve > > B fork with CLONE_FS > > C start execve, reach check_unsafe_exec(), set fs->in_exec > > A bprm_execve() failure, clear fs->in_exec > > B copy_fs() increment fs->users. > > C bprm_execve() failure, clear fs->in_exec > > > > But I don't think this is a "real" flaw, though, since the locking is to > > protect a _successful_ execve from a fork (i.e. getting the user count > > right). A successful execve will de_thread, and I don't see any wrong > > counting of fs->users with regard to thread lifetime. > > > > Did I miss something in the analysis? Should we perform locking anyway, > > or add data race annotations, or something else? > > Umm... What if C succeeds, ending up with suid sharing ->fs? I still can't quite construct it -- fs->users is always correct, I think? Below would be the bad set of events, but it's wrong that "fs->users==1". If A and C are both running with CLONE_FS then fs->users==2. A would need to exit first, but it can't do that and also set fs->in_exec=0 A execve, reaches bprm_execve() failure path B fork with CLONE_FS, reaches copy_fs() C execve, reaches check_unsafe_exec() C takes fs->lock, counts, finds safe fs->users==1, sets in_exec=1, unlocks A sets fs->in_exec=0 B takes fs->lock, sees in_exec==0, does fs->users++, unlocks C goes setuid, sharing fs with unpriv B Something still feels very weird, though. Does fs->in_exec not matter at all? Hmm, no, it stops fs->users++ happening after it was validated to be 1. -- Kees Cook ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) 2025-03-21 8:10 ` Kees Cook @ 2025-03-21 8:49 ` Christian Brauner 0 siblings, 0 replies; 30+ messages in thread From: Christian Brauner @ 2025-03-21 8:49 UTC (permalink / raw) To: Kees Cook Cc: Al Viro, Oleg Nesterov, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, syzbot On Fri, Mar 21, 2025 at 01:10:26AM -0700, Kees Cook wrote: > On Fri, Mar 21, 2025 at 01:44:23AM +0000, Al Viro wrote: > > On Thu, Mar 20, 2025 at 01:09:38PM -0700, Kees Cook wrote: > > > > > What I can imagine here is two failing execs racing a fork: > > > > > > A start execve > > > B fork with CLONE_FS > > > C start execve, reach check_unsafe_exec(), set fs->in_exec > > > A bprm_execve() failure, clear fs->in_exec > > > B copy_fs() increment fs->users. > > > C bprm_execve() failure, clear fs->in_exec > > > > > > But I don't think this is a "real" flaw, though, since the locking is to > > > protect a _successful_ execve from a fork (i.e. getting the user count > > > right). A successful execve will de_thread, and I don't see any wrong > > > counting of fs->users with regard to thread lifetime. > > > > > > Did I miss something in the analysis? Should we perform locking anyway, > > > or add data race annotations, or something else? > > > > Umm... What if C succeeds, ending up with suid sharing ->fs? > > I still can't quite construct it -- fs->users is always correct, I > think? > > Below would be the bad set of events, but it's wrong that "fs->users==1". > If A and C are both running with CLONE_FS then fs->users==2. A would need to > exit first, but it can't do that and also set fs->in_exec=0 > > A execve, reaches bprm_execve() failure path > B fork with CLONE_FS, reaches copy_fs() > C execve, reaches check_unsafe_exec() > C takes fs->lock, counts, finds safe fs->users==1, sets in_exec=1, unlocks > A sets fs->in_exec=0 > B takes fs->lock, sees in_exec==0, does fs->users++, unlocks > C goes setuid, sharing fs with unpriv B > > Something still feels very weird, though. Does fs->in_exec not matter at > all? Hmm, no, it stops fs->users++ happening after it was validated to be 1. This is a harmless data race afaict. See my other mail. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) 2025-03-20 20:09 ` Kees Cook 2025-03-21 1:44 ` Al Viro @ 2025-03-21 8:45 ` Christian Brauner 2025-03-22 1:00 ` Al Viro 1 sibling, 1 reply; 30+ messages in thread From: Christian Brauner @ 2025-03-21 8:45 UTC (permalink / raw) To: Kees Cook Cc: Oleg Nesterov, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, viro, syzbot On Thu, Mar 20, 2025 at 01:09:38PM -0700, Kees Cook wrote: > Hey look another threaded exec bug. :| > > On Thu, Mar 20, 2025 at 12:09:36PM -0700, syzbot wrote: > > ================================================================== > > BUG: KCSAN: data-race in bprm_execve / copy_fs > > > > write to 0xffff8881044f8250 of 4 bytes by task 13692 on cpu 0: > > bprm_execve+0x748/0x9c0 fs/exec.c:1884 > > This is: > > current->fs->in_exec = 0; > > And is part of the execve failure path: > > out: > ... > if (bprm->point_of_no_return && !fatal_signal_pending(current)) > force_fatal_sig(SIGSEGV); > > sched_mm_cid_after_execve(current); > current->fs->in_exec = 0; > current->in_execve = 0; > > return retval; > } > > > do_execveat_common+0x769/0x7e0 fs/exec.c:1966 > > do_execveat fs/exec.c:2051 [inline] > > __do_sys_execveat fs/exec.c:2125 [inline] > > __se_sys_execveat fs/exec.c:2119 [inline] > > __x64_sys_execveat+0x75/0x90 fs/exec.c:2119 > > x64_sys_call+0x291e/0x2dc0 arch/x86/include/generated/asm/syscalls_64.h:323 > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83 > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > > read to 0xffff8881044f8250 of 4 bytes by task 13686 on cpu 1: > > copy_fs+0x95/0xf0 kernel/fork.c:1770 > > This is: > > if (fs->in_exec) { > > Which is under lock: > > struct fs_struct *fs = current->fs; > if (clone_flags & CLONE_FS) { > /* tsk->fs is already what we want */ > spin_lock(&fs->lock); > /* "users" and "in_exec" locked for check_unsafe_exec() * */ > if (fs->in_exec) { > spin_unlock(&fs->lock); > return -EAGAIN; > } > fs->users++; > spin_unlock(&fs->lock); > > > Does execve need to be taking this lock? The other thing touching it is > check_unsafe_exec(), which takes the lock. It looks like the bprm_execve() > lock was removed in commit 8c652f96d385 ("do_execve() must not clear > fs->in_exec if it was set by another thread") which used the return > value from check_unsafe_exec(): > > When do_execve() succeeds, it is safe to clear ->in_exec unconditionally. > It can be set only if we don't share ->fs with another process, and since > we already killed all sub-threads either ->in_exec == 0 or we are the > only user of this ->fs. > > Also, we do not need fs->lock to clear fs->in_exec. > > This logic was updated in commit 9e00cdb091b0 ("exec:check_unsafe_exec: > kill the dead -EAGAIN and clear_in_exec logic"), which includes this > rationale: > > 2. "out_unmark:" in do_execve_common() is either called > under ->cred_guard_mutex, or after de_thread() which > kills other threads, so we can't race with sub-thread > which could set ->in_exec. And if ->fs is shared with > another process ->in_exec should be false anyway. > > The de_thread() is part of the "point of no return" in exec_binprm(), > called via exec_binprm(). But the bprm_execve() error path is reachable > from many paths prior to the point of no return. > > What I can imagine here is two failing execs racing a fork: > > A start execve > B fork with CLONE_FS > C start execve, reach check_unsafe_exec(), set fs->in_exec > A bprm_execve() failure, clear fs->in_exec > B copy_fs() increment fs->users. > C bprm_execve() failure, clear fs->in_exec > > But I don't think this is a "real" flaw, though, since the locking is to > protect a _successful_ execve from a fork (i.e. getting the user count > right). A successful execve will de_thread, and I don't see any wrong > counting of fs->users with regard to thread lifetime. > > Did I miss something in the analysis? Should we perform locking anyway, > or add data race annotations, or something else? Afaict, the only way this data race can happen is if we jump to the cleanup label and then reset current->fs->in_exec. If the execve was successful there's no one to race us with CLONE_FS obviously because we took down all other threads. I think the logic in commit 9e00cdb091b0 ("exec:check_unsafe_exec: kill the dead -EAGAIN and clear_in_exec logic") is sound. This is a harmless data race that can only happen if the execve fails. The worst that can happen is that a subthread does clone(CLONE_FS) and gets a spurious error because it raced with the exec'ing subthread resetting fs->in_exec. So I think all we need is: diff --git a/fs/exec.c b/fs/exec.c index 506cd411f4ac..177acaf196a9 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1881,7 +1881,13 @@ static int bprm_execve(struct linux_binprm *bprm) force_fatal_sig(SIGSEGV); sched_mm_cid_after_execve(current); - current->fs->in_exec = 0; + /* + * If this execve failed before de_thread() and another + * subthread is concurrently forking with CLONE_FS they race + * with us resetting current->fs->in_exec. This is fine, + * annotate it. + */ + data_race(current->fs->in_exec = 1); current->in_execve = 0; return retval; ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) 2025-03-21 8:45 ` Christian Brauner @ 2025-03-22 1:00 ` Al Viro 2025-03-22 6:26 ` Kees Cook ` (2 more replies) 0 siblings, 3 replies; 30+ messages in thread From: Al Viro @ 2025-03-22 1:00 UTC (permalink / raw) To: Christian Brauner Cc: Kees Cook, Oleg Nesterov, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, syzbot On Fri, Mar 21, 2025 at 09:45:39AM +0100, Christian Brauner wrote: > Afaict, the only way this data race can happen is if we jump to the > cleanup label and then reset current->fs->in_exec. If the execve was > successful there's no one to race us with CLONE_FS obviously because we > took down all other threads. Not really. 1) A enters check_unsafe_execve(), sets ->in_exec to 1 2) B enters check_unsafe_execve(), sets ->in_exec to 1 3) A calls exec_binprm(), fails (bad binary) 4) A clears ->in_exec 5) C calls clone(2) with CLONE_FS and spawns D - ->in_exec is 0 6) B gets through exec_binprm(), kills A and C, but not D. 7) B clears ->in_exec, returns Result: B and D share ->fs, B runs suid binary. Had (5) happened prior to (2), (2) wouldn't have set ->in_exec; had (5) happened prior to (4), clone() would've failed; had (5) been delayed past (6), there wouldn't have been a thread to call clone(). But in the window between (4) and (6), clone() doesn't see execve() in progress and check_unsafe_execve() has already been done, so it hadn't seen the extra thread. IOW, it really is racy. It's a counter, not a flag. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) 2025-03-22 1:00 ` Al Viro @ 2025-03-22 6:26 ` Kees Cook 2025-03-22 10:15 ` Mateusz Guzik 2025-03-22 10:23 ` Christian Brauner 2025-03-22 15:55 ` Oleg Nesterov 2 siblings, 1 reply; 30+ messages in thread From: Kees Cook @ 2025-03-22 6:26 UTC (permalink / raw) To: Al Viro Cc: Christian Brauner, Oleg Nesterov, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, syzbot On Sat, Mar 22, 2025 at 01:00:08AM +0000, Al Viro wrote: > On Fri, Mar 21, 2025 at 09:45:39AM +0100, Christian Brauner wrote: > > > Afaict, the only way this data race can happen is if we jump to the > > cleanup label and then reset current->fs->in_exec. If the execve was > > successful there's no one to race us with CLONE_FS obviously because we > > took down all other threads. > > Not really. Yeah, you found it. Thank you! > 1) A enters check_unsafe_execve(), sets ->in_exec to 1 > 2) B enters check_unsafe_execve(), sets ->in_exec to 1 With 3 threads A, B, and C already running, fs->users == 3, so steps (1) and (2) happily pass. > 3) A calls exec_binprm(), fails (bad binary) > 4) A clears ->in_exec > 5) C calls clone(2) with CLONE_FS and spawns D - ->in_exec is 0 D's creation bumps fs->users == 4. > 6) B gets through exec_binprm(), kills A and C, but not D. > 7) B clears ->in_exec, returns > > Result: B and D share ->fs, B runs suid binary. > > Had (5) happened prior to (2), (2) wouldn't have set ->in_exec; > had (5) happened prior to (4), clone() would've failed; had > (5) been delayed past (6), there wouldn't have been a thread > to call clone(). > > But in the window between (4) and (6), clone() doesn't see > execve() in progress and check_unsafe_execve() has already > been done, so it hadn't seen the extra thread. > > IOW, it really is racy. It's a counter, not a flag. Yeah, I would agree. Totally untested patch: diff --git a/fs/exec.c b/fs/exec.c index 506cd411f4ac..988b8621c079 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1632,7 +1632,7 @@ static void check_unsafe_exec(struct linux_binprm *bprm) if (p->fs->users > n_fs) bprm->unsafe |= LSM_UNSAFE_SHARE; else - p->fs->in_exec = 1; + refcount_inc(&p->fs->in_exec); spin_unlock(&p->fs->lock); } @@ -1862,7 +1862,7 @@ static int bprm_execve(struct linux_binprm *bprm) sched_mm_cid_after_execve(current); /* execve succeeded */ - current->fs->in_exec = 0; + refcount_dec(¤t->fs->in_exec); current->in_execve = 0; rseq_execve(current); user_events_execve(current); @@ -1881,7 +1881,7 @@ static int bprm_execve(struct linux_binprm *bprm) force_fatal_sig(SIGSEGV); sched_mm_cid_after_execve(current); - current->fs->in_exec = 0; + refcount_dec(¤t->fs->in_exec); current->in_execve = 0; return retval; diff --git a/fs/fs_struct.c b/fs/fs_struct.c index 64c2d0814ed6..df46b873c425 100644 --- a/fs/fs_struct.c +++ b/fs/fs_struct.c @@ -115,7 +115,7 @@ struct fs_struct *copy_fs_struct(struct fs_struct *old) /* We don't need to lock fs - think why ;-) */ if (fs) { fs->users = 1; - fs->in_exec = 0; + fs->in_exec = REFCOUNT_INIT(0); spin_lock_init(&fs->lock); seqcount_spinlock_init(&fs->seq, &fs->lock); fs->umask = old->umask; diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h index 783b48dedb72..aebc0b7aedb9 100644 --- a/include/linux/fs_struct.h +++ b/include/linux/fs_struct.h @@ -11,7 +11,7 @@ struct fs_struct { spinlock_t lock; seqcount_spinlock_t seq; int umask; - int in_exec; + refcount_t in_exec; struct path root, pwd; } __randomize_layout; diff --git a/kernel/fork.c b/kernel/fork.c index 735405a9c5f3..8b427045fd86 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1767,7 +1767,7 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk) /* tsk->fs is already what we want */ spin_lock(&fs->lock); /* "users" and "in_exec" locked for check_unsafe_exec() */ - if (fs->in_exec) { + if (refcount_read(&fs->in_exec)) { spin_unlock(&fs->lock); return -EAGAIN; } -- Kees Cook ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) 2025-03-22 6:26 ` Kees Cook @ 2025-03-22 10:15 ` Mateusz Guzik 2025-03-22 10:28 ` Christian Brauner 0 siblings, 1 reply; 30+ messages in thread From: Mateusz Guzik @ 2025-03-22 10:15 UTC (permalink / raw) To: Kees Cook Cc: Al Viro, Christian Brauner, Oleg Nesterov, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, syzbot On Fri, Mar 21, 2025 at 11:26:03PM -0700, Kees Cook wrote: > On Sat, Mar 22, 2025 at 01:00:08AM +0000, Al Viro wrote: > > On Fri, Mar 21, 2025 at 09:45:39AM +0100, Christian Brauner wrote: > > > > > Afaict, the only way this data race can happen is if we jump to the > > > cleanup label and then reset current->fs->in_exec. If the execve was > > > successful there's no one to race us with CLONE_FS obviously because we > > > took down all other threads. > > > > Not really. > > Yeah, you found it. Thank you! > > > 1) A enters check_unsafe_execve(), sets ->in_exec to 1 > > 2) B enters check_unsafe_execve(), sets ->in_exec to 1 > > With 3 threads A, B, and C already running, fs->users == 3, so steps (1) > and (2) happily pass. > > > 3) A calls exec_binprm(), fails (bad binary) > > 4) A clears ->in_exec > > 5) C calls clone(2) with CLONE_FS and spawns D - ->in_exec is 0 > > D's creation bumps fs->users == 4. > > > 6) B gets through exec_binprm(), kills A and C, but not D. > > 7) B clears ->in_exec, returns > > > > Result: B and D share ->fs, B runs suid binary. > > > > Had (5) happened prior to (2), (2) wouldn't have set ->in_exec; > > had (5) happened prior to (4), clone() would've failed; had > > (5) been delayed past (6), there wouldn't have been a thread > > to call clone(). > > > > But in the window between (4) and (6), clone() doesn't see > > execve() in progress and check_unsafe_execve() has already > > been done, so it hadn't seen the extra thread. > > > > IOW, it really is racy. It's a counter, not a flag. > > Yeah, I would agree. Totally untested patch: > > diff --git a/fs/exec.c b/fs/exec.c > index 506cd411f4ac..988b8621c079 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1632,7 +1632,7 @@ static void check_unsafe_exec(struct linux_binprm *bprm) > if (p->fs->users > n_fs) > bprm->unsafe |= LSM_UNSAFE_SHARE; > else > - p->fs->in_exec = 1; > + refcount_inc(&p->fs->in_exec); > spin_unlock(&p->fs->lock); > } > > @@ -1862,7 +1862,7 @@ static int bprm_execve(struct linux_binprm *bprm) > > sched_mm_cid_after_execve(current); > /* execve succeeded */ > - current->fs->in_exec = 0; > + refcount_dec(¤t->fs->in_exec); > current->in_execve = 0; > rseq_execve(current); > user_events_execve(current); > @@ -1881,7 +1881,7 @@ static int bprm_execve(struct linux_binprm *bprm) > force_fatal_sig(SIGSEGV); > > sched_mm_cid_after_execve(current); > - current->fs->in_exec = 0; > + refcount_dec(¤t->fs->in_exec); > current->in_execve = 0; > > return retval; The bump is conditional and with this patch you may be issuing refcount_dec when you declined to refcount_inc. A special case where there are others to worry about and which proceeds with an exec without leaving in any indicators is imo sketchy. I would argue it would make the most sense to serialize these execs. Vast majority of programs are single-threaded when they exec with an unshared ->fs, so they don't need to bear any overhead nor complexity modulo a branch. For any fucky case you can park yourself waiting for any pending exec to finish. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) 2025-03-22 10:15 ` Mateusz Guzik @ 2025-03-22 10:28 ` Christian Brauner 0 siblings, 0 replies; 30+ messages in thread From: Christian Brauner @ 2025-03-22 10:28 UTC (permalink / raw) To: Mateusz Guzik Cc: Kees Cook, Al Viro, Oleg Nesterov, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, syzbot On Sat, Mar 22, 2025 at 11:15:44AM +0100, Mateusz Guzik wrote: > On Fri, Mar 21, 2025 at 11:26:03PM -0700, Kees Cook wrote: > > On Sat, Mar 22, 2025 at 01:00:08AM +0000, Al Viro wrote: > > > On Fri, Mar 21, 2025 at 09:45:39AM +0100, Christian Brauner wrote: > > > > > > > Afaict, the only way this data race can happen is if we jump to the > > > > cleanup label and then reset current->fs->in_exec. If the execve was > > > > successful there's no one to race us with CLONE_FS obviously because we > > > > took down all other threads. > > > > > > Not really. > > > > Yeah, you found it. Thank you! > > > > > 1) A enters check_unsafe_execve(), sets ->in_exec to 1 > > > 2) B enters check_unsafe_execve(), sets ->in_exec to 1 > > > > With 3 threads A, B, and C already running, fs->users == 3, so steps (1) > > and (2) happily pass. > > > > > 3) A calls exec_binprm(), fails (bad binary) > > > 4) A clears ->in_exec > > > 5) C calls clone(2) with CLONE_FS and spawns D - ->in_exec is 0 > > > > D's creation bumps fs->users == 4. > > > > > 6) B gets through exec_binprm(), kills A and C, but not D. > > > 7) B clears ->in_exec, returns > > > > > > Result: B and D share ->fs, B runs suid binary. > > > > > > Had (5) happened prior to (2), (2) wouldn't have set ->in_exec; > > > had (5) happened prior to (4), clone() would've failed; had > > > (5) been delayed past (6), there wouldn't have been a thread > > > to call clone(). > > > > > > But in the window between (4) and (6), clone() doesn't see > > > execve() in progress and check_unsafe_execve() has already > > > been done, so it hadn't seen the extra thread. > > > > > > IOW, it really is racy. It's a counter, not a flag. > > > > Yeah, I would agree. Totally untested patch: > > > > diff --git a/fs/exec.c b/fs/exec.c > > index 506cd411f4ac..988b8621c079 100644 > > --- a/fs/exec.c > > +++ b/fs/exec.c > > @@ -1632,7 +1632,7 @@ static void check_unsafe_exec(struct linux_binprm *bprm) > > if (p->fs->users > n_fs) > > bprm->unsafe |= LSM_UNSAFE_SHARE; > > else > > - p->fs->in_exec = 1; > > + refcount_inc(&p->fs->in_exec); > > spin_unlock(&p->fs->lock); > > } > > > > @@ -1862,7 +1862,7 @@ static int bprm_execve(struct linux_binprm *bprm) > > > > sched_mm_cid_after_execve(current); > > /* execve succeeded */ > > - current->fs->in_exec = 0; > > + refcount_dec(¤t->fs->in_exec); > > current->in_execve = 0; > > rseq_execve(current); > > user_events_execve(current); > > @@ -1881,7 +1881,7 @@ static int bprm_execve(struct linux_binprm *bprm) > > force_fatal_sig(SIGSEGV); > > > > sched_mm_cid_after_execve(current); > > - current->fs->in_exec = 0; > > + refcount_dec(¤t->fs->in_exec); > > current->in_execve = 0; > > > > return retval; > > The bump is conditional and with this patch you may be issuing > refcount_dec when you declined to refcount_inc. > > A special case where there are others to worry about and which proceeds > with an exec without leaving in any indicators is imo sketchy. > > I would argue it would make the most sense to serialize these execs. Yeah, I tend to agree. And fwiw, I had proposed somewhere else already that we should start restricting thread-group exec to the thread-group leader because subthread exec is about as ugly as it can get. I'd like to add a sysctl for this with the goal of removing this completely in the future. I think there are very few if any legitimate cases for subthread aka non-thread-group leader exec. It not just complicates things it also means two task switch TIDs which is super confusing from userspace (It also complicates pidfd polling but that's an aside.). I'll wip up a patch for this once I get back from travel. > > Vast majority of programs are single-threaded when they exec with an > unshared ->fs, so they don't need to bear any overhead nor complexity > modulo a branch. > > For any fucky case you can park yourself waiting for any pending exec to > finish. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) 2025-03-22 1:00 ` Al Viro 2025-03-22 6:26 ` Kees Cook @ 2025-03-22 10:23 ` Christian Brauner 2025-03-22 15:55 ` Oleg Nesterov 2 siblings, 0 replies; 30+ messages in thread From: Christian Brauner @ 2025-03-22 10:23 UTC (permalink / raw) To: Al Viro Cc: Kees Cook, Oleg Nesterov, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, syzbot On Sat, Mar 22, 2025 at 01:00:08AM +0000, Al Viro wrote: > On Fri, Mar 21, 2025 at 09:45:39AM +0100, Christian Brauner wrote: > > > Afaict, the only way this data race can happen is if we jump to the > > cleanup label and then reset current->fs->in_exec. If the execve was > > successful there's no one to race us with CLONE_FS obviously because we > > took down all other threads. > > Not really. > > 1) A enters check_unsafe_execve(), sets ->in_exec to 1 > 2) B enters check_unsafe_execve(), sets ->in_exec to 1 > 3) A calls exec_binprm(), fails (bad binary) > 4) A clears ->in_exec > 5) C calls clone(2) with CLONE_FS and spawns D - ->in_exec is 0 > 6) B gets through exec_binprm(), kills A and C, but not D. > 7) B clears ->in_exec, returns > > Result: B and D share ->fs, B runs suid binary. > > Had (5) happened prior to (2), (2) wouldn't have set ->in_exec; > had (5) happened prior to (4), clone() would've failed; had > (5) been delayed past (6), there wouldn't have been a thread > to call clone(). > > But in the window between (4) and (6), clone() doesn't see > execve() in progress and check_unsafe_execve() has already > been done, so it hadn't seen the extra thread. Eewww, you're right. That's ugly as hell. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) 2025-03-22 1:00 ` Al Viro 2025-03-22 6:26 ` Kees Cook 2025-03-22 10:23 ` Christian Brauner @ 2025-03-22 15:55 ` Oleg Nesterov 2025-03-22 18:50 ` Al Viro 2 siblings, 1 reply; 30+ messages in thread From: Oleg Nesterov @ 2025-03-22 15:55 UTC (permalink / raw) To: Al Viro Cc: Christian Brauner, Kees Cook, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, syzbot Quite possibly I am wrong, I need to recall this code, but at first glance... On 03/22, Al Viro wrote: > > Not really. I agree, it is really racy. But, > 1) A enters check_unsafe_execve(), sets ->in_exec to 1 > 2) B enters check_unsafe_execve(), sets ->in_exec to 1 No, check_unsafe_execve() is called with cred_guard_mutex held, see prepare_bprm_creds() > 3) A calls exec_binprm(), fails (bad binary) > 4) A clears ->in_exec So (2) can only happen after A fails and drops cred_guard_mutex. And this means that we just need to ensure that ->in_exec is cleared before this mutex is dropped, no? Something like below? Oleg. --- diff --git a/fs/exec.c b/fs/exec.c index 506cd411f4ac..f8bf3c96e181 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1233,6 +1233,7 @@ int begin_new_exec(struct linux_binprm * bprm) * Make this the only thread in the thread group. */ retval = de_thread(me); + current->fs->in_exec = 0; if (retval) goto out; @@ -1497,6 +1498,8 @@ static void free_bprm(struct linux_binprm *bprm) } free_arg_pages(bprm); if (bprm->cred) { + // for the case exec fails before de_thread() + current->fs->in_exec = 0; mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); } @@ -1862,7 +1865,6 @@ static int bprm_execve(struct linux_binprm *bprm) sched_mm_cid_after_execve(current); /* execve succeeded */ - current->fs->in_exec = 0; current->in_execve = 0; rseq_execve(current); user_events_execve(current); @@ -1881,7 +1883,6 @@ static int bprm_execve(struct linux_binprm *bprm) force_fatal_sig(SIGSEGV); sched_mm_cid_after_execve(current); - current->fs->in_exec = 0; current->in_execve = 0; return retval; ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) 2025-03-22 15:55 ` Oleg Nesterov @ 2025-03-22 18:50 ` Al Viro 2025-03-23 18:14 ` Oleg Nesterov 0 siblings, 1 reply; 30+ messages in thread From: Al Viro @ 2025-03-22 18:50 UTC (permalink / raw) To: Oleg Nesterov Cc: Christian Brauner, Kees Cook, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, syzbot On Sat, Mar 22, 2025 at 04:55:39PM +0100, Oleg Nesterov wrote: > No, check_unsafe_execve() is called with cred_guard_mutex held, > see prepare_bprm_creds() Point... > > 3) A calls exec_binprm(), fails (bad binary) > > 4) A clears ->in_exec > > So (2) can only happen after A fails and drops cred_guard_mutex. > > And this means that we just need to ensure that ->in_exec is cleared > before this mutex is dropped, no? Something like below? Probably should work, but I wonder if it would be cleaner to have ->in_exec replaced with pointer to task_struct responsible. Not "somebody with that fs_struct for ->fs is trying to do execve(), has verified that nothing outside of their threads is using this and had been holding ->signal->cred_guard_mutex ever since then", but "this is the thread that..." ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) 2025-03-22 18:50 ` Al Viro @ 2025-03-23 18:14 ` Oleg Nesterov 2025-03-23 20:57 ` Christian Brauner 0 siblings, 1 reply; 30+ messages in thread From: Oleg Nesterov @ 2025-03-23 18:14 UTC (permalink / raw) To: Al Viro Cc: Christian Brauner, Kees Cook, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, syzbot On 03/22, Al Viro wrote: > > On Sat, Mar 22, 2025 at 04:55:39PM +0100, Oleg Nesterov wrote: > > > And this means that we just need to ensure that ->in_exec is cleared > > before this mutex is dropped, no? Something like below? > > Probably should work, but I wonder if it would be cleaner to have > ->in_exec replaced with pointer to task_struct responsible. Not > "somebody with that fs_struct for ->fs is trying to do execve(), > has verified that nothing outside of their threads is using this > and had been holding ->signal->cred_guard_mutex ever since then", > but "this is the thread that..." perhaps... or something else to make this "not immediately obvious" fs->in_exec more clear. But I guess we need something simple for -stable, so will you agree with this fix for now? Apart from changelog/comments. retval = de_thread(me); + current->fs->in_exec = 0; if (retval) current->fs->in_exec = 0; is correct but looks confusing. See "V2" below, it clears fs->in_exec after the "if (retval)" check. syzbot says: Unfortunately, I don't have any reproducer for this issue yet. so I guess "#syz test: " is pointless right now... Oleg. --- diff --git a/fs/exec.c b/fs/exec.c index 506cd411f4ac..02e8824fc9cd 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1236,6 +1236,7 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) goto out; + current->fs->in_exec = 0; /* * Cancel any io_uring activity across execve */ @@ -1497,6 +1498,8 @@ static void free_bprm(struct linux_binprm *bprm) } free_arg_pages(bprm); if (bprm->cred) { + // for the case exec fails before de_thread() + current->fs->in_exec = 0; mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); } @@ -1862,7 +1865,6 @@ static int bprm_execve(struct linux_binprm *bprm) sched_mm_cid_after_execve(current); /* execve succeeded */ - current->fs->in_exec = 0; current->in_execve = 0; rseq_execve(current); user_events_execve(current); @@ -1881,7 +1883,6 @@ static int bprm_execve(struct linux_binprm *bprm) force_fatal_sig(SIGSEGV); sched_mm_cid_after_execve(current); - current->fs->in_exec = 0; current->in_execve = 0; return retval; ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) 2025-03-23 18:14 ` Oleg Nesterov @ 2025-03-23 20:57 ` Christian Brauner 0 siblings, 0 replies; 30+ messages in thread From: Christian Brauner @ 2025-03-23 20:57 UTC (permalink / raw) To: Oleg Nesterov Cc: Al Viro, Kees Cook, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, syzbot On Sun, Mar 23, 2025 at 07:14:21PM +0100, Oleg Nesterov wrote: > On 03/22, Al Viro wrote: > > > > On Sat, Mar 22, 2025 at 04:55:39PM +0100, Oleg Nesterov wrote: > > > > > And this means that we just need to ensure that ->in_exec is cleared > > > before this mutex is dropped, no? Something like below? > > > > Probably should work, but I wonder if it would be cleaner to have > > ->in_exec replaced with pointer to task_struct responsible. Not > > "somebody with that fs_struct for ->fs is trying to do execve(), > > has verified that nothing outside of their threads is using this > > and had been holding ->signal->cred_guard_mutex ever since then", > > but "this is the thread that..." > > perhaps... or something else to make this "not immediately obvious" > fs->in_exec more clear. Well, it would certainly help to document that cred_guard_mutex serializes concurrent exec. This is kind of important information given that begin_new_exec() and finalize_exec() are only called from ->load_binary() and are thus always located in the individual binfmt_*.c files. That makes this pretty implicit information. Let alone that the unlocking is all based on bprm->cred being set or unset. Otherwise the patch looks good to me. > > But I guess we need something simple for -stable, so will you agree > with this fix for now? Apart from changelog/comments. > > retval = de_thread(me); > + current->fs->in_exec = 0; > if (retval) > current->fs->in_exec = 0; > > is correct but looks confusing. See "V2" below, it clears fs->in_exec > after the "if (retval)" check. > > syzbot says: > > Unfortunately, I don't have any reproducer for this issue yet. > > so I guess "#syz test: " is pointless right now... > > Oleg. > --- > > diff --git a/fs/exec.c b/fs/exec.c > index 506cd411f4ac..02e8824fc9cd 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1236,6 +1236,7 @@ int begin_new_exec(struct linux_binprm * bprm) > if (retval) > goto out; > > + current->fs->in_exec = 0; > /* > * Cancel any io_uring activity across execve > */ > @@ -1497,6 +1498,8 @@ static void free_bprm(struct linux_binprm *bprm) > } > free_arg_pages(bprm); > if (bprm->cred) { > + // for the case exec fails before de_thread() > + current->fs->in_exec = 0; > mutex_unlock(¤t->signal->cred_guard_mutex); > abort_creds(bprm->cred); > } > @@ -1862,7 +1865,6 @@ static int bprm_execve(struct linux_binprm *bprm) > > sched_mm_cid_after_execve(current); > /* execve succeeded */ > - current->fs->in_exec = 0; > current->in_execve = 0; > rseq_execve(current); > user_events_execve(current); > @@ -1881,7 +1883,6 @@ static int bprm_execve(struct linux_binprm *bprm) > force_fatal_sig(SIGSEGV); > > sched_mm_cid_after_execve(current); > - current->fs->in_exec = 0; > current->in_execve = 0; > > return retval; > ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH] exec: fix the racy usage of fs_struct->in_exec 2025-03-20 19:09 [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) syzbot 2025-03-20 20:09 ` Kees Cook @ 2025-03-24 16:00 ` Oleg Nesterov 2025-03-24 17:01 ` Mateusz Guzik 2025-04-29 15:49 ` Oleg Nesterov 1 sibling, 2 replies; 30+ messages in thread From: Oleg Nesterov @ 2025-03-24 16:00 UTC (permalink / raw) To: syzbot, brauner, kees, viro Cc: jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, mjguzik check_unsafe_exec() sets fs->in_exec under cred_guard_mutex, then execve() paths clear fs->in_exec lockless. This is fine if exec succeeds, but if it fails we have the following race: T1 sets fs->in_exec = 1, fails, drops cred_guard_mutex T2 sets fs->in_exec = 1 T1 clears fs->in_exec T2 continues with fs->in_exec == 0 Change fs/exec.c to clear fs->in_exec with cred_guard_mutex held. Reported-by: syzbot+1c486d0b62032c82a968@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001f.GAE@google.com/ Cc: stable@vger.kernel.org Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- fs/exec.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 506cd411f4ac..17047210be46 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1229,13 +1229,12 @@ int begin_new_exec(struct linux_binprm * bprm) */ bprm->point_of_no_return = true; - /* - * Make this the only thread in the thread group. - */ + /* Make this the only thread in the thread group */ retval = de_thread(me); if (retval) goto out; - + /* see the comment in check_unsafe_exec() */ + current->fs->in_exec = 0; /* * Cancel any io_uring activity across execve */ @@ -1497,6 +1496,8 @@ static void free_bprm(struct linux_binprm *bprm) } free_arg_pages(bprm); if (bprm->cred) { + /* in case exec fails before de_thread() succeeds */ + current->fs->in_exec = 0; mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); } @@ -1618,6 +1619,10 @@ static void check_unsafe_exec(struct linux_binprm *bprm) * suid exec because the differently privileged task * will be able to manipulate the current directory, etc. * It would be nice to force an unshare instead... + * + * Otherwise we set fs->in_exec = 1 to deny clone(CLONE_FS) + * from another sub-thread until de_thread() succeeds, this + * state is protected by cred_guard_mutex we hold. */ n_fs = 1; spin_lock(&p->fs->lock); @@ -1862,7 +1867,6 @@ static int bprm_execve(struct linux_binprm *bprm) sched_mm_cid_after_execve(current); /* execve succeeded */ - current->fs->in_exec = 0; current->in_execve = 0; rseq_execve(current); user_events_execve(current); @@ -1881,7 +1885,6 @@ static int bprm_execve(struct linux_binprm *bprm) force_fatal_sig(SIGSEGV); sched_mm_cid_after_execve(current); - current->fs->in_exec = 0; current->in_execve = 0; return retval; -- 2.25.1.362.g51ebf55 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec 2025-03-24 16:00 ` [PATCH] exec: fix the racy usage of fs_struct->in_exec Oleg Nesterov @ 2025-03-24 17:01 ` Mateusz Guzik 2025-03-24 18:27 ` Oleg Nesterov 2025-04-29 15:49 ` Oleg Nesterov 1 sibling, 1 reply; 30+ messages in thread From: Mateusz Guzik @ 2025-03-24 17:01 UTC (permalink / raw) To: Oleg Nesterov Cc: syzbot, brauner, kees, viro, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs On Mon, Mar 24, 2025 at 5:00 PM Oleg Nesterov <oleg@redhat.com> wrote: > > check_unsafe_exec() sets fs->in_exec under cred_guard_mutex, then execve() > paths clear fs->in_exec lockless. This is fine if exec succeeds, but if it > fails we have the following race: > > T1 sets fs->in_exec = 1, fails, drops cred_guard_mutex > > T2 sets fs->in_exec = 1 > > T1 clears fs->in_exec > > T2 continues with fs->in_exec == 0 > > Change fs/exec.c to clear fs->in_exec with cred_guard_mutex held. > I had cursory glances at this code earlier and the more I try to understand it the more confused I am. The mutex at hand hides in ->signal and fs->in_exec remains treated as a flag. The loop in check_unsafe_exec() tries to guard against a task which shares ->fs, but does not share ->mm? To my reading this implies unshared ->signal, so the mutex protection does not apply. I think this ends up being harmless as in this case nobody is going to set ->in_exec (instead they are going to share LSM_UNSAFE_SHARE), so clearing it in these spots becomes a nop. At the same time the check in copy_fs() no longer blocks clones as check_unsafe_exec() already opts to LSM_UNSAFE_SHARE? Even if this all works with the patch, this is an incredibly odd set of dependencies and I don't see a good reason for it to still be here. Per my other e-mail the obvious scheme would serialize all execs sharing ->fs and make copy_fs do a killable wait for execs to finish. Arguably this would also improve userspace-visible behavior as a transient -EBUSY would be eliminated. No matter what the specific solution, imo treating ->in_exec as a flag needs to die. is there a problem getting this done even for stable kernels? I understand it would be harder to backport churn-wise, but should be much easier to reason about? Just my $0,03 -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec 2025-03-24 17:01 ` Mateusz Guzik @ 2025-03-24 18:27 ` Oleg Nesterov 2025-03-24 18:37 ` Oleg Nesterov 2025-03-24 22:24 ` Mateusz Guzik 0 siblings, 2 replies; 30+ messages in thread From: Oleg Nesterov @ 2025-03-24 18:27 UTC (permalink / raw) To: Mateusz Guzik Cc: syzbot, brauner, kees, viro, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs On 03/24, Mateusz Guzik wrote: > > I had cursory glances at this code earlier and the more I try to > understand it the more confused I am. You are not alone ;) > Per my other e-mail the obvious scheme would serialize all execs > sharing ->fs and make copy_fs do a killable wait for execs to finish. > Arguably this would also improve userspace-visible behavior as a > transient -EBUSY would be eliminated. I had the same feeling years ago. Why didn't I do it? I can't recall. Perhaps because I found some problem in this idea, but most probably because I failed to make the correct and simple patch. > is there a problem getting this done even for stable kernels? I > understand it would be harder to backport churn-wise, but should be > much easier to reason about? I won't argue with another solution. But this problem is quite old, unless I am totally confused this logic was wrong from the very beginning when fs->in_exec was introduced by 498052bba55ec. So to me it would be better to have the trivial fix for stable, exactly because it is trivially backportable. Then cleanup/simplify this logic on top of it. Oleg. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec 2025-03-24 18:27 ` Oleg Nesterov @ 2025-03-24 18:37 ` Oleg Nesterov 2025-03-24 22:24 ` Mateusz Guzik 1 sibling, 0 replies; 30+ messages in thread From: Oleg Nesterov @ 2025-03-24 18:37 UTC (permalink / raw) To: Mateusz Guzik Cc: syzbot, brauner, kees, viro, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs On 03/24, Oleg Nesterov wrote: > > I won't argue with another solution. But this problem is quite old, Yes, but... > unless I am totally confused this logic was wrong from the very > beginning when fs->in_exec was introduced by 498052bba55ec. OK, I was wrong. According to git show 498052bba55ec:fs/exec.c this patch was correct in this respect. Sorry for the noise. > So to me it would be better to have the trivial fix for stable, > exactly because it is trivially backportable. Then cleanup/simplify > this logic on top of it. Yes... Oleg. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec 2025-03-24 18:27 ` Oleg Nesterov 2025-03-24 18:37 ` Oleg Nesterov @ 2025-03-24 22:24 ` Mateusz Guzik 2025-03-25 10:09 ` Oleg Nesterov 1 sibling, 1 reply; 30+ messages in thread From: Mateusz Guzik @ 2025-03-24 22:24 UTC (permalink / raw) To: Oleg Nesterov Cc: syzbot, brauner, kees, viro, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote: > I won't argue with another solution. But this problem is quite old, > unless I am totally confused this logic was wrong from the very > beginning when fs->in_exec was introduced by 498052bba55ec. > > So to me it would be better to have the trivial fix for stable, > exactly because it is trivially backportable. Then cleanup/simplify > this logic on top of it. > So I got myself a crap testcase with a CLONE_FS'ed task which can execve and sanity-checked that suid is indeed not honored as expected. The most convenient way out of planting a mutex in there does not work because of the size -- fs_struct is already at 56 bytes and I'm not going to push it past 64. However, looks like "wait on bit" machinery can be employed here, based on what I had seen with inodes (see __wait_on_freeing_inode) and that should avoid growing the struct, unless I'm grossly misreading something. Anyhow, the plan would be to serialize on the bit, synchronized with the current spin lock. copy_fs would call a helper to wait for it to clear, would still bump ->users under the spin lock. This would decouple the handling from cred_mutex and avoid weirdness like clearing the ->in_exec flag when we never set it. Should be easy enough and viable for backports, I'll try to hack it up tomorrow unless the idea is NAKed. The crapper mentioned above will be used to validate exec vs clone work as expected. -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec 2025-03-24 22:24 ` Mateusz Guzik @ 2025-03-25 10:09 ` Oleg Nesterov 2025-03-25 11:01 ` Mateusz Guzik 0 siblings, 1 reply; 30+ messages in thread From: Oleg Nesterov @ 2025-03-25 10:09 UTC (permalink / raw) To: Mateusz Guzik Cc: syzbot, brauner, kees, viro, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs On 03/24, Mateusz Guzik wrote: > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > So to me it would be better to have the trivial fix for stable, > > exactly because it is trivially backportable. Then cleanup/simplify > > this logic on top of it. > > So I got myself a crap testcase with a CLONE_FS'ed task which can > execve and sanity-checked that suid is indeed not honored as expected. So you mean my patch can't fix the problem? > Anyhow, the plan would be to serialize on the bit, synchronized with > the current spin lock. copy_fs would call a helper to wait for it to > clear, would still bump ->users under the spin lock. > > This would decouple the handling from cred_mutex and avoid weirdness > like clearing the ->in_exec flag when we never set it. I don't really understand the idea, but as I said I won't argue with another solution. Oleg. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec 2025-03-25 10:09 ` Oleg Nesterov @ 2025-03-25 11:01 ` Mateusz Guzik 2025-03-25 13:21 ` Oleg Nesterov 0 siblings, 1 reply; 30+ messages in thread From: Mateusz Guzik @ 2025-03-25 11:01 UTC (permalink / raw) To: Oleg Nesterov Cc: syzbot, brauner, kees, viro, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote: > > On 03/24, Mateusz Guzik wrote: > > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > So to me it would be better to have the trivial fix for stable, > > > exactly because it is trivially backportable. Then cleanup/simplify > > > this logic on top of it. > > > > So I got myself a crap testcase with a CLONE_FS'ed task which can > > execve and sanity-checked that suid is indeed not honored as expected. > > So you mean my patch can't fix the problem? No, I think the patch works. I am saying the current scheme is avoidably hard to reason about. > > > Anyhow, the plan would be to serialize on the bit, synchronized with > > the current spin lock. copy_fs would call a helper to wait for it to > > clear, would still bump ->users under the spin lock. > > > > This would decouple the handling from cred_mutex and avoid weirdness > > like clearing the ->in_exec flag when we never set it. > > I don't really understand the idea, but as I said I won't argue with > another solution. > I'll try to ship later today so that there will be something concrete to comment on. -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec 2025-03-25 11:01 ` Mateusz Guzik @ 2025-03-25 13:21 ` Oleg Nesterov 2025-03-25 13:30 ` Christian Brauner 0 siblings, 1 reply; 30+ messages in thread From: Oleg Nesterov @ 2025-03-25 13:21 UTC (permalink / raw) To: Mateusz Guzik Cc: syzbot, brauner, kees, viro, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs On 03/25, Mateusz Guzik wrote: > > On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote: > > > > On 03/24, Mateusz Guzik wrote: > > > > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > So to me it would be better to have the trivial fix for stable, > > > > exactly because it is trivially backportable. Then cleanup/simplify > > > > this logic on top of it. > > > > > > So I got myself a crap testcase with a CLONE_FS'ed task which can > > > execve and sanity-checked that suid is indeed not honored as expected. > > > > So you mean my patch can't fix the problem? > > No, I think the patch works. > > I am saying the current scheme is avoidably hard to reason about. Ah, OK, thanks. Then I still think it makes more sense to do the cleanups you propose on top of this fix. But I leave this to you and other fs/ maintainers. Oleg. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec 2025-03-25 13:21 ` Oleg Nesterov @ 2025-03-25 13:30 ` Christian Brauner 2025-03-25 14:15 ` Mateusz Guzik 0 siblings, 1 reply; 30+ messages in thread From: Christian Brauner @ 2025-03-25 13:30 UTC (permalink / raw) To: Oleg Nesterov, Mateusz Guzik Cc: syzbot, kees, viro, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs On Tue, Mar 25, 2025 at 02:21:36PM +0100, Oleg Nesterov wrote: > On 03/25, Mateusz Guzik wrote: > > > > On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > On 03/24, Mateusz Guzik wrote: > > > > > > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > > > So to me it would be better to have the trivial fix for stable, > > > > > exactly because it is trivially backportable. Then cleanup/simplify > > > > > this logic on top of it. > > > > > > > > So I got myself a crap testcase with a CLONE_FS'ed task which can > > > > execve and sanity-checked that suid is indeed not honored as expected. > > > > > > So you mean my patch can't fix the problem? > > > > No, I think the patch works. > > > > I am saying the current scheme is avoidably hard to reason about. > > Ah, OK, thanks. Then I still think it makes more sense to do the > cleanups you propose on top of this fix. I agree. We should go with Oleg's fix that in the old scheme and use that. And then @Mateusz your cleanup should please go on top! ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec 2025-03-25 13:30 ` Christian Brauner @ 2025-03-25 14:15 ` Mateusz Guzik 2025-03-25 14:46 ` Christian Brauner 0 siblings, 1 reply; 30+ messages in thread From: Mateusz Guzik @ 2025-03-25 14:15 UTC (permalink / raw) To: Christian Brauner Cc: Oleg Nesterov, syzbot, kees, viro, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs On Tue, Mar 25, 2025 at 2:30 PM Christian Brauner <brauner@kernel.org> wrote: > > On Tue, Mar 25, 2025 at 02:21:36PM +0100, Oleg Nesterov wrote: > > On 03/25, Mateusz Guzik wrote: > > > > > > On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > On 03/24, Mateusz Guzik wrote: > > > > > > > > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > > > > > So to me it would be better to have the trivial fix for stable, > > > > > > exactly because it is trivially backportable. Then cleanup/simplify > > > > > > this logic on top of it. > > > > > > > > > > So I got myself a crap testcase with a CLONE_FS'ed task which can > > > > > execve and sanity-checked that suid is indeed not honored as expected. > > > > > > > > So you mean my patch can't fix the problem? > > > > > > No, I think the patch works. > > > > > > I am saying the current scheme is avoidably hard to reason about. > > > > Ah, OK, thanks. Then I still think it makes more sense to do the > > cleanups you propose on top of this fix. > > I agree. We should go with Oleg's fix that in the old scheme and use > that. And then @Mateusz your cleanup should please go on top! Ok, in that case I'm gonna ship when I'm gonna ship(tm), maybe later this week. -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec 2025-03-25 14:15 ` Mateusz Guzik @ 2025-03-25 14:46 ` Christian Brauner 2025-03-25 18:40 ` Kees Cook 0 siblings, 1 reply; 30+ messages in thread From: Christian Brauner @ 2025-03-25 14:46 UTC (permalink / raw) To: Mateusz Guzik Cc: Oleg Nesterov, syzbot, kees, viro, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs On Tue, Mar 25, 2025 at 03:15:06PM +0100, Mateusz Guzik wrote: > On Tue, Mar 25, 2025 at 2:30 PM Christian Brauner <brauner@kernel.org> wrote: > > > > On Tue, Mar 25, 2025 at 02:21:36PM +0100, Oleg Nesterov wrote: > > > On 03/25, Mateusz Guzik wrote: > > > > > > > > On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > > > On 03/24, Mateusz Guzik wrote: > > > > > > > > > > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > > > > > > > So to me it would be better to have the trivial fix for stable, > > > > > > > exactly because it is trivially backportable. Then cleanup/simplify > > > > > > > this logic on top of it. > > > > > > > > > > > > So I got myself a crap testcase with a CLONE_FS'ed task which can > > > > > > execve and sanity-checked that suid is indeed not honored as expected. > > > > > > > > > > So you mean my patch can't fix the problem? > > > > > > > > No, I think the patch works. > > > > > > > > I am saying the current scheme is avoidably hard to reason about. > > > > > > Ah, OK, thanks. Then I still think it makes more sense to do the > > > cleanups you propose on top of this fix. > > > > I agree. We should go with Oleg's fix that in the old scheme and use > > that. And then @Mateusz your cleanup should please go on top! > > Ok, in that case I'm gonna ship when I'm gonna ship(tm), maybe later this week. Ok, I've taken the patch as I've got a first round of fixes to send already. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec 2025-03-25 14:46 ` Christian Brauner @ 2025-03-25 18:40 ` Kees Cook 0 siblings, 0 replies; 30+ messages in thread From: Kees Cook @ 2025-03-25 18:40 UTC (permalink / raw) To: Christian Brauner, Mateusz Guzik Cc: Oleg Nesterov, syzbot, viro, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs On March 25, 2025 7:46:15 AM PDT, Christian Brauner <brauner@kernel.org> wrote: >On Tue, Mar 25, 2025 at 03:15:06PM +0100, Mateusz Guzik wrote: >> On Tue, Mar 25, 2025 at 2:30 PM Christian Brauner <brauner@kernel.org> wrote: >> > >> > On Tue, Mar 25, 2025 at 02:21:36PM +0100, Oleg Nesterov wrote: >> > > On 03/25, Mateusz Guzik wrote: >> > > > >> > > > On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote: >> > > > > >> > > > > On 03/24, Mateusz Guzik wrote: >> > > > > > >> > > > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote: >> > > > > > > >> > > > > > > So to me it would be better to have the trivial fix for stable, >> > > > > > > exactly because it is trivially backportable. Then cleanup/simplify >> > > > > > > this logic on top of it. >> > > > > > >> > > > > > So I got myself a crap testcase with a CLONE_FS'ed task which can >> > > > > > execve and sanity-checked that suid is indeed not honored as expected. >> > > > > >> > > > > So you mean my patch can't fix the problem? >> > > > >> > > > No, I think the patch works. >> > > > >> > > > I am saying the current scheme is avoidably hard to reason about. >> > > >> > > Ah, OK, thanks. Then I still think it makes more sense to do the >> > > cleanups you propose on top of this fix. >> > >> > I agree. We should go with Oleg's fix that in the old scheme and use >> > that. And then @Mateusz your cleanup should please go on top! >> >> Ok, in that case I'm gonna ship when I'm gonna ship(tm), maybe later this week. > >Ok, I've taken the patch as I've got a first round of fixes to send >already. Thanks! Acked-by: Kees Cook <kees@kernel.org> -- Kees Cook ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec 2025-03-24 16:00 ` [PATCH] exec: fix the racy usage of fs_struct->in_exec Oleg Nesterov 2025-03-24 17:01 ` Mateusz Guzik @ 2025-04-29 15:49 ` Oleg Nesterov 2025-04-29 16:57 ` Kees Cook 2025-04-29 17:12 ` Mateusz Guzik 1 sibling, 2 replies; 30+ messages in thread From: Oleg Nesterov @ 2025-04-29 15:49 UTC (permalink / raw) To: brauner, kees, viro Cc: jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, mjguzik, Michal Hocko Damn, I am stupid. On 03/24, Oleg Nesterov wrote: > > check_unsafe_exec() sets fs->in_exec under cred_guard_mutex, then execve() > paths clear fs->in_exec lockless. This is fine if exec succeeds, but if it > fails we have the following race: > > T1 sets fs->in_exec = 1, fails, drops cred_guard_mutex > > T2 sets fs->in_exec = 1 > > T1 clears fs->in_exec When I look at this code again, I think this race was not possible and thus this patch (applied as af7bb0d2ca45) was not needed. Yes, begin_new_exec() can drop cred_guard_mutex on failure, but only after de_thread() succeeds, when we can't race with another sub-thread. I hope this patch didn't make the things worse so we don't need to revert it. Plus I think it makes this (confusing) logic a bit more clear. Just, unless I am confused again, it wasn't really needed. ----------------------------------------------------------------------------- But. I didn't read the original report from syzbot, https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001f.GAE@google.com/#t because I wasn't CC'ed. and then - sorry Kees!!! - I didn't bother to read your first reply carefully. So yes, with or without this patch the "if (fs->in_exec)" check in copy_fs() can obviously hit the 1 -> 0 transition. This is harmless, but should be probably fixed just to avoid another report from KCSAN. I do not want to add another spin_lock(fs->lock). We can change copy_fs() to use data_race(), but I'd prefer the patch below. Yes, it needs the additional comment(s) to explain READ_ONCE(). What do you think? Did I miss somthing again??? Quite possibly... Mateusz, I hope you will cleanup this horror sooner or later ;) Oleg. --- diff --git a/fs/exec.c b/fs/exec.c index 5d1c0d2dc403..42a7f9b43911 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1495,7 +1495,7 @@ static void free_bprm(struct linux_binprm *bprm) free_arg_pages(bprm); if (bprm->cred) { /* in case exec fails before de_thread() succeeds */ - current->fs->in_exec = 0; + WRITE_ONCE(current->fs->in_exec, 0); mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); } diff --git a/kernel/fork.c b/kernel/fork.c index 4c2df3816728..381af8c8ece8 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1802,7 +1802,7 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk) /* tsk->fs is already what we want */ spin_lock(&fs->lock); /* "users" and "in_exec" locked for check_unsafe_exec() */ - if (fs->in_exec) { + if (READ_ONCE(fs->in_exec)) { spin_unlock(&fs->lock); return -EAGAIN; } ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec 2025-04-29 15:49 ` Oleg Nesterov @ 2025-04-29 16:57 ` Kees Cook 2025-04-29 17:12 ` Mateusz Guzik 1 sibling, 0 replies; 30+ messages in thread From: Kees Cook @ 2025-04-29 16:57 UTC (permalink / raw) To: Oleg Nesterov Cc: brauner, viro, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, mjguzik, Michal Hocko On Tue, Apr 29, 2025 at 05:49:44PM +0200, Oleg Nesterov wrote: > Damn, I am stupid. > > On 03/24, Oleg Nesterov wrote: > > > > check_unsafe_exec() sets fs->in_exec under cred_guard_mutex, then execve() > > paths clear fs->in_exec lockless. This is fine if exec succeeds, but if it > > fails we have the following race: > > > > T1 sets fs->in_exec = 1, fails, drops cred_guard_mutex > > > > T2 sets fs->in_exec = 1 > > > > T1 clears fs->in_exec > > When I look at this code again, I think this race was not possible and thus > this patch (applied as af7bb0d2ca45) was not needed. > > Yes, begin_new_exec() can drop cred_guard_mutex on failure, but only after > de_thread() succeeds, when we can't race with another sub-thread. > > I hope this patch didn't make the things worse so we don't need to revert it. > Plus I think it makes this (confusing) logic a bit more clear. Just, unless > I am confused again, it wasn't really needed. > > ----------------------------------------------------------------------------- > But. I didn't read the original report from syzbot, > https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001f.GAE@google.com/#t > because I wasn't CC'ed. and then - sorry Kees!!! - I didn't bother to read > your first reply carefully. > > So yes, with or without this patch the "if (fs->in_exec)" check in copy_fs() > can obviously hit the 1 -> 0 transition. > > This is harmless, but should be probably fixed just to avoid another report > from KCSAN. > > I do not want to add another spin_lock(fs->lock). We can change copy_fs() to > use data_race(), but I'd prefer the patch below. Yes, it needs the additional > comment(s) to explain READ_ONCE(). > > What do you think? Did I miss somthing again??? Quite possibly... > > Mateusz, I hope you will cleanup this horror sooner or later ;) > > Oleg. > --- > > diff --git a/fs/exec.c b/fs/exec.c > index 5d1c0d2dc403..42a7f9b43911 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1495,7 +1495,7 @@ static void free_bprm(struct linux_binprm *bprm) > free_arg_pages(bprm); > if (bprm->cred) { > /* in case exec fails before de_thread() succeeds */ > - current->fs->in_exec = 0; > + WRITE_ONCE(current->fs->in_exec, 0); > mutex_unlock(¤t->signal->cred_guard_mutex); > abort_creds(bprm->cred); > } > diff --git a/kernel/fork.c b/kernel/fork.c > index 4c2df3816728..381af8c8ece8 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1802,7 +1802,7 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk) > /* tsk->fs is already what we want */ > spin_lock(&fs->lock); > /* "users" and "in_exec" locked for check_unsafe_exec() */ > - if (fs->in_exec) { > + if (READ_ONCE(fs->in_exec)) { > spin_unlock(&fs->lock); > return -EAGAIN; > } > Yeah, this seems reasonable. -- Kees Cook ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] exec: fix the racy usage of fs_struct->in_exec 2025-04-29 15:49 ` Oleg Nesterov 2025-04-29 16:57 ` Kees Cook @ 2025-04-29 17:12 ` Mateusz Guzik 1 sibling, 0 replies; 30+ messages in thread From: Mateusz Guzik @ 2025-04-29 17:12 UTC (permalink / raw) To: Oleg Nesterov Cc: brauner, kees, viro, jack, linux-fsdevel, linux-kernel, linux-mm, syzkaller-bugs, Michal Hocko On Tue, Apr 29, 2025 at 5:50 PM Oleg Nesterov <oleg@redhat.com> wrote: > > Damn, I am stupid. > > On 03/24, Oleg Nesterov wrote: > > > > check_unsafe_exec() sets fs->in_exec under cred_guard_mutex, then execve() > > paths clear fs->in_exec lockless. This is fine if exec succeeds, but if it > > fails we have the following race: > > > > T1 sets fs->in_exec = 1, fails, drops cred_guard_mutex > > > > T2 sets fs->in_exec = 1 > > > > T1 clears fs->in_exec > > When I look at this code again, I think this race was not possible and thus > this patch (applied as af7bb0d2ca45) was not needed. > > Yes, begin_new_exec() can drop cred_guard_mutex on failure, but only after > de_thread() succeeds, when we can't race with another sub-thread. > > I hope this patch didn't make the things worse so we don't need to revert it. > Plus I think it makes this (confusing) logic a bit more clear. Just, unless > I am confused again, it wasn't really needed. > > ----------------------------------------------------------------------------- > But. I didn't read the original report from syzbot, > https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001f.GAE@google.com/#t > because I wasn't CC'ed. and then - sorry Kees!!! - I didn't bother to read > your first reply carefully. > > So yes, with or without this patch the "if (fs->in_exec)" check in copy_fs() > can obviously hit the 1 -> 0 transition. > > This is harmless, but should be probably fixed just to avoid another report > from KCSAN. > > I do not want to add another spin_lock(fs->lock). We can change copy_fs() to > use data_race(), but I'd prefer the patch below. Yes, it needs the additional > comment(s) to explain READ_ONCE(). > > What do you think? Did I miss somthing again??? Quite possibly... > > Mateusz, I hope you will cleanup this horror sooner or later ;) > > Oleg. > --- > > diff --git a/fs/exec.c b/fs/exec.c > index 5d1c0d2dc403..42a7f9b43911 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1495,7 +1495,7 @@ static void free_bprm(struct linux_binprm *bprm) > free_arg_pages(bprm); > if (bprm->cred) { > /* in case exec fails before de_thread() succeeds */ > - current->fs->in_exec = 0; > + WRITE_ONCE(current->fs->in_exec, 0); > mutex_unlock(¤t->signal->cred_guard_mutex); > abort_creds(bprm->cred); > } > diff --git a/kernel/fork.c b/kernel/fork.c > index 4c2df3816728..381af8c8ece8 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1802,7 +1802,7 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk) > /* tsk->fs is already what we want */ > spin_lock(&fs->lock); > /* "users" and "in_exec" locked for check_unsafe_exec() */ > - if (fs->in_exec) { > + if (READ_ONCE(fs->in_exec)) { > spin_unlock(&fs->lock); > return -EAGAIN; > } > good grief man ;) I have this on my TODO list, (un)fortunately $life got in the way and by now I swapped out almost all of the context. I mostly remember the code is hard to follow. ;) that said, maybe i'll give it a stab soon(tm). I have a testcase somewhere to validate that this does provide the expect behavior vs suid, so it's not going to be me flying blindly. -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2025-04-29 17:12 UTC | newest] Thread overview: 30+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-03-20 19:09 [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4) syzbot 2025-03-20 20:09 ` Kees Cook 2025-03-21 1:44 ` Al Viro 2025-03-21 8:10 ` Kees Cook 2025-03-21 8:49 ` Christian Brauner 2025-03-21 8:45 ` Christian Brauner 2025-03-22 1:00 ` Al Viro 2025-03-22 6:26 ` Kees Cook 2025-03-22 10:15 ` Mateusz Guzik 2025-03-22 10:28 ` Christian Brauner 2025-03-22 10:23 ` Christian Brauner 2025-03-22 15:55 ` Oleg Nesterov 2025-03-22 18:50 ` Al Viro 2025-03-23 18:14 ` Oleg Nesterov 2025-03-23 20:57 ` Christian Brauner 2025-03-24 16:00 ` [PATCH] exec: fix the racy usage of fs_struct->in_exec Oleg Nesterov 2025-03-24 17:01 ` Mateusz Guzik 2025-03-24 18:27 ` Oleg Nesterov 2025-03-24 18:37 ` Oleg Nesterov 2025-03-24 22:24 ` Mateusz Guzik 2025-03-25 10:09 ` Oleg Nesterov 2025-03-25 11:01 ` Mateusz Guzik 2025-03-25 13:21 ` Oleg Nesterov 2025-03-25 13:30 ` Christian Brauner 2025-03-25 14:15 ` Mateusz Guzik 2025-03-25 14:46 ` Christian Brauner 2025-03-25 18:40 ` Kees Cook 2025-04-29 15:49 ` Oleg Nesterov 2025-04-29 16:57 ` Kees Cook 2025-04-29 17:12 ` Mateusz Guzik
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).