* [PATCH 0/2] fs/exec: Explicitly unshare fs_struct on exec @ 2022-10-06 8:27 Kees Cook 2022-10-06 8:27 ` [PATCH 1/2] " Kees Cook 2022-10-06 8:27 ` [PATCH 2/2] exec: Remove LSM_UNSAFE_SHARE Kees Cook 0 siblings, 2 replies; 25+ messages in thread From: Kees Cook @ 2022-10-06 8:27 UTC (permalink / raw) To: Eric Biederman Cc: Kees Cook, Jorge Merlino, Al Viro, Christian Brauner (Microsoft), Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, linux-kernel, linux-mm, linux-fsdevel, apparmor, linux-security-module, selinux, linux-hardening Hi, These changes seek to address an issue reported[1] by Jorge Merlino where high-thread-count processes would sometimes fail to setuid during a setuid execve(). It looks to me like the solution is to explicitly do an unshare_fs(), which should almost always be a no-op. Current testing seems to indicate that only the swapper->init exec triggers this condition (and I'm unclear on whether that's expected or undesirable). This has only received very light testing so far, but I wanted to share it so other folks could look it over. Jorge, can you test with these patches? Your PoC triggered immediately for me on an unpatched kernel, and did not trigger on a patched one. I added this patch on top of the series to see if the code ever fired: diff --git a/kernel/fork.c b/kernel/fork.c index 53b7248f7a4b..3c197d9d8daa 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -3113,6 +3113,7 @@ int unshare_fs(void) if (error || !new_fs) return error; + pr_notice("UNSHARE of \"%s\" [%d]\n", current->comm, current->pid); unshare_fs_finalize(&new_fs); if (new_fs) Thanks! -Kees [1] https://lore.kernel.org/lkml/20220910211215.140270-1-jorge.merlino@canonical.com/ Kees Cook (2): fs/exec: Explicitly unshare fs_struct on exec exec: Remove LSM_UNSAFE_SHARE fs/exec.c | 26 ++++------------ fs/fs_struct.c | 1 - include/linux/fdtable.h | 1 + include/linux/fs_struct.h | 1 - include/linux/security.h | 5 ++- kernel/fork.c | 62 ++++++++++++++++++++++++++------------ security/apparmor/domain.c | 5 --- security/selinux/hooks.c | 10 ------ 8 files changed, 51 insertions(+), 60 deletions(-) -- 2.34.1 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2022-10-06 8:27 [PATCH 0/2] fs/exec: Explicitly unshare fs_struct on exec Kees Cook @ 2022-10-06 8:27 ` Kees Cook 2022-10-06 9:05 ` Christian Brauner 2022-10-06 8:27 ` [PATCH 2/2] exec: Remove LSM_UNSAFE_SHARE Kees Cook 1 sibling, 1 reply; 25+ messages in thread From: Kees Cook @ 2022-10-06 8:27 UTC (permalink / raw) To: Eric Biederman Cc: Kees Cook, Jorge Merlino, Alexander Viro, Christian Brauner (Microsoft), Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, linux-kernel, apparmor, linux-security-module, selinux, linux-hardening The check_unsafe_exec() counting of n_fs would not add up under a heavily threaded process trying to perform a suid exec, causing the suid portion to fail. This counting error appears to be unneeded, but to catch any possible conditions, explicitly unshare fs_struct on exec, if it ends up needing to happen. This means fs->in_exec must be removed (since it would never get cleared in the old copy), and is specifically no longer needed. See also commit 498052bba55e ("New locking/refcounting for fs_struct"). This additionally allows the "in_exec" member to be dropped, showing an 8 bytes savings per fs_struct on 64-bit. Before: struct fs_struct { int users; /* 0 4 */ spinlock_t lock; /* 4 4 */ seqcount_spinlock_t seq; /* 8 4 */ int umask; /* 12 4 */ int in_exec; /* 16 4 */ /* XXX 4 bytes hole, try to pack */ struct path root; /* 24 16 */ struct path pwd; /* 40 16 */ /* size: 56, cachelines: 1, members: 7 */ /* sum members: 52, holes: 1, sum holes: 4 */ /* last cacheline: 56 bytes */ }; After: struct fs_struct { int users; /* 0 4 */ spinlock_t lock; /* 4 4 */ seqcount_spinlock_t seq; /* 8 4 */ int umask; /* 12 4 */ struct path root; /* 16 16 */ struct path pwd; /* 32 16 */ /* size: 48, cachelines: 1, members: 6 */ /* last cacheline: 48 bytes */ }; Reported-by: Jorge Merlino <jorge.merlino@canonical.com> Link: https://lore.kernel.org/lkml/20220910211215.140270-1-jorge.merlino@canonical.com/ Cc: Eric Biederman <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Christian Brauner (Microsoft)" <brauner@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> --- fs/exec.c | 9 +++--- fs/fs_struct.c | 1 - include/linux/fdtable.h | 1 + include/linux/fs_struct.h | 1 - kernel/fork.c | 62 ++++++++++++++++++++++++++------------- 5 files changed, 48 insertions(+), 26 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 902bce45b116..7d5f63f03c58 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1272,6 +1272,11 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) goto out; + /* Ensure the fs_struct is not shared. */ + retval = unshare_fs(); + if (retval) + goto out; + /* * Must be called _before_ exec_mmap() as bprm->mm is * not visible until then. This also enables the update @@ -1583,8 +1588,6 @@ static void check_unsafe_exec(struct linux_binprm *bprm) if (p->fs->users > n_fs) bprm->unsafe |= LSM_UNSAFE_SHARE; - else - p->fs->in_exec = 1; spin_unlock(&p->fs->lock); } @@ -1843,7 +1846,6 @@ static int bprm_execve(struct linux_binprm *bprm, goto out; /* execve succeeded */ - current->fs->in_exec = 0; current->in_execve = 0; rseq_execve(current); acct_update_integrals(current); @@ -1861,7 +1863,6 @@ static int bprm_execve(struct linux_binprm *bprm, force_fatal_sig(SIGSEGV); out_unmark: - current->fs->in_exec = 0; current->in_execve = 0; return retval; diff --git a/fs/fs_struct.c b/fs/fs_struct.c index 04b3f5b9c629..c27c67023d01 100644 --- a/fs/fs_struct.c +++ b/fs/fs_struct.c @@ -115,7 +115,6 @@ struct fs_struct *copy_fs_struct(struct fs_struct *old) /* We don't need to lock fs - think why ;-) */ if (fs) { fs->users = 1; - fs->in_exec = 0; spin_lock_init(&fs->lock); seqcount_spinlock_init(&fs->seq, &fs->lock); fs->umask = old->umask; diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index e066816f3519..fbfb89ee3bda 100644 --- a/include/linux/fdtable.h +++ b/include/linux/fdtable.h @@ -117,6 +117,7 @@ struct task_struct; void put_files_struct(struct files_struct *fs); int unshare_files(void); +int unshare_fs(void); struct files_struct *dup_fd(struct files_struct *, unsigned, int *) __latent_entropy; void do_close_on_exec(struct files_struct *); int iterate_fd(struct files_struct *, unsigned, diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h index 783b48dedb72..08b82a90ceae 100644 --- a/include/linux/fs_struct.h +++ b/include/linux/fs_struct.h @@ -11,7 +11,6 @@ struct fs_struct { spinlock_t lock; seqcount_spinlock_t seq; int umask; - int in_exec; struct path root, pwd; } __randomize_layout; diff --git a/kernel/fork.c b/kernel/fork.c index b4a799d9c50f..53b7248f7a4b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1589,10 +1589,6 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk) if (clone_flags & CLONE_FS) { /* tsk->fs is already what we want */ spin_lock(&fs->lock); - if (fs->in_exec) { - spin_unlock(&fs->lock); - return -EAGAIN; - } fs->users++; spin_unlock(&fs->lock); return 0; @@ -3070,10 +3066,9 @@ static int check_unshare_flags(unsigned long unshare_flags) return 0; } -/* - * Unshare the filesystem structure if it is being shared - */ -static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp) +/* Allocate an fs_struct if it is currently shared and CLONE_FS requested. */ +static int unshare_fs_alloc(unsigned long unshare_flags, + struct fs_struct **new_fsp) { struct fs_struct *fs = current->fs; @@ -3091,6 +3086,41 @@ static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp) return 0; } +/* Swap out fs_struct, returning old fs if it needs freeing. */ +static void unshare_fs_finalize(struct fs_struct **new_fsp) +{ + struct task_struct *task = current; + struct fs_struct *fs = task->fs; + + spin_lock(&fs->lock); + task->fs = *new_fsp; + if (--fs->users) + *new_fsp = NULL; + else + *new_fsp = fs; + spin_unlock(&fs->lock); +} + +/* + * Unshare the filesystem structure if it is being shared + */ +int unshare_fs(void) +{ + struct fs_struct *new_fs = NULL; + int error; + + error = unshare_fs_alloc(CLONE_FS, &new_fs); + if (error || !new_fs) + return error; + + unshare_fs_finalize(&new_fs); + + if (new_fs) + free_fs_struct(new_fs); + + return 0; +} + /* * Unshare file descriptor table if it is being shared */ @@ -3120,7 +3150,7 @@ int unshare_fd(unsigned long unshare_flags, unsigned int max_fds, */ int ksys_unshare(unsigned long unshare_flags) { - struct fs_struct *fs, *new_fs = NULL; + struct fs_struct *new_fs = NULL; struct files_struct *new_fd = NULL; struct cred *new_cred = NULL; struct nsproxy *new_nsproxy = NULL; @@ -3159,7 +3189,7 @@ int ksys_unshare(unsigned long unshare_flags) */ if (unshare_flags & (CLONE_NEWIPC|CLONE_SYSVSEM)) do_sysvsem = 1; - err = unshare_fs(unshare_flags, &new_fs); + err = unshare_fs_alloc(unshare_flags, &new_fs); if (err) goto bad_unshare_out; err = unshare_fd(unshare_flags, NR_OPEN_MAX, &new_fd); @@ -3197,16 +3227,8 @@ int ksys_unshare(unsigned long unshare_flags) task_lock(current); - if (new_fs) { - fs = current->fs; - spin_lock(&fs->lock); - current->fs = new_fs; - if (--fs->users) - new_fs = NULL; - else - new_fs = fs; - spin_unlock(&fs->lock); - } + if (new_fs) + unshare_fs_finalize(&new_fs); if (new_fd) swap(current->files, new_fd); -- 2.34.1 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2022-10-06 8:27 ` [PATCH 1/2] " Kees Cook @ 2022-10-06 9:05 ` Christian Brauner 2022-10-06 10:48 ` David Laight 2022-10-06 14:13 ` Jann Horn 0 siblings, 2 replies; 25+ messages in thread From: Christian Brauner @ 2022-10-06 9:05 UTC (permalink / raw) To: Kees Cook Cc: Eric Biederman, Jorge Merlino, Alexander Viro, Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, linux-kernel, apparmor, linux-security-module, selinux, linux-hardening On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: > The check_unsafe_exec() counting of n_fs would not add up under a heavily > threaded process trying to perform a suid exec, causing the suid portion > to fail. This counting error appears to be unneeded, but to catch any > possible conditions, explicitly unshare fs_struct on exec, if it ends up Isn't this a potential uapi break? Afaict, before this change a call to clone{3}(CLONE_FS) followed by an exec in the child would have the parent and child share fs information. So if the child e.g., changes the working directory post exec it would also affect the parent. But after this change here this would no longer be true. So a child changing a workding directoro would not affect the parent anymore. IOW, an exec is accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but it seems like a non-trivial uapi change but there might be few users that do clone{3}(CLONE_FS) followed by an exec. > +/* > + * Unshare the filesystem structure if it is being shared > + */ > +int unshare_fs(void) > +{ > + struct fs_struct *new_fs = NULL; > + int error; > + > + error = unshare_fs_alloc(CLONE_FS, &new_fs); > + if (error || !new_fs) > + return error; You should just check for error here, not new_fs. ^ permalink raw reply [flat|nested] 25+ messages in thread
* RE: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2022-10-06 9:05 ` Christian Brauner @ 2022-10-06 10:48 ` David Laight 2022-10-06 14:13 ` Jann Horn 1 sibling, 0 replies; 25+ messages in thread From: David Laight @ 2022-10-06 10:48 UTC (permalink / raw) To: 'Christian Brauner', Kees Cook Cc: Eric Biederman, Jorge Merlino, Alexander Viro, Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, linux-kernel@vger.kernel.org, apparmor@lists.ubuntu.com, linux-security-module@vger.kernel.org, selinux@vger.kernel.org, linux-hardening@vger.kernel.org From: Christian Brauner > Sent: 06 October 2022 10:05 > > On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: > > The check_unsafe_exec() counting of n_fs would not add up under a heavily > > threaded process trying to perform a suid exec, causing the suid portion > > to fail. This counting error appears to be unneeded, but to catch any > > possible conditions, explicitly unshare fs_struct on exec, if it ends up > > Isn't this a potential uapi break? Afaict, before this change a call to > clone{3}(CLONE_FS) followed by an exec in the child would have the > parent and child share fs information. So if the child e.g., changes the > working directory post exec it would also affect the parent. But after > this change here this would no longer be true. So a child changing a > workding directoro would not affect the parent anymore. IOW, an exec is > accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but > it seems like a non-trivial uapi change but there might be few users > that do clone{3}(CLONE_FS) followed by an exec. The thought of that is entirely horrid... I presume a suid exec will fail in that case? If the old code is trying to compare the number of threads with the number of users of the fs table isn't is just buggy? If a thread unshares the fs table there can be another reference somewhere else - which is what (I presume) is being tested for. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2022-10-06 9:05 ` Christian Brauner 2022-10-06 10:48 ` David Laight @ 2022-10-06 14:13 ` Jann Horn 2022-10-06 15:25 ` Kees Cook 2022-10-14 3:18 ` Andy Lutomirski 1 sibling, 2 replies; 25+ messages in thread From: Jann Horn @ 2022-10-06 14:13 UTC (permalink / raw) To: Christian Brauner Cc: Kees Cook, Eric Biederman, Jorge Merlino, Alexander Viro, Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, linux-kernel, apparmor, linux-security-module, selinux, linux-hardening On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote: > On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: > > The check_unsafe_exec() counting of n_fs would not add up under a heavily > > threaded process trying to perform a suid exec, causing the suid portion > > to fail. This counting error appears to be unneeded, but to catch any > > possible conditions, explicitly unshare fs_struct on exec, if it ends up > > Isn't this a potential uapi break? Afaict, before this change a call to > clone{3}(CLONE_FS) followed by an exec in the child would have the > parent and child share fs information. So if the child e.g., changes the > working directory post exec it would also affect the parent. But after > this change here this would no longer be true. So a child changing a > workding directoro would not affect the parent anymore. IOW, an exec is > accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but > it seems like a non-trivial uapi change but there might be few users > that do clone{3}(CLONE_FS) followed by an exec. I believe the following code in Chromium explicitly relies on this behavior, but I'm not sure whether this code is in active use anymore: https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2022-10-06 14:13 ` Jann Horn @ 2022-10-06 15:25 ` Kees Cook 2022-10-06 15:35 ` Jann Horn 2025-05-13 13:05 ` Mateusz Guzik 2022-10-14 3:18 ` Andy Lutomirski 1 sibling, 2 replies; 25+ messages in thread From: Kees Cook @ 2022-10-06 15:25 UTC (permalink / raw) To: Jann Horn, Christian Brauner Cc: Eric Biederman, Jorge Merlino, Alexander Viro, Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, linux-kernel, apparmor, linux-security-module, selinux, linux-hardening On October 6, 2022 7:13:37 AM PDT, Jann Horn <jannh@google.com> wrote: >On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote: >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily >> > threaded process trying to perform a suid exec, causing the suid portion >> > to fail. This counting error appears to be unneeded, but to catch any >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up >> >> Isn't this a potential uapi break? Afaict, before this change a call to >> clone{3}(CLONE_FS) followed by an exec in the child would have the >> parent and child share fs information. So if the child e.g., changes the >> working directory post exec it would also affect the parent. But after >> this change here this would no longer be true. So a child changing a >> workding directoro would not affect the parent anymore. IOW, an exec is >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but >> it seems like a non-trivial uapi change but there might be few users >> that do clone{3}(CLONE_FS) followed by an exec. > >I believe the following code in Chromium explicitly relies on this >behavior, but I'm not sure whether this code is in active use anymore: > >https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium Oh yes. I think I had tried to forget this existed. Ugh. Okay, so back to the drawing board, I guess. The counting will need to be fixed... It's possible we can move the counting after dethread -- it seems the early count was just to avoid setting flags after the point of no return, but it's not an error condition... -- Kees Cook ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2022-10-06 15:25 ` Kees Cook @ 2022-10-06 15:35 ` Jann Horn 2025-05-13 13:05 ` Mateusz Guzik 1 sibling, 0 replies; 25+ messages in thread From: Jann Horn @ 2022-10-06 15:35 UTC (permalink / raw) To: Kees Cook Cc: Christian Brauner, Eric Biederman, Jorge Merlino, Alexander Viro, Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, linux-kernel, apparmor, linux-security-module, selinux, linux-hardening On Thu, Oct 6, 2022 at 5:25 PM Kees Cook <keescook@chromium.org> wrote: > On October 6, 2022 7:13:37 AM PDT, Jann Horn <jannh@google.com> wrote: > >On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote: > >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: > >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily > >> > threaded process trying to perform a suid exec, causing the suid portion > >> > to fail. This counting error appears to be unneeded, but to catch any > >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up > >> > >> Isn't this a potential uapi break? Afaict, before this change a call to > >> clone{3}(CLONE_FS) followed by an exec in the child would have the > >> parent and child share fs information. So if the child e.g., changes the > >> working directory post exec it would also affect the parent. But after > >> this change here this would no longer be true. So a child changing a > >> workding directoro would not affect the parent anymore. IOW, an exec is > >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but > >> it seems like a non-trivial uapi change but there might be few users > >> that do clone{3}(CLONE_FS) followed by an exec. > > > >I believe the following code in Chromium explicitly relies on this > >behavior, but I'm not sure whether this code is in active use anymore: > > > >https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium > > Oh yes. I think I had tried to forget this existed. Ugh. Okay, so back to the drawing board, I guess. The counting will need to be fixed... > > It's possible we can move the counting after dethread -- it seems the early count was just to avoid setting flags after the point of no return, but it's not an error condition... Random idea that I haven't thought about a lot: One approach might be to not do it by counting, but instead have a flag on the fs_struct that we set when someone does a clone() with CLONE_FS but without CLONE_THREAD? Then we'd end up with the following possible states for fs_struct: - single-process, normal - single-process, pending execve past check_unsafe_exec() (prevent concurrent CLONE_FS) - shared between processes The slight difference from the old semantics would be that once you've used CLONE_FS without CLONE_THREAD, you can never do setuid execve() from your current process again (without calling unshare()), even if the child disappears in the meantime. I think that might be an acceptably small UAPI break. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2022-10-06 15:25 ` Kees Cook 2022-10-06 15:35 ` Jann Horn @ 2025-05-13 13:05 ` Mateusz Guzik 2025-05-13 15:29 ` Eric W. Biederman 2025-05-13 20:57 ` Kees Cook 1 sibling, 2 replies; 25+ messages in thread From: Mateusz Guzik @ 2025-05-13 13:05 UTC (permalink / raw) To: Kees Cook Cc: Jann Horn, Christian Brauner, Eric Biederman, Jorge Merlino, Alexander Viro, Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, linux-kernel, apparmor, linux-security-module, selinux, linux-hardening, oleg On Thu, Oct 06, 2022 at 08:25:01AM -0700, Kees Cook wrote: > On October 6, 2022 7:13:37 AM PDT, Jann Horn <jannh@google.com> wrote: > >On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote: > >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: > >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily > >> > threaded process trying to perform a suid exec, causing the suid portion > >> > to fail. This counting error appears to be unneeded, but to catch any > >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up > >> > >> Isn't this a potential uapi break? Afaict, before this change a call to > >> clone{3}(CLONE_FS) followed by an exec in the child would have the > >> parent and child share fs information. So if the child e.g., changes the > >> working directory post exec it would also affect the parent. But after > >> this change here this would no longer be true. So a child changing a > >> workding directoro would not affect the parent anymore. IOW, an exec is > >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but > >> it seems like a non-trivial uapi change but there might be few users > >> that do clone{3}(CLONE_FS) followed by an exec. > > > >I believe the following code in Chromium explicitly relies on this > >behavior, but I'm not sure whether this code is in active use anymore: > > > >https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium > > Oh yes. I think I had tried to forget this existed. Ugh. Okay, so back to the drawing board, I guess. The counting will need to be fixed... > > It's possible we can move the counting after dethread -- it seems the early count was just to avoid setting flags after the point of no return, but it's not an error condition... > I landed here from git blame. I was looking at sanitizing shared fs vs suid handling, but the entire ordeal is so convoluted I'm confident the best way forward is to whack the problem to begin with. Per the above link, the notion of a shared fs struct across different processes is depended on so merely unsharing is a no-go. However, the shared state is only a problem for suid/sgid. Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is shared. This will have to be checked for after the execing proc becomes single-threaded ofc. While technically speaking this does introduce a change in behavior, there is precedent for doing it and seeing if anyone yells. With this in place there is no point maintainig ->in_exec or checking the flag. There is the known example of depending on shared fs_struct across exec. Hopefully there is no example of depending on execing a suid/sgid binary in such a setting -- it would be quite a weird setup given that for security reasons the perms must not be changed. The upshot of this method is that any breakage will be immediately visible in the form of a failed exec. Another route would be to do the mandatory unshare but only for suid/sgid, except that would have a hidden failure (if you will). Comments? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2025-05-13 13:05 ` Mateusz Guzik @ 2025-05-13 15:29 ` Eric W. Biederman 2025-05-13 20:57 ` Kees Cook 1 sibling, 0 replies; 25+ messages in thread From: Eric W. Biederman @ 2025-05-13 15:29 UTC (permalink / raw) To: Mateusz Guzik Cc: Kees Cook, Jann Horn, Christian Brauner, Jorge Merlino, Alexander Viro, Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, linux-kernel, apparmor, linux-security-module, selinux, linux-hardening, oleg Mateusz Guzik <mjguzik@gmail.com> writes: > On Thu, Oct 06, 2022 at 08:25:01AM -0700, Kees Cook wrote: >> On October 6, 2022 7:13:37 AM PDT, Jann Horn <jannh@google.com> wrote: >> >On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote: >> >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: >> >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily >> >> > threaded process trying to perform a suid exec, causing the suid portion >> >> > to fail. This counting error appears to be unneeded, but to catch any >> >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up >> >> >> >> Isn't this a potential uapi break? Afaict, before this change a call to >> >> clone{3}(CLONE_FS) followed by an exec in the child would have the >> >> parent and child share fs information. So if the child e.g., changes the >> >> working directory post exec it would also affect the parent. But after >> >> this change here this would no longer be true. So a child changing a >> >> workding directoro would not affect the parent anymore. IOW, an exec is >> >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but >> >> it seems like a non-trivial uapi change but there might be few users >> >> that do clone{3}(CLONE_FS) followed by an exec. >> > >> >I believe the following code in Chromium explicitly relies on this >> >behavior, but I'm not sure whether this code is in active use anymore: >> > >> >https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium >> >> Oh yes. I think I had tried to forget this existed. Ugh. Okay, so back to the drawing board, I guess. The counting will need to be fixed... >> >> It's possible we can move the counting after dethread -- it seems the early count was just to avoid setting flags after the point of no return, but it's not an error condition... >> > > I landed here from git blame. > > I was looking at sanitizing shared fs vs suid handling, but the entire > ordeal is so convoluted I'm confident the best way forward is to whack > the problem to begin with. > > Per the above link, the notion of a shared fs struct across different > processes is depended on so merely unsharing is a no-go. > > However, the shared state is only a problem for suid/sgid. > > Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is > shared. This will have to be checked for after the execing proc becomes > single-threaded ofc. > > While technically speaking this does introduce a change in behavior, > there is precedent for doing it and seeing if anyone yells. > > With this in place there is no point maintainig ->in_exec or checking > the flag. > > There is the known example of depending on shared fs_struct across exec. > Hopefully there is no example of depending on execing a suid/sgid binary > in such a setting -- it would be quite a weird setup given that for > security reasons the perms must not be changed. > > The upshot of this method is that any breakage will be immediately > visible in the form of a failed exec. > > Another route would be to do the mandatory unshare but only for > suid/sgid, except that would have a hidden failure (if you will). > > Comments? What is the problem that is trying to be fixed? A uapi change to not allow sharing a fs_struct for processes that change their cred on exec seems possible. I said changing cred instead of suid/sgid because there are capabilities and LSM labels that we probably want this to apply to as well. I think such a limitation can be justified based upon having a shared fs_struct is likely to allow confuse suid executables. Earlier in the thread there was talk about the refcount for fs_struct. I don't see that problem at the moment, and I don't see how dealing with suid+sgid exectuables will have any bearing on how the refcount works. Eric ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2025-05-13 13:05 ` Mateusz Guzik 2025-05-13 15:29 ` Eric W. Biederman @ 2025-05-13 20:57 ` Kees Cook 2025-05-13 21:09 ` Jann Horn 1 sibling, 1 reply; 25+ messages in thread From: Kees Cook @ 2025-05-13 20:57 UTC (permalink / raw) To: Mateusz Guzik, Kees Cook Cc: Jann Horn, Christian Brauner, Eric Biederman, Jorge Merlino, Alexander Viro, Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, linux-kernel, apparmor, linux-security-module, selinux, linux-hardening, oleg On May 13, 2025 6:05:45 AM PDT, Mateusz Guzik <mjguzik@gmail.com> wrote: >On Thu, Oct 06, 2022 at 08:25:01AM -0700, Kees Cook wrote: >> On October 6, 2022 7:13:37 AM PDT, Jann Horn <jannh@google.com> wrote: >> >On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote: >> >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: >> >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily >> >> > threaded process trying to perform a suid exec, causing the suid portion >> >> > to fail. This counting error appears to be unneeded, but to catch any >> >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up >> >> >> >> Isn't this a potential uapi break? Afaict, before this change a call to >> >> clone{3}(CLONE_FS) followed by an exec in the child would have the >> >> parent and child share fs information. So if the child e.g., changes the >> >> working directory post exec it would also affect the parent. But after >> >> this change here this would no longer be true. So a child changing a >> >> workding directoro would not affect the parent anymore. IOW, an exec is >> >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but >> >> it seems like a non-trivial uapi change but there might be few users >> >> that do clone{3}(CLONE_FS) followed by an exec. >> > >> >I believe the following code in Chromium explicitly relies on this >> >behavior, but I'm not sure whether this code is in active use anymore: >> > >> >https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium >> >> Oh yes. I think I had tried to forget this existed. Ugh. Okay, so back to the drawing board, I guess. The counting will need to be fixed... >> >> It's possible we can move the counting after dethread -- it seems the early count was just to avoid setting flags after the point of no return, but it's not an error condition... >> > >I landed here from git blame. > >I was looking at sanitizing shared fs vs suid handling, but the entire >ordeal is so convoluted I'm confident the best way forward is to whack >the problem to begin with. > >Per the above link, the notion of a shared fs struct across different >processes is depended on so merely unsharing is a no-go. > >However, the shared state is only a problem for suid/sgid. > >Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is >shared. This will have to be checked for after the execing proc becomes >single-threaded ofc. Unfortunately the above Chrome helper is setuid and uses CLONE_FS. But to echo what Eric asked: what problem are you trying to solve? -Kees -- Kees Cook ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2025-05-13 20:57 ` Kees Cook @ 2025-05-13 21:09 ` Jann Horn 2025-05-13 22:16 ` Eric W. Biederman 2025-05-13 23:15 ` Kees Cook 0 siblings, 2 replies; 25+ messages in thread From: Jann Horn @ 2025-05-13 21:09 UTC (permalink / raw) To: Kees Cook Cc: Mateusz Guzik, Kees Cook, Christian Brauner, Eric Biederman, Jorge Merlino, Alexander Viro, Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, linux-kernel, apparmor, linux-security-module, selinux, linux-hardening, oleg On Tue, May 13, 2025 at 10:57 PM Kees Cook <kees@kernel.org> wrote: > On May 13, 2025 6:05:45 AM PDT, Mateusz Guzik <mjguzik@gmail.com> wrote: > >Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is > >shared. This will have to be checked for after the execing proc becomes > >single-threaded ofc. > > Unfortunately the above Chrome helper is setuid and uses CLONE_FS. Chrome first launches a setuid helper, and then the setuid helper does CLONE_FS. Mateusz's proposal would not impact this usecase. Mateusz is proposing to block the case where a process first does CLONE_FS, and *then* one of the processes sharing the fs_struct does a setuid execve(). Linux already downgrades such an execve() to be non-setuid, which probably means anyone trying to do this will get hard-to-understand problems. Mateusz' proposal would just turn this hard-to-debug edgecase, which already doesn't really work, into a clean error; I think that is a nice improvement even just from the UAPI standpoint. If this change makes it possible to clean up the kernel code a bit, even better. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2025-05-13 21:09 ` Jann Horn @ 2025-05-13 22:16 ` Eric W. Biederman 2025-05-14 0:03 ` Mateusz Guzik 2025-05-13 23:15 ` Kees Cook 1 sibling, 1 reply; 25+ messages in thread From: Eric W. Biederman @ 2025-05-13 22:16 UTC (permalink / raw) To: Jann Horn Cc: Kees Cook, Mateusz Guzik, Kees Cook, Christian Brauner, Jorge Merlino, Alexander Viro, Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, linux-kernel, apparmor, linux-security-module, selinux, linux-hardening, oleg Jann Horn <jannh@google.com> writes: > On Tue, May 13, 2025 at 10:57 PM Kees Cook <kees@kernel.org> wrote: >> On May 13, 2025 6:05:45 AM PDT, Mateusz Guzik <mjguzik@gmail.com> wrote: >> >Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is >> >shared. This will have to be checked for after the execing proc becomes >> >single-threaded ofc. >> >> Unfortunately the above Chrome helper is setuid and uses CLONE_FS. > > Chrome first launches a setuid helper, and then the setuid helper does > CLONE_FS. Mateusz's proposal would not impact this usecase. > > Mateusz is proposing to block the case where a process first does > CLONE_FS, and *then* one of the processes sharing the fs_struct does a > setuid execve(). Linux already downgrades such an execve() to be > non-setuid, which probably means anyone trying to do this will get > hard-to-understand problems. Mateusz' proposal would just turn this > hard-to-debug edgecase, which already doesn't really work, into a > clean error; I think that is a nice improvement even just from the > UAPI standpoint. > > If this change makes it possible to clean up the kernel code a bit, even better. What has brought this to everyone's attention just now? This is the second mention of this code path I have seen this week. AKA: security/commoncap.c:cap_bprm_creds_from_file(...) > ... > /* Don't let someone trace a set[ug]id/setpcap binary with the revised > * credentials unless they have the appropriate permit. > * > * In addition, if NO_NEW_PRIVS, then ensure we get no new privs. > */ > is_setid = __is_setuid(new, old) || __is_setgid(new, old); > > if ((is_setid || __cap_gained(permitted, new, old)) && > ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) || > !ptracer_capable(current, new->user_ns))) { > /* downgrade; they get no more than they had, and maybe less */ > if (!ns_capable(new->user_ns, CAP_SETUID) || > (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) { > new->euid = new->uid; > new->egid = new->gid; > } > new->cap_permitted = cap_intersect(new->cap_permitted, > old->cap_permitted); > } The actual downgrade is because a ptrace'd executable also takes this path. I have seen it argued rather forcefully that continuing rather than simply failing seems better in the ptrace case. In general I think it can be said this policy is "safe". AKA we don't let a shared fs struct confuse privileged applications. So nothing to panic about. It looks like most of the lsm's also test bprm->unsafe. So I imagine someone could very carefully separate the non-ptrace case from the ptrace case but *shrug*. Perhaps: if ((is_setid || __cap_gained(permitted, new_old)) && ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) || !ptracer_capable(current, new->user_ns))) { + if (!(bprm->unsafe & LSM_UNSAFE_PTRACE)) { + return -EPERM; + } /* downgrade; they get no more than they had, and maybe less */ if (!ns_capable(new->user_ns, CAP_SETUID) || (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) { new->euid = new->uid; new->egid = new->gid; } new->cap_permitted = cap_intersect(new->cap_permitted, old->cap_permitted); } If that is what you want that doesn't look to scary. I don't think it simplifies anything about fs->in_exec. As fs->in_exec is set when the processing calling exec is the only process that owns the fs_struct. With fs->in_exec just being a flag that doesn't allow another thread to call fork and start sharing the fs_struct during exec. *Shrug* I don't see why anyone would care. It is just a very silly corner case. Eric ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2025-05-13 22:16 ` Eric W. Biederman @ 2025-05-14 0:03 ` Mateusz Guzik 2025-05-14 15:33 ` Eric W. Biederman 2025-05-14 15:42 ` Kees Cook 0 siblings, 2 replies; 25+ messages in thread From: Mateusz Guzik @ 2025-05-14 0:03 UTC (permalink / raw) To: Eric W. Biederman Cc: Jann Horn, Kees Cook, Kees Cook, Christian Brauner, Jorge Merlino, Alexander Viro, Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, linux-kernel, apparmor, linux-security-module, selinux, linux-hardening, oleg On Wed, May 14, 2025 at 12:17 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Jann Horn <jannh@google.com> writes: > > > On Tue, May 13, 2025 at 10:57 PM Kees Cook <kees@kernel.org> wrote: > >> On May 13, 2025 6:05:45 AM PDT, Mateusz Guzik <mjguzik@gmail.com> wrote: > >> >Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is > >> >shared. This will have to be checked for after the execing proc becomes > >> >single-threaded ofc. > >> > >> Unfortunately the above Chrome helper is setuid and uses CLONE_FS. > > > > Chrome first launches a setuid helper, and then the setuid helper does > > CLONE_FS. Mateusz's proposal would not impact this usecase. > > > > Mateusz is proposing to block the case where a process first does > > CLONE_FS, and *then* one of the processes sharing the fs_struct does a > > setuid execve(). Linux already downgrades such an execve() to be > > non-setuid, which probably means anyone trying to do this will get > > hard-to-understand problems. Mateusz' proposal would just turn this > > hard-to-debug edgecase, which already doesn't really work, into a > > clean error; I think that is a nice improvement even just from the > > UAPI standpoint. > > > > If this change makes it possible to clean up the kernel code a bit, even better. > > What has brought this to everyone's attention just now? This is > the second mention of this code path I have seen this week. > There is a syzkaller report concerning ->in_exec handling, for example: https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001f.GAE@google.com/#t > AKA: security/commoncap.c:cap_bprm_creds_from_file(...) > > ... > > /* Don't let someone trace a set[ug]id/setpcap binary with the revised > > * credentials unless they have the appropriate permit. > > * > > * In addition, if NO_NEW_PRIVS, then ensure we get no new privs. > > */ > > is_setid = __is_setuid(new, old) || __is_setgid(new, old); > > > > if ((is_setid || __cap_gained(permitted, new, old)) && > > ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) || > > !ptracer_capable(current, new->user_ns))) { > > /* downgrade; they get no more than they had, and maybe less */ > > if (!ns_capable(new->user_ns, CAP_SETUID) || > > (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) { > > new->euid = new->uid; > > new->egid = new->gid; > > } > > new->cap_permitted = cap_intersect(new->cap_permitted, > > old->cap_permitted); > > } > > The actual downgrade is because a ptrace'd executable also takes > this path. > > I have seen it argued rather forcefully that continuing rather than > simply failing seems better in the ptrace case. > > In general I think it can be said this policy is "safe". AKA we don't > let a shared fs struct confuse privileged applications. So nothing > to panic about. > > It looks like most of the lsm's also test bprm->unsafe. > > So I imagine someone could very carefully separate the non-ptrace case > from the ptrace case but *shrug*. > > Perhaps: > > if ((is_setid || __cap_gained(permitted, new_old)) && > ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) || > !ptracer_capable(current, new->user_ns))) { > + if (!(bprm->unsafe & LSM_UNSAFE_PTRACE)) { > + return -EPERM; > + } > /* downgrade; they get no more than they had, and maybe less */ > if (!ns_capable(new->user_ns, CAP_SETUID) || > (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) { > new->euid = new->uid; > new->egid = new->gid; > } > new->cap_permitted = cap_intersect(new->cap_permitted, > old->cap_permitted); > } > > If that is what you want that doesn't look to scary. I don't think > it simplifies anything about fs->in_exec. As fs->in_exec is set when > the processing calling exec is the only process that owns the fs_struct. > With fs->in_exec just being a flag that doesn't allow another thread > to call fork and start sharing the fs_struct during exec. > > *Shrug* > > I don't see why anyone would care. It is just a very silly corner case. Well I don't see how ptrace factors into any of this, apart from being a different case of ignoring suid/sgid. I can agree the suid/sgid situation vs CLONE_FS is a silly corner case, but one which needs to be handled for security reasons and which currently has weirdly convoluted code to do it. The intent behind my proposal is very much to get the crapper out of the way in a future-proof and simple manner. In check_unsafe_exec() you can find a nasty loop over threads in the group to find out if the fs struct is used by anyone outside of said group. Since fs struct users are not explicitly tracked and any of them can have different creds than the current thread, the kernel opts to ignore suid/sgid if there are extra users found (for security reasons). The loop depends on no new threads showing up as the list is being walked, to that end copy_fs() can transiently return an error if it spots ->in_exec. The >in_exec field is used as a boolean/flag, but parallel execs using the same fs struct from different thread groups don't look serialized. This is supposed to be fine as in this case ->in_exec is not getting set to begin with, but it gets unconditionally unset on all execs. And so on. It's all weird af. Initially I was thinking about serializing all execs using a given fs_struct to bypass majority of the fuckery, but that's some churn to add and it still leaves possible breakage down the road -- should this unsafe sharing detection ever become racing nobody will find out until the bad guys have their turn with it. While unconditional unsharing turns out to be a no-go because of chrome, one can still do postpone detection until after the caller is single-threaded. By that time, if this is only the that thread and fs_struct has ->users == 1, we know there is nobody sharing the struct or racing to add a ref to it. This allows treating ->users as a regular refcount, removes the weird loop over threads and removes the (at best misleading) ->in_exec field. With this in place it becomes trivial to also *deny* suid/sgid exec instead of trying to placate it. If you are sharing fs and are execing a binary in the first place, things are a little fishy. But if you are execing a suid/sgid, the kernel has to ignore the bit so either you are doing something wrong or are trying to exploit a bug. In order to sort this crapper out, I think one can start with a runtime tunable and a once-per-boot printk stating it denied such an exec (and stating how to bring it back). To be removed some time after hitting LTS perhaps. -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2025-05-14 0:03 ` Mateusz Guzik @ 2025-05-14 15:33 ` Eric W. Biederman 2025-05-14 15:42 ` Kees Cook 1 sibling, 0 replies; 25+ messages in thread From: Eric W. Biederman @ 2025-05-14 15:33 UTC (permalink / raw) To: Mateusz Guzik Cc: Jann Horn, Kees Cook, Kees Cook, Christian Brauner, Jorge Merlino, Alexander Viro, Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, linux-kernel, apparmor, linux-security-module, selinux, linux-hardening, oleg Mateusz Guzik <mjguzik@gmail.com> writes: > On Wed, May 14, 2025 at 12:17 AM Eric W. Biederman > <ebiederm@xmission.com> wrote: >> >> Jann Horn <jannh@google.com> writes: >> >> > On Tue, May 13, 2025 at 10:57 PM Kees Cook <kees@kernel.org> wrote: >> >> On May 13, 2025 6:05:45 AM PDT, Mateusz Guzik <mjguzik@gmail.com> wrote: >> >> >Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is >> >> >shared. This will have to be checked for after the execing proc becomes >> >> >single-threaded ofc. >> >> >> >> Unfortunately the above Chrome helper is setuid and uses CLONE_FS. >> > >> > Chrome first launches a setuid helper, and then the setuid helper does >> > CLONE_FS. Mateusz's proposal would not impact this usecase. >> > >> > Mateusz is proposing to block the case where a process first does >> > CLONE_FS, and *then* one of the processes sharing the fs_struct does a >> > setuid execve(). Linux already downgrades such an execve() to be >> > non-setuid, which probably means anyone trying to do this will get >> > hard-to-understand problems. Mateusz' proposal would just turn this >> > hard-to-debug edgecase, which already doesn't really work, into a >> > clean error; I think that is a nice improvement even just from the >> > UAPI standpoint. >> > >> > If this change makes it possible to clean up the kernel code a bit, even better. >> >> What has brought this to everyone's attention just now? This is >> the second mention of this code path I have seen this week. >> > > There is a syzkaller report concerning ->in_exec handling, for example: > https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001f.GAE@google.com/#t Thanks that grounds the conversation. I am trying to think how there could be a real race (and not a theoretical race) between copy_fs and bprm_execve. Unless I am missing something the write in bprm_execve to current->fs->in_exec happens after the point of no-return and de_thread. Unless the exec fails without getting to de_thread, or we are in the multi-threaded case. So I think it would be sensible to do something like: /* Clear in_exec if we set it */ if (!(bprm->unsafe & LSM_UNSAFE_SHARE)) { spin_lock(¤t->fs->lock); current->fs->in_exec = 0; spin_unlock(¤t->fs->lock); } I expect that will cause KCSAN to be quiet as well as fixing any real races that might happen when exec fails. In all but the most extreme cases there should be no contention on the lock so I don't see a downside of taking the fs->lock. >> AKA: security/commoncap.c:cap_bprm_creds_from_file(...) >> > ... >> > /* Don't let someone trace a set[ug]id/setpcap binary with the revised >> > * credentials unless they have the appropriate permit. >> > * >> > * In addition, if NO_NEW_PRIVS, then ensure we get no new privs. >> > */ >> > is_setid = __is_setuid(new, old) || __is_setgid(new, old); >> > >> > if ((is_setid || __cap_gained(permitted, new, old)) && >> > ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) || >> > !ptracer_capable(current, new->user_ns))) { >> > /* downgrade; they get no more than they had, and maybe less */ >> > if (!ns_capable(new->user_ns, CAP_SETUID) || >> > (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) { >> > new->euid = new->uid; >> > new->egid = new->gid; >> > } >> > new->cap_permitted = cap_intersect(new->cap_permitted, >> > old->cap_permitted); >> > } >> >> The actual downgrade is because a ptrace'd executable also takes >> this path. >> >> I have seen it argued rather forcefully that continuing rather than >> simply failing seems better in the ptrace case. >> >> In general I think it can be said this policy is "safe". AKA we don't >> let a shared fs struct confuse privileged applications. So nothing >> to panic about. >> >> It looks like most of the lsm's also test bprm->unsafe. >> >> So I imagine someone could very carefully separate the non-ptrace case >> from the ptrace case but *shrug*. >> >> Perhaps: >> >> if ((is_setid || __cap_gained(permitted, new_old)) && >> ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) || >> !ptracer_capable(current, new->user_ns))) { >> + if (!(bprm->unsafe & LSM_UNSAFE_PTRACE)) { >> + return -EPERM; >> + } >> /* downgrade; they get no more than they had, and maybe less */ >> if (!ns_capable(new->user_ns, CAP_SETUID) || >> (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) { >> new->euid = new->uid; >> new->egid = new->gid; >> } >> new->cap_permitted = cap_intersect(new->cap_permitted, >> old->cap_permitted); >> } >> >> If that is what you want that doesn't look to scary. I don't think >> it simplifies anything about fs->in_exec. As fs->in_exec is set when >> the processing calling exec is the only process that owns the fs_struct. >> With fs->in_exec just being a flag that doesn't allow another thread >> to call fork and start sharing the fs_struct during exec. >> >> *Shrug* >> >> I don't see why anyone would care. It is just a very silly corner case. > > Well I don't see how ptrace factors into any of this, apart from being > a different case of ignoring suid/sgid. It is simple. Ptrace is the case for ignoring suid/sgid, and since that is the only case that is likely to happen in practice, and the only case people actually care about everything all of the other cases just got lumped on to ptrace case. > I can agree the suid/sgid situation vs CLONE_FS is a silly corner > case, but one which needs to be handled for security reasons and which > currently has weirdly convoluted code to do it. > > The intent behind my proposal is very much to get the crapper out of > the way in a future-proof and simple manner. > > In check_unsafe_exec() you can find a nasty loop over threads in the > group to find out if the fs struct is used by anyone outside of said > group. Since fs struct users are not explicitly tracked and any of > them can have different creds than the current thread, the kernel opts > to ignore suid/sgid if there are extra users found (for security > reasons). The loop depends on no new threads showing up as the list is > being walked, to that end copy_fs() can transiently return an error if > it spots ->in_exec. > > The >in_exec field is used as a boolean/flag, but parallel execs using > the same fs struct from different thread groups don't look serialized. > This is supposed to be fine as in this case ->in_exec is not getting > set to begin with, but it gets unconditionally unset on all execs. > > And so on. It's all weird af. > > Initially I was thinking about serializing all execs using a given > fs_struct to bypass majority of the fuckery, but that's some churn to > add and it still leaves possible breakage down the road -- should this > unsafe sharing detection ever become racing nobody will find out until > the bad guys have their turn with it. > > While unconditional unsharing turns out to be a no-go because of > chrome, one can still do postpone detection until after the caller is > single-threaded. By that time, if this is only the that thread and > fs_struct has ->users == 1, we know there is nobody sharing the struct > or racing to add a ref to it. This allows treating ->users as a > regular refcount, removes the weird loop over threads and removes the > (at best misleading) ->in_exec field. The problem is that after becoming single threaded there is no way to fail the exec. The process calling exec must be terminated, or you drop privs as we currently do. That terminating the process that calls exec absolutely sucks for debug-ability, plus you are changing where the work happens which means auditing all of the security hooks, which is pain. So I really don't think we want to perform the test after de_thread. > With this in place it becomes trivial to also *deny* suid/sgid exec > instead of trying to placate it. If you are sharing fs and are execing > a binary in the first place, things are a little fishy. But if you are > execing a suid/sgid, the kernel has to ignore the bit so either you > are doing something wrong or are trying to exploit a bug. In order to > sort this crapper out, I think one can start with a runtime tunable > and a once-per-boot printk stating it denied such an exec (and stating > how to bring it back). To be removed some time after hitting LTS > perhaps. There is a potential cleanup here that is worth exploring. Not allowing clone(CLONE_THREAD | CLONE_FS). AKA forcing all threads to share an fs_struct. If we can do that then the reference to fs_struct can live in signal_struct (potentially duplicated in current for performance). Still that doesn't let us remove fs->in_exec. Hmm.. The real issue is in_exec, and the convolutions it requires to maintain it. The simplest thing I can think of that doesn't meaningfully change user-space behavior is to make any other calls to clone/fork and execve fail with -EAGAIN, while our current execve is running. In practice we already do that with clone/fork in copy_fs so we don't need to worry about that case introducing regressions. I can't see a meaningful use for racing calls of execve, so that should not be a big deal. That prevents any races with clearing the state if only one thread of a process can be in execve at a time. It would allow replacing both current->in_execve and current->fs->in_exec. We would probably use the sighand lock to protect whatever flag we use in signal_struct. We can't use signal->flags as we are going to want to keep this separate from the de_thread and group exit logic. .... Those are the directions I would suggest exploring if you want simplicity. Eric ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2025-05-14 0:03 ` Mateusz Guzik 2025-05-14 15:33 ` Eric W. Biederman @ 2025-05-14 15:42 ` Kees Cook 2025-05-15 16:48 ` Eric W. Biederman 1 sibling, 1 reply; 25+ messages in thread From: Kees Cook @ 2025-05-14 15:42 UTC (permalink / raw) To: Mateusz Guzik Cc: Eric W. Biederman, Jann Horn, Christian Brauner, Jorge Merlino, Alexander Viro, Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, linux-kernel, apparmor, linux-security-module, selinux, linux-hardening, oleg On Wed, May 14, 2025 at 02:03:31AM +0200, Mateusz Guzik wrote: > On Wed, May 14, 2025 at 12:17 AM Eric W. Biederman > <ebiederm@xmission.com> wrote: > > > > Jann Horn <jannh@google.com> writes: > > > > > On Tue, May 13, 2025 at 10:57 PM Kees Cook <kees@kernel.org> wrote: > > >> On May 13, 2025 6:05:45 AM PDT, Mateusz Guzik <mjguzik@gmail.com> wrote: > > >> >Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is > > >> >shared. This will have to be checked for after the execing proc becomes > > >> >single-threaded ofc. > > >> > > >> Unfortunately the above Chrome helper is setuid and uses CLONE_FS. > > > > > > Chrome first launches a setuid helper, and then the setuid helper does > > > CLONE_FS. Mateusz's proposal would not impact this usecase. > > > > > > Mateusz is proposing to block the case where a process first does > > > CLONE_FS, and *then* one of the processes sharing the fs_struct does a > > > setuid execve(). Linux already downgrades such an execve() to be > > > non-setuid, which probably means anyone trying to do this will get > > > hard-to-understand problems. Mateusz' proposal would just turn this > > > hard-to-debug edgecase, which already doesn't really work, into a > > > clean error; I think that is a nice improvement even just from the > > > UAPI standpoint. > > > > > > If this change makes it possible to clean up the kernel code a bit, even better. > > > > What has brought this to everyone's attention just now? This is > > the second mention of this code path I have seen this week. > > > > There is a syzkaller report concerning ->in_exec handling, for example: > https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001f.GAE@google.com/#t > > [...] > > It looks like most of the lsm's also test bprm->unsafe. > > > > So I imagine someone could very carefully separate the non-ptrace case > > from the ptrace case but *shrug*. > > > > Perhaps: > > > > if ((is_setid || __cap_gained(permitted, new_old)) && > > ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) || > > !ptracer_capable(current, new->user_ns))) { > > + if (!(bprm->unsafe & LSM_UNSAFE_PTRACE)) { > > + return -EPERM; > > + } > > /* downgrade; they get no more than they had, and maybe less */ > > if (!ns_capable(new->user_ns, CAP_SETUID) || > > (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) { > > new->euid = new->uid; > > new->egid = new->gid; > > } > > new->cap_permitted = cap_intersect(new->cap_permitted, > > old->cap_permitted); > > } > > > > If that is what you want that doesn't look to scary. I don't think > > it simplifies anything about fs->in_exec. As fs->in_exec is set when > > the processing calling exec is the only process that owns the fs_struct. > > With fs->in_exec just being a flag that doesn't allow another thread > > to call fork and start sharing the fs_struct during exec. > > > > *Shrug* > > > > I don't see why anyone would care. It is just a very silly corner case. > > Well I don't see how ptrace factors into any of this, apart from being > a different case of ignoring suid/sgid. I actually think we might want to expand the above bit of logic to use an explicit tests of each LSM_UNSAFE case -- the merged logic is very difficult to read currently. Totally untested expansion, if I'm reading everything correctly: if (bprm->unsafe && (is_setid || __cap_gained(permitted, new_old))) { bool limit_caps = false; bool strip_eid = false; unsigned int unsafe = bprm->unsafe; /* Check each bit */ if (unsafe & LSM_UNSAFE_PTRACE) { if (!ptracer_capable(current, new->user_ns)) limit_caps = true; unsafe &= ~LSM_UNSAFE_PTRACE; } if (unsafe & LSM_UNSAFE_SHARE) { limit_caps = true; if (!ns_capable(new->user_ns, CAP_SETUID)) strip_eid = true; unsafe &= ~LSM_UNSAFE_SHARE; } if (unsafe & LSM_UNSAFE_NO_NEW_PRIVS) { limit_caps = true; if (!ns_capable(new->user_ns, CAP_SETUID)) strip_eid = true; unsafe &= ~LSM_UNSAFE_NO_NEW_PRIVS; } if (WARN(unsafe, "Unhandled LSM_UNSAFE flag: %u?!\n", unsafe)) return -EINVAL; if (limit_caps) { new->cap_permitted = cap_intersect(new->cap_permitted, old->cap_permitted); } if (strip_eid) { new->euid = new->uid; new->egid = new->gid; } } > I can agree the suid/sgid situation vs CLONE_FS is a silly corner > case, but one which needs to be handled for security reasons and which > currently has weirdly convoluted code to do it. > > The intent behind my proposal is very much to get the crapper out of > the way in a future-proof and simple manner. > > In check_unsafe_exec() you can find a nasty loop over threads in the > group to find out if the fs struct is used by anyone outside of said > group. Since fs struct users are not explicitly tracked and any of > them can have different creds than the current thread, the kernel opts > to ignore suid/sgid if there are extra users found (for security > reasons). The loop depends on no new threads showing up as the list is > being walked, to that end copy_fs() can transiently return an error if > it spots ->in_exec. > > The >in_exec field is used as a boolean/flag, but parallel execs using > the same fs struct from different thread groups don't look serialized. > This is supposed to be fine as in this case ->in_exec is not getting > set to begin with, but it gets unconditionally unset on all execs. > > And so on. It's all weird af. 100% :) -- Kees Cook ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2025-05-14 15:42 ` Kees Cook @ 2025-05-15 16:48 ` Eric W. Biederman 0 siblings, 0 replies; 25+ messages in thread From: Eric W. Biederman @ 2025-05-15 16:48 UTC (permalink / raw) To: Kees Cook Cc: Mateusz Guzik, Jann Horn, Christian Brauner, Jorge Merlino, Alexander Viro, Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, linux-kernel, apparmor, linux-security-module, selinux, linux-hardening, oleg Kees Cook <kees@kernel.org> writes: > On Wed, May 14, 2025 at 02:03:31AM +0200, Mateusz Guzik wrote: >> On Wed, May 14, 2025 at 12:17 AM Eric W. Biederman >> <ebiederm@xmission.com> wrote: >> > >> > Jann Horn <jannh@google.com> writes: >> > >> > > On Tue, May 13, 2025 at 10:57 PM Kees Cook <kees@kernel.org> wrote: >> > >> On May 13, 2025 6:05:45 AM PDT, Mateusz Guzik <mjguzik@gmail.com> wrote: >> > >> >Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is >> > >> >shared. This will have to be checked for after the execing proc becomes >> > >> >single-threaded ofc. >> > >> >> > >> Unfortunately the above Chrome helper is setuid and uses CLONE_FS. >> > > >> > > Chrome first launches a setuid helper, and then the setuid helper does >> > > CLONE_FS. Mateusz's proposal would not impact this usecase. >> > > >> > > Mateusz is proposing to block the case where a process first does >> > > CLONE_FS, and *then* one of the processes sharing the fs_struct does a >> > > setuid execve(). Linux already downgrades such an execve() to be >> > > non-setuid, which probably means anyone trying to do this will get >> > > hard-to-understand problems. Mateusz' proposal would just turn this >> > > hard-to-debug edgecase, which already doesn't really work, into a >> > > clean error; I think that is a nice improvement even just from the >> > > UAPI standpoint. >> > > >> > > If this change makes it possible to clean up the kernel code a bit, even better. >> > >> > What has brought this to everyone's attention just now? This is >> > the second mention of this code path I have seen this week. >> > >> >> There is a syzkaller report concerning ->in_exec handling, for example: >> https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001f.GAE@google.com/#t >> >> [...] >> > It looks like most of the lsm's also test bprm->unsafe. >> > >> > So I imagine someone could very carefully separate the non-ptrace case >> > from the ptrace case but *shrug*. >> > >> > Perhaps: >> > >> > if ((is_setid || __cap_gained(permitted, new_old)) && >> > ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) || >> > !ptracer_capable(current, new->user_ns))) { >> > + if (!(bprm->unsafe & LSM_UNSAFE_PTRACE)) { >> > + return -EPERM; >> > + } >> > /* downgrade; they get no more than they had, and maybe less */ >> > if (!ns_capable(new->user_ns, CAP_SETUID) || >> > (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) { >> > new->euid = new->uid; >> > new->egid = new->gid; >> > } >> > new->cap_permitted = cap_intersect(new->cap_permitted, >> > old->cap_permitted); >> > } >> > >> > If that is what you want that doesn't look to scary. I don't think >> > it simplifies anything about fs->in_exec. As fs->in_exec is set when >> > the processing calling exec is the only process that owns the fs_struct. >> > With fs->in_exec just being a flag that doesn't allow another thread >> > to call fork and start sharing the fs_struct during exec. >> > >> > *Shrug* >> > >> > I don't see why anyone would care. It is just a very silly corner case. >> >> Well I don't see how ptrace factors into any of this, apart from being >> a different case of ignoring suid/sgid. > > I actually think we might want to expand the above bit of logic to use > an explicit tests of each LSM_UNSAFE case -- the merged > logic is very difficult to read currently. Totally untested expansion, > if I'm reading everything correctly: > > if (bprm->unsafe && > (is_setid || __cap_gained(permitted, new_old))) { > bool limit_caps = false; > bool strip_eid = false; > unsigned int unsafe = bprm->unsafe; > /* Check each bit */ > > if (unsafe & LSM_UNSAFE_PTRACE) { > if (!ptracer_capable(current, new->user_ns)) > limit_caps = true; strip_eid = true; You missed the euid stripping there. > unsafe &= ~LSM_UNSAFE_PTRACE; > } > if (unsafe & LSM_UNSAFE_SHARE) { > limit_caps = true; > if (!ns_capable(new->user_ns, CAP_SETUID)) > strip_eid = true; > unsafe &= ~LSM_UNSAFE_SHARE; > } > if (unsafe & LSM_UNSAFE_NO_NEW_PRIVS) { > limit_caps = true; > if (!ns_capable(new->user_ns, CAP_SETUID)) > strip_eid = true; > unsafe &= ~LSM_UNSAFE_NO_NEW_PRIVS; > } > > if (WARN(unsafe, "Unhandled LSM_UNSAFE flag: %u?!\n", unsafe)) > return -EINVAL; > > if (limit_caps) { > new->cap_permitted = cap_intersect(new->cap_permitted, > old->cap_permitted); > } > if (strip_eid) { > new->euid = new->uid; > new->egid = new->gid; > } > } I think I would simplify this all to: if ((id_changed || __cap_gained(permitted, new, old)) && !ptracer_capable(current->new_user_ns)) { if (!ns_capable(new->user_ns, CAP_SETUID)) { new->euid = old->euid; new->egid = old->egid; } new->cap_permitted = cap_intersect(new->cap_permitted, old->cap_permitted); } if ((id_changed || __cap_gained(permitted, new, old)) && (bprm->unsafe & ~LSM_UNSAFE_PTRACE)) { return -EPERM; } The code of no_new_privs doesn't prevent capset so ignoring the results of ns_capable when NO_NEW_PRIVS is set doesn't make sense. If we are going to do anything please let's find ways to understand what is happening and simplify this code, not add to it's complexity. Eric ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2025-05-13 21:09 ` Jann Horn 2025-05-13 22:16 ` Eric W. Biederman @ 2025-05-13 23:15 ` Kees Cook 1 sibling, 0 replies; 25+ messages in thread From: Kees Cook @ 2025-05-13 23:15 UTC (permalink / raw) To: Jann Horn Cc: Mateusz Guzik, Kees Cook, Christian Brauner, Eric Biederman, Jorge Merlino, Alexander Viro, Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, linux-kernel, apparmor, linux-security-module, selinux, linux-hardening, oleg On May 13, 2025 2:09:48 PM PDT, Jann Horn <jannh@google.com> wrote: >On Tue, May 13, 2025 at 10:57 PM Kees Cook <kees@kernel.org> wrote: >> On May 13, 2025 6:05:45 AM PDT, Mateusz Guzik <mjguzik@gmail.com> wrote: >> >Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is >> >shared. This will have to be checked for after the execing proc becomes >> >single-threaded ofc. >> >> Unfortunately the above Chrome helper is setuid and uses CLONE_FS. > >Chrome first launches a setuid helper, and then the setuid helper does >CLONE_FS. Mateusz's proposal would not impact this usecase. > >Mateusz is proposing to block the case where a process first does >CLONE_FS, and *then* one of the processes sharing the fs_struct does a >setuid execve(). Linux already downgrades such an execve() to be >non-setuid, which probably means anyone trying to do this will get >hard-to-understand problems. Mateusz' proposal would just turn this >hard-to-debug edgecase, which already doesn't really work, into a >clean error; I think that is a nice improvement even just from the >UAPI standpoint. > >If this change makes it possible to clean up the kernel code a bit, even better. Ah! Okay, I appreciate the clarification. :) I'm game to try making it an error instead of silent downgrading. -Kees -- Kees Cook -- Kees Cook ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2022-10-06 14:13 ` Jann Horn 2022-10-06 15:25 ` Kees Cook @ 2022-10-14 3:18 ` Andy Lutomirski 2022-10-14 3:54 ` Kees Cook ` (2 more replies) 1 sibling, 3 replies; 25+ messages in thread From: Andy Lutomirski @ 2022-10-14 3:18 UTC (permalink / raw) To: Jann Horn, Christian Brauner Cc: Kees Cook, Eric W. Biederman, Jorge Merlino, Al Viro, Thomas Gleixner, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, Linux Kernel Mailing List, apparmor, linux-security-module, selinux, linux-hardening On Thu, Oct 6, 2022, at 7:13 AM, Jann Horn wrote: > On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote: >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily >> > threaded process trying to perform a suid exec, causing the suid portion >> > to fail. This counting error appears to be unneeded, but to catch any >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up >> >> Isn't this a potential uapi break? Afaict, before this change a call to >> clone{3}(CLONE_FS) followed by an exec in the child would have the >> parent and child share fs information. So if the child e.g., changes the >> working directory post exec it would also affect the parent. But after >> this change here this would no longer be true. So a child changing a >> workding directoro would not affect the parent anymore. IOW, an exec is >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but >> it seems like a non-trivial uapi change but there might be few users >> that do clone{3}(CLONE_FS) followed by an exec. > > I believe the following code in Chromium explicitly relies on this > behavior, but I'm not sure whether this code is in active use anymore: > > https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium Wait, this is absolutely nucking futs. On a very quick inspection, the sharable things like this are fs, files, sighand, and io. files and sighand get unshared, which makes sense. fs supposedly checks for extra refs and prevents gaining privilege. io is... ignored! At least it's not immediately obvious that io is a problem. But seriously, this makes no sense at all. It should not be possible to exec a program and then, without ptrace, change its cwd out from under it. Do we really need to preserve this behavior? --Andy ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2022-10-14 3:18 ` Andy Lutomirski @ 2022-10-14 3:54 ` Kees Cook 2022-10-14 15:35 ` Jann Horn 2022-10-14 22:03 ` David Laight 2 siblings, 0 replies; 25+ messages in thread From: Kees Cook @ 2022-10-14 3:54 UTC (permalink / raw) To: Andy Lutomirski Cc: Jann Horn, Christian Brauner, Eric W. Biederman, Jorge Merlino, Al Viro, Thomas Gleixner, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, Linux Kernel Mailing List, apparmor, linux-security-module, selinux, linux-hardening On Thu, Oct 13, 2022 at 08:18:04PM -0700, Andy Lutomirski wrote: > But seriously, this makes no sense at all. It should not be possible to exec a program and then, without ptrace, change its cwd out from under it. Do we really need to preserve this behavior? Yup, already abandoned: https://lore.kernel.org/lkml/202210061301.207A20C8E5@keescook/ -- Kees Cook ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2022-10-14 3:18 ` Andy Lutomirski 2022-10-14 3:54 ` Kees Cook @ 2022-10-14 15:35 ` Jann Horn 2022-10-18 7:09 ` Kees Cook 2022-10-14 22:03 ` David Laight 2 siblings, 1 reply; 25+ messages in thread From: Jann Horn @ 2022-10-14 15:35 UTC (permalink / raw) To: Andy Lutomirski Cc: Christian Brauner, Kees Cook, Eric W. Biederman, Jorge Merlino, Al Viro, Thomas Gleixner, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, Linux Kernel Mailing List, apparmor, linux-security-module, selinux, linux-hardening On Fri, Oct 14, 2022 at 5:18 AM Andy Lutomirski <luto@kernel.org> wrote: > On Thu, Oct 6, 2022, at 7:13 AM, Jann Horn wrote: > > On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote: > >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: > >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily > >> > threaded process trying to perform a suid exec, causing the suid portion > >> > to fail. This counting error appears to be unneeded, but to catch any > >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up > >> > >> Isn't this a potential uapi break? Afaict, before this change a call to > >> clone{3}(CLONE_FS) followed by an exec in the child would have the > >> parent and child share fs information. So if the child e.g., changes the > >> working directory post exec it would also affect the parent. But after > >> this change here this would no longer be true. So a child changing a > >> workding directoro would not affect the parent anymore. IOW, an exec is > >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but > >> it seems like a non-trivial uapi change but there might be few users > >> that do clone{3}(CLONE_FS) followed by an exec. > > > > I believe the following code in Chromium explicitly relies on this > > behavior, but I'm not sure whether this code is in active use anymore: > > > > https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium > > Wait, this is absolutely nucking futs. On a very quick inspection, the sharable things like this are fs, files, sighand, and io. files and sighand get unshared, which makes sense. fs supposedly checks for extra refs and prevents gaining privilege. io is... ignored! At least it's not immediately obvious that io is a problem. > > But seriously, this makes no sense at all. It should not be possible to exec a program and then, without ptrace, change its cwd out from under it. Do we really need to preserve this behavior? I agree that this is pretty wild. The single user I'm aware of is Chrome, and as far as I know, they use it for establishing their sandbox on systems where unprivileged user namespaces are disabled - see <https://chromium.googlesource.com/chromium/src/+/main/docs/linux/suid_sandbox.md>. They also have seccomp-based sandboxing, but IIRC there are some small holes that mean it's still useful for them to be able to set up namespaces, like how sendmsg() on a unix domain socket can specify a file path as the destination address. (By the way, I think maybe Chrome wouldn't need this wacky trick with the shared fs_struct if the "NO_NEW_PRIVS permits chroot()" thing had ever landed that you (https://lore.kernel.org/lkml/0e2f0f54e19bff53a3739ecfddb4ffa9a6dbde4d.1327858005.git.luto@amacapital.net/) and Mickaël Salaün proposed in the past... or alternatively, if there was a way to properly filter all the syscalls that Chrome has to permit for renderers.) (But also, to be clear, I don't speak for Chrome, this is just my understanding of how their stuff works.) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2022-10-14 15:35 ` Jann Horn @ 2022-10-18 7:09 ` Kees Cook 2022-10-18 11:19 ` Jann Horn 0 siblings, 1 reply; 25+ messages in thread From: Kees Cook @ 2022-10-18 7:09 UTC (permalink / raw) To: Jann Horn Cc: Andy Lutomirski, Christian Brauner, Eric W. Biederman, Jorge Merlino, Al Viro, Thomas Gleixner, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, Linux Kernel Mailing List, apparmor, linux-security-module, selinux, linux-hardening On Fri, Oct 14, 2022 at 05:35:26PM +0200, Jann Horn wrote: > On Fri, Oct 14, 2022 at 5:18 AM Andy Lutomirski <luto@kernel.org> wrote: > > On Thu, Oct 6, 2022, at 7:13 AM, Jann Horn wrote: > > > On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote: > > >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: > > >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily > > >> > threaded process trying to perform a suid exec, causing the suid portion > > >> > to fail. This counting error appears to be unneeded, but to catch any > > >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up > > >> > > >> Isn't this a potential uapi break? Afaict, before this change a call to > > >> clone{3}(CLONE_FS) followed by an exec in the child would have the > > >> parent and child share fs information. So if the child e.g., changes the > > >> working directory post exec it would also affect the parent. But after > > >> this change here this would no longer be true. So a child changing a > > >> workding directoro would not affect the parent anymore. IOW, an exec is > > >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but > > >> it seems like a non-trivial uapi change but there might be few users > > >> that do clone{3}(CLONE_FS) followed by an exec. > > > > > > I believe the following code in Chromium explicitly relies on this > > > behavior, but I'm not sure whether this code is in active use anymore: > > > > > > https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium > > > > Wait, this is absolutely nucking futs. On a very quick inspection, the sharable things like this are fs, files, sighand, and io. files and sighand get unshared, which makes sense. fs supposedly checks for extra refs and prevents gaining privilege. io is... ignored! At least it's not immediately obvious that io is a problem. > > > > But seriously, this makes no sense at all. It should not be possible to exec a program and then, without ptrace, change its cwd out from under it. Do we really need to preserve this behavior? > > I agree that this is pretty wild. > > The single user I'm aware of is Chrome, and as far as I know, they use > it for establishing their sandbox on systems where unprivileged user > namespaces are disabled - see > <https://chromium.googlesource.com/chromium/src/+/main/docs/linux/suid_sandbox.md>. > They also have seccomp-based sandboxing, but IIRC there are some small > holes that mean it's still useful for them to be able to set up > namespaces, like how sendmsg() on a unix domain socket can specify a > file path as the destination address. > > (By the way, I think maybe Chrome wouldn't need this wacky trick with > the shared fs_struct if the "NO_NEW_PRIVS permits chroot()" thing had > ever landed that you > (https://lore.kernel.org/lkml/0e2f0f54e19bff53a3739ecfddb4ffa9a6dbde4d.1327858005.git.luto@amacapital.net/) > and Mickaël Salaün proposed in the past... or alternatively, if there > was a way to properly filter all the syscalls that Chrome has to > permit for renderers.) > > (But also, to be clear, I don't speak for Chrome, this is just my > understanding of how their stuff works.) Chrome seems to just want a totally empty filesystem view, yes? Let's land the nnp+chroot change. :P Only 10 years late! Then we can have Chrome use this and we can unshare fs on exec... -- Kees Cook ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2022-10-18 7:09 ` Kees Cook @ 2022-10-18 11:19 ` Jann Horn 0 siblings, 0 replies; 25+ messages in thread From: Jann Horn @ 2022-10-18 11:19 UTC (permalink / raw) To: Kees Cook Cc: Andy Lutomirski, Christian Brauner, Eric W. Biederman, Jorge Merlino, Al Viro, Thomas Gleixner, Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, Linux Kernel Mailing List, apparmor, linux-security-module, selinux, linux-hardening On Tue, Oct 18, 2022 at 9:09 AM Kees Cook <keescook@chromium.org> wrote: > On Fri, Oct 14, 2022 at 05:35:26PM +0200, Jann Horn wrote: > > On Fri, Oct 14, 2022 at 5:18 AM Andy Lutomirski <luto@kernel.org> wrote: > > > On Thu, Oct 6, 2022, at 7:13 AM, Jann Horn wrote: > > > > On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote: > > > >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: > > > >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily > > > >> > threaded process trying to perform a suid exec, causing the suid portion > > > >> > to fail. This counting error appears to be unneeded, but to catch any > > > >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up > > > >> > > > >> Isn't this a potential uapi break? Afaict, before this change a call to > > > >> clone{3}(CLONE_FS) followed by an exec in the child would have the > > > >> parent and child share fs information. So if the child e.g., changes the > > > >> working directory post exec it would also affect the parent. But after > > > >> this change here this would no longer be true. So a child changing a > > > >> workding directoro would not affect the parent anymore. IOW, an exec is > > > >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but > > > >> it seems like a non-trivial uapi change but there might be few users > > > >> that do clone{3}(CLONE_FS) followed by an exec. > > > > > > > > I believe the following code in Chromium explicitly relies on this > > > > behavior, but I'm not sure whether this code is in active use anymore: > > > > > > > > https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium > > > > > > Wait, this is absolutely nucking futs. On a very quick inspection, the sharable things like this are fs, files, sighand, and io. files and sighand get unshared, which makes sense. fs supposedly checks for extra refs and prevents gaining privilege. io is... ignored! At least it's not immediately obvious that io is a problem. > > > > > > But seriously, this makes no sense at all. It should not be possible to exec a program and then, without ptrace, change its cwd out from under it. Do we really need to preserve this behavior? > > > > I agree that this is pretty wild. > > > > The single user I'm aware of is Chrome, and as far as I know, they use > > it for establishing their sandbox on systems where unprivileged user > > namespaces are disabled - see > > <https://chromium.googlesource.com/chromium/src/+/main/docs/linux/suid_sandbox.md>. > > They also have seccomp-based sandboxing, but IIRC there are some small > > holes that mean it's still useful for them to be able to set up > > namespaces, like how sendmsg() on a unix domain socket can specify a > > file path as the destination address. > > > > (By the way, I think maybe Chrome wouldn't need this wacky trick with > > the shared fs_struct if the "NO_NEW_PRIVS permits chroot()" thing had > > ever landed that you > > (https://lore.kernel.org/lkml/0e2f0f54e19bff53a3739ecfddb4ffa9a6dbde4d.1327858005.git.luto@amacapital.net/) > > and Mickaël Salaün proposed in the past... or alternatively, if there > > was a way to properly filter all the syscalls that Chrome has to > > permit for renderers.) > > > > (But also, to be clear, I don't speak for Chrome, this is just my > > understanding of how their stuff works.) > > Chrome seems to just want a totally empty filesystem view, yes? > Let's land the nnp+chroot change. :P Only 10 years late! Then we can > have Chrome use this and we can unshare fs on exec... Someone should check with Chrome first though to make sure what I said accurately represents what they think... ^ permalink raw reply [flat|nested] 25+ messages in thread
* RE: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2022-10-14 3:18 ` Andy Lutomirski 2022-10-14 3:54 ` Kees Cook 2022-10-14 15:35 ` Jann Horn @ 2022-10-14 22:03 ` David Laight 2022-11-28 17:49 ` Eric W. Biederman 2 siblings, 1 reply; 25+ messages in thread From: David Laight @ 2022-10-14 22:03 UTC (permalink / raw) To: 'Andy Lutomirski', Jann Horn, Christian Brauner Cc: Kees Cook, Eric W. Biederman, Jorge Merlino, Al Viro, Thomas Gleixner, Sebastian Andrzej Siewior, Andrew Morton, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, Linux Kernel Mailing List, apparmor@lists.ubuntu.com, linux-security-module@vger.kernel.org, selinux@vger.kernel.org, linux-hardening@vger.kernel.org From: Andy Lutomirski > Sent: 14 October 2022 04:18 ... > But seriously, this makes no sense at all. It should not be possible to exec a program and then, > without ptrace, change its cwd out from under it. Do we really need to preserve this behavior? it maybe ok if the exec'ed program also 'bought-in' to the fact that its cwd and open files might get changed. But imagine someone doing it to a login shell! David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec 2022-10-14 22:03 ` David Laight @ 2022-11-28 17:49 ` Eric W. Biederman 0 siblings, 0 replies; 25+ messages in thread From: Eric W. Biederman @ 2022-11-28 17:49 UTC (permalink / raw) To: David Laight Cc: 'Andy Lutomirski', Jann Horn, Christian Brauner, Kees Cook, Jorge Merlino, Al Viro, Thomas Gleixner, Sebastian Andrzej Siewior, Andrew Morton, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, Linux Kernel Mailing List, apparmor@lists.ubuntu.com, linux-security-module@vger.kernel.org, selinux@vger.kernel.org, linux-hardening@vger.kernel.org David Laight <David.Laight@ACULAB.COM> writes: > From: Andy Lutomirski >> Sent: 14 October 2022 04:18 > ... >> But seriously, this makes no sense at all. It should not be possible to exec a program and then, >> without ptrace, change its cwd out from under it. Do we really need to preserve this behavior? > > it maybe ok if the exec'ed program also 'bought-in' to the > fact that its cwd and open files might get changed. > But imagine someone doing it to a login shell! I am slowly catching up on my email and I saw this conversation. When I initially saw this thread I was confused and thought this might run into an issue with fs/locks.c. I was close but wrong. fs/locks.c uses current->files as a sort of process identifier and so is very sensitive to when it is unshared. Making unsharing current->files unconditionally a bug. Not relevant to this conversation. There are several clone options that were only relevant for the old LinuxThreads implementation including CLONE_FS and CLONE_SIGHAND. The LinuxThreads implementation has not been needed since the introduction of CLONE_THREAD in linux-2.6.0 in 17 Dec 2003. Almost 20 years ago. I suggest we introduce CONFIG_CLONE_FS and CONFIG_SIGHAND to allow disabling support of these clone options. No known user space will care. The are both getting in the way of kernel maintenance so there is a reason to start pushing them out. Further simply not worrying about UNSHARE_FS during exec fixes the race so it essentially a bug fix by code removal. I believe something like the patch below should get the job done. diff --git a/fs/exec.c b/fs/exec.c index a0b1f0337a62..7ff13c77ad04 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1186,7 +1186,8 @@ static int unshare_sighand(struct task_struct *me) { struct sighand_struct *oldsighand = me->sighand; - if (refcount_read(&oldsighand->count) != 1) { + if (IS_ENABLED(CONFIG_CLONE_SIGHAND) && + refcount_read(&oldsighand->count) != 1) { struct sighand_struct *newsighand; /* * This ->sighand is shared with the CLONE_SIGHAND @@ -1568,6 +1569,9 @@ static void check_unsafe_exec(struct linux_binprm *bprm) if (task_no_new_privs(current)) bprm->unsafe |= LSM_UNSAFE_NO_NEW_PRIVS; + if (!IS_ENABLED(CONFIG_CLONE_FS)) + return; + t = p; n_fs = 1; spin_lock(&p->fs->lock); diff --git a/init/Kconfig b/init/Kconfig index 94125d3b6893..8660a6bcc1cf 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1764,6 +1764,23 @@ config KALLSYMS_BASE_RELATIVE time constants, and no relocation pass is required at runtime to fix up the entries based on the runtime load address of the kernel. +config CLONE_FS + bool + default y + help + Support CLONE_FS being passed to clone. The only known user + is the old LinuxThreads package so it should be safe to disable + this option. + +config CLONE_SIGHAND + bool + default y + help + Support CLONE_SIGHAND being passed to clone. The only known user + is the old LinuxThreads package so it should be safe to disable + this option. + + # end of the "standard kernel features (expert users)" menu # syscall, maps, verifier diff --git a/kernel/fork.c b/kernel/fork.c index 08969f5aa38d..da9017b51da4 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2023,6 +2023,16 @@ static __latent_entropy struct task_struct *copy_process( if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM)) return ERR_PTR(-EINVAL); + /* Don't allow CLONE_FS if not enabled */ + if (!IS_ENABLED(CONFIG_CLONE_FS) && + ((clone_flags & (CLONE_THREAD | CLONE_FS)) == CLONE_FS)) + return ERR_PTR(-EINVAL); + + /* Don't allow CLONE_SIGHAND if not enabled */ + if (!IS_ENABLED(CONFIG_CLONE_SIGHAND) && + ((clone_flags & (CLONE_THREAD | CLONE_SIGHAND)) == CLONE_SIGHAND)) + return ERR_PTR(-EINVAL); + /* * Siblings of global init remain as zombies on exit since they are * not reaped by their parent (swapper). To solve this and to avoid @@ -3101,6 +3111,9 @@ static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp) if (fs->users == 1) return 0; + if (!IS_ENABLED(CONFIG_CLONE_FS)) + return -EINVAL; + *new_fsp = copy_fs_struct(fs); if (!*new_fsp) return -ENOMEM; Eric ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 2/2] exec: Remove LSM_UNSAFE_SHARE 2022-10-06 8:27 [PATCH 0/2] fs/exec: Explicitly unshare fs_struct on exec Kees Cook 2022-10-06 8:27 ` [PATCH 1/2] " Kees Cook @ 2022-10-06 8:27 ` Kees Cook 1 sibling, 0 replies; 25+ messages in thread From: Kees Cook @ 2022-10-06 8:27 UTC (permalink / raw) To: Eric Biederman Cc: Kees Cook, Alexander Viro, John Johansen, Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler, Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek, linux-fsdevel, linux-mm, apparmor, linux-security-module, selinux, Jorge Merlino, Christian Brauner (Microsoft), Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton, Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin, linux-kernel, linux-hardening With fs_struct explicitly unshared during exec, it is no longer possible to have unexpected shared state, and LSM_UNSAFE_SHARE can be entirely removed. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Eric Biederman <ebiederm@xmission.com> Cc: John Johansen <john.johansen@canonical.com> Cc: Paul Moore <paul@paul-moore.com> Cc: James Morris <jmorris@namei.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Stephen Smalley <stephen.smalley.work@gmail.com> Cc: Eric Paris <eparis@parisplace.org> Cc: Richard Haines <richard_c_haines@btinternet.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Xin Long <lucien.xin@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Todd Kjos <tkjos@google.com> Cc: Ondrej Mosnacek <omosnace@redhat.com> Cc: linux-fsdevel@vger.kernel.org Cc: linux-mm@kvack.org Cc: apparmor@lists.ubuntu.com Cc: linux-security-module@vger.kernel.org Cc: selinux@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> --- fs/exec.c | 17 +---------------- include/linux/security.h | 5 ++--- security/apparmor/domain.c | 5 ----- security/selinux/hooks.c | 10 ---------- 4 files changed, 3 insertions(+), 34 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 7d5f63f03c58..3cd058711098 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1563,8 +1563,7 @@ EXPORT_SYMBOL(bprm_change_interp); */ static void check_unsafe_exec(struct linux_binprm *bprm) { - struct task_struct *p = current, *t; - unsigned n_fs; + struct task_struct *p = current; if (p->ptrace) bprm->unsafe |= LSM_UNSAFE_PTRACE; @@ -1575,20 +1574,6 @@ static void check_unsafe_exec(struct linux_binprm *bprm) */ if (task_no_new_privs(current)) bprm->unsafe |= LSM_UNSAFE_NO_NEW_PRIVS; - - t = p; - n_fs = 1; - spin_lock(&p->fs->lock); - rcu_read_lock(); - while_each_thread(p, t) { - if (t->fs == p->fs) - n_fs++; - } - rcu_read_unlock(); - - if (p->fs->users > n_fs) - bprm->unsafe |= LSM_UNSAFE_SHARE; - spin_unlock(&p->fs->lock); } static void bprm_fill_uid(struct linux_binprm *bprm, struct file *file) diff --git a/include/linux/security.h b/include/linux/security.h index 1bc362cb413f..db508a8c3cc7 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -215,9 +215,8 @@ struct sched_param; struct request_sock; /* bprm->unsafe reasons */ -#define LSM_UNSAFE_SHARE 1 -#define LSM_UNSAFE_PTRACE 2 -#define LSM_UNSAFE_NO_NEW_PRIVS 4 +#define LSM_UNSAFE_PTRACE BIT(0) +#define LSM_UNSAFE_NO_NEW_PRIVS BIT(1) #ifdef CONFIG_MMU extern int mmap_min_addr_handler(struct ctl_table *table, int write, diff --git a/security/apparmor/domain.c b/security/apparmor/domain.c index 91689d34d281..1b2c0bb4d9ae 100644 --- a/security/apparmor/domain.c +++ b/security/apparmor/domain.c @@ -924,11 +924,6 @@ int apparmor_bprm_creds_for_exec(struct linux_binprm *bprm) goto audit; } - if (bprm->unsafe & LSM_UNSAFE_SHARE) { - /* FIXME: currently don't mediate shared state */ - ; - } - if (bprm->unsafe & (LSM_UNSAFE_PTRACE)) { /* TODO: test needs to be profile of label to new */ error = may_change_ptraced_domain(new, &info); diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index 79573504783b..3ec80cc8ad1c 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -2349,16 +2349,6 @@ static int selinux_bprm_creds_for_exec(struct linux_binprm *bprm) if (rc) return rc; - /* Check for shared state */ - if (bprm->unsafe & LSM_UNSAFE_SHARE) { - rc = avc_has_perm(&selinux_state, - old_tsec->sid, new_tsec->sid, - SECCLASS_PROCESS, PROCESS__SHARE, - NULL); - if (rc) - return -EPERM; - } - /* Make sure that anyone attempting to ptrace over a task that * changes its SID has the appropriate permit */ if (bprm->unsafe & LSM_UNSAFE_PTRACE) { -- 2.34.1 ^ permalink raw reply related [flat|nested] 25+ messages in thread
end of thread, other threads:[~2025-05-15 16:48 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-10-06 8:27 [PATCH 0/2] fs/exec: Explicitly unshare fs_struct on exec Kees Cook 2022-10-06 8:27 ` [PATCH 1/2] " Kees Cook 2022-10-06 9:05 ` Christian Brauner 2022-10-06 10:48 ` David Laight 2022-10-06 14:13 ` Jann Horn 2022-10-06 15:25 ` Kees Cook 2022-10-06 15:35 ` Jann Horn 2025-05-13 13:05 ` Mateusz Guzik 2025-05-13 15:29 ` Eric W. Biederman 2025-05-13 20:57 ` Kees Cook 2025-05-13 21:09 ` Jann Horn 2025-05-13 22:16 ` Eric W. Biederman 2025-05-14 0:03 ` Mateusz Guzik 2025-05-14 15:33 ` Eric W. Biederman 2025-05-14 15:42 ` Kees Cook 2025-05-15 16:48 ` Eric W. Biederman 2025-05-13 23:15 ` Kees Cook 2022-10-14 3:18 ` Andy Lutomirski 2022-10-14 3:54 ` Kees Cook 2022-10-14 15:35 ` Jann Horn 2022-10-18 7:09 ` Kees Cook 2022-10-18 11:19 ` Jann Horn 2022-10-14 22:03 ` David Laight 2022-11-28 17:49 ` Eric W. Biederman 2022-10-06 8:27 ` [PATCH 2/2] exec: Remove LSM_UNSAFE_SHARE Kees Cook
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).