[PATCH 0/2] fs/exec: Explicitly unshare fs

linux-security-module.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] fs/exec: Explicitly unshare fs_struct on exec
@ 2022-10-06  8:27 Kees Cook
  2022-10-06  8:27 ` [PATCH 1/2] " Kees Cook
  2022-10-06  8:27 ` [PATCH 2/2] exec: Remove LSM_UNSAFE_SHARE Kees Cook
  0 siblings, 2 replies; 25+ messages in thread
From: Kees Cook @ 2022-10-06  8:27 UTC (permalink / raw)
  To: Eric Biederman
  Cc: Kees Cook, Jorge Merlino, Al Viro, Christian Brauner (Microsoft),
	Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior,
	Andrew Morton, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines,
	Casey Schaufler, Xin Long, David S. Miller, Todd Kjos,
	Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu,
	Andrei Vagin, linux-kernel, linux-mm, linux-fsdevel, apparmor,
	linux-security-module, selinux, linux-hardening

Hi,

These changes seek to address an issue reported[1] by Jorge Merlino where
high-thread-count processes would sometimes fail to setuid during a
setuid execve().

It looks to me like the solution is to explicitly do an unshare_fs(),
which should almost always be a no-op. Current testing seems to indicate
that only the swapper->init exec triggers this condition (and I'm unclear
on whether that's expected or undesirable). This has only received very
light testing so far, but I wanted to share it so other folks could look
it over.

Jorge, can you test with these patches? Your PoC triggered immediately
for me on an unpatched kernel, and did not trigger on a patched one.

I added this patch on top of the series to see if the code ever fired:

diff --git a/kernel/fork.c b/kernel/fork.c
index 53b7248f7a4b..3c197d9d8daa 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -3113,6 +3113,7 @@ int unshare_fs(void)
 	if (error || !new_fs)
 		return error;
 
+	pr_notice("UNSHARE of \"%s\" [%d]\n", current->comm, current->pid);
 	unshare_fs_finalize(&new_fs);
 
 	if (new_fs)

Thanks!

-Kees

[1] https://lore.kernel.org/lkml/20220910211215.140270-1-jorge.merlino@canonical.com/

Kees Cook (2):
  fs/exec: Explicitly unshare fs_struct on exec
  exec: Remove LSM_UNSAFE_SHARE

 fs/exec.c                  | 26 ++++------------
 fs/fs_struct.c             |  1 -
 include/linux/fdtable.h    |  1 +
 include/linux/fs_struct.h  |  1 -
 include/linux/security.h   |  5 ++-
 kernel/fork.c              | 62 ++++++++++++++++++++++++++------------
 security/apparmor/domain.c |  5 ---
 security/selinux/hooks.c   | 10 ------
 8 files changed, 51 insertions(+), 60 deletions(-)

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2022-10-06  8:27 [PATCH 0/2] fs/exec: Explicitly unshare fs_struct on exec Kees Cook
@ 2022-10-06  8:27 ` Kees Cook
  2022-10-06  9:05   ` Christian Brauner
  2022-10-06  8:27 ` [PATCH 2/2] exec: Remove LSM_UNSAFE_SHARE Kees Cook
  1 sibling, 1 reply; 25+ messages in thread
From: Kees Cook @ 2022-10-06  8:27 UTC (permalink / raw)
  To: Eric Biederman
  Cc: Kees Cook, Jorge Merlino, Alexander Viro,
	Christian Brauner (Microsoft), Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel,
	John Johansen, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler,
	Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek,
	Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin,
	linux-kernel, apparmor, linux-security-module, selinux,
	linux-hardening

The check_unsafe_exec() counting of n_fs would not add up under a heavily
threaded process trying to perform a suid exec, causing the suid portion
to fail. This counting error appears to be unneeded, but to catch any
possible conditions, explicitly unshare fs_struct on exec, if it ends up
needing to happen. This means fs->in_exec must be removed (since it would
never get cleared in the old copy), and is specifically no longer needed.

See also commit 498052bba55e ("New locking/refcounting for fs_struct").

This additionally allows the "in_exec" member to be dropped, showing an
8 bytes savings per fs_struct on 64-bit. Before:

struct fs_struct {
        int                        users;                /*     0     4 */
        spinlock_t                 lock;                 /*     4     4 */
        seqcount_spinlock_t        seq;                  /*     8     4 */
        int                        umask;                /*    12     4 */
        int                        in_exec;              /*    16     4 */

        /* XXX 4 bytes hole, try to pack */

        struct path                root;                 /*    24    16 */
        struct path                pwd;                  /*    40    16 */

        /* size: 56, cachelines: 1, members: 7 */
        /* sum members: 52, holes: 1, sum holes: 4 */
        /* last cacheline: 56 bytes */
};

After:

struct fs_struct {
        int                        users;                /*     0     4 */
        spinlock_t                 lock;                 /*     4     4 */
        seqcount_spinlock_t        seq;                  /*     8     4 */
        int                        umask;                /*    12     4 */
        struct path                root;                 /*    16    16 */
        struct path                pwd;                  /*    32    16 */

        /* size: 48, cachelines: 1, members: 6 */
        /* last cacheline: 48 bytes */
};

Reported-by: Jorge Merlino <jorge.merlino@canonical.com>
Link: https://lore.kernel.org/lkml/20220910211215.140270-1-jorge.merlino@canonical.com/
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Christian Brauner (Microsoft)" <brauner@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 fs/exec.c                 |  9 +++---
 fs/fs_struct.c            |  1 -
 include/linux/fdtable.h   |  1 +
 include/linux/fs_struct.h |  1 -
 kernel/fork.c             | 62 ++++++++++++++++++++++++++-------------
 5 files changed, 48 insertions(+), 26 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 902bce45b116..7d5f63f03c58 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1272,6 +1272,11 @@ int begin_new_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+	/* Ensure the fs_struct is not shared. */
+	retval = unshare_fs();
+	if (retval)
+		goto out;
+
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
 	 * not visible until then. This also enables the update
@@ -1583,8 +1588,6 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
 
 	if (p->fs->users > n_fs)
 		bprm->unsafe |= LSM_UNSAFE_SHARE;
-	else
-		p->fs->in_exec = 1;
 	spin_unlock(&p->fs->lock);
 }
 
@@ -1843,7 +1846,6 @@ static int bprm_execve(struct linux_binprm *bprm,
 		goto out;
 
 	/* execve succeeded */
-	current->fs->in_exec = 0;
 	current->in_execve = 0;
 	rseq_execve(current);
 	acct_update_integrals(current);
@@ -1861,7 +1863,6 @@ static int bprm_execve(struct linux_binprm *bprm,
 		force_fatal_sig(SIGSEGV);
 
 out_unmark:
-	current->fs->in_exec = 0;
 	current->in_execve = 0;
 
 	return retval;
diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index 04b3f5b9c629..c27c67023d01 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -115,7 +115,6 @@ struct fs_struct *copy_fs_struct(struct fs_struct *old)
 	/* We don't need to lock fs - think why ;-) */
 	if (fs) {
 		fs->users = 1;
-		fs->in_exec = 0;
 		spin_lock_init(&fs->lock);
 		seqcount_spinlock_init(&fs->seq, &fs->lock);
 		fs->umask = old->umask;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index e066816f3519..fbfb89ee3bda 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -117,6 +117,7 @@ struct task_struct;
 
 void put_files_struct(struct files_struct *fs);
 int unshare_files(void);
+int unshare_fs(void);
 struct files_struct *dup_fd(struct files_struct *, unsigned, int *) __latent_entropy;
 void do_close_on_exec(struct files_struct *);
 int iterate_fd(struct files_struct *, unsigned,
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index 783b48dedb72..08b82a90ceae 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -11,7 +11,6 @@ struct fs_struct {
 	spinlock_t lock;
 	seqcount_spinlock_t seq;
 	int umask;
-	int in_exec;
 	struct path root, pwd;
 } __randomize_layout;
 
diff --git a/kernel/fork.c b/kernel/fork.c
index b4a799d9c50f..53b7248f7a4b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1589,10 +1589,6 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
 	if (clone_flags & CLONE_FS) {
 		/* tsk->fs is already what we want */
 		spin_lock(&fs->lock);
-		if (fs->in_exec) {
-			spin_unlock(&fs->lock);
-			return -EAGAIN;
-		}
 		fs->users++;
 		spin_unlock(&fs->lock);
 		return 0;
@@ -3070,10 +3066,9 @@ static int check_unshare_flags(unsigned long unshare_flags)
 	return 0;
 }
 
-/*
- * Unshare the filesystem structure if it is being shared
- */
-static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp)
+/* Allocate an fs_struct if it is currently shared and CLONE_FS requested. */
+static int unshare_fs_alloc(unsigned long unshare_flags,
+			    struct fs_struct **new_fsp)
 {
 	struct fs_struct *fs = current->fs;
 
@@ -3091,6 +3086,41 @@ static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp)
 	return 0;
 }
 
+/* Swap out fs_struct, returning old fs if it needs freeing. */
+static void unshare_fs_finalize(struct fs_struct **new_fsp)
+{
+	struct task_struct *task = current;
+	struct fs_struct *fs = task->fs;
+
+	spin_lock(&fs->lock);
+	task->fs = *new_fsp;
+	if (--fs->users)
+		*new_fsp = NULL;
+	else
+		*new_fsp = fs;
+	spin_unlock(&fs->lock);
+}
+
+/*
+ * Unshare the filesystem structure if it is being shared
+ */
+int unshare_fs(void)
+{
+	struct fs_struct *new_fs = NULL;
+	int error;
+
+	error = unshare_fs_alloc(CLONE_FS, &new_fs);
+	if (error || !new_fs)
+		return error;
+
+	unshare_fs_finalize(&new_fs);
+
+	if (new_fs)
+		free_fs_struct(new_fs);
+
+	return 0;
+}
+
 /*
  * Unshare file descriptor table if it is being shared
  */
@@ -3120,7 +3150,7 @@ int unshare_fd(unsigned long unshare_flags, unsigned int max_fds,
  */
 int ksys_unshare(unsigned long unshare_flags)
 {
-	struct fs_struct *fs, *new_fs = NULL;
+	struct fs_struct *new_fs = NULL;
 	struct files_struct *new_fd = NULL;
 	struct cred *new_cred = NULL;
 	struct nsproxy *new_nsproxy = NULL;
@@ -3159,7 +3189,7 @@ int ksys_unshare(unsigned long unshare_flags)
 	 */
 	if (unshare_flags & (CLONE_NEWIPC|CLONE_SYSVSEM))
 		do_sysvsem = 1;
-	err = unshare_fs(unshare_flags, &new_fs);
+	err = unshare_fs_alloc(unshare_flags, &new_fs);
 	if (err)
 		goto bad_unshare_out;
 	err = unshare_fd(unshare_flags, NR_OPEN_MAX, &new_fd);
@@ -3197,16 +3227,8 @@ int ksys_unshare(unsigned long unshare_flags)
 
 		task_lock(current);
 
-		if (new_fs) {
-			fs = current->fs;
-			spin_lock(&fs->lock);
-			current->fs = new_fs;
-			if (--fs->users)
-				new_fs = NULL;
-			else
-				new_fs = fs;
-			spin_unlock(&fs->lock);
-		}
+		if (new_fs)
+			unshare_fs_finalize(&new_fs);
 
 		if (new_fd)
 			swap(current->files, new_fd);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2022-10-06  8:27 ` [PATCH 1/2] " Kees Cook
@ 2022-10-06  9:05   ` Christian Brauner
  2022-10-06 10:48     ` David Laight
  2022-10-06 14:13     ` Jann Horn
  0 siblings, 2 replies; 25+ messages in thread
From: Christian Brauner @ 2022-10-06  9:05 UTC (permalink / raw)
  To: Kees Cook
  Cc: Eric Biederman, Jorge Merlino, Alexander Viro, Thomas Gleixner,
	Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton,
	linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines,
	Casey Schaufler, Xin Long, David S. Miller, Todd Kjos,
	Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu,
	Andrei Vagin, linux-kernel, apparmor, linux-security-module,
	selinux, linux-hardening

On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
> The check_unsafe_exec() counting of n_fs would not add up under a heavily
> threaded process trying to perform a suid exec, causing the suid portion
> to fail. This counting error appears to be unneeded, but to catch any
> possible conditions, explicitly unshare fs_struct on exec, if it ends up

Isn't this a potential uapi break? Afaict, before this change a call to
clone{3}(CLONE_FS) followed by an exec in the child would have the
parent and child share fs information. So if the child e.g., changes the
working directory post exec it would also affect the parent. But after
this change here this would no longer be true. So a child changing a
workding directoro would not affect the parent anymore. IOW, an exec is
accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
it seems like a non-trivial uapi change but there might be few users
that do clone{3}(CLONE_FS) followed by an exec.

> +/*
> + * Unshare the filesystem structure if it is being shared
> + */
> +int unshare_fs(void)
> +{
> +	struct fs_struct *new_fs = NULL;
> +	int error;
> +
> +	error = unshare_fs_alloc(CLONE_FS, &new_fs);
> +	if (error || !new_fs)
> +		return error;

You should just check for error here, not new_fs.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2022-10-06  9:05   ` Christian Brauner
@ 2022-10-06 10:48     ` David Laight
  2022-10-06 14:13     ` Jann Horn
  1 sibling, 0 replies; 25+ messages in thread
From: David Laight @ 2022-10-06 10:48 UTC (permalink / raw)
  To: 'Christian Brauner', Kees Cook
  Cc: Eric Biederman, Jorge Merlino, Alexander Viro, Thomas Gleixner,
	Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, John Johansen,
	Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley,
	Eric Paris, Richard Haines, Casey Schaufler, Xin Long,
	David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad,
	Micah Morton, Fenghua Yu, Andrei Vagin,
	linux-kernel@vger.kernel.org, apparmor@lists.ubuntu.com,
	linux-security-module@vger.kernel.org, selinux@vger.kernel.org,
	linux-hardening@vger.kernel.org

From: Christian Brauner
> Sent: 06 October 2022 10:05
> 
> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
> > The check_unsafe_exec() counting of n_fs would not add up under a heavily
> > threaded process trying to perform a suid exec, causing the suid portion
> > to fail. This counting error appears to be unneeded, but to catch any
> > possible conditions, explicitly unshare fs_struct on exec, if it ends up
> 
> Isn't this a potential uapi break? Afaict, before this change a call to
> clone{3}(CLONE_FS) followed by an exec in the child would have the
> parent and child share fs information. So if the child e.g., changes the
> working directory post exec it would also affect the parent. But after
> this change here this would no longer be true. So a child changing a
> workding directoro would not affect the parent anymore. IOW, an exec is
> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
> it seems like a non-trivial uapi change but there might be few users
> that do clone{3}(CLONE_FS) followed by an exec.

The thought of that is entirely horrid...

I presume a suid exec will fail in that case?

If the old code is trying to compare the number of threads
with the number of users of the fs table isn't is just buggy?
If a thread unshares the fs table there can be another
reference somewhere else - which is what (I presume) is being
tested for.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2022-10-06  9:05   ` Christian Brauner
  2022-10-06 10:48     ` David Laight
@ 2022-10-06 14:13     ` Jann Horn
  2022-10-06 15:25       ` Kees Cook
  2022-10-14  3:18       ` Andy Lutomirski
  1 sibling, 2 replies; 25+ messages in thread
From: Jann Horn @ 2022-10-06 14:13 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kees Cook, Eric Biederman, Jorge Merlino, Alexander Viro,
	Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior,
	Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore,
	James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris,
	Richard Haines, Casey Schaufler, Xin Long, David S. Miller,
	Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton,
	Fenghua Yu, Andrei Vagin, linux-kernel, apparmor,
	linux-security-module, selinux, linux-hardening

On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote:
> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
> > The check_unsafe_exec() counting of n_fs would not add up under a heavily
> > threaded process trying to perform a suid exec, causing the suid portion
> > to fail. This counting error appears to be unneeded, but to catch any
> > possible conditions, explicitly unshare fs_struct on exec, if it ends up
>
> Isn't this a potential uapi break? Afaict, before this change a call to
> clone{3}(CLONE_FS) followed by an exec in the child would have the
> parent and child share fs information. So if the child e.g., changes the
> working directory post exec it would also affect the parent. But after
> this change here this would no longer be true. So a child changing a
> workding directoro would not affect the parent anymore. IOW, an exec is
> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
> it seems like a non-trivial uapi change but there might be few users
> that do clone{3}(CLONE_FS) followed by an exec.

I believe the following code in Chromium explicitly relies on this
behavior, but I'm not sure whether this code is in active use anymore:

https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2022-10-06 14:13     ` Jann Horn
@ 2022-10-06 15:25       ` Kees Cook
  2022-10-06 15:35         ` Jann Horn
  2025-05-13 13:05         ` Mateusz Guzik
  2022-10-14  3:18       ` Andy Lutomirski
  1 sibling, 2 replies; 25+ messages in thread
From: Kees Cook @ 2022-10-06 15:25 UTC (permalink / raw)
  To: Jann Horn, Christian Brauner
  Cc: Eric Biederman, Jorge Merlino, Alexander Viro, Thomas Gleixner,
	Andy Lutomirski, Sebastian Andrzej Siewior, Andrew Morton,
	linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines,
	Casey Schaufler, Xin Long, David S. Miller, Todd Kjos,
	Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu,
	Andrei Vagin, linux-kernel, apparmor, linux-security-module,
	selinux, linux-hardening



On October 6, 2022 7:13:37 AM PDT, Jann Horn <jannh@google.com> wrote:
>On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote:
>> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
>> > The check_unsafe_exec() counting of n_fs would not add up under a heavily
>> > threaded process trying to perform a suid exec, causing the suid portion
>> > to fail. This counting error appears to be unneeded, but to catch any
>> > possible conditions, explicitly unshare fs_struct on exec, if it ends up
>>
>> Isn't this a potential uapi break? Afaict, before this change a call to
>> clone{3}(CLONE_FS) followed by an exec in the child would have the
>> parent and child share fs information. So if the child e.g., changes the
>> working directory post exec it would also affect the parent. But after
>> this change here this would no longer be true. So a child changing a
>> workding directoro would not affect the parent anymore. IOW, an exec is
>> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
>> it seems like a non-trivial uapi change but there might be few users
>> that do clone{3}(CLONE_FS) followed by an exec.
>
>I believe the following code in Chromium explicitly relies on this
>behavior, but I'm not sure whether this code is in active use anymore:
>
>https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium

Oh yes. I think I had tried to forget this existed. Ugh. Okay, so back to the drawing board, I guess. The counting will need to be fixed...

It's possible we can move the counting after dethread -- it seems the early count was just to avoid setting flags after the point of no return, but it's not an error condition...

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2022-10-06 15:25       ` Kees Cook
@ 2022-10-06 15:35         ` Jann Horn
  2025-05-13 13:05         ` Mateusz Guzik
  1 sibling, 0 replies; 25+ messages in thread
From: Jann Horn @ 2022-10-06 15:35 UTC (permalink / raw)
  To: Kees Cook
  Cc: Christian Brauner, Eric Biederman, Jorge Merlino, Alexander Viro,
	Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior,
	Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore,
	James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris,
	Richard Haines, Casey Schaufler, Xin Long, David S. Miller,
	Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton,
	Fenghua Yu, Andrei Vagin, linux-kernel, apparmor,
	linux-security-module, selinux, linux-hardening

On Thu, Oct 6, 2022 at 5:25 PM Kees Cook <keescook@chromium.org> wrote:
> On October 6, 2022 7:13:37 AM PDT, Jann Horn <jannh@google.com> wrote:
> >On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote:
> >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
> >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily
> >> > threaded process trying to perform a suid exec, causing the suid portion
> >> > to fail. This counting error appears to be unneeded, but to catch any
> >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up
> >>
> >> Isn't this a potential uapi break? Afaict, before this change a call to
> >> clone{3}(CLONE_FS) followed by an exec in the child would have the
> >> parent and child share fs information. So if the child e.g., changes the
> >> working directory post exec it would also affect the parent. But after
> >> this change here this would no longer be true. So a child changing a
> >> workding directoro would not affect the parent anymore. IOW, an exec is
> >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
> >> it seems like a non-trivial uapi change but there might be few users
> >> that do clone{3}(CLONE_FS) followed by an exec.
> >
> >I believe the following code in Chromium explicitly relies on this
> >behavior, but I'm not sure whether this code is in active use anymore:
> >
> >https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium
>
> Oh yes. I think I had tried to forget this existed. Ugh. Okay, so back to the drawing board, I guess. The counting will need to be fixed...
>
> It's possible we can move the counting after dethread -- it seems the early count was just to avoid setting flags after the point of no return, but it's not an error condition...

Random idea that I haven't thought about a lot:

One approach might be to not do it by counting, but instead have a
flag on the fs_struct that we set when someone does a clone() with
CLONE_FS but without CLONE_THREAD? Then we'd end up with the following
possible states for fs_struct:

 - single-process, normal
 - single-process, pending execve past check_unsafe_exec() (prevent
concurrent CLONE_FS)
 - shared between processes

The slight difference from the old semantics would be that once you've
used CLONE_FS without CLONE_THREAD, you can never do setuid execve()
from your current process again (without calling unshare()), even if
the child disappears in the meantime. I think that might be an
acceptably small UAPI break.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2022-10-06 15:25       ` Kees Cook
  2022-10-06 15:35         ` Jann Horn
@ 2025-05-13 13:05         ` Mateusz Guzik
  2025-05-13 15:29           ` Eric W. Biederman
  2025-05-13 20:57           ` Kees Cook
  1 sibling, 2 replies; 25+ messages in thread
From: Mateusz Guzik @ 2025-05-13 13:05 UTC (permalink / raw)
  To: Kees Cook
  Cc: Jann Horn, Christian Brauner, Eric Biederman, Jorge Merlino,
	Alexander Viro, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel,
	John Johansen, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler,
	Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek,
	Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin,
	linux-kernel, apparmor, linux-security-module, selinux,
	linux-hardening, oleg

On Thu, Oct 06, 2022 at 08:25:01AM -0700, Kees Cook wrote:
> On October 6, 2022 7:13:37 AM PDT, Jann Horn <jannh@google.com> wrote:
> >On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote:
> >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
> >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily
> >> > threaded process trying to perform a suid exec, causing the suid portion
> >> > to fail. This counting error appears to be unneeded, but to catch any
> >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up
> >>
> >> Isn't this a potential uapi break? Afaict, before this change a call to
> >> clone{3}(CLONE_FS) followed by an exec in the child would have the
> >> parent and child share fs information. So if the child e.g., changes the
> >> working directory post exec it would also affect the parent. But after
> >> this change here this would no longer be true. So a child changing a
> >> workding directoro would not affect the parent anymore. IOW, an exec is
> >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
> >> it seems like a non-trivial uapi change but there might be few users
> >> that do clone{3}(CLONE_FS) followed by an exec.
> >
> >I believe the following code in Chromium explicitly relies on this
> >behavior, but I'm not sure whether this code is in active use anymore:
> >
> >https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium
> 
> Oh yes. I think I had tried to forget this existed. Ugh. Okay, so back to the drawing board, I guess. The counting will need to be fixed...
> 
> It's possible we can move the counting after dethread -- it seems the early count was just to avoid setting flags after the point of no return, but it's not an error condition...
> 

I landed here from git blame.

I was looking at sanitizing shared fs vs suid handling, but the entire
ordeal is so convoluted I'm confident the best way forward is to whack
the problem to begin with.

Per the above link, the notion of a shared fs struct across different
processes is depended on so merely unsharing is a no-go.

However, the shared state is only a problem for suid/sgid.

Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is
shared. This will have to be checked for after the execing proc becomes
single-threaded ofc.

While technically speaking this does introduce a change in behavior,
there is precedent for doing it and seeing if anyone yells.

With this in place there is no point maintainig ->in_exec or checking
the flag.

There is the known example of depending on shared fs_struct across exec.
Hopefully there is no example of depending on execing a suid/sgid binary
in such a setting -- it would be quite a weird setup given that for
security reasons the perms must not be changed.

The upshot of this method is that any breakage will be immediately
visible in the form of a failed exec.

Another route would be to do the mandatory unshare but only for
suid/sgid, except that would have a hidden failure (if you will).

Comments?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2025-05-13 13:05         ` Mateusz Guzik
@ 2025-05-13 15:29           ` Eric W. Biederman
  2025-05-13 20:57           ` Kees Cook
  1 sibling, 0 replies; 25+ messages in thread
From: Eric W. Biederman @ 2025-05-13 15:29 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Kees Cook, Jann Horn, Christian Brauner, Jorge Merlino,
	Alexander Viro, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel,
	John Johansen, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler,
	Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek,
	Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin,
	linux-kernel, apparmor, linux-security-module, selinux,
	linux-hardening, oleg

Mateusz Guzik <mjguzik@gmail.com> writes:

> On Thu, Oct 06, 2022 at 08:25:01AM -0700, Kees Cook wrote:
>> On October 6, 2022 7:13:37 AM PDT, Jann Horn <jannh@google.com> wrote:
>> >On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote:
>> >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
>> >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily
>> >> > threaded process trying to perform a suid exec, causing the suid portion
>> >> > to fail. This counting error appears to be unneeded, but to catch any
>> >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up
>> >>
>> >> Isn't this a potential uapi break? Afaict, before this change a call to
>> >> clone{3}(CLONE_FS) followed by an exec in the child would have the
>> >> parent and child share fs information. So if the child e.g., changes the
>> >> working directory post exec it would also affect the parent. But after
>> >> this change here this would no longer be true. So a child changing a
>> >> workding directoro would not affect the parent anymore. IOW, an exec is
>> >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
>> >> it seems like a non-trivial uapi change but there might be few users
>> >> that do clone{3}(CLONE_FS) followed by an exec.
>> >
>> >I believe the following code in Chromium explicitly relies on this
>> >behavior, but I'm not sure whether this code is in active use anymore:
>> >
>> >https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium
>> 
>> Oh yes. I think I had tried to forget this existed. Ugh. Okay, so back to the drawing board, I guess. The counting will need to be fixed...
>> 
>> It's possible we can move the counting after dethread -- it seems the early count was just to avoid setting flags after the point of no return, but it's not an error condition...
>> 
>
> I landed here from git blame.
>
> I was looking at sanitizing shared fs vs suid handling, but the entire
> ordeal is so convoluted I'm confident the best way forward is to whack
> the problem to begin with.
>
> Per the above link, the notion of a shared fs struct across different
> processes is depended on so merely unsharing is a no-go.
>
> However, the shared state is only a problem for suid/sgid.
>
> Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is
> shared. This will have to be checked for after the execing proc becomes
> single-threaded ofc.
>
> While technically speaking this does introduce a change in behavior,
> there is precedent for doing it and seeing if anyone yells.
>
> With this in place there is no point maintainig ->in_exec or checking
> the flag.
>
> There is the known example of depending on shared fs_struct across exec.
> Hopefully there is no example of depending on execing a suid/sgid binary
> in such a setting -- it would be quite a weird setup given that for
> security reasons the perms must not be changed.
>
> The upshot of this method is that any breakage will be immediately
> visible in the form of a failed exec.
>
> Another route would be to do the mandatory unshare but only for
> suid/sgid, except that would have a hidden failure (if you will).
>
> Comments?

What is the problem that is trying to be fixed?

A uapi change to not allow sharing a fs_struct for processes that change
their cred on exec seems possible.

I said changing cred instead of suid/sgid because there are capabilities
and LSM labels that we probably want this to apply to as well.

I think such a limitation can be justified based upon having a shared
fs_struct is likely to allow confuse suid executables.


Earlier in the thread there was talk about the refcount for fs_struct.
I don't see that problem at the moment, and I don't see how dealing with
suid+sgid exectuables will have any bearing on how the refcount works.

Eric



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2025-05-13 13:05         ` Mateusz Guzik
  2025-05-13 15:29           ` Eric W. Biederman
@ 2025-05-13 20:57           ` Kees Cook
  2025-05-13 21:09             ` Jann Horn
  1 sibling, 1 reply; 25+ messages in thread
From: Kees Cook @ 2025-05-13 20:57 UTC (permalink / raw)
  To: Mateusz Guzik, Kees Cook
  Cc: Jann Horn, Christian Brauner, Eric Biederman, Jorge Merlino,
	Alexander Viro, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel,
	John Johansen, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler,
	Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek,
	Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin,
	linux-kernel, apparmor, linux-security-module, selinux,
	linux-hardening, oleg



On May 13, 2025 6:05:45 AM PDT, Mateusz Guzik <mjguzik@gmail.com> wrote:
>On Thu, Oct 06, 2022 at 08:25:01AM -0700, Kees Cook wrote:
>> On October 6, 2022 7:13:37 AM PDT, Jann Horn <jannh@google.com> wrote:
>> >On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote:
>> >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
>> >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily
>> >> > threaded process trying to perform a suid exec, causing the suid portion
>> >> > to fail. This counting error appears to be unneeded, but to catch any
>> >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up
>> >>
>> >> Isn't this a potential uapi break? Afaict, before this change a call to
>> >> clone{3}(CLONE_FS) followed by an exec in the child would have the
>> >> parent and child share fs information. So if the child e.g., changes the
>> >> working directory post exec it would also affect the parent. But after
>> >> this change here this would no longer be true. So a child changing a
>> >> workding directoro would not affect the parent anymore. IOW, an exec is
>> >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
>> >> it seems like a non-trivial uapi change but there might be few users
>> >> that do clone{3}(CLONE_FS) followed by an exec.
>> >
>> >I believe the following code in Chromium explicitly relies on this
>> >behavior, but I'm not sure whether this code is in active use anymore:
>> >
>> >https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium
>> 
>> Oh yes. I think I had tried to forget this existed. Ugh. Okay, so back to the drawing board, I guess. The counting will need to be fixed...
>> 
>> It's possible we can move the counting after dethread -- it seems the early count was just to avoid setting flags after the point of no return, but it's not an error condition...
>> 
>
>I landed here from git blame.
>
>I was looking at sanitizing shared fs vs suid handling, but the entire
>ordeal is so convoluted I'm confident the best way forward is to whack
>the problem to begin with.
>
>Per the above link, the notion of a shared fs struct across different
>processes is depended on so merely unsharing is a no-go.
>
>However, the shared state is only a problem for suid/sgid.
>
>Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is
>shared. This will have to be checked for after the execing proc becomes
>single-threaded ofc.

Unfortunately the above Chrome helper is setuid and uses CLONE_FS.

But to echo what Eric asked: what problem are you trying to solve?

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2025-05-13 20:57           ` Kees Cook
@ 2025-05-13 21:09             ` Jann Horn
  2025-05-13 22:16               ` Eric W. Biederman
  2025-05-13 23:15               ` Kees Cook
  0 siblings, 2 replies; 25+ messages in thread
From: Jann Horn @ 2025-05-13 21:09 UTC (permalink / raw)
  To: Kees Cook
  Cc: Mateusz Guzik, Kees Cook, Christian Brauner, Eric Biederman,
	Jorge Merlino, Alexander Viro, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel,
	John Johansen, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler,
	Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek,
	Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin,
	linux-kernel, apparmor, linux-security-module, selinux,
	linux-hardening, oleg

On Tue, May 13, 2025 at 10:57 PM Kees Cook <kees@kernel.org> wrote:
> On May 13, 2025 6:05:45 AM PDT, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is
> >shared. This will have to be checked for after the execing proc becomes
> >single-threaded ofc.
>
> Unfortunately the above Chrome helper is setuid and uses CLONE_FS.

Chrome first launches a setuid helper, and then the setuid helper does
CLONE_FS. Mateusz's proposal would not impact this usecase.

Mateusz is proposing to block the case where a process first does
CLONE_FS, and *then* one of the processes sharing the fs_struct does a
setuid execve(). Linux already downgrades such an execve() to be
non-setuid, which probably means anyone trying to do this will get
hard-to-understand problems. Mateusz' proposal would just turn this
hard-to-debug edgecase, which already doesn't really work, into a
clean error; I think that is a nice improvement even just from the
UAPI standpoint.

If this change makes it possible to clean up the kernel code a bit, even better.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2025-05-13 21:09             ` Jann Horn
@ 2025-05-13 22:16               ` Eric W. Biederman
  2025-05-14  0:03                 ` Mateusz Guzik
  2025-05-13 23:15               ` Kees Cook
  1 sibling, 1 reply; 25+ messages in thread
From: Eric W. Biederman @ 2025-05-13 22:16 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, Mateusz Guzik, Kees Cook, Christian Brauner,
	Jorge Merlino, Alexander Viro, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel,
	John Johansen, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler,
	Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek,
	Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin,
	linux-kernel, apparmor, linux-security-module, selinux,
	linux-hardening, oleg

Jann Horn <jannh@google.com> writes:

> On Tue, May 13, 2025 at 10:57 PM Kees Cook <kees@kernel.org> wrote:
>> On May 13, 2025 6:05:45 AM PDT, Mateusz Guzik <mjguzik@gmail.com> wrote:
>> >Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is
>> >shared. This will have to be checked for after the execing proc becomes
>> >single-threaded ofc.
>>
>> Unfortunately the above Chrome helper is setuid and uses CLONE_FS.
>
> Chrome first launches a setuid helper, and then the setuid helper does
> CLONE_FS. Mateusz's proposal would not impact this usecase.
>
> Mateusz is proposing to block the case where a process first does
> CLONE_FS, and *then* one of the processes sharing the fs_struct does a
> setuid execve(). Linux already downgrades such an execve() to be
> non-setuid, which probably means anyone trying to do this will get
> hard-to-understand problems. Mateusz' proposal would just turn this
> hard-to-debug edgecase, which already doesn't really work, into a
> clean error; I think that is a nice improvement even just from the
> UAPI standpoint.
>
> If this change makes it possible to clean up the kernel code a bit, even better.

What has brought this to everyone's attention just now?  This is
the second mention of this code path I have seen this week.

AKA: security/commoncap.c:cap_bprm_creds_from_file(...)
> ...
> 	/* Don't let someone trace a set[ug]id/setpcap binary with the revised
> 	 * credentials unless they have the appropriate permit.
> 	 *
> 	 * In addition, if NO_NEW_PRIVS, then ensure we get no new privs.
> 	 */
> 	is_setid = __is_setuid(new, old) || __is_setgid(new, old);
> 
> 	if ((is_setid || __cap_gained(permitted, new, old)) &&
> 	    ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) ||
> 	     !ptracer_capable(current, new->user_ns))) {
> 		/* downgrade; they get no more than they had, and maybe less */
> 		if (!ns_capable(new->user_ns, CAP_SETUID) ||
> 		    (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) {
> 			new->euid = new->uid;
> 			new->egid = new->gid;
> 		}
> 		new->cap_permitted = cap_intersect(new->cap_permitted,
> 						   old->cap_permitted);
> 	}

The actual downgrade is because a ptrace'd executable also takes
this path.

I have seen it argued rather forcefully that continuing rather than
simply failing seems better in the ptrace case.

In general I think it can be said this policy is "safe".  AKA we don't
let a shared fs struct confuse privileged applications.  So nothing
to panic about.

It looks like most of the lsm's also test bprm->unsafe.

So I imagine someone could very carefully separate the non-ptrace case
from the ptrace case but *shrug*.

Perhaps:

 	if ((is_setid || __cap_gained(permitted, new_old)) &&
 	    ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) ||
 	     !ptracer_capable(current, new->user_ns))) {
+		if (!(bprm->unsafe & LSM_UNSAFE_PTRACE)) {
+			return -EPERM;
+		}
  		/* downgrade; they get no more than they had, and maybe less */
  		if (!ns_capable(new->user_ns, CAP_SETUID) ||
  		    (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) {
  			new->euid = new->uid;
  			new->egid = new->gid;
  		}
  		new->cap_permitted = cap_intersect(new->cap_permitted,
  						   old->cap_permitted);
         }

If that is what you want that doesn't look to scary.  I don't think
it simplifies anything about fs->in_exec.  As fs->in_exec is set when
the processing calling exec is the only process that owns the fs_struct.
With fs->in_exec just being a flag that doesn't allow another thread
to call fork and start sharing the fs_struct during exec.

*Shrug*

I don't see why anyone would care.  It is just a very silly corner case.

Eric

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2025-05-13 22:16               ` Eric W. Biederman
@ 2025-05-14  0:03                 ` Mateusz Guzik
  2025-05-14 15:33                   ` Eric W. Biederman
  2025-05-14 15:42                   ` Kees Cook
  0 siblings, 2 replies; 25+ messages in thread
From: Mateusz Guzik @ 2025-05-14  0:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jann Horn, Kees Cook, Kees Cook, Christian Brauner, Jorge Merlino,
	Alexander Viro, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel,
	John Johansen, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler,
	Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek,
	Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin,
	linux-kernel, apparmor, linux-security-module, selinux,
	linux-hardening, oleg

On Wed, May 14, 2025 at 12:17 AM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Jann Horn <jannh@google.com> writes:
>
> > On Tue, May 13, 2025 at 10:57 PM Kees Cook <kees@kernel.org> wrote:
> >> On May 13, 2025 6:05:45 AM PDT, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >> >Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is
> >> >shared. This will have to be checked for after the execing proc becomes
> >> >single-threaded ofc.
> >>
> >> Unfortunately the above Chrome helper is setuid and uses CLONE_FS.
> >
> > Chrome first launches a setuid helper, and then the setuid helper does
> > CLONE_FS. Mateusz's proposal would not impact this usecase.
> >
> > Mateusz is proposing to block the case where a process first does
> > CLONE_FS, and *then* one of the processes sharing the fs_struct does a
> > setuid execve(). Linux already downgrades such an execve() to be
> > non-setuid, which probably means anyone trying to do this will get
> > hard-to-understand problems. Mateusz' proposal would just turn this
> > hard-to-debug edgecase, which already doesn't really work, into a
> > clean error; I think that is a nice improvement even just from the
> > UAPI standpoint.
> >
> > If this change makes it possible to clean up the kernel code a bit, even better.
>
> What has brought this to everyone's attention just now?  This is
> the second mention of this code path I have seen this week.
>

There is a syzkaller report concerning ->in_exec handling, for example:
https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001f.GAE@google.com/#t

> AKA: security/commoncap.c:cap_bprm_creds_from_file(...)
> > ...
> >       /* Don't let someone trace a set[ug]id/setpcap binary with the revised
> >        * credentials unless they have the appropriate permit.
> >        *
> >        * In addition, if NO_NEW_PRIVS, then ensure we get no new privs.
> >        */
> >       is_setid = __is_setuid(new, old) || __is_setgid(new, old);
> >
> >       if ((is_setid || __cap_gained(permitted, new, old)) &&
> >           ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) ||
> >            !ptracer_capable(current, new->user_ns))) {
> >               /* downgrade; they get no more than they had, and maybe less */
> >               if (!ns_capable(new->user_ns, CAP_SETUID) ||
> >                   (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) {
> >                       new->euid = new->uid;
> >                       new->egid = new->gid;
> >               }
> >               new->cap_permitted = cap_intersect(new->cap_permitted,
> >                                                  old->cap_permitted);
> >       }
>
> The actual downgrade is because a ptrace'd executable also takes
> this path.
>
> I have seen it argued rather forcefully that continuing rather than
> simply failing seems better in the ptrace case.
>
> In general I think it can be said this policy is "safe".  AKA we don't
> let a shared fs struct confuse privileged applications.  So nothing
> to panic about.
>
> It looks like most of the lsm's also test bprm->unsafe.
>
> So I imagine someone could very carefully separate the non-ptrace case
> from the ptrace case but *shrug*.
>
> Perhaps:
>
>         if ((is_setid || __cap_gained(permitted, new_old)) &&
>             ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) ||
>              !ptracer_capable(current, new->user_ns))) {
> +               if (!(bprm->unsafe & LSM_UNSAFE_PTRACE)) {
> +                       return -EPERM;
> +               }
>                 /* downgrade; they get no more than they had, and maybe less */
>                 if (!ns_capable(new->user_ns, CAP_SETUID) ||
>                     (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) {
>                         new->euid = new->uid;
>                         new->egid = new->gid;
>                 }
>                 new->cap_permitted = cap_intersect(new->cap_permitted,
>                                                    old->cap_permitted);
>          }
>
> If that is what you want that doesn't look to scary.  I don't think
> it simplifies anything about fs->in_exec.  As fs->in_exec is set when
> the processing calling exec is the only process that owns the fs_struct.
> With fs->in_exec just being a flag that doesn't allow another thread
> to call fork and start sharing the fs_struct during exec.
>
> *Shrug*
>
> I don't see why anyone would care.  It is just a very silly corner case.

Well I don't see how ptrace factors into any of this, apart from being
a different case of ignoring suid/sgid.

I can agree the suid/sgid situation vs CLONE_FS is a silly corner
case, but one which needs to be handled for security reasons and which
currently has weirdly convoluted code to do it.

The intent behind my proposal is very much to get the crapper out of
the way in a future-proof and simple manner.

In check_unsafe_exec() you can find a nasty loop over threads in the
group to find out if the fs struct is used by anyone outside of said
group. Since fs struct users are not explicitly tracked and any of
them can have different creds than the current thread, the kernel opts
to ignore suid/sgid if there are extra users found (for security
reasons). The loop depends on no new threads showing up as the list is
being walked, to that end copy_fs() can transiently return an error if
it spots ->in_exec.

The >in_exec field is used as a boolean/flag, but parallel execs using
the same fs struct from different thread groups don't look serialized.
This is supposed to be fine as in this case ->in_exec is not getting
set to begin with, but it gets unconditionally unset on all execs.

And so on. It's all weird af.

Initially I was thinking about serializing all execs using a given
fs_struct to bypass majority of the fuckery, but that's some churn to
add and it still leaves possible breakage down the road -- should this
unsafe sharing detection ever become racing nobody will find out until
the bad guys have their turn with it.

While unconditional unsharing turns out to be a no-go because of
chrome, one can still do postpone detection until after the caller is
single-threaded. By that time, if this is only the that thread and
fs_struct has ->users == 1, we know there is nobody sharing the struct
or racing to add a ref to it. This allows treating ->users as a
regular refcount, removes the weird loop over threads and removes the
(at best misleading) ->in_exec field.

With this in place it becomes trivial to also *deny* suid/sgid exec
instead of trying to placate it. If you are sharing fs and are execing
a binary in the first place, things are a little fishy. But if you are
execing a suid/sgid, the kernel has to ignore the bit so either you
are doing something wrong or are trying to exploit a bug. In order to
sort this crapper out, I think one can start with a runtime tunable
and a once-per-boot printk stating it denied such an exec (and stating
how to bring it back). To be removed some time after hitting LTS
perhaps.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2025-05-14  0:03                 ` Mateusz Guzik
@ 2025-05-14 15:33                   ` Eric W. Biederman
  2025-05-14 15:42                   ` Kees Cook
  1 sibling, 0 replies; 25+ messages in thread
From: Eric W. Biederman @ 2025-05-14 15:33 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Jann Horn, Kees Cook, Kees Cook, Christian Brauner, Jorge Merlino,
	Alexander Viro, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel,
	John Johansen, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler,
	Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek,
	Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin,
	linux-kernel, apparmor, linux-security-module, selinux,
	linux-hardening, oleg

Mateusz Guzik <mjguzik@gmail.com> writes:

> On Wed, May 14, 2025 at 12:17 AM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> Jann Horn <jannh@google.com> writes:
>>
>> > On Tue, May 13, 2025 at 10:57 PM Kees Cook <kees@kernel.org> wrote:
>> >> On May 13, 2025 6:05:45 AM PDT, Mateusz Guzik <mjguzik@gmail.com> wrote:
>> >> >Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is
>> >> >shared. This will have to be checked for after the execing proc becomes
>> >> >single-threaded ofc.
>> >>
>> >> Unfortunately the above Chrome helper is setuid and uses CLONE_FS.
>> >
>> > Chrome first launches a setuid helper, and then the setuid helper does
>> > CLONE_FS. Mateusz's proposal would not impact this usecase.
>> >
>> > Mateusz is proposing to block the case where a process first does
>> > CLONE_FS, and *then* one of the processes sharing the fs_struct does a
>> > setuid execve(). Linux already downgrades such an execve() to be
>> > non-setuid, which probably means anyone trying to do this will get
>> > hard-to-understand problems. Mateusz' proposal would just turn this
>> > hard-to-debug edgecase, which already doesn't really work, into a
>> > clean error; I think that is a nice improvement even just from the
>> > UAPI standpoint.
>> >
>> > If this change makes it possible to clean up the kernel code a bit, even better.
>>
>> What has brought this to everyone's attention just now?  This is
>> the second mention of this code path I have seen this week.
>>
>
> There is a syzkaller report concerning ->in_exec handling, for example:
> https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001f.GAE@google.com/#t

Thanks that grounds the conversation.

I am trying to think how there could be a real race (and not a
theoretical race) between copy_fs and bprm_execve.

Unless I am missing something the write in bprm_execve to
current->fs->in_exec happens after the point of no-return
and de_thread.

Unless the exec fails without getting to de_thread, or
we are in the multi-threaded case.

So I think it would be sensible to do something like:

	/* Clear in_exec if we set it */
	if (!(bprm->unsafe & LSM_UNSAFE_SHARE)) {
		spin_lock(&current->fs->lock);
        	current->fs->in_exec = 0;
	        spin_unlock(&current->fs->lock);
        }

I expect that will cause KCSAN to be quiet as well as fixing
any real races that might happen when exec fails.

In all but the most extreme cases there should be no contention
on the lock so I don't see a downside of taking the fs->lock.

>> AKA: security/commoncap.c:cap_bprm_creds_from_file(...)
>> > ...
>> >       /* Don't let someone trace a set[ug]id/setpcap binary with the revised
>> >        * credentials unless they have the appropriate permit.
>> >        *
>> >        * In addition, if NO_NEW_PRIVS, then ensure we get no new privs.
>> >        */
>> >       is_setid = __is_setuid(new, old) || __is_setgid(new, old);
>> >
>> >       if ((is_setid || __cap_gained(permitted, new, old)) &&
>> >           ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) ||
>> >            !ptracer_capable(current, new->user_ns))) {
>> >               /* downgrade; they get no more than they had, and maybe less */
>> >               if (!ns_capable(new->user_ns, CAP_SETUID) ||
>> >                   (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) {
>> >                       new->euid = new->uid;
>> >                       new->egid = new->gid;
>> >               }
>> >               new->cap_permitted = cap_intersect(new->cap_permitted,
>> >                                                  old->cap_permitted);
>> >       }
>>
>> The actual downgrade is because a ptrace'd executable also takes
>> this path.
>>
>> I have seen it argued rather forcefully that continuing rather than
>> simply failing seems better in the ptrace case.
>>
>> In general I think it can be said this policy is "safe".  AKA we don't
>> let a shared fs struct confuse privileged applications.  So nothing
>> to panic about.
>>
>> It looks like most of the lsm's also test bprm->unsafe.
>>
>> So I imagine someone could very carefully separate the non-ptrace case
>> from the ptrace case but *shrug*.
>>
>> Perhaps:
>>
>>         if ((is_setid || __cap_gained(permitted, new_old)) &&
>>             ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) ||
>>              !ptracer_capable(current, new->user_ns))) {
>> +               if (!(bprm->unsafe & LSM_UNSAFE_PTRACE)) {
>> +                       return -EPERM;
>> +               }
>>                 /* downgrade; they get no more than they had, and maybe less */
>>                 if (!ns_capable(new->user_ns, CAP_SETUID) ||
>>                     (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) {
>>                         new->euid = new->uid;
>>                         new->egid = new->gid;
>>                 }
>>                 new->cap_permitted = cap_intersect(new->cap_permitted,
>>                                                    old->cap_permitted);
>>          }
>>
>> If that is what you want that doesn't look to scary.  I don't think
>> it simplifies anything about fs->in_exec.  As fs->in_exec is set when
>> the processing calling exec is the only process that owns the fs_struct.
>> With fs->in_exec just being a flag that doesn't allow another thread
>> to call fork and start sharing the fs_struct during exec.
>>
>> *Shrug*
>>
>> I don't see why anyone would care.  It is just a very silly corner case.
>
> Well I don't see how ptrace factors into any of this, apart from being
> a different case of ignoring suid/sgid.

It is simple.  Ptrace is the case for ignoring suid/sgid, and since
that is the only case that is likely to happen in practice, and the
only case people actually care about everything all of the other cases
just got lumped on to ptrace case.

> I can agree the suid/sgid situation vs CLONE_FS is a silly corner
> case, but one which needs to be handled for security reasons and which
> currently has weirdly convoluted code to do it.
>
> The intent behind my proposal is very much to get the crapper out of
> the way in a future-proof and simple manner.
>
> In check_unsafe_exec() you can find a nasty loop over threads in the
> group to find out if the fs struct is used by anyone outside of said
> group. Since fs struct users are not explicitly tracked and any of
> them can have different creds than the current thread, the kernel opts
> to ignore suid/sgid if there are extra users found (for security
> reasons). The loop depends on no new threads showing up as the list is
> being walked, to that end copy_fs() can transiently return an error if
> it spots ->in_exec.
>
> The >in_exec field is used as a boolean/flag, but parallel execs using
> the same fs struct from different thread groups don't look serialized.
> This is supposed to be fine as in this case ->in_exec is not getting
> set to begin with, but it gets unconditionally unset on all execs.
>
> And so on. It's all weird af.
>
> Initially I was thinking about serializing all execs using a given
> fs_struct to bypass majority of the fuckery, but that's some churn to
> add and it still leaves possible breakage down the road -- should this
> unsafe sharing detection ever become racing nobody will find out until
> the bad guys have their turn with it.
>
> While unconditional unsharing turns out to be a no-go because of
> chrome, one can still do postpone detection until after the caller is
> single-threaded. By that time, if this is only the that thread and
> fs_struct has ->users == 1, we know there is nobody sharing the struct
> or racing to add a ref to it. This allows treating ->users as a
> regular refcount, removes the weird loop over threads and removes the
> (at best misleading) ->in_exec field.

The problem is that after becoming single threaded there is no way to
fail the exec.  The process calling exec must be terminated, or
you drop privs as we currently do.

That terminating the process that calls exec absolutely sucks for
debug-ability, plus you are changing where the work happens which
means auditing all of the security hooks, which is pain.

So I really don't think we want to perform the test after de_thread.

> With this in place it becomes trivial to also *deny* suid/sgid exec
> instead of trying to placate it. If you are sharing fs and are execing
> a binary in the first place, things are a little fishy. But if you are
> execing a suid/sgid, the kernel has to ignore the bit so either you
> are doing something wrong or are trying to exploit a bug. In order to
> sort this crapper out, I think one can start with a runtime tunable
> and a once-per-boot printk stating it denied such an exec (and stating
> how to bring it back). To be removed some time after hitting LTS
> perhaps.

There is a potential cleanup here that is worth exploring.

Not allowing clone(CLONE_THREAD | CLONE_FS).  AKA forcing
all threads to share an fs_struct.

If we can do that then the reference to fs_struct can live in
signal_struct (potentially duplicated in current for performance).

Still that doesn't let us remove fs->in_exec.


Hmm..


The real issue is in_exec, and the convolutions it requires to maintain
it.

The simplest thing I can think of that doesn't meaningfully change
user-space behavior is to make any other calls to clone/fork and execve
fail with -EAGAIN, while our current execve is running.

In practice we already do that with clone/fork in copy_fs so we don't
need to worry about that case introducing regressions.

I can't see a meaningful use for racing calls of execve, so that
should not be a big deal.

That prevents any races with clearing the state if only one thread of a
process can be in execve at a time.

It would allow replacing both current->in_execve and
current->fs->in_exec.

We would probably use the sighand lock to protect whatever flag we use
in signal_struct.  We can't use signal->flags as we are going to want
to keep this separate from the de_thread and group exit logic.

....

Those are the directions I would suggest exploring if you want simplicity.

Eric

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2025-05-14  0:03                 ` Mateusz Guzik
  2025-05-14 15:33                   ` Eric W. Biederman
@ 2025-05-14 15:42                   ` Kees Cook
  2025-05-15 16:48                     ` Eric W. Biederman
  1 sibling, 1 reply; 25+ messages in thread
From: Kees Cook @ 2025-05-14 15:42 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Eric W. Biederman, Jann Horn, Christian Brauner, Jorge Merlino,
	Alexander Viro, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel,
	John Johansen, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler,
	Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek,
	Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin,
	linux-kernel, apparmor, linux-security-module, selinux,
	linux-hardening, oleg

On Wed, May 14, 2025 at 02:03:31AM +0200, Mateusz Guzik wrote:
> On Wed, May 14, 2025 at 12:17 AM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> >
> > Jann Horn <jannh@google.com> writes:
> >
> > > On Tue, May 13, 2025 at 10:57 PM Kees Cook <kees@kernel.org> wrote:
> > >> On May 13, 2025 6:05:45 AM PDT, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >> >Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is
> > >> >shared. This will have to be checked for after the execing proc becomes
> > >> >single-threaded ofc.
> > >>
> > >> Unfortunately the above Chrome helper is setuid and uses CLONE_FS.
> > >
> > > Chrome first launches a setuid helper, and then the setuid helper does
> > > CLONE_FS. Mateusz's proposal would not impact this usecase.
> > >
> > > Mateusz is proposing to block the case where a process first does
> > > CLONE_FS, and *then* one of the processes sharing the fs_struct does a
> > > setuid execve(). Linux already downgrades such an execve() to be
> > > non-setuid, which probably means anyone trying to do this will get
> > > hard-to-understand problems. Mateusz' proposal would just turn this
> > > hard-to-debug edgecase, which already doesn't really work, into a
> > > clean error; I think that is a nice improvement even just from the
> > > UAPI standpoint.
> > >
> > > If this change makes it possible to clean up the kernel code a bit, even better.
> >
> > What has brought this to everyone's attention just now?  This is
> > the second mention of this code path I have seen this week.
> >
> 
> There is a syzkaller report concerning ->in_exec handling, for example:
> https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001f.GAE@google.com/#t
>
> [...]
> > It looks like most of the lsm's also test bprm->unsafe.
> >
> > So I imagine someone could very carefully separate the non-ptrace case
> > from the ptrace case but *shrug*.
> >
> > Perhaps:
> >
> >         if ((is_setid || __cap_gained(permitted, new_old)) &&
> >             ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) ||
> >              !ptracer_capable(current, new->user_ns))) {
> > +               if (!(bprm->unsafe & LSM_UNSAFE_PTRACE)) {
> > +                       return -EPERM;
> > +               }
> >                 /* downgrade; they get no more than they had, and maybe less */
> >                 if (!ns_capable(new->user_ns, CAP_SETUID) ||
> >                     (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) {
> >                         new->euid = new->uid;
> >                         new->egid = new->gid;
> >                 }
> >                 new->cap_permitted = cap_intersect(new->cap_permitted,
> >                                                    old->cap_permitted);
> >          }
> >
> > If that is what you want that doesn't look to scary.  I don't think
> > it simplifies anything about fs->in_exec.  As fs->in_exec is set when
> > the processing calling exec is the only process that owns the fs_struct.
> > With fs->in_exec just being a flag that doesn't allow another thread
> > to call fork and start sharing the fs_struct during exec.
> >
> > *Shrug*
> >
> > I don't see why anyone would care.  It is just a very silly corner case.
> 
> Well I don't see how ptrace factors into any of this, apart from being
> a different case of ignoring suid/sgid.

I actually think we might want to expand the above bit of logic to use
an explicit tests of each LSM_UNSAFE case -- the merged
logic is very difficult to read currently. Totally untested expansion,
if I'm reading everything correctly:

	if (bprm->unsafe &&
	    (is_setid || __cap_gained(permitted, new_old))) {
		bool limit_caps = false;
		bool strip_eid = false;
		unsigned int unsafe = bprm->unsafe;
		/* Check each bit */

		if (unsafe & LSM_UNSAFE_PTRACE) {
			if (!ptracer_capable(current, new->user_ns))
				limit_caps = true;
			unsafe &= ~LSM_UNSAFE_PTRACE;
		}
		if (unsafe & LSM_UNSAFE_SHARE) {
			limit_caps = true;
			if (!ns_capable(new->user_ns, CAP_SETUID))
				strip_eid = true;
			unsafe &= ~LSM_UNSAFE_SHARE;
		}
		if (unsafe & LSM_UNSAFE_NO_NEW_PRIVS) {
			limit_caps = true;
			if (!ns_capable(new->user_ns, CAP_SETUID))
				strip_eid = true;
			unsafe &= ~LSM_UNSAFE_NO_NEW_PRIVS;
		}

		if (WARN(unsafe, "Unhandled LSM_UNSAFE flag: %u?!\n", unsafe))
			return -EINVAL;

		if (limit_caps) {
			new->cap_permitted = cap_intersect(new->cap_permitted,
							   old->cap_permitted);
		}
		if (strip_eid) {
			new->euid = new->uid;
			new->egid = new->gid;
		}
	}

> I can agree the suid/sgid situation vs CLONE_FS is a silly corner
> case, but one which needs to be handled for security reasons and which
> currently has weirdly convoluted code to do it.
> 
> The intent behind my proposal is very much to get the crapper out of
> the way in a future-proof and simple manner.
> 
> In check_unsafe_exec() you can find a nasty loop over threads in the
> group to find out if the fs struct is used by anyone outside of said
> group. Since fs struct users are not explicitly tracked and any of
> them can have different creds than the current thread, the kernel opts
> to ignore suid/sgid if there are extra users found (for security
> reasons). The loop depends on no new threads showing up as the list is
> being walked, to that end copy_fs() can transiently return an error if
> it spots ->in_exec.
> 
> The >in_exec field is used as a boolean/flag, but parallel execs using
> the same fs struct from different thread groups don't look serialized.
> This is supposed to be fine as in this case ->in_exec is not getting
> set to begin with, but it gets unconditionally unset on all execs.
> 
> And so on. It's all weird af.

100% :)


-- 
Kees Cook

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2025-05-14 15:42                   ` Kees Cook
@ 2025-05-15 16:48                     ` Eric W. Biederman
  0 siblings, 0 replies; 25+ messages in thread
From: Eric W. Biederman @ 2025-05-15 16:48 UTC (permalink / raw)
  To: Kees Cook
  Cc: Mateusz Guzik, Jann Horn, Christian Brauner, Jorge Merlino,
	Alexander Viro, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel,
	John Johansen, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler,
	Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek,
	Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin,
	linux-kernel, apparmor, linux-security-module, selinux,
	linux-hardening, oleg

Kees Cook <kees@kernel.org> writes:

> On Wed, May 14, 2025 at 02:03:31AM +0200, Mateusz Guzik wrote:
>> On Wed, May 14, 2025 at 12:17 AM Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>> >
>> > Jann Horn <jannh@google.com> writes:
>> >
>> > > On Tue, May 13, 2025 at 10:57 PM Kees Cook <kees@kernel.org> wrote:
>> > >> On May 13, 2025 6:05:45 AM PDT, Mateusz Guzik <mjguzik@gmail.com> wrote:
>> > >> >Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is
>> > >> >shared. This will have to be checked for after the execing proc becomes
>> > >> >single-threaded ofc.
>> > >>
>> > >> Unfortunately the above Chrome helper is setuid and uses CLONE_FS.
>> > >
>> > > Chrome first launches a setuid helper, and then the setuid helper does
>> > > CLONE_FS. Mateusz's proposal would not impact this usecase.
>> > >
>> > > Mateusz is proposing to block the case where a process first does
>> > > CLONE_FS, and *then* one of the processes sharing the fs_struct does a
>> > > setuid execve(). Linux already downgrades such an execve() to be
>> > > non-setuid, which probably means anyone trying to do this will get
>> > > hard-to-understand problems. Mateusz' proposal would just turn this
>> > > hard-to-debug edgecase, which already doesn't really work, into a
>> > > clean error; I think that is a nice improvement even just from the
>> > > UAPI standpoint.
>> > >
>> > > If this change makes it possible to clean up the kernel code a bit, even better.
>> >
>> > What has brought this to everyone's attention just now?  This is
>> > the second mention of this code path I have seen this week.
>> >
>> 
>> There is a syzkaller report concerning ->in_exec handling, for example:
>> https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001f.GAE@google.com/#t
>>
>> [...]
>> > It looks like most of the lsm's also test bprm->unsafe.
>> >
>> > So I imagine someone could very carefully separate the non-ptrace case
>> > from the ptrace case but *shrug*.
>> >
>> > Perhaps:
>> >
>> >         if ((is_setid || __cap_gained(permitted, new_old)) &&
>> >             ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) ||
>> >              !ptracer_capable(current, new->user_ns))) {
>> > +               if (!(bprm->unsafe & LSM_UNSAFE_PTRACE)) {
>> > +                       return -EPERM;
>> > +               }
>> >                 /* downgrade; they get no more than they had, and maybe less */
>> >                 if (!ns_capable(new->user_ns, CAP_SETUID) ||
>> >                     (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) {
>> >                         new->euid = new->uid;
>> >                         new->egid = new->gid;
>> >                 }
>> >                 new->cap_permitted = cap_intersect(new->cap_permitted,
>> >                                                    old->cap_permitted);
>> >          }
>> >
>> > If that is what you want that doesn't look to scary.  I don't think
>> > it simplifies anything about fs->in_exec.  As fs->in_exec is set when
>> > the processing calling exec is the only process that owns the fs_struct.
>> > With fs->in_exec just being a flag that doesn't allow another thread
>> > to call fork and start sharing the fs_struct during exec.
>> >
>> > *Shrug*
>> >
>> > I don't see why anyone would care.  It is just a very silly corner case.
>> 
>> Well I don't see how ptrace factors into any of this, apart from being
>> a different case of ignoring suid/sgid.
>
> I actually think we might want to expand the above bit of logic to use
> an explicit tests of each LSM_UNSAFE case -- the merged
> logic is very difficult to read currently. Totally untested expansion,
> if I'm reading everything correctly:
>
> 	if (bprm->unsafe &&
> 	    (is_setid || __cap_gained(permitted, new_old))) {
> 		bool limit_caps = false;
> 		bool strip_eid = false;
> 		unsigned int unsafe = bprm->unsafe;
> 		/* Check each bit */
>
> 		if (unsafe & LSM_UNSAFE_PTRACE) {
> 			if (!ptracer_capable(current, new->user_ns))
> 				limit_caps = true;
                                strip_eid  = true;
You missed the euid stripping there.
> 			unsafe &= ~LSM_UNSAFE_PTRACE;
> 		}
> 		if (unsafe & LSM_UNSAFE_SHARE) {
> 			limit_caps = true;
> 			if (!ns_capable(new->user_ns, CAP_SETUID))
> 				strip_eid = true;
> 			unsafe &= ~LSM_UNSAFE_SHARE;
> 		}
> 		if (unsafe & LSM_UNSAFE_NO_NEW_PRIVS) {
> 			limit_caps = true;
> 			if (!ns_capable(new->user_ns, CAP_SETUID))
> 				strip_eid = true;
> 			unsafe &= ~LSM_UNSAFE_NO_NEW_PRIVS;
> 		}
>
> 		if (WARN(unsafe, "Unhandled LSM_UNSAFE flag: %u?!\n", unsafe))
> 			return -EINVAL;
>
> 		if (limit_caps) {
> 			new->cap_permitted = cap_intersect(new->cap_permitted,
> 							   old->cap_permitted);
> 		}
> 		if (strip_eid) {
> 			new->euid = new->uid;
> 			new->egid = new->gid;
> 		}
> 	}


I think I would simplify this all to:

	if ((id_changed || __cap_gained(permitted, new, old)) &&
            !ptracer_capable(current->new_user_ns)) {
        	if (!ns_capable(new->user_ns, CAP_SETUID)) {
                	new->euid = old->euid;
                        new->egid = old->egid;
                }
                new->cap_permitted = cap_intersect(new->cap_permitted,
                				   old->cap_permitted);
        }
        if ((id_changed || __cap_gained(permitted, new, old)) &&
            (bprm->unsafe & ~LSM_UNSAFE_PTRACE)) {
        	return -EPERM;
        }

The code of no_new_privs doesn't prevent capset so ignoring the results
of ns_capable when NO_NEW_PRIVS is set doesn't make sense.

If we are going to do anything please let's find ways to understand
what is happening and simplify this code, not add to it's complexity.

Eric

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2025-05-13 21:09             ` Jann Horn
  2025-05-13 22:16               ` Eric W. Biederman
@ 2025-05-13 23:15               ` Kees Cook
  1 sibling, 0 replies; 25+ messages in thread
From: Kees Cook @ 2025-05-13 23:15 UTC (permalink / raw)
  To: Jann Horn
  Cc: Mateusz Guzik, Kees Cook, Christian Brauner, Eric Biederman,
	Jorge Merlino, Alexander Viro, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel,
	John Johansen, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler,
	Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek,
	Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin,
	linux-kernel, apparmor, linux-security-module, selinux,
	linux-hardening, oleg



On May 13, 2025 2:09:48 PM PDT, Jann Horn <jannh@google.com> wrote:
>On Tue, May 13, 2025 at 10:57 PM Kees Cook <kees@kernel.org> wrote:
>> On May 13, 2025 6:05:45 AM PDT, Mateusz Guzik <mjguzik@gmail.com> wrote:
>> >Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is
>> >shared. This will have to be checked for after the execing proc becomes
>> >single-threaded ofc.
>>
>> Unfortunately the above Chrome helper is setuid and uses CLONE_FS.
>
>Chrome first launches a setuid helper, and then the setuid helper does
>CLONE_FS. Mateusz's proposal would not impact this usecase.
>
>Mateusz is proposing to block the case where a process first does
>CLONE_FS, and *then* one of the processes sharing the fs_struct does a
>setuid execve(). Linux already downgrades such an execve() to be
>non-setuid, which probably means anyone trying to do this will get
>hard-to-understand problems. Mateusz' proposal would just turn this
>hard-to-debug edgecase, which already doesn't really work, into a
>clean error; I think that is a nice improvement even just from the
>UAPI standpoint.
>
>If this change makes it possible to clean up the kernel code a bit, even better.


Ah! Okay, I appreciate the clarification. :) I'm game to try making it an error instead of silent downgrading.

-Kees


-- 
Kees Cook
-- 
Kees Cook

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2022-10-06 14:13     ` Jann Horn
  2022-10-06 15:25       ` Kees Cook
@ 2022-10-14  3:18       ` Andy Lutomirski
  2022-10-14  3:54         ` Kees Cook
                           ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: Andy Lutomirski @ 2022-10-14  3:18 UTC (permalink / raw)
  To: Jann Horn, Christian Brauner
  Cc: Kees Cook, Eric W. Biederman, Jorge Merlino, Al Viro,
	Thomas Gleixner, Sebastian Andrzej Siewior, Andrew Morton,
	linux-mm, linux-fsdevel, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Stephen Smalley, Eric Paris, Richard Haines,
	Casey Schaufler, Xin Long, David S. Miller, Todd Kjos,
	Ondrej Mosnacek, Prashanth Prahlad, Micah Morton, Fenghua Yu,
	Andrei Vagin, Linux Kernel Mailing List, apparmor,
	linux-security-module, selinux, linux-hardening



On Thu, Oct 6, 2022, at 7:13 AM, Jann Horn wrote:
> On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote:
>> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
>> > The check_unsafe_exec() counting of n_fs would not add up under a heavily
>> > threaded process trying to perform a suid exec, causing the suid portion
>> > to fail. This counting error appears to be unneeded, but to catch any
>> > possible conditions, explicitly unshare fs_struct on exec, if it ends up
>>
>> Isn't this a potential uapi break? Afaict, before this change a call to
>> clone{3}(CLONE_FS) followed by an exec in the child would have the
>> parent and child share fs information. So if the child e.g., changes the
>> working directory post exec it would also affect the parent. But after
>> this change here this would no longer be true. So a child changing a
>> workding directoro would not affect the parent anymore. IOW, an exec is
>> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
>> it seems like a non-trivial uapi change but there might be few users
>> that do clone{3}(CLONE_FS) followed by an exec.
>
> I believe the following code in Chromium explicitly relies on this
> behavior, but I'm not sure whether this code is in active use anymore:
>
> https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium

Wait, this is absolutely nucking futs.  On a very quick inspection, the sharable things like this are fs, files, sighand, and io.    files and sighand get unshared, which makes sense.  fs supposedly checks for extra refs and prevents gaining privilege.  io is... ignored!  At least it's not immediately obvious that io is a problem.

But seriously, this makes no sense at all.  It should not be possible to exec a program and then, without ptrace, change its cwd out from under it.  Do we really need to preserve this behavior?

--Andy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2022-10-14  3:18       ` Andy Lutomirski
@ 2022-10-14  3:54         ` Kees Cook
  2022-10-14 15:35         ` Jann Horn
  2022-10-14 22:03         ` David Laight
  2 siblings, 0 replies; 25+ messages in thread
From: Kees Cook @ 2022-10-14  3:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jann Horn, Christian Brauner, Eric W. Biederman, Jorge Merlino,
	Al Viro, Thomas Gleixner, Sebastian Andrzej Siewior,
	Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore,
	James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris,
	Richard Haines, Casey Schaufler, Xin Long, David S. Miller,
	Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton,
	Fenghua Yu, Andrei Vagin, Linux Kernel Mailing List, apparmor,
	linux-security-module, selinux, linux-hardening

On Thu, Oct 13, 2022 at 08:18:04PM -0700, Andy Lutomirski wrote:
> But seriously, this makes no sense at all.  It should not be possible to exec a program and then, without ptrace, change its cwd out from under it.  Do we really need to preserve this behavior?

Yup, already abandoned:
https://lore.kernel.org/lkml/202210061301.207A20C8E5@keescook/

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2022-10-14  3:18       ` Andy Lutomirski
  2022-10-14  3:54         ` Kees Cook
@ 2022-10-14 15:35         ` Jann Horn
  2022-10-18  7:09           ` Kees Cook
  2022-10-14 22:03         ` David Laight
  2 siblings, 1 reply; 25+ messages in thread
From: Jann Horn @ 2022-10-14 15:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christian Brauner, Kees Cook, Eric W. Biederman, Jorge Merlino,
	Al Viro, Thomas Gleixner, Sebastian Andrzej Siewior,
	Andrew Morton, linux-mm, linux-fsdevel, John Johansen, Paul Moore,
	James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris,
	Richard Haines, Casey Schaufler, Xin Long, David S. Miller,
	Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton,
	Fenghua Yu, Andrei Vagin, Linux Kernel Mailing List, apparmor,
	linux-security-module, selinux, linux-hardening

On Fri, Oct 14, 2022 at 5:18 AM Andy Lutomirski <luto@kernel.org> wrote:
> On Thu, Oct 6, 2022, at 7:13 AM, Jann Horn wrote:
> > On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote:
> >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
> >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily
> >> > threaded process trying to perform a suid exec, causing the suid portion
> >> > to fail. This counting error appears to be unneeded, but to catch any
> >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up
> >>
> >> Isn't this a potential uapi break? Afaict, before this change a call to
> >> clone{3}(CLONE_FS) followed by an exec in the child would have the
> >> parent and child share fs information. So if the child e.g., changes the
> >> working directory post exec it would also affect the parent. But after
> >> this change here this would no longer be true. So a child changing a
> >> workding directoro would not affect the parent anymore. IOW, an exec is
> >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
> >> it seems like a non-trivial uapi change but there might be few users
> >> that do clone{3}(CLONE_FS) followed by an exec.
> >
> > I believe the following code in Chromium explicitly relies on this
> > behavior, but I'm not sure whether this code is in active use anymore:
> >
> > https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium
>
> Wait, this is absolutely nucking futs.  On a very quick inspection, the sharable things like this are fs, files, sighand, and io.    files and sighand get unshared, which makes sense.  fs supposedly checks for extra refs and prevents gaining privilege.  io is... ignored!  At least it's not immediately obvious that io is a problem.
>
> But seriously, this makes no sense at all.  It should not be possible to exec a program and then, without ptrace, change its cwd out from under it.  Do we really need to preserve this behavior?

I agree that this is pretty wild.

The single user I'm aware of is Chrome, and as far as I know, they use
it for establishing their sandbox on systems where unprivileged user
namespaces are disabled - see
<https://chromium.googlesource.com/chromium/src/+/main/docs/linux/suid_sandbox.md>.
They also have seccomp-based sandboxing, but IIRC there are some small
holes that mean it's still useful for them to be able to set up
namespaces, like how sendmsg() on a unix domain socket can specify a
file path as the destination address.

(By the way, I think maybe Chrome wouldn't need this wacky trick with
the shared fs_struct if the "NO_NEW_PRIVS permits chroot()" thing had
ever landed that you
(https://lore.kernel.org/lkml/0e2f0f54e19bff53a3739ecfddb4ffa9a6dbde4d.1327858005.git.luto@amacapital.net/)
and Mickaël Salaün proposed in the past... or alternatively, if there
was a way to properly filter all the syscalls that Chrome has to
permit for renderers.)

(But also, to be clear, I don't speak for Chrome, this is just my
understanding of how their stuff works.)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2022-10-14 15:35         ` Jann Horn
@ 2022-10-18  7:09           ` Kees Cook
  2022-10-18 11:19             ` Jann Horn
  0 siblings, 1 reply; 25+ messages in thread
From: Kees Cook @ 2022-10-18  7:09 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Christian Brauner, Eric W. Biederman,
	Jorge Merlino, Al Viro, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel,
	John Johansen, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler,
	Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek,
	Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin,
	Linux Kernel Mailing List, apparmor, linux-security-module,
	selinux, linux-hardening

On Fri, Oct 14, 2022 at 05:35:26PM +0200, Jann Horn wrote:
> On Fri, Oct 14, 2022 at 5:18 AM Andy Lutomirski <luto@kernel.org> wrote:
> > On Thu, Oct 6, 2022, at 7:13 AM, Jann Horn wrote:
> > > On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote:
> > >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
> > >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily
> > >> > threaded process trying to perform a suid exec, causing the suid portion
> > >> > to fail. This counting error appears to be unneeded, but to catch any
> > >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up
> > >>
> > >> Isn't this a potential uapi break? Afaict, before this change a call to
> > >> clone{3}(CLONE_FS) followed by an exec in the child would have the
> > >> parent and child share fs information. So if the child e.g., changes the
> > >> working directory post exec it would also affect the parent. But after
> > >> this change here this would no longer be true. So a child changing a
> > >> workding directoro would not affect the parent anymore. IOW, an exec is
> > >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
> > >> it seems like a non-trivial uapi change but there might be few users
> > >> that do clone{3}(CLONE_FS) followed by an exec.
> > >
> > > I believe the following code in Chromium explicitly relies on this
> > > behavior, but I'm not sure whether this code is in active use anymore:
> > >
> > > https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium
> >
> > Wait, this is absolutely nucking futs.  On a very quick inspection, the sharable things like this are fs, files, sighand, and io.    files and sighand get unshared, which makes sense.  fs supposedly checks for extra refs and prevents gaining privilege.  io is... ignored!  At least it's not immediately obvious that io is a problem.
> >
> > But seriously, this makes no sense at all.  It should not be possible to exec a program and then, without ptrace, change its cwd out from under it.  Do we really need to preserve this behavior?
> 
> I agree that this is pretty wild.
> 
> The single user I'm aware of is Chrome, and as far as I know, they use
> it for establishing their sandbox on systems where unprivileged user
> namespaces are disabled - see
> <https://chromium.googlesource.com/chromium/src/+/main/docs/linux/suid_sandbox.md>.
> They also have seccomp-based sandboxing, but IIRC there are some small
> holes that mean it's still useful for them to be able to set up
> namespaces, like how sendmsg() on a unix domain socket can specify a
> file path as the destination address.
> 
> (By the way, I think maybe Chrome wouldn't need this wacky trick with
> the shared fs_struct if the "NO_NEW_PRIVS permits chroot()" thing had
> ever landed that you
> (https://lore.kernel.org/lkml/0e2f0f54e19bff53a3739ecfddb4ffa9a6dbde4d.1327858005.git.luto@amacapital.net/)
> and Mickaël Salaün proposed in the past... or alternatively, if there
> was a way to properly filter all the syscalls that Chrome has to
> permit for renderers.)
> 
> (But also, to be clear, I don't speak for Chrome, this is just my
> understanding of how their stuff works.)

Chrome seems to just want a totally empty filesystem view, yes?
Let's land the nnp+chroot change. :P Only 10 years late! Then we can
have Chrome use this and we can unshare fs on exec...

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2022-10-18  7:09           ` Kees Cook
@ 2022-10-18 11:19             ` Jann Horn
  0 siblings, 0 replies; 25+ messages in thread
From: Jann Horn @ 2022-10-18 11:19 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Christian Brauner, Eric W. Biederman,
	Jorge Merlino, Al Viro, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andrew Morton, linux-mm, linux-fsdevel,
	John Johansen, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Eric Paris, Richard Haines, Casey Schaufler,
	Xin Long, David S. Miller, Todd Kjos, Ondrej Mosnacek,
	Prashanth Prahlad, Micah Morton, Fenghua Yu, Andrei Vagin,
	Linux Kernel Mailing List, apparmor, linux-security-module,
	selinux, linux-hardening

On Tue, Oct 18, 2022 at 9:09 AM Kees Cook <keescook@chromium.org> wrote:
> On Fri, Oct 14, 2022 at 05:35:26PM +0200, Jann Horn wrote:
> > On Fri, Oct 14, 2022 at 5:18 AM Andy Lutomirski <luto@kernel.org> wrote:
> > > On Thu, Oct 6, 2022, at 7:13 AM, Jann Horn wrote:
> > > > On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote:
> > > >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
> > > >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily
> > > >> > threaded process trying to perform a suid exec, causing the suid portion
> > > >> > to fail. This counting error appears to be unneeded, but to catch any
> > > >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up
> > > >>
> > > >> Isn't this a potential uapi break? Afaict, before this change a call to
> > > >> clone{3}(CLONE_FS) followed by an exec in the child would have the
> > > >> parent and child share fs information. So if the child e.g., changes the
> > > >> working directory post exec it would also affect the parent. But after
> > > >> this change here this would no longer be true. So a child changing a
> > > >> workding directoro would not affect the parent anymore. IOW, an exec is
> > > >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
> > > >> it seems like a non-trivial uapi change but there might be few users
> > > >> that do clone{3}(CLONE_FS) followed by an exec.
> > > >
> > > > I believe the following code in Chromium explicitly relies on this
> > > > behavior, but I'm not sure whether this code is in active use anymore:
> > > >
> > > > https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium
> > >
> > > Wait, this is absolutely nucking futs.  On a very quick inspection, the sharable things like this are fs, files, sighand, and io.    files and sighand get unshared, which makes sense.  fs supposedly checks for extra refs and prevents gaining privilege.  io is... ignored!  At least it's not immediately obvious that io is a problem.
> > >
> > > But seriously, this makes no sense at all.  It should not be possible to exec a program and then, without ptrace, change its cwd out from under it.  Do we really need to preserve this behavior?
> >
> > I agree that this is pretty wild.
> >
> > The single user I'm aware of is Chrome, and as far as I know, they use
> > it for establishing their sandbox on systems where unprivileged user
> > namespaces are disabled - see
> > <https://chromium.googlesource.com/chromium/src/+/main/docs/linux/suid_sandbox.md>.
> > They also have seccomp-based sandboxing, but IIRC there are some small
> > holes that mean it's still useful for them to be able to set up
> > namespaces, like how sendmsg() on a unix domain socket can specify a
> > file path as the destination address.
> >
> > (By the way, I think maybe Chrome wouldn't need this wacky trick with
> > the shared fs_struct if the "NO_NEW_PRIVS permits chroot()" thing had
> > ever landed that you
> > (https://lore.kernel.org/lkml/0e2f0f54e19bff53a3739ecfddb4ffa9a6dbde4d.1327858005.git.luto@amacapital.net/)
> > and Mickaël Salaün proposed in the past... or alternatively, if there
> > was a way to properly filter all the syscalls that Chrome has to
> > permit for renderers.)
> >
> > (But also, to be clear, I don't speak for Chrome, this is just my
> > understanding of how their stuff works.)
>
> Chrome seems to just want a totally empty filesystem view, yes?
> Let's land the nnp+chroot change. :P Only 10 years late! Then we can
> have Chrome use this and we can unshare fs on exec...

Someone should check with Chrome first though to make sure what I said
accurately represents what they think...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2022-10-14  3:18       ` Andy Lutomirski
  2022-10-14  3:54         ` Kees Cook
  2022-10-14 15:35         ` Jann Horn
@ 2022-10-14 22:03         ` David Laight
  2022-11-28 17:49           ` Eric W. Biederman
  2 siblings, 1 reply; 25+ messages in thread
From: David Laight @ 2022-10-14 22:03 UTC (permalink / raw)
  To: 'Andy Lutomirski', Jann Horn, Christian Brauner
  Cc: Kees Cook, Eric W. Biederman, Jorge Merlino, Al Viro,
	Thomas Gleixner, Sebastian Andrzej Siewior, Andrew Morton,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, John Johansen,
	Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley,
	Eric Paris, Richard Haines, Casey Schaufler, Xin Long,
	David S. Miller, Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad,
	Micah Morton, Fenghua Yu, Andrei Vagin, Linux Kernel Mailing List,
	apparmor@lists.ubuntu.com, linux-security-module@vger.kernel.org,
	selinux@vger.kernel.org, linux-hardening@vger.kernel.org

From: Andy Lutomirski
> Sent: 14 October 2022 04:18
...
> But seriously, this makes no sense at all.  It should not be possible to exec a program and then,
> without ptrace, change its cwd out from under it.  Do we really need to preserve this behavior?

it maybe ok if the exec'ed program also 'bought-in' to the
fact that its cwd and open files might get changed.
But imagine someone doing it to a login shell!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec
  2022-10-14 22:03         ` David Laight
@ 2022-11-28 17:49           ` Eric W. Biederman
  0 siblings, 0 replies; 25+ messages in thread
From: Eric W. Biederman @ 2022-11-28 17:49 UTC (permalink / raw)
  To: David Laight
  Cc: 'Andy Lutomirski', Jann Horn, Christian Brauner,
	Kees Cook, Jorge Merlino, Al Viro, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andrew Morton, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, John Johansen, Paul Moore,
	James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris,
	Richard Haines, Casey Schaufler, Xin Long, David S. Miller,
	Todd Kjos, Ondrej Mosnacek, Prashanth Prahlad, Micah Morton,
	Fenghua Yu, Andrei Vagin, Linux Kernel Mailing List,
	apparmor@lists.ubuntu.com, linux-security-module@vger.kernel.org,
	selinux@vger.kernel.org, linux-hardening@vger.kernel.org

David Laight <David.Laight@ACULAB.COM> writes:

> From: Andy Lutomirski
>> Sent: 14 October 2022 04:18
> ...
>> But seriously, this makes no sense at all.  It should not be possible to exec a program and then,
>> without ptrace, change its cwd out from under it.  Do we really need to preserve this behavior?
>
> it maybe ok if the exec'ed program also 'bought-in' to the
> fact that its cwd and open files might get changed.
> But imagine someone doing it to a login shell!


I am slowly catching up on my email and I saw this conversation.

When I initially saw this thread I was confused and thought this
might run into an issue with fs/locks.c.  I was close but wrong.
fs/locks.c uses current->files as a sort of process identifier
and so is very sensitive to when it is unshared.  Making
unsharing current->files unconditionally a bug.  Not relevant to
this conversation.


There are several clone options that were only relevant for the old
LinuxThreads implementation including CLONE_FS and CLONE_SIGHAND.
The LinuxThreads implementation has not been needed since
the introduction of CLONE_THREAD in linux-2.6.0 in 17 Dec 2003.
Almost 20 years ago.

I suggest we introduce CONFIG_CLONE_FS and CONFIG_SIGHAND to allow
disabling support of these clone options.  No known user space will
care.  The are both getting in the way of kernel maintenance so there
is a reason to start pushing them out.

Further simply not worrying about UNSHARE_FS during exec fixes the
race so it essentially a bug fix by code removal.

I believe something like the patch below should get the job done.

diff --git a/fs/exec.c b/fs/exec.c
index a0b1f0337a62..7ff13c77ad04 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1186,7 +1186,8 @@ static int unshare_sighand(struct task_struct *me)
 {
 	struct sighand_struct *oldsighand = me->sighand;
 
-	if (refcount_read(&oldsighand->count) != 1) {
+	if (IS_ENABLED(CONFIG_CLONE_SIGHAND) &&
+	    refcount_read(&oldsighand->count) != 1) {
 		struct sighand_struct *newsighand;
 		/*
 		 * This ->sighand is shared with the CLONE_SIGHAND
@@ -1568,6 +1569,9 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
 	if (task_no_new_privs(current))
 		bprm->unsafe |= LSM_UNSAFE_NO_NEW_PRIVS;
 
+	if (!IS_ENABLED(CONFIG_CLONE_FS))
+		return;
+
 	t = p;
 	n_fs = 1;
 	spin_lock(&p->fs->lock);
diff --git a/init/Kconfig b/init/Kconfig
index 94125d3b6893..8660a6bcc1cf 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1764,6 +1764,23 @@ config KALLSYMS_BASE_RELATIVE
 	  time constants, and no relocation pass is required at runtime to fix
 	  up the entries based on the runtime load address of the kernel.
 
+config CLONE_FS
+	bool
+	default y
+	help
+	  Support CLONE_FS being passed to clone.  The only known user
+	  is the old LinuxThreads package so it should be safe to disable
+	  this option.
+
+config CLONE_SIGHAND
+	bool
+	default y
+	help
+	  Support CLONE_SIGHAND being passed to clone.  The only known user
+	  is the old LinuxThreads package so it should be safe to disable
+	  this option.
+
+
 # end of the "standard kernel features (expert users)" menu
 
 # syscall, maps, verifier
diff --git a/kernel/fork.c b/kernel/fork.c
index 08969f5aa38d..da9017b51da4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2023,6 +2023,16 @@ static __latent_entropy struct task_struct *copy_process(
 	if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM))
 		return ERR_PTR(-EINVAL);
 
+	/* Don't allow CLONE_FS if not enabled */
+	if (!IS_ENABLED(CONFIG_CLONE_FS) &&
+	    ((clone_flags & (CLONE_THREAD | CLONE_FS)) == CLONE_FS))
+		return ERR_PTR(-EINVAL);
+
+	/* Don't allow CLONE_SIGHAND if not enabled */
+	if (!IS_ENABLED(CONFIG_CLONE_SIGHAND) &&
+	    ((clone_flags & (CLONE_THREAD | CLONE_SIGHAND)) == CLONE_SIGHAND))
+		return ERR_PTR(-EINVAL);
+
 	/*
 	 * Siblings of global init remain as zombies on exit since they are
 	 * not reaped by their parent (swapper). To solve this and to avoid
@@ -3101,6 +3111,9 @@ static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp)
 	if (fs->users == 1)
 		return 0;
 
+	if (!IS_ENABLED(CONFIG_CLONE_FS))
+		return -EINVAL;
+
 	*new_fsp = copy_fs_struct(fs);
 	if (!*new_fsp)
 		return -ENOMEM;

Eric

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/2] exec: Remove LSM_UNSAFE_SHARE
  2022-10-06  8:27 [PATCH 0/2] fs/exec: Explicitly unshare fs_struct on exec Kees Cook
  2022-10-06  8:27 ` [PATCH 1/2] " Kees Cook
@ 2022-10-06  8:27 ` Kees Cook
  1 sibling, 0 replies; 25+ messages in thread
From: Kees Cook @ 2022-10-06  8:27 UTC (permalink / raw)
  To: Eric Biederman
  Cc: Kees Cook, Alexander Viro, John Johansen, Paul Moore,
	James Morris, Serge E. Hallyn, Stephen Smalley, Eric Paris,
	Richard Haines, Casey Schaufler, Xin Long, David S. Miller,
	Todd Kjos, Ondrej Mosnacek, linux-fsdevel, linux-mm, apparmor,
	linux-security-module, selinux, Jorge Merlino,
	Christian Brauner (Microsoft), Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Andrew Morton, Prashanth Prahlad,
	Micah Morton, Fenghua Yu, Andrei Vagin, linux-kernel,
	linux-hardening

With fs_struct explicitly unshared during exec, it is no longer possible
to have unexpected shared state, and LSM_UNSAFE_SHARE can be entirely
removed.

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Richard Haines <richard_c_haines@btinternet.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Todd Kjos <tkjos@google.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: apparmor@lists.ubuntu.com
Cc: linux-security-module@vger.kernel.org
Cc: selinux@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 fs/exec.c                  | 17 +----------------
 include/linux/security.h   |  5 ++---
 security/apparmor/domain.c |  5 -----
 security/selinux/hooks.c   | 10 ----------
 4 files changed, 3 insertions(+), 34 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 7d5f63f03c58..3cd058711098 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1563,8 +1563,7 @@ EXPORT_SYMBOL(bprm_change_interp);
  */
 static void check_unsafe_exec(struct linux_binprm *bprm)
 {
-	struct task_struct *p = current, *t;
-	unsigned n_fs;
+	struct task_struct *p = current;
 
 	if (p->ptrace)
 		bprm->unsafe |= LSM_UNSAFE_PTRACE;
@@ -1575,20 +1574,6 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
 	 */
 	if (task_no_new_privs(current))
 		bprm->unsafe |= LSM_UNSAFE_NO_NEW_PRIVS;
-
-	t = p;
-	n_fs = 1;
-	spin_lock(&p->fs->lock);
-	rcu_read_lock();
-	while_each_thread(p, t) {
-		if (t->fs == p->fs)
-			n_fs++;
-	}
-	rcu_read_unlock();
-
-	if (p->fs->users > n_fs)
-		bprm->unsafe |= LSM_UNSAFE_SHARE;
-	spin_unlock(&p->fs->lock);
 }
 
 static void bprm_fill_uid(struct linux_binprm *bprm, struct file *file)
diff --git a/include/linux/security.h b/include/linux/security.h
index 1bc362cb413f..db508a8c3cc7 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -215,9 +215,8 @@ struct sched_param;
 struct request_sock;
 
 /* bprm->unsafe reasons */
-#define LSM_UNSAFE_SHARE	1
-#define LSM_UNSAFE_PTRACE	2
-#define LSM_UNSAFE_NO_NEW_PRIVS	4
+#define LSM_UNSAFE_PTRACE	BIT(0)
+#define LSM_UNSAFE_NO_NEW_PRIVS	BIT(1)
 
 #ifdef CONFIG_MMU
 extern int mmap_min_addr_handler(struct ctl_table *table, int write,
diff --git a/security/apparmor/domain.c b/security/apparmor/domain.c
index 91689d34d281..1b2c0bb4d9ae 100644
--- a/security/apparmor/domain.c
+++ b/security/apparmor/domain.c
@@ -924,11 +924,6 @@ int apparmor_bprm_creds_for_exec(struct linux_binprm *bprm)
 		goto audit;
 	}
 
-	if (bprm->unsafe & LSM_UNSAFE_SHARE) {
-		/* FIXME: currently don't mediate shared state */
-		;
-	}
-
 	if (bprm->unsafe & (LSM_UNSAFE_PTRACE)) {
 		/* TODO: test needs to be profile of label to new */
 		error = may_change_ptraced_domain(new, &info);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 79573504783b..3ec80cc8ad1c 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2349,16 +2349,6 @@ static int selinux_bprm_creds_for_exec(struct linux_binprm *bprm)
 		if (rc)
 			return rc;
 
-		/* Check for shared state */
-		if (bprm->unsafe & LSM_UNSAFE_SHARE) {
-			rc = avc_has_perm(&selinux_state,
-					  old_tsec->sid, new_tsec->sid,
-					  SECCLASS_PROCESS, PROCESS__SHARE,
-					  NULL);
-			if (rc)
-				return -EPERM;
-		}
-
 		/* Make sure that anyone attempting to ptrace over a task that
 		 * changes its SID has the appropriate permit */
 		if (bprm->unsafe & LSM_UNSAFE_PTRACE) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-05-15 16:48 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-06  8:27 [PATCH 0/2] fs/exec: Explicitly unshare fs_struct on exec Kees Cook
2022-10-06  8:27 ` [PATCH 1/2] " Kees Cook
2022-10-06  9:05   ` Christian Brauner
2022-10-06 10:48     ` David Laight
2022-10-06 14:13     ` Jann Horn
2022-10-06 15:25       ` Kees Cook
2022-10-06 15:35         ` Jann Horn
2025-05-13 13:05         ` Mateusz Guzik
2025-05-13 15:29           ` Eric W. Biederman
2025-05-13 20:57           ` Kees Cook
2025-05-13 21:09             ` Jann Horn
2025-05-13 22:16               ` Eric W. Biederman
2025-05-14  0:03                 ` Mateusz Guzik
2025-05-14 15:33                   ` Eric W. Biederman
2025-05-14 15:42                   ` Kees Cook
2025-05-15 16:48                     ` Eric W. Biederman
2025-05-13 23:15               ` Kees Cook
2022-10-14  3:18       ` Andy Lutomirski
2022-10-14  3:54         ` Kees Cook
2022-10-14 15:35         ` Jann Horn
2022-10-18  7:09           ` Kees Cook
2022-10-18 11:19             ` Jann Horn
2022-10-14 22:03         ` David Laight
2022-11-28 17:49           ` Eric W. Biederman
2022-10-06  8:27 ` [PATCH 2/2] exec: Remove LSM_UNSAFE_SHARE Kees Cook

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).