[PATCH v2 0/2] User namespace aware fanotify

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] User namespace aware fanotify
@ 2025-04-19 10:06 Amir Goldstein
  2025-04-19 10:06 ` [PATCH v2 1/2] fanotify: remove redundant permission checks Amir Goldstein
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Amir Goldstein @ 2025-04-19 10:06 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Christian Brauner, linux-fsdevel

Jan,

This v2 is following a two years leap from the RFC path [1].
the code is based on the mntns fix patches I posted and is available
on my github [2].

Since then, Christian added support for open_by_handle_at(2)
to admin inside userns, which makes watching FS_USERNS_MOUNT
sb more useful.

And this should also be useful for Miklos' mntns mount tree watch
inside userns.

Tested sb/mount watches inside userns manually with fsnotifywatch -S
and -M with some changes to inotify-tools [3].

Ran mount-notify test manually inside userns and saw that it works
after this change.

I was going to write a variant of mount-notify selftest that clones
also a userns, but did not get to it.

Christian, Miklos,

If you guys have interest and time in this work, it would be nice if
you can help with this test variant or give me some pointers.

I can work on the test and address review comments when I get back from
vacation around rc5 time, but wanted to get this out soon for review.

Thanks,
Amir.

changes since v1:
- Split cleanup patch (Jan)
- Logic simplified a bit
- Add support for mntns marks inside userns

[1] https://lore.kernel.org/linux-fsdevel/20230416060722.1912831-1-amir73il@gmail.com/
[2] https://github.com/amir73il/linux/commits/fanotify_userns/
[3] https://github.com/amir73il/inotify-tools/commits/fanotify_userns/

Amir Goldstein (2):
  fanotify: remove redundant permission checks
  fanotify: support watching filesystems and mounts inside userns

 fs/notify/fanotify/fanotify.c      |  1 +
 fs/notify/fanotify/fanotify_user.c | 47 ++++++++++++++++++------------
 include/linux/fanotify.h           |  5 ++--
 include/linux/fsnotify_backend.h   |  1 +
 4 files changed, 32 insertions(+), 22 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/2] fanotify: remove redundant permission checks
  2025-04-19 10:06 [PATCH v2 0/2] User namespace aware fanotify Amir Goldstein
@ 2025-04-19 10:06 ` Amir Goldstein
  2025-04-19 10:06 ` [PATCH v2 2/2] fanotify: support watching filesystems and mounts inside userns Amir Goldstein
  2025-04-19 11:48 ` [PATCH v2 0/2] User namespace aware fanotify Amir Goldstein
  2 siblings, 0 replies; 14+ messages in thread
From: Amir Goldstein @ 2025-04-19 10:06 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Christian Brauner, linux-fsdevel

FAN_UNLIMITED_QUEUE and FAN_UNLIMITED_MARK flags are already checked
as part of the CAP_SYS_ADMIN check for any FANOTIFY_ADMIN_INIT_FLAGS.

Remove the individual CAP_SYS_ADMIN checks for these flags.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify_user.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 87f861e9004f..471c57832357 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1334,6 +1334,7 @@ static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
 	 * A group with FAN_UNLIMITED_MARKS does not contribute to mark count
 	 * in the limited groups account.
 	 */
+	BUILD_BUG_ON(!(FANOTIFY_ADMIN_INIT_FLAGS & FAN_UNLIMITED_MARKS));
 	if (!FAN_GROUP_FLAG(group, FAN_UNLIMITED_MARKS) &&
 	    !inc_ucount(ucounts->ns, ucounts->uid, UCOUNT_FANOTIFY_MARKS))
 		return ERR_PTR(-ENOSPC);
@@ -1637,21 +1638,13 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
 		goto out_destroy_group;
 	}
 
+	BUILD_BUG_ON(!(FANOTIFY_ADMIN_INIT_FLAGS & FAN_UNLIMITED_QUEUE));
 	if (flags & FAN_UNLIMITED_QUEUE) {
-		fd = -EPERM;
-		if (!capable(CAP_SYS_ADMIN))
-			goto out_destroy_group;
 		group->max_events = UINT_MAX;
 	} else {
 		group->max_events = fanotify_max_queued_events;
 	}
 
-	if (flags & FAN_UNLIMITED_MARKS) {
-		fd = -EPERM;
-		if (!capable(CAP_SYS_ADMIN))
-			goto out_destroy_group;
-	}
-
 	if (flags & FAN_ENABLE_AUDIT) {
 		fd = -EPERM;
 		if (!capable(CAP_AUDIT_WRITE))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/2] fanotify: support watching filesystems and mounts inside userns
  2025-04-19 10:06 [PATCH v2 0/2] User namespace aware fanotify Amir Goldstein
  2025-04-19 10:06 ` [PATCH v2 1/2] fanotify: remove redundant permission checks Amir Goldstein
@ 2025-04-19 10:06 ` Amir Goldstein
  2025-05-14 15:49   ` Jan Kara
                     ` (2 more replies)
  2025-04-19 11:48 ` [PATCH v2 0/2] User namespace aware fanotify Amir Goldstein
  2 siblings, 3 replies; 14+ messages in thread
From: Amir Goldstein @ 2025-04-19 10:06 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Christian Brauner, linux-fsdevel

An unprivileged user is allowed to create an fanotify group and add
inode marks, but not filesystem, mntns and mount marks.

Add limited support for setting up filesystem, mntns and mount marks by
an unprivileged user under the following conditions:

1.   User has CAP_SYS_ADMIN in the user ns where the group was created
2.a. User has CAP_SYS_ADMIN in the user ns where the filesystem was
     mounted (implies FS_USERNS_MOUNT)
  OR (in case setting up a mntns or mount mark)
2.b. User has CAP_SYS_ADMIN in the user ns associated with the mntns

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify.c      |  1 +
 fs/notify/fanotify/fanotify_user.c | 36 +++++++++++++++++++++---------
 include/linux/fanotify.h           |  5 ++---
 include/linux/fsnotify_backend.h   |  1 +
 4 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 6d386080faf2..060d9bee34bd 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -1009,6 +1009,7 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
 
 static void fanotify_free_group_priv(struct fsnotify_group *group)
 {
+	put_user_ns(group->user_ns);
 	kfree(group->fanotify_data.merge_hash);
 	if (group->fanotify_data.ucounts)
 		dec_ucount(group->fanotify_data.ucounts,
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 471c57832357..b4255b661bda 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1499,6 +1499,7 @@ static struct hlist_head *fanotify_alloc_merge_hash(void)
 /* fanotify syscalls */
 SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
 {
+	struct user_namespace *user_ns = current_user_ns();
 	struct fsnotify_group *group;
 	int f_flags, fd;
 	unsigned int fid_mode = flags & FANOTIFY_FID_BITS;
@@ -1513,10 +1514,11 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
 		/*
 		 * An unprivileged user can setup an fanotify group with
 		 * limited functionality - an unprivileged group is limited to
-		 * notification events with file handles and it cannot use
-		 * unlimited queue/marks.
+		 * notification events with file handles or mount ids and it
+		 * cannot use unlimited queue/marks.
 		 */
-		if ((flags & FANOTIFY_ADMIN_INIT_FLAGS) || !fid_mode)
+		if ((flags & FANOTIFY_ADMIN_INIT_FLAGS) ||
+		    !(flags & (FANOTIFY_FID_BITS | FAN_REPORT_MNT)))
 			return -EPERM;
 
 		/*
@@ -1595,8 +1597,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
 	}
 
 	/* Enforce groups limits per user in all containing user ns */
-	group->fanotify_data.ucounts = inc_ucount(current_user_ns(),
-						  current_euid(),
+	group->fanotify_data.ucounts = inc_ucount(user_ns, current_euid(),
 						  UCOUNT_FANOTIFY_GROUPS);
 	if (!group->fanotify_data.ucounts) {
 		fd = -EMFILE;
@@ -1605,6 +1606,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
 
 	group->fanotify_data.flags = flags | internal_flags;
 	group->memcg = get_mem_cgroup_from_mm(current->mm);
+	group->user_ns = get_user_ns(user_ns);
 
 	group->fanotify_data.merge_hash = fanotify_alloc_merge_hash();
 	if (!group->fanotify_data.merge_hash) {
@@ -1804,6 +1806,7 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
 	struct fsnotify_group *group;
 	struct path path;
 	struct fan_fsid __fsid, *fsid = NULL;
+	struct user_namespace *user_ns = NULL;
 	u32 valid_mask = FANOTIFY_EVENTS | FANOTIFY_EVENT_FLAGS;
 	unsigned int mark_type = flags & FANOTIFY_MARK_TYPE_BITS;
 	unsigned int mark_cmd = flags & FANOTIFY_MARK_CMD_BITS;
@@ -1897,12 +1900,10 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
 	}
 
 	/*
-	 * An unprivileged user is not allowed to setup mount nor filesystem
-	 * marks.  This also includes setting up such marks by a group that
-	 * was initialized by an unprivileged user.
+	 * A user is allowed to setup sb/mount/mntns marks only if it is
+	 * capable in the user ns where the group was created.
 	 */
-	if ((!capable(CAP_SYS_ADMIN) ||
-	     FAN_GROUP_FLAG(group, FANOTIFY_UNPRIV)) &&
+	if (!ns_capable(group->user_ns, CAP_SYS_ADMIN) &&
 	    mark_type != FAN_MARK_INODE)
 		return -EPERM;
 
@@ -1987,12 +1988,27 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
 		obj = inode;
 	} else if (obj_type == FSNOTIFY_OBJ_TYPE_VFSMOUNT) {
 		obj = path.mnt;
+		user_ns = real_mount(obj)->mnt_ns->user_ns;
 	} else if (obj_type == FSNOTIFY_OBJ_TYPE_SB) {
 		obj = path.mnt->mnt_sb;
+		user_ns = path.mnt->mnt_sb->s_user_ns;
 	} else if (obj_type == FSNOTIFY_OBJ_TYPE_MNTNS) {
 		obj = mnt_ns_from_dentry(path.dentry);
+		user_ns = ((struct mnt_namespace *)obj)->user_ns;
 	}
 
+	/*
+	 * In addition to being capable in the user ns where group was created,
+	 * the user also needs to be capable in the user ns associated with
+	 * the marked filesystem (for FS_USERNS_MOUNT filesystems) or in the
+	 * user ns associated with the mntns (when marking a mount or mntns).
+	 * This is aligned with the required permissions to open_by_handle_at()
+	 * a directory fid provided with the events.
+	 */
+	ret = -EPERM;
+	if (user_ns && !ns_capable(user_ns, CAP_SYS_ADMIN))
+		goto path_put_and_out;
+
 	ret = -EINVAL;
 	if (!obj)
 		goto path_put_and_out;
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index 3c817dc6292e..879cff5eccd4 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -38,8 +38,7 @@
 					 FAN_REPORT_PIDFD | \
 					 FAN_REPORT_FD_ERROR | \
 					 FAN_UNLIMITED_QUEUE | \
-					 FAN_UNLIMITED_MARKS | \
-					 FAN_REPORT_MNT)
+					 FAN_UNLIMITED_MARKS)
 
 /*
  * fanotify_init() flags that are allowed for user without CAP_SYS_ADMIN.
@@ -48,7 +47,7 @@
  * so one of the flags for reporting file handles is required.
  */
 #define FANOTIFY_USER_INIT_FLAGS	(FAN_CLASS_NOTIF | \
-					 FANOTIFY_FID_BITS | \
+					 FANOTIFY_FID_BITS | FAN_REPORT_MNT | \
 					 FAN_CLOEXEC | FAN_NONBLOCK)
 
 #define FANOTIFY_INIT_FLAGS	(FANOTIFY_ADMIN_INIT_FLAGS | \
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index fc27b53c58c2..d4034ddaf392 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -250,6 +250,7 @@ struct fsnotify_group {
 						 * full */
 
 	struct mem_cgroup *memcg;	/* memcg to charge allocations */
+	struct user_namespace *user_ns;	/* user ns where group was created */
 
 	/* groups can define private fields here or use the void *private */
 	union {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/2] User namespace aware fanotify
  2025-04-19 10:06 [PATCH v2 0/2] User namespace aware fanotify Amir Goldstein
  2025-04-19 10:06 ` [PATCH v2 1/2] fanotify: remove redundant permission checks Amir Goldstein
  2025-04-19 10:06 ` [PATCH v2 2/2] fanotify: support watching filesystems and mounts inside userns Amir Goldstein
@ 2025-04-19 11:48 ` Amir Goldstein
  2025-05-08 20:46   ` Amir Goldstein
  2 siblings, 1 reply; 14+ messages in thread
From: Amir Goldstein @ 2025-04-19 11:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Christian Brauner, linux-fsdevel

On Sat, Apr 19, 2025 at 12:07 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> Jan,
>
> This v2 is following a two years leap from the RFC path [1].
> the code is based on the mntns fix patches I posted and is available
> on my github [2].
>
> Since then, Christian added support for open_by_handle_at(2)
> to admin inside userns, which makes watching FS_USERNS_MOUNT
> sb more useful.
>
> And this should also be useful for Miklos' mntns mount tree watch
> inside userns.
>
> Tested sb/mount watches inside userns manually with fsnotifywatch -S
> and -M with some changes to inotify-tools [3].
>
> Ran mount-notify test manually inside userns and saw that it works
> after this change.
>
> I was going to write a variant of mount-notify selftest that clones
> also a userns, but did not get to it.
>
> Christian, Miklos,
>
> If you guys have interest and time in this work, it would be nice if
> you can help with this test variant or give me some pointers.
>
> I can work on the test and address review comments when I get back from
> vacation around rc5 time, but wanted to get this out soon for review.
>

FWIW, this is my failed attempt to copy what statmount_test_ns does
to mount-notify_test_ns:

https://github.com/amir73il/linux/commits/fanotify_selftests/

Maybe there is a simple way to fix it?
or maybe it should use the better infrastructure that Chritian
added for overlayfs selftests?

I did not have much time to look into it.

Thanks,
Amir.

>
> changes since v1:
> - Split cleanup patch (Jan)
> - Logic simplified a bit
> - Add support for mntns marks inside userns
>
> [1] https://lore.kernel.org/linux-fsdevel/20230416060722.1912831-1-amir73il@gmail.com/
> [2] https://github.com/amir73il/linux/commits/fanotify_userns/
> [3] https://github.com/amir73il/inotify-tools/commits/fanotify_userns/
>
> Amir Goldstein (2):
>   fanotify: remove redundant permission checks
>   fanotify: support watching filesystems and mounts inside userns
>
>  fs/notify/fanotify/fanotify.c      |  1 +
>  fs/notify/fanotify/fanotify_user.c | 47 ++++++++++++++++++------------
>  include/linux/fanotify.h           |  5 ++--
>  include/linux/fsnotify_backend.h   |  1 +
>  4 files changed, 32 insertions(+), 22 deletions(-)
>
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/2] User namespace aware fanotify
  2025-04-19 11:48 ` [PATCH v2 0/2] User namespace aware fanotify Amir Goldstein
@ 2025-05-08 20:46   ` Amir Goldstein
  2025-05-14  9:00     ` Amir Goldstein
  0 siblings, 1 reply; 14+ messages in thread
From: Amir Goldstein @ 2025-05-08 20:46 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Christian Brauner, linux-fsdevel

On Sat, Apr 19, 2025 at 1:48 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Sat, Apr 19, 2025 at 12:07 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > Jan,
> >
> > This v2 is following a two years leap from the RFC path [1].
> > the code is based on the mntns fix patches I posted and is available
> > on my github [2].
> >
> > Since then, Christian added support for open_by_handle_at(2)
> > to admin inside userns, which makes watching FS_USERNS_MOUNT
> > sb more useful.
> >
> > And this should also be useful for Miklos' mntns mount tree watch
> > inside userns.
> >
> > Tested sb/mount watches inside userns manually with fsnotifywatch -S
> > and -M with some changes to inotify-tools [3].
> >
> > Ran mount-notify test manually inside userns and saw that it works
> > after this change.
> >
> > I was going to write a variant of mount-notify selftest that clones
> > also a userns, but did not get to it.
> >
> > Christian, Miklos,
> >
> > If you guys have interest and time in this work, it would be nice if
> > you can help with this test variant or give me some pointers.
> >
> > I can work on the test and address review comments when I get back from
> > vacation around rc5 time, but wanted to get this out soon for review.
> >
>
> FWIW, this is my failed attempt to copy what statmount_test_ns does
> to mount-notify_test_ns:
>
> https://github.com/amir73il/linux/commits/fanotify_selftests/
>

Hi Jan,

This selftests branch is now updated.
The test is working as expected and verifies the changes in this patch set.

Would you consider queuing the fanotify patches for v6.6?

We need to collaborate the merge of the selftests with Christian
because my selftests branch has some cleanups with a minor
conflict with Christian's vfs/vfs-6.16.mount branch.

Maybe you will carry only the fanotify patches to v6.6 and
Christian will carry the tests in a separate branch?
because the fanotify patches and the tests do not actually depend
on each other to build, only for the test to pass.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/2] User namespace aware fanotify
  2025-05-08 20:46   ` Amir Goldstein
@ 2025-05-14  9:00     ` Amir Goldstein
  0 siblings, 0 replies; 14+ messages in thread
From: Amir Goldstein @ 2025-05-14  9:00 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Christian Brauner, linux-fsdevel

On Thu, May 8, 2025 at 10:46 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Sat, Apr 19, 2025 at 1:48 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Sat, Apr 19, 2025 at 12:07 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > Jan,
> > >
> > > This v2 is following a two years leap from the RFC path [1].
> > > the code is based on the mntns fix patches I posted and is available
> > > on my github [2].
> > >
> > > Since then, Christian added support for open_by_handle_at(2)
> > > to admin inside userns, which makes watching FS_USERNS_MOUNT
> > > sb more useful.
> > >
> > > And this should also be useful for Miklos' mntns mount tree watch
> > > inside userns.
> > >
> > > Tested sb/mount watches inside userns manually with fsnotifywatch -S
> > > and -M with some changes to inotify-tools [3].
> > >
> > > Ran mount-notify test manually inside userns and saw that it works
> > > after this change.
> > >
> > > I was going to write a variant of mount-notify selftest that clones
> > > also a userns, but did not get to it.
> > >
> > > Christian, Miklos,
> > >
> > > If you guys have interest and time in this work, it would be nice if
> > > you can help with this test variant or give me some pointers.
> > >
> > > I can work on the test and address review comments when I get back from
> > > vacation around rc5 time, but wanted to get this out soon for review.
> > >
> >
> > FWIW, this is my failed attempt to copy what statmount_test_ns does
> > to mount-notify_test_ns:
> >
> > https://github.com/amir73il/linux/commits/fanotify_selftests/
> >
>
> Hi Jan,
>
> This selftests branch is now updated.
> The test is working as expected and verifies the changes in this patch set.
>
> Would you consider queuing the fanotify patches for v6.6?
>
> We need to collaborate the merge of the selftests with Christian
> because my selftests branch has some cleanups with a minor
> conflict with Christian's vfs/vfs-6.16.mount branch.
>
> Maybe you will carry only the fanotify patches to v6.6 and
> Christian will carry the tests in a separate branch?
> because the fanotify patches and the tests do not actually depend
> on each other to build, only for the test to pass.

FYI, the selftests to verify these fanotify patches have been queued
on Christian's vfs-6.16.selftests branch.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/2] fanotify: support watching filesystems and mounts inside userns
  2025-04-19 10:06 ` [PATCH v2 2/2] fanotify: support watching filesystems and mounts inside userns Amir Goldstein
@ 2025-05-14 15:49   ` Jan Kara
  2025-05-14 18:39     ` Amir Goldstein
  2025-05-16 13:22   ` Miklos Szeredi
  2025-05-19 10:03   ` Christian Brauner
  2 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2025-05-14 15:49 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Miklos Szeredi, Christian Brauner, linux-fsdevel

On Sat 19-04-25 12:06:57, Amir Goldstein wrote:
> An unprivileged user is allowed to create an fanotify group and add
> inode marks, but not filesystem, mntns and mount marks.
> 
> Add limited support for setting up filesystem, mntns and mount marks by
> an unprivileged user under the following conditions:
> 
> 1.   User has CAP_SYS_ADMIN in the user ns where the group was created
> 2.a. User has CAP_SYS_ADMIN in the user ns where the filesystem was
>      mounted (implies FS_USERNS_MOUNT)
>   OR (in case setting up a mntns or mount mark)
> 2.b. User has CAP_SYS_ADMIN in the user ns associated with the mntns
> 
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>

I'm sorry for the delay. Both patches look good to me but I'd like somebody
more versed with user namespaces to double check namespace checks in this
patch are indeed sound (so that we don't introduce some security issue).
Christian, can you have a look please?

								Honza

> ---
>  fs/notify/fanotify/fanotify.c      |  1 +
>  fs/notify/fanotify/fanotify_user.c | 36 +++++++++++++++++++++---------
>  include/linux/fanotify.h           |  5 ++---
>  include/linux/fsnotify_backend.h   |  1 +
>  4 files changed, 30 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 6d386080faf2..060d9bee34bd 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -1009,6 +1009,7 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
>  
>  static void fanotify_free_group_priv(struct fsnotify_group *group)
>  {
> +	put_user_ns(group->user_ns);
>  	kfree(group->fanotify_data.merge_hash);
>  	if (group->fanotify_data.ucounts)
>  		dec_ucount(group->fanotify_data.ucounts,
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 471c57832357..b4255b661bda 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -1499,6 +1499,7 @@ static struct hlist_head *fanotify_alloc_merge_hash(void)
>  /* fanotify syscalls */
>  SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
>  {
> +	struct user_namespace *user_ns = current_user_ns();
>  	struct fsnotify_group *group;
>  	int f_flags, fd;
>  	unsigned int fid_mode = flags & FANOTIFY_FID_BITS;
> @@ -1513,10 +1514,11 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
>  		/*
>  		 * An unprivileged user can setup an fanotify group with
>  		 * limited functionality - an unprivileged group is limited to
> -		 * notification events with file handles and it cannot use
> -		 * unlimited queue/marks.
> +		 * notification events with file handles or mount ids and it
> +		 * cannot use unlimited queue/marks.
>  		 */
> -		if ((flags & FANOTIFY_ADMIN_INIT_FLAGS) || !fid_mode)
> +		if ((flags & FANOTIFY_ADMIN_INIT_FLAGS) ||
> +		    !(flags & (FANOTIFY_FID_BITS | FAN_REPORT_MNT)))
>  			return -EPERM;
>  
>  		/*
> @@ -1595,8 +1597,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
>  	}
>  
>  	/* Enforce groups limits per user in all containing user ns */
> -	group->fanotify_data.ucounts = inc_ucount(current_user_ns(),
> -						  current_euid(),
> +	group->fanotify_data.ucounts = inc_ucount(user_ns, current_euid(),
>  						  UCOUNT_FANOTIFY_GROUPS);
>  	if (!group->fanotify_data.ucounts) {
>  		fd = -EMFILE;
> @@ -1605,6 +1606,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
>  
>  	group->fanotify_data.flags = flags | internal_flags;
>  	group->memcg = get_mem_cgroup_from_mm(current->mm);
> +	group->user_ns = get_user_ns(user_ns);
>  
>  	group->fanotify_data.merge_hash = fanotify_alloc_merge_hash();
>  	if (!group->fanotify_data.merge_hash) {
> @@ -1804,6 +1806,7 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
>  	struct fsnotify_group *group;
>  	struct path path;
>  	struct fan_fsid __fsid, *fsid = NULL;
> +	struct user_namespace *user_ns = NULL;
>  	u32 valid_mask = FANOTIFY_EVENTS | FANOTIFY_EVENT_FLAGS;
>  	unsigned int mark_type = flags & FANOTIFY_MARK_TYPE_BITS;
>  	unsigned int mark_cmd = flags & FANOTIFY_MARK_CMD_BITS;
> @@ -1897,12 +1900,10 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
>  	}
>  
>  	/*
> -	 * An unprivileged user is not allowed to setup mount nor filesystem
> -	 * marks.  This also includes setting up such marks by a group that
> -	 * was initialized by an unprivileged user.
> +	 * A user is allowed to setup sb/mount/mntns marks only if it is
> +	 * capable in the user ns where the group was created.
>  	 */
> -	if ((!capable(CAP_SYS_ADMIN) ||
> -	     FAN_GROUP_FLAG(group, FANOTIFY_UNPRIV)) &&
> +	if (!ns_capable(group->user_ns, CAP_SYS_ADMIN) &&
>  	    mark_type != FAN_MARK_INODE)
>  		return -EPERM;
>  
> @@ -1987,12 +1988,27 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
>  		obj = inode;
>  	} else if (obj_type == FSNOTIFY_OBJ_TYPE_VFSMOUNT) {
>  		obj = path.mnt;
> +		user_ns = real_mount(obj)->mnt_ns->user_ns;
>  	} else if (obj_type == FSNOTIFY_OBJ_TYPE_SB) {
>  		obj = path.mnt->mnt_sb;
> +		user_ns = path.mnt->mnt_sb->s_user_ns;
>  	} else if (obj_type == FSNOTIFY_OBJ_TYPE_MNTNS) {
>  		obj = mnt_ns_from_dentry(path.dentry);
> +		user_ns = ((struct mnt_namespace *)obj)->user_ns;
>  	}
>  
> +	/*
> +	 * In addition to being capable in the user ns where group was created,
> +	 * the user also needs to be capable in the user ns associated with
> +	 * the marked filesystem (for FS_USERNS_MOUNT filesystems) or in the
> +	 * user ns associated with the mntns (when marking a mount or mntns).
> +	 * This is aligned with the required permissions to open_by_handle_at()
> +	 * a directory fid provided with the events.
> +	 */
> +	ret = -EPERM;
> +	if (user_ns && !ns_capable(user_ns, CAP_SYS_ADMIN))
> +		goto path_put_and_out;
> +
>  	ret = -EINVAL;
>  	if (!obj)
>  		goto path_put_and_out;
> diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
> index 3c817dc6292e..879cff5eccd4 100644
> --- a/include/linux/fanotify.h
> +++ b/include/linux/fanotify.h
> @@ -38,8 +38,7 @@
>  					 FAN_REPORT_PIDFD | \
>  					 FAN_REPORT_FD_ERROR | \
>  					 FAN_UNLIMITED_QUEUE | \
> -					 FAN_UNLIMITED_MARKS | \
> -					 FAN_REPORT_MNT)
> +					 FAN_UNLIMITED_MARKS)
>  
>  /*
>   * fanotify_init() flags that are allowed for user without CAP_SYS_ADMIN.
> @@ -48,7 +47,7 @@
>   * so one of the flags for reporting file handles is required.
>   */
>  #define FANOTIFY_USER_INIT_FLAGS	(FAN_CLASS_NOTIF | \
> -					 FANOTIFY_FID_BITS | \
> +					 FANOTIFY_FID_BITS | FAN_REPORT_MNT | \
>  					 FAN_CLOEXEC | FAN_NONBLOCK)
>  
>  #define FANOTIFY_INIT_FLAGS	(FANOTIFY_ADMIN_INIT_FLAGS | \
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index fc27b53c58c2..d4034ddaf392 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -250,6 +250,7 @@ struct fsnotify_group {
>  						 * full */
>  
>  	struct mem_cgroup *memcg;	/* memcg to charge allocations */
> +	struct user_namespace *user_ns;	/* user ns where group was created */
>  
>  	/* groups can define private fields here or use the void *private */
>  	union {
> -- 
> 2.34.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/2] fanotify: support watching filesystems and mounts inside userns
  2025-05-14 15:49   ` Jan Kara
@ 2025-05-14 18:39     ` Amir Goldstein
  2025-05-16 17:28       ` Amir Goldstein
  0 siblings, 1 reply; 14+ messages in thread
From: Amir Goldstein @ 2025-05-14 18:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Christian Brauner, linux-fsdevel

On Wed, May 14, 2025 at 5:49 PM Jan Kara <jack@suse.cz> wrote:
>
> On Sat 19-04-25 12:06:57, Amir Goldstein wrote:
> > An unprivileged user is allowed to create an fanotify group and add
> > inode marks, but not filesystem, mntns and mount marks.
> >
> > Add limited support for setting up filesystem, mntns and mount marks by
> > an unprivileged user under the following conditions:
> >
> > 1.   User has CAP_SYS_ADMIN in the user ns where the group was created
> > 2.a. User has CAP_SYS_ADMIN in the user ns where the filesystem was
> >      mounted (implies FS_USERNS_MOUNT)
> >   OR (in case setting up a mntns or mount mark)
> > 2.b. User has CAP_SYS_ADMIN in the user ns associated with the mntns
> >
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> I'm sorry for the delay. Both patches look good to me but I'd like somebody
> more versed with user namespaces to double check namespace checks in this
> patch are indeed sound (so that we don't introduce some security issue).

Good idea!

> Christian, can you have a look please?
>

Christian,

Please note that the checks below are loosely modeled after the tests in
may_decode_fh(), with some differences:

> >
> >       /*
> > -      * An unprivileged user is not allowed to setup mount nor filesystem
> > -      * marks.  This also includes setting up such marks by a group that
> > -      * was initialized by an unprivileged user.
> > +      * A user is allowed to setup sb/mount/mntns marks only if it is
> > +      * capable in the user ns where the group was created.
> >        */
> > -     if ((!capable(CAP_SYS_ADMIN) ||
> > -          FAN_GROUP_FLAG(group, FANOTIFY_UNPRIV)) &&
> > +     if (!ns_capable(group->user_ns, CAP_SYS_ADMIN) &&
> >           mark_type != FAN_MARK_INODE)
> >               return -EPERM;
> >

1. This is an extra restriction. Not sure is need to remain forever,
but it reduces
    attack surface and does not limit the common use cases,
    so I think it's worth to have.

> > @@ -1987,12 +1988,27 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
> >               obj = inode;
> >       } else if (obj_type == FSNOTIFY_OBJ_TYPE_VFSMOUNT) {
> >               obj = path.mnt;
> > +             user_ns = real_mount(obj)->mnt_ns->user_ns;
> >       } else if (obj_type == FSNOTIFY_OBJ_TYPE_SB) {
> >               obj = path.mnt->mnt_sb;
> > +             user_ns = path.mnt->mnt_sb->s_user_ns;
> >       } else if (obj_type == FSNOTIFY_OBJ_TYPE_MNTNS) {
> >               obj = mnt_ns_from_dentry(path.dentry);
> > +             user_ns = ((struct mnt_namespace *)obj)->user_ns;
> >       }
> >
> > +     /*
> > +      * In addition to being capable in the user ns where group was created,
> > +      * the user also needs to be capable in the user ns associated with
> > +      * the marked filesystem (for FS_USERNS_MOUNT filesystems) or in the
> > +      * user ns associated with the mntns (when marking a mount or mntns).
> > +      * This is aligned with the required permissions to open_by_handle_at()
> > +      * a directory fid provided with the events.
> > +      */
> > +     ret = -EPERM;
> > +     if (user_ns && !ns_capable(user_ns, CAP_SYS_ADMIN))
> > +             goto path_put_and_out;
> > +

2. In may_decode_fh() we know the mount that resulting file will be
    opened from so we accept
    Either capable mnt->mnt_sb->s_user_ns
    OR capable real_mount(mnt)->mnt_ns->user_ns
whereas here we only check capable mnt->mnt_sb->s_user_ns
   when subscribing to fs events on sb
   and only check capable real_mount(mnt)->mnt_ns->user_ns
   when subscribing to fs events on a specific mount

I am not sure if there is a use case to allow watching events on a
specific mount for capable mnt->mnt_sb->s_user_ns where
real_mount(mnt)->mnt_ns->user_ns is not capable
or if that setup is even possible?

3. Unlike may_decode_fh(), we do not check has_locked_children()
    Not sure how bad it is to allow receiving events for fs changes in
    obstructed paths (with file handles that cannot be opened).
If this case does needs to be checked  then perhaps we should
check for capable mnt->mnt_sb->s_user_ns also when subscribing
to fs events on a mount.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/2] fanotify: support watching filesystems and mounts inside userns
  2025-04-19 10:06 ` [PATCH v2 2/2] fanotify: support watching filesystems and mounts inside userns Amir Goldstein
  2025-05-14 15:49   ` Jan Kara
@ 2025-05-16 13:22   ` Miklos Szeredi
  2025-05-16 15:32     ` Amir Goldstein
  2025-05-19 10:03   ` Christian Brauner
  2 siblings, 1 reply; 14+ messages in thread
From: Miklos Szeredi @ 2025-05-16 13:22 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Christian Brauner, linux-fsdevel

On Sat, 19 Apr 2025 at 12:07, Amir Goldstein <amir73il@gmail.com> wrote:

> @@ -1987,12 +1988,27 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
>                 obj = inode;
>         } else if (obj_type == FSNOTIFY_OBJ_TYPE_VFSMOUNT) {
>                 obj = path.mnt;
> +               user_ns = real_mount(obj)->mnt_ns->user_ns;
>         } else if (obj_type == FSNOTIFY_OBJ_TYPE_SB) {
>                 obj = path.mnt->mnt_sb;
> +               user_ns = path.mnt->mnt_sb->s_user_ns;

The patch header notes that user_ns != &init_user_ns implies
FS_USERNS_MOUNT, but it'd be nice to document this with a WARN_ON() in
the code as well.

>         } else if (obj_type == FSNOTIFY_OBJ_TYPE_MNTNS) {
>                 obj = mnt_ns_from_dentry(path.dentry);
> +               user_ns = ((struct mnt_namespace *)obj)->user_ns;

It would be much more elegant if the type wasn't lost before this assignment.

Otherwise looks good:

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/2] fanotify: support watching filesystems and mounts inside userns
  2025-05-16 13:22   ` Miklos Szeredi
@ 2025-05-16 15:32     ` Amir Goldstein
  0 siblings, 0 replies; 14+ messages in thread
From: Amir Goldstein @ 2025-05-16 15:32 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Jan Kara, Christian Brauner, linux-fsdevel

On Fri, May 16, 2025 at 3:22 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Sat, 19 Apr 2025 at 12:07, Amir Goldstein <amir73il@gmail.com> wrote:
>
> > @@ -1987,12 +1988,27 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
> >                 obj = inode;
> >         } else if (obj_type == FSNOTIFY_OBJ_TYPE_VFSMOUNT) {
> >                 obj = path.mnt;
> > +               user_ns = real_mount(obj)->mnt_ns->user_ns;
> >         } else if (obj_type == FSNOTIFY_OBJ_TYPE_SB) {
> >                 obj = path.mnt->mnt_sb;
> > +               user_ns = path.mnt->mnt_sb->s_user_ns;
>
> The patch header notes that user_ns != &init_user_ns implies
> FS_USERNS_MOUNT, but it'd be nice to document this with a WARN_ON() in
> the code as well.
>

Can't do that because the commit message is wrong...
An sb *can* have non-init s_user_ns without FS_USERNS_MOUNT
if the mounter was running in non-init user ns, but had CAP_SYS_ADMIN
in init_user_ns.

Maybe not a very likely use case, but still cannot be asserted,
so we better remove the (*FS_USERNS_MOUNT*) remark
from comment and commit message.

> >         } else if (obj_type == FSNOTIFY_OBJ_TYPE_MNTNS) {
> >                 obj = mnt_ns_from_dentry(path.dentry);
> > +               user_ns = ((struct mnt_namespace *)obj)->user_ns;
>
> It would be much more elegant if the type wasn't lost before this assignment.

True.
We can make it:

        struct mnt_namespace *mntns = mnt_ns_from_dentry(path.dentry);
        user_ns = mntns->user_ns;
        obj = mntns;

>
> Otherwise looks good:
>
> Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
>

Thanks for the review!

Amir.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/2] fanotify: support watching filesystems and mounts inside userns
  2025-05-14 18:39     ` Amir Goldstein
@ 2025-05-16 17:28       ` Amir Goldstein
  2025-05-16 18:52         ` Amir Goldstein
  0 siblings, 1 reply; 14+ messages in thread
From: Amir Goldstein @ 2025-05-16 17:28 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Miklos Szeredi, linux-fsdevel

On Wed, May 14, 2025 at 8:39 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, May 14, 2025 at 5:49 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Sat 19-04-25 12:06:57, Amir Goldstein wrote:
> > > An unprivileged user is allowed to create an fanotify group and add
> > > inode marks, but not filesystem, mntns and mount marks.
> > >
> > > Add limited support for setting up filesystem, mntns and mount marks by
> > > an unprivileged user under the following conditions:
> > >
> > > 1.   User has CAP_SYS_ADMIN in the user ns where the group was created
> > > 2.a. User has CAP_SYS_ADMIN in the user ns where the filesystem was
> > >      mounted (implies FS_USERNS_MOUNT)
> > >   OR (in case setting up a mntns or mount mark)
> > > 2.b. User has CAP_SYS_ADMIN in the user ns associated with the mntns
> > >
> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >
> > I'm sorry for the delay. Both patches look good to me but I'd like somebody
> > more versed with user namespaces to double check namespace checks in this
> > patch are indeed sound (so that we don't introduce some security issue).
>
> Good idea!
>
> > Christian, can you have a look please?
> >
>
> Christian,
>
> Please note that the checks below are loosely modeled after the tests in
> may_decode_fh(), with some differences:
>
> > >
> > >       /*
> > > -      * An unprivileged user is not allowed to setup mount nor filesystem
> > > -      * marks.  This also includes setting up such marks by a group that
> > > -      * was initialized by an unprivileged user.
> > > +      * A user is allowed to setup sb/mount/mntns marks only if it is
> > > +      * capable in the user ns where the group was created.
> > >        */
> > > -     if ((!capable(CAP_SYS_ADMIN) ||
> > > -          FAN_GROUP_FLAG(group, FANOTIFY_UNPRIV)) &&
> > > +     if (!ns_capable(group->user_ns, CAP_SYS_ADMIN) &&
> > >           mark_type != FAN_MARK_INODE)
> > >               return -EPERM;
> > >
>
> 1. This is an extra restriction. Not sure is need to remain forever,
> but it reduces
>     attack surface and does not limit the common use cases,
>     so I think it's worth to have.
>
> > > @@ -1987,12 +1988,27 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
> > >               obj = inode;
> > >       } else if (obj_type == FSNOTIFY_OBJ_TYPE_VFSMOUNT) {
> > >               obj = path.mnt;
> > > +             user_ns = real_mount(obj)->mnt_ns->user_ns;
> > >       } else if (obj_type == FSNOTIFY_OBJ_TYPE_SB) {
> > >               obj = path.mnt->mnt_sb;
> > > +             user_ns = path.mnt->mnt_sb->s_user_ns;
> > >       } else if (obj_type == FSNOTIFY_OBJ_TYPE_MNTNS) {
> > >               obj = mnt_ns_from_dentry(path.dentry);
> > > +             user_ns = ((struct mnt_namespace *)obj)->user_ns;
> > >       }
> > >
> > > +     /*
> > > +      * In addition to being capable in the user ns where group was created,
> > > +      * the user also needs to be capable in the user ns associated with
> > > +      * the marked filesystem (for FS_USERNS_MOUNT filesystems) or in the
> > > +      * user ns associated with the mntns (when marking a mount or mntns).
> > > +      * This is aligned with the required permissions to open_by_handle_at()
> > > +      * a directory fid provided with the events.
> > > +      */
> > > +     ret = -EPERM;
> > > +     if (user_ns && !ns_capable(user_ns, CAP_SYS_ADMIN))
> > > +             goto path_put_and_out;
> > > +
>
> 2. In may_decode_fh() we know the mount that resulting file will be
>     opened from so we accept
>     Either capable mnt->mnt_sb->s_user_ns
>     OR capable real_mount(mnt)->mnt_ns->user_ns
> whereas here we only check capable mnt->mnt_sb->s_user_ns
>    when subscribing to fs events on sb
>    and only check capable real_mount(mnt)->mnt_ns->user_ns
>    when subscribing to fs events on a specific mount
>
> I am not sure if there is a use case to allow watching events on a
> specific mount for capable mnt->mnt_sb->s_user_ns where
> real_mount(mnt)->mnt_ns->user_ns is not capable
> or if that setup is even possible?
>
> 3. Unlike may_decode_fh(), we do not check has_locked_children()
>     Not sure how bad it is to allow receiving events for fs changes in
>     obstructed paths (with file handles that cannot be opened).
> If this case does needs to be checked  then perhaps we should
> check for capable mnt->mnt_sb->s_user_ns also when subscribing
> to fs events on a mount.
>

OK, after some thinking I think it is best to align the logic to match
may_decode_fh() more closely, like this:

@@ -1986,13 +1987,40 @@ static int do_fanotify_mark(int fanotify_fd,
unsigned int flags, __u64 mask,
                inode = path.dentry->d_inode;
                obj = inode;
        } else if (obj_type == FSNOTIFY_OBJ_TYPE_VFSMOUNT) {
+               struct mount *mnt = real_mount(path.mnt);
+
                obj = path.mnt;
+               user_ns = path.mnt->mnt_sb->s_user_ns;
+               /*
+                * Do not allow watching a mount with locked mounts on top
+                * that could be hiding access to paths, unless user is also
+                * capable on the user ns that created the sb.
+                */
+               if (!ns_capable(user_ns, CAP_SYS_ADMIN) &&
+                   !has_locked_children(mnt, path.mnt->mnt_root))
+                       user_ns = mnt->mnt_ns->user_ns;
        } else if (obj_type == FSNOTIFY_OBJ_TYPE_SB) {
                obj = path.mnt->mnt_sb;
+               user_ns = path.mnt->mnt_sb->s_user_ns;
        } else if (obj_type == FSNOTIFY_OBJ_TYPE_MNTNS) {
-               obj = mnt_ns_from_dentry(path.dentry);
+               struct mnt_namespace *mntns = mnt_ns_from_dentry(path.dentry);
+
+               obj = mntns;
+               user_ns = mntns->user_ns;
        }

+       /*
+        * In addition to being capable in the user ns where group was created,
+        * the user also needs to be capable in the user ns associated with
+        * the marked filesystem or in the user ns associated with the mntns
+        * (when marking a mount or mntns).
+        * This is aligned with the required permissions to open_by_handle_at()
+        * a directory fid provided with the events.
+        */
+       ret = -EPERM;
+       if (user_ns && !ns_capable(user_ns, CAP_SYS_ADMIN))
+               goto path_put_and_out;
+

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/2] fanotify: support watching filesystems and mounts inside userns
  2025-05-16 17:28       ` Amir Goldstein
@ 2025-05-16 18:52         ` Amir Goldstein
  0 siblings, 0 replies; 14+ messages in thread
From: Amir Goldstein @ 2025-05-16 18:52 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Miklos Szeredi, linux-fsdevel

On Fri, May 16, 2025 at 7:28 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, May 14, 2025 at 8:39 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Wed, May 14, 2025 at 5:49 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Sat 19-04-25 12:06:57, Amir Goldstein wrote:
> > > > An unprivileged user is allowed to create an fanotify group and add
> > > > inode marks, but not filesystem, mntns and mount marks.
> > > >
> > > > Add limited support for setting up filesystem, mntns and mount marks by
> > > > an unprivileged user under the following conditions:
> > > >
> > > > 1.   User has CAP_SYS_ADMIN in the user ns where the group was created
> > > > 2.a. User has CAP_SYS_ADMIN in the user ns where the filesystem was
> > > >      mounted (implies FS_USERNS_MOUNT)
> > > >   OR (in case setting up a mntns or mount mark)
> > > > 2.b. User has CAP_SYS_ADMIN in the user ns associated with the mntns
> > > >
> > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > >
> > > I'm sorry for the delay. Both patches look good to me but I'd like somebody
> > > more versed with user namespaces to double check namespace checks in this
> > > patch are indeed sound (so that we don't introduce some security issue).
> >
> > Good idea!
> >
> > > Christian, can you have a look please?
> > >
> >
> > Christian,
> >
> > Please note that the checks below are loosely modeled after the tests in
> > may_decode_fh(), with some differences:
> >
> > > >
> > > >       /*
> > > > -      * An unprivileged user is not allowed to setup mount nor filesystem
> > > > -      * marks.  This also includes setting up such marks by a group that
> > > > -      * was initialized by an unprivileged user.
> > > > +      * A user is allowed to setup sb/mount/mntns marks only if it is
> > > > +      * capable in the user ns where the group was created.
> > > >        */
> > > > -     if ((!capable(CAP_SYS_ADMIN) ||
> > > > -          FAN_GROUP_FLAG(group, FANOTIFY_UNPRIV)) &&
> > > > +     if (!ns_capable(group->user_ns, CAP_SYS_ADMIN) &&
> > > >           mark_type != FAN_MARK_INODE)
> > > >               return -EPERM;
> > > >
> >
> > 1. This is an extra restriction. Not sure is need to remain forever,
> > but it reduces
> >     attack surface and does not limit the common use cases,
> >     so I think it's worth to have.
> >
> > > > @@ -1987,12 +1988,27 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
> > > >               obj = inode;
> > > >       } else if (obj_type == FSNOTIFY_OBJ_TYPE_VFSMOUNT) {
> > > >               obj = path.mnt;
> > > > +             user_ns = real_mount(obj)->mnt_ns->user_ns;
> > > >       } else if (obj_type == FSNOTIFY_OBJ_TYPE_SB) {
> > > >               obj = path.mnt->mnt_sb;
> > > > +             user_ns = path.mnt->mnt_sb->s_user_ns;
> > > >       } else if (obj_type == FSNOTIFY_OBJ_TYPE_MNTNS) {
> > > >               obj = mnt_ns_from_dentry(path.dentry);
> > > > +             user_ns = ((struct mnt_namespace *)obj)->user_ns;
> > > >       }
> > > >
> > > > +     /*
> > > > +      * In addition to being capable in the user ns where group was created,
> > > > +      * the user also needs to be capable in the user ns associated with
> > > > +      * the marked filesystem (for FS_USERNS_MOUNT filesystems) or in the
> > > > +      * user ns associated with the mntns (when marking a mount or mntns).
> > > > +      * This is aligned with the required permissions to open_by_handle_at()
> > > > +      * a directory fid provided with the events.
> > > > +      */
> > > > +     ret = -EPERM;
> > > > +     if (user_ns && !ns_capable(user_ns, CAP_SYS_ADMIN))
> > > > +             goto path_put_and_out;
> > > > +
> >
> > 2. In may_decode_fh() we know the mount that resulting file will be
> >     opened from so we accept
> >     Either capable mnt->mnt_sb->s_user_ns
> >     OR capable real_mount(mnt)->mnt_ns->user_ns
> > whereas here we only check capable mnt->mnt_sb->s_user_ns
> >    when subscribing to fs events on sb
> >    and only check capable real_mount(mnt)->mnt_ns->user_ns
> >    when subscribing to fs events on a specific mount
> >
> > I am not sure if there is a use case to allow watching events on a
> > specific mount for capable mnt->mnt_sb->s_user_ns where
> > real_mount(mnt)->mnt_ns->user_ns is not capable
> > or if that setup is even possible?
> >
> > 3. Unlike may_decode_fh(), we do not check has_locked_children()
> >     Not sure how bad it is to allow receiving events for fs changes in
> >     obstructed paths (with file handles that cannot be opened).
> > If this case does needs to be checked  then perhaps we should
> > check for capable mnt->mnt_sb->s_user_ns also when subscribing
> > to fs events on a mount.
> >
>
> OK, after some thinking I think it is best to align the logic to match
> may_decode_fh() more closely, like this:
>
> @@ -1986,13 +1987,40 @@ static int do_fanotify_mark(int fanotify_fd,
> unsigned int flags, __u64 mask,
>                 inode = path.dentry->d_inode;
>                 obj = inode;
>         } else if (obj_type == FSNOTIFY_OBJ_TYPE_VFSMOUNT) {
> +               struct mount *mnt = real_mount(path.mnt);
> +
>                 obj = path.mnt;
> +               user_ns = path.mnt->mnt_sb->s_user_ns;
> +               /*
> +                * Do not allow watching a mount with locked mounts on top
> +                * that could be hiding access to paths, unless user is also
> +                * capable on the user ns that created the sb.
> +                */
> +               if (!ns_capable(user_ns, CAP_SYS_ADMIN) &&
> +                   !has_locked_children(mnt, path.mnt->mnt_root))
> +                       user_ns = mnt->mnt_ns->user_ns;

No scratch that.
I will send v2 that is much simpler.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/2] fanotify: support watching filesystems and mounts inside userns
  2025-04-19 10:06 ` [PATCH v2 2/2] fanotify: support watching filesystems and mounts inside userns Amir Goldstein
  2025-05-14 15:49   ` Jan Kara
  2025-05-16 13:22   ` Miklos Szeredi
@ 2025-05-19 10:03   ` Christian Brauner
  2025-05-19 10:46     ` Amir Goldstein
  2 siblings, 1 reply; 14+ messages in thread
From: Christian Brauner @ 2025-05-19 10:03 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Miklos Szeredi, linux-fsdevel

On Sat, Apr 19, 2025 at 12:06:57PM +0200, Amir Goldstein wrote:
> An unprivileged user is allowed to create an fanotify group and add
> inode marks, but not filesystem, mntns and mount marks.
> 
> Add limited support for setting up filesystem, mntns and mount marks by
> an unprivileged user under the following conditions:
> 
> 1.   User has CAP_SYS_ADMIN in the user ns where the group was created
> 2.a. User has CAP_SYS_ADMIN in the user ns where the filesystem was
>      mounted (implies FS_USERNS_MOUNT)
>   OR (in case setting up a mntns or mount mark)
> 2.b. User has CAP_SYS_ADMIN in the user ns associated with the mntns

So the crux of the problem is that we need to be sure that for all
options we need to be sure that the scope of the permission guarantees
necessary privileges over all the associated objects are held.

CAP_SYS_ADMIN in the owning user namespace of the filesystem (1.) seems
trivially ok because it means that the caller has privileges to mount
that filesystem.

If the caller is just privileged over the owning user namespace of the
mount namespace (2.b) then they are able to listen for mount
notifications starting with v6.15. Note how that the permissions are
specifically scoped to mount objects in that api.

But what you're trying to do here is not scoped to mounts. You're using
that permission check to delegate privileges over non-mount objects such
as directories accessible from that mount as well.

IOW, if I set up a mount mark on a mount based on the fact that I have
privileges over that mount's owning users namespace then it reads to me
that if I have:

mount --bind-into-unprvileged-container /etc /my/container/rootfs/etc

such that the new bind-mount for /etc that gets plugging into the
container is owned by the unprivileged containers's mount namespace then
the container can see write/open/read events on /etc/passwd and
/etc/shadow? But that bind-mount exposes the host's /etc/shadow and
/etc/passwd. That seems like a no go to me.

> 
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  fs/notify/fanotify/fanotify.c      |  1 +
>  fs/notify/fanotify/fanotify_user.c | 36 +++++++++++++++++++++---------
>  include/linux/fanotify.h           |  5 ++---
>  include/linux/fsnotify_backend.h   |  1 +
>  4 files changed, 30 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 6d386080faf2..060d9bee34bd 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -1009,6 +1009,7 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
>  
>  static void fanotify_free_group_priv(struct fsnotify_group *group)
>  {
> +	put_user_ns(group->user_ns);
>  	kfree(group->fanotify_data.merge_hash);
>  	if (group->fanotify_data.ucounts)
>  		dec_ucount(group->fanotify_data.ucounts,
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 471c57832357..b4255b661bda 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -1499,6 +1499,7 @@ static struct hlist_head *fanotify_alloc_merge_hash(void)
>  /* fanotify syscalls */
>  SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
>  {
> +	struct user_namespace *user_ns = current_user_ns();
>  	struct fsnotify_group *group;
>  	int f_flags, fd;
>  	unsigned int fid_mode = flags & FANOTIFY_FID_BITS;
> @@ -1513,10 +1514,11 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
>  		/*
>  		 * An unprivileged user can setup an fanotify group with
>  		 * limited functionality - an unprivileged group is limited to
> -		 * notification events with file handles and it cannot use
> -		 * unlimited queue/marks.
> +		 * notification events with file handles or mount ids and it
> +		 * cannot use unlimited queue/marks.
>  		 */
> -		if ((flags & FANOTIFY_ADMIN_INIT_FLAGS) || !fid_mode)
> +		if ((flags & FANOTIFY_ADMIN_INIT_FLAGS) ||
> +		    !(flags & (FANOTIFY_FID_BITS | FAN_REPORT_MNT)))
>  			return -EPERM;
>  
>  		/*
> @@ -1595,8 +1597,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
>  	}
>  
>  	/* Enforce groups limits per user in all containing user ns */
> -	group->fanotify_data.ucounts = inc_ucount(current_user_ns(),
> -						  current_euid(),
> +	group->fanotify_data.ucounts = inc_ucount(user_ns, current_euid(),
>  						  UCOUNT_FANOTIFY_GROUPS);
>  	if (!group->fanotify_data.ucounts) {
>  		fd = -EMFILE;
> @@ -1605,6 +1606,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
>  
>  	group->fanotify_data.flags = flags | internal_flags;
>  	group->memcg = get_mem_cgroup_from_mm(current->mm);
> +	group->user_ns = get_user_ns(user_ns);
>  
>  	group->fanotify_data.merge_hash = fanotify_alloc_merge_hash();
>  	if (!group->fanotify_data.merge_hash) {
> @@ -1804,6 +1806,7 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
>  	struct fsnotify_group *group;
>  	struct path path;
>  	struct fan_fsid __fsid, *fsid = NULL;
> +	struct user_namespace *user_ns = NULL;
>  	u32 valid_mask = FANOTIFY_EVENTS | FANOTIFY_EVENT_FLAGS;
>  	unsigned int mark_type = flags & FANOTIFY_MARK_TYPE_BITS;
>  	unsigned int mark_cmd = flags & FANOTIFY_MARK_CMD_BITS;
> @@ -1897,12 +1900,10 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
>  	}
>  
>  	/*
> -	 * An unprivileged user is not allowed to setup mount nor filesystem
> -	 * marks.  This also includes setting up such marks by a group that
> -	 * was initialized by an unprivileged user.
> +	 * A user is allowed to setup sb/mount/mntns marks only if it is
> +	 * capable in the user ns where the group was created.
>  	 */
> -	if ((!capable(CAP_SYS_ADMIN) ||
> -	     FAN_GROUP_FLAG(group, FANOTIFY_UNPRIV)) &&
> +	if (!ns_capable(group->user_ns, CAP_SYS_ADMIN) &&
>  	    mark_type != FAN_MARK_INODE)
>  		return -EPERM;
>  
> @@ -1987,12 +1988,27 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
>  		obj = inode;
>  	} else if (obj_type == FSNOTIFY_OBJ_TYPE_VFSMOUNT) {
>  		obj = path.mnt;
> +		user_ns = real_mount(obj)->mnt_ns->user_ns;
>  	} else if (obj_type == FSNOTIFY_OBJ_TYPE_SB) {
>  		obj = path.mnt->mnt_sb;
> +		user_ns = path.mnt->mnt_sb->s_user_ns;
>  	} else if (obj_type == FSNOTIFY_OBJ_TYPE_MNTNS) {
>  		obj = mnt_ns_from_dentry(path.dentry);
> +		user_ns = ((struct mnt_namespace *)obj)->user_ns;
>  	}
>  
> +	/*
> +	 * In addition to being capable in the user ns where group was created,
> +	 * the user also needs to be capable in the user ns associated with
> +	 * the marked filesystem (for FS_USERNS_MOUNT filesystems) or in the
> +	 * user ns associated with the mntns (when marking a mount or mntns).
> +	 * This is aligned with the required permissions to open_by_handle_at()
> +	 * a directory fid provided with the events.
> +	 */
> +	ret = -EPERM;
> +	if (user_ns && !ns_capable(user_ns, CAP_SYS_ADMIN))
> +		goto path_put_and_out;
> +
>  	ret = -EINVAL;
>  	if (!obj)
>  		goto path_put_and_out;
> diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
> index 3c817dc6292e..879cff5eccd4 100644
> --- a/include/linux/fanotify.h
> +++ b/include/linux/fanotify.h
> @@ -38,8 +38,7 @@
>  					 FAN_REPORT_PIDFD | \
>  					 FAN_REPORT_FD_ERROR | \
>  					 FAN_UNLIMITED_QUEUE | \
> -					 FAN_UNLIMITED_MARKS | \
> -					 FAN_REPORT_MNT)
> +					 FAN_UNLIMITED_MARKS)
>  
>  /*
>   * fanotify_init() flags that are allowed for user without CAP_SYS_ADMIN.
> @@ -48,7 +47,7 @@
>   * so one of the flags for reporting file handles is required.
>   */
>  #define FANOTIFY_USER_INIT_FLAGS	(FAN_CLASS_NOTIF | \
> -					 FANOTIFY_FID_BITS | \
> +					 FANOTIFY_FID_BITS | FAN_REPORT_MNT | \
>  					 FAN_CLOEXEC | FAN_NONBLOCK)
>  
>  #define FANOTIFY_INIT_FLAGS	(FANOTIFY_ADMIN_INIT_FLAGS | \
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index fc27b53c58c2..d4034ddaf392 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -250,6 +250,7 @@ struct fsnotify_group {
>  						 * full */
>  
>  	struct mem_cgroup *memcg;	/* memcg to charge allocations */
> +	struct user_namespace *user_ns;	/* user ns where group was created */
>  
>  	/* groups can define private fields here or use the void *private */
>  	union {
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/2] fanotify: support watching filesystems and mounts inside userns
  2025-05-19 10:03   ` Christian Brauner
@ 2025-05-19 10:46     ` Amir Goldstein
  0 siblings, 0 replies; 14+ messages in thread
From: Amir Goldstein @ 2025-05-19 10:46 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, Miklos Szeredi, linux-fsdevel

On Mon, May 19, 2025 at 12:03 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Sat, Apr 19, 2025 at 12:06:57PM +0200, Amir Goldstein wrote:
> > An unprivileged user is allowed to create an fanotify group and add
> > inode marks, but not filesystem, mntns and mount marks.
> >
> > Add limited support for setting up filesystem, mntns and mount marks by
> > an unprivileged user under the following conditions:
> >
> > 1.   User has CAP_SYS_ADMIN in the user ns where the group was created
> > 2.a. User has CAP_SYS_ADMIN in the user ns where the filesystem was
> >      mounted (implies FS_USERNS_MOUNT)
> >   OR (in case setting up a mntns or mount mark)
> > 2.b. User has CAP_SYS_ADMIN in the user ns associated with the mntns
>
> So the crux of the problem is that we need to be sure that for all
> options we need to be sure that the scope of the permission guarantees
> necessary privileges over all the associated objects are held.
>
> CAP_SYS_ADMIN in the owning user namespace of the filesystem (1.) seems
> trivially ok because it means that the caller has privileges to mount
> that filesystem.
>
> If the caller is just privileged over the owning user namespace of the
> mount namespace (2.b) then they are able to listen for mount
> notifications starting with v6.15. Note how that the permissions are
> specifically scoped to mount objects in that api.
>
> But what you're trying to do here is not scoped to mounts. You're using
> that permission check to delegate privileges over non-mount objects such
> as directories accessible from that mount as well.
>
> IOW, if I set up a mount mark on a mount based on the fact that I have
> privileges over that mount's owning users namespace then it reads to me
> that if I have:
>
> mount --bind-into-unprvileged-container /etc /my/container/rootfs/etc
>
> such that the new bind-mount for /etc that gets plugging into the
> container is owned by the unprivileged containers's mount namespace then
> the container can see write/open/read events on /etc/passwd and
> /etc/shadow? But that bind-mount exposes the host's /etc/shadow and
> /etc/passwd. That seems like a no go to me.
>

You are absolutely right.

This is why I change the mount mark to require same privileges
as filesystem mark in v3:

https://lore.kernel.org/linux-fsdevel/20250516192803.838659-1-amir73il@gmail.com/

Sorry that I have not clarified this in this thread.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-05-19 10:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-19 10:06 [PATCH v2 0/2] User namespace aware fanotify Amir Goldstein
2025-04-19 10:06 ` [PATCH v2 1/2] fanotify: remove redundant permission checks Amir Goldstein
2025-04-19 10:06 ` [PATCH v2 2/2] fanotify: support watching filesystems and mounts inside userns Amir Goldstein
2025-05-14 15:49   ` Jan Kara
2025-05-14 18:39     ` Amir Goldstein
2025-05-16 17:28       ` Amir Goldstein
2025-05-16 18:52         ` Amir Goldstein
2025-05-16 13:22   ` Miklos Szeredi
2025-05-16 15:32     ` Amir Goldstein
2025-05-19 10:03   ` Christian Brauner
2025-05-19 10:46     ` Amir Goldstein
2025-04-19 11:48 ` [PATCH v2 0/2] User namespace aware fanotify Amir Goldstein
2025-05-08 20:46   ` Amir Goldstein
2025-05-14  9:00     ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).