[PATCH v4 0/3] kernfs: Add inotify IN_DELETE_SELF, IN

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/3] kernfs: Add inotify IN_DELETE_SELF, IN_IGNORED support
@ 2026-02-20  5:54 T.J. Mercier
  2026-02-20  5:54 ` [PATCH v4 1/3] kernfs: Don't set_nlink for directories being removed T.J. Mercier
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: T.J. Mercier @ 2026-02-20  5:54 UTC (permalink / raw)
  To: gregkh, tj, driver-core, linux-kernel, cgroups, linux-fsdevel,
	jack, amir73il, shuah, linux-kselftest
  Cc: T.J. Mercier

This series adds support for IN_DELETE_SELF and IN_IGNORED inotify
events to kernfs files and directories.

Currently, kernfs (used by cgroup and others) supports IN_MODIFY events
but fails to notify watchers when the file is removed (e.g. during
cgroup destruction). This forces userspace monitors to maintain resource
intensive side-channels like pidfds, procfs polling, or redundant
directory watches to detect when a cgroup dies and a watched file is
removed.

By generating IN_DELETE_SELF events on destruction, we allow watchers to
rely on a single watch descriptor for the entire lifecycle of the
monitored file, reducing resource usage (file descriptors, CPU cycles)
and complexity in userspace.

The series is structured as follows:
Patch 1 preemptively addresses a race to set/clear i_nlink that would
        arise in patch 2.
Patch 2 implements the logic to generate DELETE_SELF and IGNORED events
        on file / dir removal.
Patch 3 adds selftests to verify the new behavior.

---
Changes in v4:
Clear inode i_nlink upon kernfs removal instead of calling fsnotify
from kernfs per Jan. This adds support for directories.
Abandon support for files removed from vfs_writes.
Add selftest for directory watch per Amir.
Add Amir's Ack to selftests.

Changes in v3:
Remove parent IN_DELETE notification per Amir.
  Refactored kernfs_notify_workfn to avoid grabbing parent when
  unnecessary for DELETE events as a result.
Use notify_event for fsnotify_inode call per Amir
Initialize memcg pointers to NULL in selftests
Add Amir's Ack
Add Tejun's Acks to the series

Changes in v2:
Remove unused variables from new selftests per kernel test robot
Fix kernfs_type argument per Tejun
Inline checks for FS_MODIFY, FS_DELETE in kernfs_notify_workfn per Tejun

T.J. Mercier (3):
  kernfs: Don't set_nlink for directories being removed
  kernfs: Send IN_DELETE_SELF and IN_IGNORED
  selftests: memcg: Add tests for IN_DELETE_SELF and IN_IGNORED

 fs/kernfs/dir.c                               |  32 ++++-
 fs/kernfs/inode.c                             |   2 +-
 .../selftests/cgroup/test_memcontrol.c        | 112 ++++++++++++++++++
 3 files changed, 144 insertions(+), 2 deletions(-)

base-commit: ba268514ea14b44570030e8ed2aef92a38679e85
-- 
2.53.0.414.gf7e9f6c205-goog

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4 1/3] kernfs: Don't set_nlink for directories being removed
  2026-02-20  5:54 [PATCH v4 0/3] kernfs: Add inotify IN_DELETE_SELF, IN_IGNORED support T.J. Mercier
@ 2026-02-20  5:54 ` T.J. Mercier
  2026-02-20  5:54 ` [PATCH v4 2/3] kernfs: Send IN_DELETE_SELF and IN_IGNORED T.J. Mercier
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 20+ messages in thread
From: T.J. Mercier @ 2026-02-20  5:54 UTC (permalink / raw)
  To: gregkh, tj, driver-core, linux-kernel, cgroups, linux-fsdevel,
	jack, amir73il, shuah, linux-kselftest
  Cc: T.J. Mercier

If a directory is already in the process of removal its i_nlink count
becomes irrelevant because its contents are also about to be removed and
any pending filesystem operations on it or its contents will soon start
to fail. So we can avoid setting it for directories already flagged for
removal.

This avoids a race in the next patch, which adds clearing of the i_nlink
count for kernfs nodes being removed to support inotify delete events.

Use protection from the kernfs_iattr_rwsem to avoid adding more
contention to the kernfs_rwsem for calls to kernfs_refresh_inode.

Signed-off-by: T.J. Mercier <tjmercier@google.com>
---
 fs/kernfs/dir.c   | 2 ++
 fs/kernfs/inode.c | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 29baeeb97871..5b6ce2351a53 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1491,12 +1491,14 @@ static void __kernfs_remove(struct kernfs_node *kn)
 	pr_debug("kernfs %s: removing\n", kernfs_rcu_name(kn));
 
 	/* prevent new usage by marking all nodes removing and deactivating */
+	down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
 	pos = NULL;
 	while ((pos = kernfs_next_descendant_post(pos, kn))) {
 		pos->flags |= KERNFS_REMOVING;
 		if (kernfs_active(pos))
 			atomic_add(KN_DEACTIVATED_BIAS, &pos->active);
 	}
+	up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
 
 	/* deactivate and unlink the subtree node-by-node */
 	do {
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index a36aaee98dce..afdc4021e81a 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -178,7 +178,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
 		 */
 		set_inode_attr(inode, attrs);
 
-	if (kernfs_type(kn) == KERNFS_DIR)
+	if (kernfs_type(kn) == KERNFS_DIR && !(kn->flags & KERNFS_REMOVING))
 		set_nlink(inode, kn->dir.subdirs + 2);
 }
 
-- 
2.53.0.414.gf7e9f6c205-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 2/3] kernfs: Send IN_DELETE_SELF and IN_IGNORED
  2026-02-20  5:54 [PATCH v4 0/3] kernfs: Add inotify IN_DELETE_SELF, IN_IGNORED support T.J. Mercier
  2026-02-20  5:54 ` [PATCH v4 1/3] kernfs: Don't set_nlink for directories being removed T.J. Mercier
@ 2026-02-20  5:54 ` T.J. Mercier
  2026-02-20 15:32   ` Tejun Heo
  2026-02-20  5:54 ` [PATCH v4 3/3] selftests: memcg: Add tests for " T.J. Mercier
  2026-02-20 10:14 ` [syzbot ci] Re: kernfs: Add inotify IN_DELETE_SELF, IN_IGNORED support syzbot ci
  3 siblings, 1 reply; 20+ messages in thread
From: T.J. Mercier @ 2026-02-20  5:54 UTC (permalink / raw)
  To: gregkh, tj, driver-core, linux-kernel, cgroups, linux-fsdevel,
	jack, amir73il, shuah, linux-kselftest
  Cc: T.J. Mercier

Currently some kernfs files (e.g. cgroup.events, memory.events) support
inotify watches for IN_MODIFY, but unlike with regular filesystems, they
do not receive IN_DELETE_SELF or IN_IGNORED events when they are
removed. This means inotify watches persist after file deletion until
the process exits and the inotify file descriptor is cleaned up, or
until inotify_rm_watch is called manually.

This creates a problem for processes monitoring cgroups. For example, a
service monitoring memory.events for memory.high breaches needs to know
when a cgroup is removed to clean up its state. Where it's known that a
cgroup is removed when all processes die, without IN_DELETE_SELF the
service must resort to inefficient workarounds such as:
  1) Periodically scanning procfs to detect process death (wastes CPU
     and is susceptible to PID reuse).
  2) Holding a pidfd for every monitored cgroup (can exhaust file
     descriptors).

This patch enables IN_DELETE_SELF and IN_IGNORED events for kernfs files
and directories by clearing inode i_nlink values during removal. This
allows VFS to make the necessary fsnotify calls so that userspace
receives the inotify events.

As a result, applications can rely on a single existing watch on a file
of interest (e.g. memory.events) to receive notifications for both
modifications and the eventual removal of the file, as well as automatic
watch descriptor cleanup, simplifying userspace logic and improving
efficiency.

There is gap in this implementation for certain file removals due their
unique nature in kernfs. Directory removals that trigger file removals
occur through vfs_rmdir, which shrinks the dcache and emits fsnotify
events after the rmdir operation; there is no issue here. However kernfs
writes to particular files (e.g. cgroup.subtree_control) can also cause
file removal, but vfs_write does not attempt to emit fsnotify events
after the write operation, even if i_nlink counts are 0. As a usecase
for monitoring this category of file removals is not known, they are
left without having IN_DELETE or IN_DELETE_SELF events generated.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: T.J. Mercier <tjmercier@google.com>
---
 fs/kernfs/dir.c | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 5b6ce2351a53..41541b969fb2 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1471,6 +1471,23 @@ void kernfs_show(struct kernfs_node *kn, bool show)
 	up_write(&root->kernfs_rwsem);
 }
 
+static void kernfs_clear_inode_nlink(struct kernfs_node *kn)
+{
+	struct kernfs_root *root = kernfs_root(kn);
+	struct kernfs_super_info *info;
+
+	lockdep_assert_held_read(&root->kernfs_supers_rwsem);
+
+	list_for_each_entry(info, &root->supers, node) {
+		struct inode *inode = ilookup(info->sb, kernfs_ino(kn));
+
+		if (inode) {
+			clear_nlink(inode);
+			iput(inode);
+		}
+	}
+}
+
 static void __kernfs_remove(struct kernfs_node *kn)
 {
 	struct kernfs_node *pos, *parent;
@@ -1479,6 +1496,7 @@ static void __kernfs_remove(struct kernfs_node *kn)
 	if (!kn)
 		return;
 
+	lockdep_assert_held_read(&kernfs_root(kn)->kernfs_supers_rwsem);
 	lockdep_assert_held_write(&kernfs_root(kn)->kernfs_rwsem);
 
 	/*
@@ -1522,9 +1540,11 @@ static void __kernfs_remove(struct kernfs_node *kn)
 			struct kernfs_iattrs *ps_iattr =
 				parent ? parent->iattr : NULL;
 
-			/* update timestamps on the parent */
 			down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
 
+			kernfs_clear_inode_nlink(pos);
+
+			/* update timestamps on the parent */
 			if (ps_iattr) {
 				ktime_get_real_ts64(&ps_iattr->ia_ctime);
 				ps_iattr->ia_mtime = ps_iattr->ia_ctime;
@@ -1553,9 +1573,11 @@ void kernfs_remove(struct kernfs_node *kn)
 
 	root = kernfs_root(kn);
 
+	down_read(&root->kernfs_supers_rwsem);
 	down_write(&root->kernfs_rwsem);
 	__kernfs_remove(kn);
 	up_write(&root->kernfs_rwsem);
+	up_read(&root->kernfs_supers_rwsem);
 }
 
 /**
@@ -1646,6 +1668,7 @@ bool kernfs_remove_self(struct kernfs_node *kn)
 	bool ret;
 	struct kernfs_root *root = kernfs_root(kn);
 
+	down_read(&root->kernfs_supers_rwsem);
 	down_write(&root->kernfs_rwsem);
 	kernfs_break_active_protection(kn);
 
@@ -1675,7 +1698,9 @@ bool kernfs_remove_self(struct kernfs_node *kn)
 				break;
 
 			up_write(&root->kernfs_rwsem);
+			up_read(&root->kernfs_supers_rwsem);
 			schedule();
+			down_read(&root->kernfs_supers_rwsem);
 			down_write(&root->kernfs_rwsem);
 		}
 		finish_wait(waitq, &wait);
@@ -1690,6 +1715,7 @@ bool kernfs_remove_self(struct kernfs_node *kn)
 	kernfs_unbreak_active_protection(kn);
 
 	up_write(&root->kernfs_rwsem);
+	up_read(&root->kernfs_supers_rwsem);
 	return ret;
 }
 
@@ -1716,6 +1742,7 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
 	}
 
 	root = kernfs_root(parent);
+	down_read(&root->kernfs_supers_rwsem);
 	down_write(&root->kernfs_rwsem);
 
 	kn = kernfs_find_ns(parent, name, ns);
@@ -1726,6 +1753,7 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
 	}
 
 	up_write(&root->kernfs_rwsem);
+	up_read(&root->kernfs_supers_rwsem);
 
 	if (kn)
 		return 0;
-- 
2.53.0.414.gf7e9f6c205-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 3/3] selftests: memcg: Add tests for IN_DELETE_SELF and IN_IGNORED
  2026-02-20  5:54 [PATCH v4 0/3] kernfs: Add inotify IN_DELETE_SELF, IN_IGNORED support T.J. Mercier
  2026-02-20  5:54 ` [PATCH v4 1/3] kernfs: Don't set_nlink for directories being removed T.J. Mercier
  2026-02-20  5:54 ` [PATCH v4 2/3] kernfs: Send IN_DELETE_SELF and IN_IGNORED T.J. Mercier
@ 2026-02-20  5:54 ` T.J. Mercier
  2026-02-20 17:43   ` Amir Goldstein
  2026-02-20 10:14 ` [syzbot ci] Re: kernfs: Add inotify IN_DELETE_SELF, IN_IGNORED support syzbot ci
  3 siblings, 1 reply; 20+ messages in thread
From: T.J. Mercier @ 2026-02-20  5:54 UTC (permalink / raw)
  To: gregkh, tj, driver-core, linux-kernel, cgroups, linux-fsdevel,
	jack, amir73il, shuah, linux-kselftest
  Cc: T.J. Mercier

Add two new tests that verify inotify events are sent when memcg files
or directories are removed with rmdir.

Signed-off-by: T.J. Mercier <tjmercier@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Amir Goldstein <amir73il@gmail.com>
---
 .../selftests/cgroup/test_memcontrol.c        | 112 ++++++++++++++++++
 1 file changed, 112 insertions(+)

diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index 4e1647568c5b..57726bc82757 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -10,6 +10,7 @@
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <unistd.h>
+#include <sys/inotify.h>
 #include <sys/socket.h>
 #include <sys/wait.h>
 #include <arpa/inet.h>
@@ -1625,6 +1626,115 @@ static int test_memcg_oom_group_score_events(const char *root)
 	return ret;
 }
 
+static int read_event(int inotify_fd, int expected_event, int expected_wd)
+{
+	struct inotify_event event;
+	ssize_t len = 0;
+
+	len = read(inotify_fd, &event, sizeof(event));
+	if (len < (ssize_t)sizeof(event))
+		return -1;
+
+	if (event.mask != expected_event || event.wd != expected_wd) {
+		fprintf(stderr,
+			"event does not match expected values: mask %d (expected %d) wd %d (expected %d)\n",
+			event.mask, expected_event, event.wd, expected_wd);
+		return -1;
+	}
+
+	return 0;
+}
+
+static int test_memcg_inotify_delete_file(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *memcg = NULL;
+	int fd, wd;
+
+	memcg = cg_name(root, "memcg_test_0");
+
+	if (!memcg)
+		goto cleanup;
+
+	if (cg_create(memcg))
+		goto cleanup;
+
+	fd = inotify_init1(0);
+	if (fd == -1)
+		goto cleanup;
+
+	wd = inotify_add_watch(fd, cg_control(memcg, "memory.events"), IN_DELETE_SELF);
+	if (wd == -1)
+		goto cleanup;
+
+	if (cg_destroy(memcg))
+		goto cleanup;
+	free(memcg);
+	memcg = NULL;
+
+	if (read_event(fd, IN_DELETE_SELF, wd))
+		goto cleanup;
+
+	if (read_event(fd, IN_IGNORED, wd))
+		goto cleanup;
+
+	ret = KSFT_PASS;
+
+cleanup:
+	if (fd >= 0)
+		close(fd);
+	if (memcg)
+		cg_destroy(memcg);
+	free(memcg);
+
+	return ret;
+}
+
+static int test_memcg_inotify_delete_dir(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *memcg = NULL;
+	int fd, wd;
+
+	memcg = cg_name(root, "memcg_test_0");
+
+	if (!memcg)
+		goto cleanup;
+
+	if (cg_create(memcg))
+		goto cleanup;
+
+	fd = inotify_init1(0);
+	if (fd == -1)
+		goto cleanup;
+
+	wd = inotify_add_watch(fd, memcg, IN_DELETE_SELF);
+	if (wd == -1)
+		goto cleanup;
+
+	if (cg_destroy(memcg))
+		goto cleanup;
+	free(memcg);
+	memcg = NULL;
+
+	if (read_event(fd, IN_DELETE_SELF, wd))
+		goto cleanup;
+
+	if (read_event(fd, IN_IGNORED, wd))
+		goto cleanup;
+
+	ret = KSFT_PASS;
+
+cleanup:
+	if (fd >= 0)
+		close(fd);
+	if (memcg)
+		cg_destroy(memcg);
+	free(memcg);
+
+	return ret;
+}
+
 #define T(x) { x, #x }
 struct memcg_test {
 	int (*fn)(const char *root);
@@ -1644,6 +1754,8 @@ struct memcg_test {
 	T(test_memcg_oom_group_leaf_events),
 	T(test_memcg_oom_group_parent_events),
 	T(test_memcg_oom_group_score_events),
+	T(test_memcg_inotify_delete_file),
+	T(test_memcg_inotify_delete_dir),
 };
 #undef T
 
-- 
2.53.0.414.gf7e9f6c205-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [syzbot ci] Re: kernfs: Add inotify IN_DELETE_SELF, IN_IGNORED support
  2026-02-20  5:54 [PATCH v4 0/3] kernfs: Add inotify IN_DELETE_SELF, IN_IGNORED support T.J. Mercier
                   ` (2 preceding siblings ...)
  2026-02-20  5:54 ` [PATCH v4 3/3] selftests: memcg: Add tests for " T.J. Mercier
@ 2026-02-20 10:14 ` syzbot ci
  2026-02-20 18:41   ` T.J. Mercier
  3 siblings, 1 reply; 20+ messages in thread
From: syzbot ci @ 2026-02-20 10:14 UTC (permalink / raw)
  To: amir73il, cgroups, driver-core, gregkh, jack, linux-fsdevel,
	linux-kernel, linux-kselftest, shuah, tj, tjmercier
  Cc: syzbot, syzkaller-bugs

syzbot ci has tested the following series

[v4] kernfs: Add inotify IN_DELETE_SELF, IN_IGNORED support
https://lore.kernel.org/all/20260220055449.3073-1-tjmercier@google.com
* [PATCH v4 1/3] kernfs: Don't set_nlink for directories being removed
* [PATCH v4 2/3] kernfs: Send IN_DELETE_SELF and IN_IGNORED
* [PATCH v4 3/3] selftests: memcg: Add tests for IN_DELETE_SELF and IN_IGNORED

and found the following issue:
possible deadlock in __kernfs_remove

Full report is available here:
https://ci.syzbot.org/series/4b44d5c2-c2eb-4425-a19a-f9963b64f74f

***

possible deadlock in __kernfs_remove

tree:      bpf-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/bpf/bpf-next.git
base:      ba268514ea14b44570030e8ed2aef92a38679e85
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/45ab774f-e8d7-4def-8279-888a5cb2d01e/config
syz repro: https://ci.syzbot.org/findings/b74cbc6a-1cef-4ae9-be46-dd9e8b29b648/syz_repro

======================================================
WARNING: possible circular locking dependency detected
syzkaller #0 Not tainted
------------------------------------------------------
kworker/u8:1/13 is trying to acquire lock:
ffff88816ef2b878 (kn->active#5){++++}-{0:0}, at: __kernfs_remove+0x47e/0x8c0 fs/kernfs/dir.c:1533

but task is already holding lock:
ffff8881012e8ab8 (&root->kernfs_supers_rwsem){++++}-{4:4}, at: kernfs_remove_by_name_ns+0x3f/0x140 fs/kernfs/dir.c:1745

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #2 (&root->kernfs_supers_rwsem){++++}-{4:4}:
       down_read+0x47/0x2e0 kernel/locking/rwsem.c:1537
       kernfs_remove_by_name_ns+0x3f/0x140 fs/kernfs/dir.c:1745
       acpi_unbind_one+0x2d8/0x3b0 drivers/acpi/glue.c:337
       device_platform_notify_remove drivers/base/core.c:2386 [inline]
       device_del+0x547/0x8f0 drivers/base/core.c:3881
       serdev_controller_add+0x46f/0x640 drivers/tty/serdev/core.c:785
       serdev_tty_port_register+0x159/0x260 drivers/tty/serdev/serdev-ttyport.c:291
       tty_port_register_device_attr_serdev+0xe7/0x170 drivers/tty/tty_port.c:187
       serial_core_add_one_port drivers/tty/serial/serial_core.c:3107 [inline]
       serial_core_register_port+0x103a/0x28b0 drivers/tty/serial/serial_core.c:3305
       serial8250_register_8250_port+0x1658/0x1fd0 drivers/tty/serial/8250/8250_core.c:822
       serial_pnp_probe+0x568/0x7f0 drivers/tty/serial/8250/8250_pnp.c:480
       pnp_device_probe+0x30b/0x4c0 drivers/pnp/driver.c:111
       call_driver_probe drivers/base/dd.c:-1 [inline]
       really_probe+0x267/0xaf0 drivers/base/dd.c:661
       __driver_probe_device+0x18c/0x320 drivers/base/dd.c:803
       driver_probe_device+0x4f/0x240 drivers/base/dd.c:833
       __driver_attach+0x3e7/0x710 drivers/base/dd.c:1227
       bus_for_each_dev+0x23b/0x2c0 drivers/base/bus.c:383
       bus_add_driver+0x345/0x670 drivers/base/bus.c:715
       driver_register+0x23a/0x320 drivers/base/driver.c:249
       serial8250_init+0x8f/0x160 drivers/tty/serial/8250/8250_platform.c:317
       do_one_initcall+0x250/0x840 init/main.c:1378
       do_initcall_level+0x104/0x190 init/main.c:1440
       do_initcalls+0x59/0xa0 init/main.c:1456
       kernel_init_freeable+0x2a6/0x3d0 init/main.c:1688
       kernel_init+0x1d/0x1d0 init/main.c:1578
       ret_from_fork+0x51b/0xa40 arch/x86/kernel/process.c:158
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246

-> #1 (&device->physical_node_lock){+.+.}-{4:4}:
       __mutex_lock_common kernel/locking/mutex.c:614 [inline]
       __mutex_lock+0x19f/0x1300 kernel/locking/mutex.c:776
       acpi_get_first_physical_node drivers/acpi/bus.c:691 [inline]
       acpi_primary_dev_companion drivers/acpi/bus.c:710 [inline]
       acpi_companion_match+0x8a/0x120 drivers/acpi/bus.c:764
       acpi_device_uevent_modalias+0x1a/0x30 drivers/acpi/device_sysfs.c:280
       platform_uevent+0x3c/0xb0 drivers/base/platform.c:1411
       dev_uevent+0x446/0x8a0 drivers/base/core.c:2692
       kobject_uevent_env+0x477/0x9e0 lib/kobject_uevent.c:573
       kobject_synth_uevent+0x585/0xbd0 lib/kobject_uevent.c:207
       uevent_store+0x26/0x70 drivers/base/core.c:2773
       kernfs_fop_write_iter+0x3af/0x540 fs/kernfs/file.c:352
       new_sync_write fs/read_write.c:593 [inline]
       vfs_write+0x61d/0xb90 fs/read_write.c:686
       ksys_write+0x150/0x270 fs/read_write.c:738
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #0 (kn->active#5){++++}-{0:0}:
       check_prev_add kernel/locking/lockdep.c:3165 [inline]
       check_prevs_add kernel/locking/lockdep.c:3284 [inline]
       validate_chain kernel/locking/lockdep.c:3908 [inline]
       __lock_acquire+0x15a5/0x2cf0 kernel/locking/lockdep.c:5237
       lock_acquire+0x106/0x330 kernel/locking/lockdep.c:5868
       kernfs_drain+0x27c/0x5f0 fs/kernfs/dir.c:511
       __kernfs_remove+0x47e/0x8c0 fs/kernfs/dir.c:1533
       kernfs_remove_by_name_ns+0xc0/0x140 fs/kernfs/dir.c:1751
       sysfs_remove_file include/linux/sysfs.h:780 [inline]
       device_remove_file drivers/base/core.c:3071 [inline]
       device_del+0x506/0x8f0 drivers/base/core.c:3876
       device_unregister+0x21/0xf0 drivers/base/core.c:3919
       mac80211_hwsim_del_radio+0x2dc/0x490 drivers/net/wireless/virtual/mac80211_hwsim.c:5918
       hwsim_exit_net+0xede/0xfa0 drivers/net/wireless/virtual/mac80211_hwsim.c:6807
       ops_exit_list net/core/net_namespace.c:199 [inline]
       ops_undo_list+0x49f/0x940 net/core/net_namespace.c:252
       cleanup_net+0x4df/0x7b0 net/core/net_namespace.c:696
       process_one_work kernel/workqueue.c:3257 [inline]
       process_scheduled_works+0xaec/0x17a0 kernel/workqueue.c:3340
       worker_thread+0xda6/0x1360 kernel/workqueue.c:3421
       kthread+0x726/0x8b0 kernel/kthread.c:463
       ret_from_fork+0x51b/0xa40 arch/x86/kernel/process.c:158
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246

other info that might help us debug this:

Chain exists of:
  kn->active#5 --> &device->physical_node_lock --> &root->kernfs_supers_rwsem

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  rlock(&root->kernfs_supers_rwsem);
                               lock(&device->physical_node_lock);
                               lock(&root->kernfs_supers_rwsem);
  lock(kn->active#5);

 *** DEADLOCK ***

4 locks held by kworker/u8:1/13:
 #0: ffff888100ef7948 ((wq_completion)netns){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3232 [inline]
 #0: ffff888100ef7948 ((wq_completion)netns){+.+.}-{0:0}, at: process_scheduled_works+0x9d4/0x17a0 kernel/workqueue.c:3340
 #1: ffffc90000127bc0 (net_cleanup_work){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3233 [inline]
 #1: ffffc90000127bc0 (net_cleanup_work){+.+.}-{0:0}, at: process_scheduled_works+0xa0f/0x17a0 kernel/workqueue.c:3340
 #2: ffffffff8f99d2d0 (pernet_ops_rwsem){++++}-{4:4}, at: cleanup_net+0xfe/0x7b0 net/core/net_namespace.c:670
 #3: ffff8881012e8ab8 (&root->kernfs_supers_rwsem){++++}-{4:4}, at: kernfs_remove_by_name_ns+0x3f/0x140 fs/kernfs/dir.c:1745

stack backtrace:
CPU: 0 UID: 0 PID: 13 Comm: kworker/u8:1 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Workqueue: netns cleanup_net
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_circular_bug+0x2e1/0x300 kernel/locking/lockdep.c:2043
 check_noncircular+0x12e/0x150 kernel/locking/lockdep.c:2175
 check_prev_add kernel/locking/lockdep.c:3165 [inline]
 check_prevs_add kernel/locking/lockdep.c:3284 [inline]
 validate_chain kernel/locking/lockdep.c:3908 [inline]
 __lock_acquire+0x15a5/0x2cf0 kernel/locking/lockdep.c:5237
 lock_acquire+0x106/0x330 kernel/locking/lockdep.c:5868
 kernfs_drain+0x27c/0x5f0 fs/kernfs/dir.c:511
 __kernfs_remove+0x47e/0x8c0 fs/kernfs/dir.c:1533
 kernfs_remove_by_name_ns+0xc0/0x140 fs/kernfs/dir.c:1751
 sysfs_remove_file include/linux/sysfs.h:780 [inline]
 device_remove_file drivers/base/core.c:3071 [inline]
 device_del+0x506/0x8f0 drivers/base/core.c:3876
 device_unregister+0x21/0xf0 drivers/base/core.c:3919
 mac80211_hwsim_del_radio+0x2dc/0x490 drivers/net/wireless/virtual/mac80211_hwsim.c:5918
 hwsim_exit_net+0xede/0xfa0 drivers/net/wireless/virtual/mac80211_hwsim.c:6807
 ops_exit_list net/core/net_namespace.c:199 [inline]
 ops_undo_list+0x49f/0x940 net/core/net_namespace.c:252
 cleanup_net+0x4df/0x7b0 net/core/net_namespace.c:696
 process_one_work kernel/workqueue.c:3257 [inline]
 process_scheduled_works+0xaec/0x17a0 kernel/workqueue.c:3340
 worker_thread+0xda6/0x1360 kernel/workqueue.c:3421
 kthread+0x726/0x8b0 kernel/kthread.c:463
 ret_from_fork+0x51b/0xa40 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
 </TASK>
hsr_slave_0: left promiscuous mode
hsr_slave_1: left promiscuous mode
batman_adv: batadv0: Interface deactivated: batadv_slave_0
batman_adv: batadv0: Removing interface: batadv_slave_0
batman_adv: batadv0: Interface deactivated: batadv_slave_1
batman_adv: batadv0: Removing interface: batadv_slave_1
veth1_macvtap: left promiscuous mode
veth0_macvtap: left promiscuous mode
veth1_vlan: left promiscuous mode
veth0_vlan: left promiscuous mode
team0 (unregistering): Port device team_slave_1 removed
team0 (unregistering): Port device team_slave_0 removed
netdevsim netdevsim2 netdevsim0: set [1, 0] type 2 family 0 port 6081 - 0
netdevsim netdevsim2 netdevsim1: set [1, 0] type 2 family 0 port 6081 - 0
netdevsim netdevsim2 netdevsim2: set [1, 0] type 2 family 0 port 6081 - 0
netdevsim netdevsim2 netdevsim3: set [1, 0] type 2 family 0 port 6081 - 0


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/3] kernfs: Send IN_DELETE_SELF and IN_IGNORED
  2026-02-20  5:54 ` [PATCH v4 2/3] kernfs: Send IN_DELETE_SELF and IN_IGNORED T.J. Mercier
@ 2026-02-20 15:32   ` Tejun Heo
  2026-02-20 17:15     ` Amir Goldstein
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2026-02-20 15:32 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: gregkh, driver-core, linux-kernel, cgroups, linux-fsdevel, jack,
	amir73il, shuah, linux-kselftest

Hello,

On Thu, Feb 19, 2026 at 09:54:47PM -0800, T.J. Mercier wrote:
> Currently some kernfs files (e.g. cgroup.events, memory.events) support
> inotify watches for IN_MODIFY, but unlike with regular filesystems, they
> do not receive IN_DELETE_SELF or IN_IGNORED events when they are
> removed. This means inotify watches persist after file deletion until
> the process exits and the inotify file descriptor is cleaned up, or
> until inotify_rm_watch is called manually.
> 
> This creates a problem for processes monitoring cgroups. For example, a
> service monitoring memory.events for memory.high breaches needs to know
> when a cgroup is removed to clean up its state. Where it's known that a
> cgroup is removed when all processes die, without IN_DELETE_SELF the
> service must resort to inefficient workarounds such as:
>   1) Periodically scanning procfs to detect process death (wastes CPU
>      and is susceptible to PID reuse).
>   2) Holding a pidfd for every monitored cgroup (can exhaust file
>      descriptors).
> 
> This patch enables IN_DELETE_SELF and IN_IGNORED events for kernfs files
> and directories by clearing inode i_nlink values during removal. This
> allows VFS to make the necessary fsnotify calls so that userspace
> receives the inotify events.
> 
> As a result, applications can rely on a single existing watch on a file
> of interest (e.g. memory.events) to receive notifications for both
> modifications and the eventual removal of the file, as well as automatic
> watch descriptor cleanup, simplifying userspace logic and improving
> efficiency.
> 
> There is gap in this implementation for certain file removals due their
> unique nature in kernfs. Directory removals that trigger file removals
> occur through vfs_rmdir, which shrinks the dcache and emits fsnotify
> events after the rmdir operation; there is no issue here. However kernfs
> writes to particular files (e.g. cgroup.subtree_control) can also cause
> file removal, but vfs_write does not attempt to emit fsnotify events
> after the write operation, even if i_nlink counts are 0. As a usecase
> for monitoring this category of file removals is not known, they are
> left without having IN_DELETE or IN_DELETE_SELF events generated.

Adding a comment with the above content would probably be useful. It also
might be worthwhile to note that fanotify recursive monitoring wouldn't work
reliably as cgroups can go away while inodes are not attached.

> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: T.J. Mercier <tjmercier@google.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/3] kernfs: Send IN_DELETE_SELF and IN_IGNORED
  2026-02-20 15:32   ` Tejun Heo
@ 2026-02-20 17:15     ` Amir Goldstein
  2026-02-20 19:50       ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2026-02-20 17:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: T.J. Mercier, gregkh, driver-core, linux-kernel, cgroups,
	linux-fsdevel, jack, shuah, linux-kselftest

On Fri, Feb 20, 2026 at 4:32 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Feb 19, 2026 at 09:54:47PM -0800, T.J. Mercier wrote:
> > Currently some kernfs files (e.g. cgroup.events, memory.events) support
> > inotify watches for IN_MODIFY, but unlike with regular filesystems, they
> > do not receive IN_DELETE_SELF or IN_IGNORED events when they are
> > removed. This means inotify watches persist after file deletion until
> > the process exits and the inotify file descriptor is cleaned up, or
> > until inotify_rm_watch is called manually.
> >
> > This creates a problem for processes monitoring cgroups. For example, a
> > service monitoring memory.events for memory.high breaches needs to know
> > when a cgroup is removed to clean up its state. Where it's known that a
> > cgroup is removed when all processes die, without IN_DELETE_SELF the
> > service must resort to inefficient workarounds such as:
> >   1) Periodically scanning procfs to detect process death (wastes CPU
> >      and is susceptible to PID reuse).
> >   2) Holding a pidfd for every monitored cgroup (can exhaust file
> >      descriptors).
> >
> > This patch enables IN_DELETE_SELF and IN_IGNORED events for kernfs files
> > and directories by clearing inode i_nlink values during removal. This
> > allows VFS to make the necessary fsnotify calls so that userspace
> > receives the inotify events.
> >
> > As a result, applications can rely on a single existing watch on a file
> > of interest (e.g. memory.events) to receive notifications for both
> > modifications and the eventual removal of the file, as well as automatic
> > watch descriptor cleanup, simplifying userspace logic and improving
> > efficiency.
> >
> > There is gap in this implementation for certain file removals due their
> > unique nature in kernfs. Directory removals that trigger file removals
> > occur through vfs_rmdir, which shrinks the dcache and emits fsnotify
> > events after the rmdir operation; there is no issue here. However kernfs
> > writes to particular files (e.g. cgroup.subtree_control) can also cause
> > file removal, but vfs_write does not attempt to emit fsnotify events
> > after the write operation, even if i_nlink counts are 0. As a usecase
> > for monitoring this category of file removals is not known, they are
> > left without having IN_DELETE or IN_DELETE_SELF events generated.
>
> Adding a comment with the above content would probably be useful. It also
> might be worthwhile to note that fanotify recursive monitoring wouldn't work
> reliably as cgroups can go away while inodes are not attached.

Sigh.. it's a shame to grow more weird semantics.

But I take this back to the POV of "remote" vs. "local" vfs notifications.
the IN_DELETE_SELF events added by this change are actually
"local" vfs notifications.

If we would want to support monitoring cgroups fs super block
for all added/removed cgroups with fanotify, we would be able
to implement this as "remote" notifications and in this case, adding
explicit fsnotify() calls could make sense.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] selftests: memcg: Add tests for IN_DELETE_SELF and IN_IGNORED
  2026-02-20  5:54 ` [PATCH v4 3/3] selftests: memcg: Add tests for " T.J. Mercier
@ 2026-02-20 17:43   ` Amir Goldstein
  2026-02-20 17:46     ` T.J. Mercier
  0 siblings, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2026-02-20 17:43 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: gregkh, tj, driver-core, linux-kernel, cgroups, linux-fsdevel,
	jack, shuah, linux-kselftest

On Fri, Feb 20, 2026 at 6:55 AM T.J. Mercier <tjmercier@google.com> wrote:
>
> Add two new tests that verify inotify events are sent when memcg files
> or directories are removed with rmdir.
>
> Signed-off-by: T.J. Mercier <tjmercier@google.com>
> Acked-by: Tejun Heo <tj@kernel.org>
> Acked-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  .../selftests/cgroup/test_memcontrol.c        | 112 ++++++++++++++++++
>  1 file changed, 112 insertions(+)
>
> diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
> index 4e1647568c5b..57726bc82757 100644
> --- a/tools/testing/selftests/cgroup/test_memcontrol.c
> +++ b/tools/testing/selftests/cgroup/test_memcontrol.c
> @@ -10,6 +10,7 @@
>  #include <sys/stat.h>
>  #include <sys/types.h>
>  #include <unistd.h>
> +#include <sys/inotify.h>
>  #include <sys/socket.h>
>  #include <sys/wait.h>
>  #include <arpa/inet.h>
> @@ -1625,6 +1626,115 @@ static int test_memcg_oom_group_score_events(const char *root)
>         return ret;
>  }
>
> +static int read_event(int inotify_fd, int expected_event, int expected_wd)
> +{
> +       struct inotify_event event;
> +       ssize_t len = 0;
> +
> +       len = read(inotify_fd, &event, sizeof(event));
> +       if (len < (ssize_t)sizeof(event))
> +               return -1;
> +
> +       if (event.mask != expected_event || event.wd != expected_wd) {
> +               fprintf(stderr,
> +                       "event does not match expected values: mask %d (expected %d) wd %d (expected %d)\n",
> +                       event.mask, expected_event, event.wd, expected_wd);
> +               return -1;
> +       }
> +
> +       return 0;
> +}
> +
> +static int test_memcg_inotify_delete_file(const char *root)
> +{
> +       int ret = KSFT_FAIL;
> +       char *memcg = NULL;
> +       int fd, wd;
> +
> +       memcg = cg_name(root, "memcg_test_0");
> +
> +       if (!memcg)
> +               goto cleanup;
> +
> +       if (cg_create(memcg))
> +               goto cleanup;
> +
> +       fd = inotify_init1(0);
> +       if (fd == -1)
> +               goto cleanup;
> +
> +       wd = inotify_add_watch(fd, cg_control(memcg, "memory.events"), IN_DELETE_SELF);
> +       if (wd == -1)
> +               goto cleanup;
> +
> +       if (cg_destroy(memcg))
> +               goto cleanup;
> +       free(memcg);
> +       memcg = NULL;
> +
> +       if (read_event(fd, IN_DELETE_SELF, wd))
> +               goto cleanup;
> +
> +       if (read_event(fd, IN_IGNORED, wd))
> +               goto cleanup;
> +
> +       ret = KSFT_PASS;
> +
> +cleanup:
> +       if (fd >= 0)
> +               close(fd);
> +       if (memcg)
> +               cg_destroy(memcg);
> +       free(memcg);
> +
> +       return ret;
> +}
> +
> +static int test_memcg_inotify_delete_dir(const char *root)
> +{
> +       int ret = KSFT_FAIL;
> +       char *memcg = NULL;
> +       int fd, wd;
> +
> +       memcg = cg_name(root, "memcg_test_0");
> +
> +       if (!memcg)
> +               goto cleanup;
> +
> +       if (cg_create(memcg))
> +               goto cleanup;
> +
> +       fd = inotify_init1(0);
> +       if (fd == -1)
> +               goto cleanup;
> +
> +       wd = inotify_add_watch(fd, memcg, IN_DELETE_SELF);
> +       if (wd == -1)
> +               goto cleanup;
> +
> +       if (cg_destroy(memcg))
> +               goto cleanup;
> +       free(memcg);
> +       memcg = NULL;
> +
> +       if (read_event(fd, IN_DELETE_SELF, wd))
> +               goto cleanup;


Does this test pass? I expect that listener would get event mask
IN_DELETE_SELF | IN_ISDIR?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] selftests: memcg: Add tests for IN_DELETE_SELF and IN_IGNORED
  2026-02-20 17:43   ` Amir Goldstein
@ 2026-02-20 17:46     ` T.J. Mercier
  2026-02-20 17:53       ` T.J. Mercier
  0 siblings, 1 reply; 20+ messages in thread
From: T.J. Mercier @ 2026-02-20 17:46 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: gregkh, tj, driver-core, linux-kernel, cgroups, linux-fsdevel,
	jack, shuah, linux-kselftest

On Fri, Feb 20, 2026 at 9:44 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Fri, Feb 20, 2026 at 6:55 AM T.J. Mercier <tjmercier@google.com> wrote:
> >
> > Add two new tests that verify inotify events are sent when memcg files
> > or directories are removed with rmdir.
> >
> > Signed-off-by: T.J. Mercier <tjmercier@google.com>
> > Acked-by: Tejun Heo <tj@kernel.org>
> > Acked-by: Amir Goldstein <amir73il@gmail.com>
> > ---
> >  .../selftests/cgroup/test_memcontrol.c        | 112 ++++++++++++++++++
> >  1 file changed, 112 insertions(+)
> >
> > diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
> > index 4e1647568c5b..57726bc82757 100644
> > --- a/tools/testing/selftests/cgroup/test_memcontrol.c
> > +++ b/tools/testing/selftests/cgroup/test_memcontrol.c
> > @@ -10,6 +10,7 @@
> >  #include <sys/stat.h>
> >  #include <sys/types.h>
> >  #include <unistd.h>
> > +#include <sys/inotify.h>
> >  #include <sys/socket.h>
> >  #include <sys/wait.h>
> >  #include <arpa/inet.h>
> > @@ -1625,6 +1626,115 @@ static int test_memcg_oom_group_score_events(const char *root)
> >         return ret;
> >  }
> >
> > +static int read_event(int inotify_fd, int expected_event, int expected_wd)
> > +{
> > +       struct inotify_event event;
> > +       ssize_t len = 0;
> > +
> > +       len = read(inotify_fd, &event, sizeof(event));
> > +       if (len < (ssize_t)sizeof(event))
> > +               return -1;
> > +
> > +       if (event.mask != expected_event || event.wd != expected_wd) {
> > +               fprintf(stderr,
> > +                       "event does not match expected values: mask %d (expected %d) wd %d (expected %d)\n",
> > +                       event.mask, expected_event, event.wd, expected_wd);
> > +               return -1;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static int test_memcg_inotify_delete_file(const char *root)
> > +{
> > +       int ret = KSFT_FAIL;
> > +       char *memcg = NULL;
> > +       int fd, wd;
> > +
> > +       memcg = cg_name(root, "memcg_test_0");
> > +
> > +       if (!memcg)
> > +               goto cleanup;
> > +
> > +       if (cg_create(memcg))
> > +               goto cleanup;
> > +
> > +       fd = inotify_init1(0);
> > +       if (fd == -1)
> > +               goto cleanup;
> > +
> > +       wd = inotify_add_watch(fd, cg_control(memcg, "memory.events"), IN_DELETE_SELF);
> > +       if (wd == -1)
> > +               goto cleanup;
> > +
> > +       if (cg_destroy(memcg))
> > +               goto cleanup;
> > +       free(memcg);
> > +       memcg = NULL;
> > +
> > +       if (read_event(fd, IN_DELETE_SELF, wd))
> > +               goto cleanup;
> > +
> > +       if (read_event(fd, IN_IGNORED, wd))
> > +               goto cleanup;
> > +
> > +       ret = KSFT_PASS;
> > +
> > +cleanup:
> > +       if (fd >= 0)
> > +               close(fd);
> > +       if (memcg)
> > +               cg_destroy(memcg);
> > +       free(memcg);
> > +
> > +       return ret;
> > +}
> > +
> > +static int test_memcg_inotify_delete_dir(const char *root)
> > +{
> > +       int ret = KSFT_FAIL;
> > +       char *memcg = NULL;
> > +       int fd, wd;
> > +
> > +       memcg = cg_name(root, "memcg_test_0");
> > +
> > +       if (!memcg)
> > +               goto cleanup;
> > +
> > +       if (cg_create(memcg))
> > +               goto cleanup;
> > +
> > +       fd = inotify_init1(0);
> > +       if (fd == -1)
> > +               goto cleanup;
> > +
> > +       wd = inotify_add_watch(fd, memcg, IN_DELETE_SELF);
> > +       if (wd == -1)
> > +               goto cleanup;
> > +
> > +       if (cg_destroy(memcg))
> > +               goto cleanup;
> > +       free(memcg);
> > +       memcg = NULL;
> > +
> > +       if (read_event(fd, IN_DELETE_SELF, wd))
> > +               goto cleanup;
>
>
> Does this test pass? I expect that listener would get event mask
> IN_DELETE_SELF | IN_ISDIR?

Yes, I tested on 4 different machines across different filesystems and
none of them set IN_ISDIR with IN_DELETE_SELF. The inotify docs say,
"may be set"... I wonder if that is wishful thinking?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] selftests: memcg: Add tests for IN_DELETE_SELF and IN_IGNORED
  2026-02-20 17:46     ` T.J. Mercier
@ 2026-02-20 17:53       ` T.J. Mercier
  2026-02-20 18:01         ` Amir Goldstein
  0 siblings, 1 reply; 20+ messages in thread
From: T.J. Mercier @ 2026-02-20 17:53 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: gregkh, tj, driver-core, linux-kernel, cgroups, linux-fsdevel,
	jack, shuah, linux-kselftest

On Fri, Feb 20, 2026 at 9:46 AM T.J. Mercier <tjmercier@google.com> wrote:
>
> On Fri, Feb 20, 2026 at 9:44 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Fri, Feb 20, 2026 at 6:55 AM T.J. Mercier <tjmercier@google.com> wrote:
> > >
> > > Add two new tests that verify inotify events are sent when memcg files
> > > or directories are removed with rmdir.
> > >
> > > Signed-off-by: T.J. Mercier <tjmercier@google.com>
> > > Acked-by: Tejun Heo <tj@kernel.org>
> > > Acked-by: Amir Goldstein <amir73il@gmail.com>
> > > ---
> > >  .../selftests/cgroup/test_memcontrol.c        | 112 ++++++++++++++++++
> > >  1 file changed, 112 insertions(+)
> > >
> > > diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
> > > index 4e1647568c5b..57726bc82757 100644
> > > --- a/tools/testing/selftests/cgroup/test_memcontrol.c
> > > +++ b/tools/testing/selftests/cgroup/test_memcontrol.c
> > > @@ -10,6 +10,7 @@
> > >  #include <sys/stat.h>
> > >  #include <sys/types.h>
> > >  #include <unistd.h>
> > > +#include <sys/inotify.h>
> > >  #include <sys/socket.h>
> > >  #include <sys/wait.h>
> > >  #include <arpa/inet.h>
> > > @@ -1625,6 +1626,115 @@ static int test_memcg_oom_group_score_events(const char *root)
> > >         return ret;
> > >  }
> > >
> > > +static int read_event(int inotify_fd, int expected_event, int expected_wd)
> > > +{
> > > +       struct inotify_event event;
> > > +       ssize_t len = 0;
> > > +
> > > +       len = read(inotify_fd, &event, sizeof(event));
> > > +       if (len < (ssize_t)sizeof(event))
> > > +               return -1;
> > > +
> > > +       if (event.mask != expected_event || event.wd != expected_wd) {
> > > +               fprintf(stderr,
> > > +                       "event does not match expected values: mask %d (expected %d) wd %d (expected %d)\n",
> > > +                       event.mask, expected_event, event.wd, expected_wd);
> > > +               return -1;
> > > +       }
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +static int test_memcg_inotify_delete_file(const char *root)
> > > +{
> > > +       int ret = KSFT_FAIL;
> > > +       char *memcg = NULL;
> > > +       int fd, wd;
> > > +
> > > +       memcg = cg_name(root, "memcg_test_0");
> > > +
> > > +       if (!memcg)
> > > +               goto cleanup;
> > > +
> > > +       if (cg_create(memcg))
> > > +               goto cleanup;
> > > +
> > > +       fd = inotify_init1(0);
> > > +       if (fd == -1)
> > > +               goto cleanup;
> > > +
> > > +       wd = inotify_add_watch(fd, cg_control(memcg, "memory.events"), IN_DELETE_SELF);
> > > +       if (wd == -1)
> > > +               goto cleanup;
> > > +
> > > +       if (cg_destroy(memcg))
> > > +               goto cleanup;
> > > +       free(memcg);
> > > +       memcg = NULL;
> > > +
> > > +       if (read_event(fd, IN_DELETE_SELF, wd))
> > > +               goto cleanup;
> > > +
> > > +       if (read_event(fd, IN_IGNORED, wd))
> > > +               goto cleanup;
> > > +
> > > +       ret = KSFT_PASS;
> > > +
> > > +cleanup:
> > > +       if (fd >= 0)
> > > +               close(fd);
> > > +       if (memcg)
> > > +               cg_destroy(memcg);
> > > +       free(memcg);
> > > +
> > > +       return ret;
> > > +}
> > > +
> > > +static int test_memcg_inotify_delete_dir(const char *root)
> > > +{
> > > +       int ret = KSFT_FAIL;
> > > +       char *memcg = NULL;
> > > +       int fd, wd;
> > > +
> > > +       memcg = cg_name(root, "memcg_test_0");
> > > +
> > > +       if (!memcg)
> > > +               goto cleanup;
> > > +
> > > +       if (cg_create(memcg))
> > > +               goto cleanup;
> > > +
> > > +       fd = inotify_init1(0);
> > > +       if (fd == -1)
> > > +               goto cleanup;
> > > +
> > > +       wd = inotify_add_watch(fd, memcg, IN_DELETE_SELF);
> > > +       if (wd == -1)
> > > +               goto cleanup;
> > > +
> > > +       if (cg_destroy(memcg))
> > > +               goto cleanup;
> > > +       free(memcg);
> > > +       memcg = NULL;
> > > +
> > > +       if (read_event(fd, IN_DELETE_SELF, wd))
> > > +               goto cleanup;
> >
> >
> > Does this test pass? I expect that listener would get event mask
> > IN_DELETE_SELF | IN_ISDIR?
>
> Yes, I tested on 4 different machines across different filesystems and
> none of them set IN_ISDIR with IN_DELETE_SELF. The inotify docs say,
> "may be set"... I wonder if that is wishful thinking?

Oh, very intentional:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/notify/inotify/inotify_fsnotify.c?h=v6.19#n109

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] selftests: memcg: Add tests for IN_DELETE_SELF and IN_IGNORED
  2026-02-20 17:53       ` T.J. Mercier
@ 2026-02-20 18:01         ` Amir Goldstein
  0 siblings, 0 replies; 20+ messages in thread
From: Amir Goldstein @ 2026-02-20 18:01 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: gregkh, tj, driver-core, linux-kernel, cgroups, linux-fsdevel,
	jack, shuah, linux-kselftest

On Fri, Feb 20, 2026 at 6:53 PM T.J. Mercier <tjmercier@google.com> wrote:
>
> On Fri, Feb 20, 2026 at 9:46 AM T.J. Mercier <tjmercier@google.com> wrote:
> >
> > On Fri, Feb 20, 2026 at 9:44 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Fri, Feb 20, 2026 at 6:55 AM T.J. Mercier <tjmercier@google.com> wrote:
> > > >
> > > > Add two new tests that verify inotify events are sent when memcg files
> > > > or directories are removed with rmdir.
> > > >
> > > > Signed-off-by: T.J. Mercier <tjmercier@google.com>
> > > > Acked-by: Tejun Heo <tj@kernel.org>
> > > > Acked-by: Amir Goldstein <amir73il@gmail.com>
> > > > ---
> > > >  .../selftests/cgroup/test_memcontrol.c        | 112 ++++++++++++++++++
> > > >  1 file changed, 112 insertions(+)
> > > >
> > > > diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
> > > > index 4e1647568c5b..57726bc82757 100644
> > > > --- a/tools/testing/selftests/cgroup/test_memcontrol.c
> > > > +++ b/tools/testing/selftests/cgroup/test_memcontrol.c
> > > > @@ -10,6 +10,7 @@
> > > >  #include <sys/stat.h>
> > > >  #include <sys/types.h>
> > > >  #include <unistd.h>
> > > > +#include <sys/inotify.h>
> > > >  #include <sys/socket.h>
> > > >  #include <sys/wait.h>
> > > >  #include <arpa/inet.h>
> > > > @@ -1625,6 +1626,115 @@ static int test_memcg_oom_group_score_events(const char *root)
> > > >         return ret;
> > > >  }
> > > >
> > > > +static int read_event(int inotify_fd, int expected_event, int expected_wd)
> > > > +{
> > > > +       struct inotify_event event;
> > > > +       ssize_t len = 0;
> > > > +
> > > > +       len = read(inotify_fd, &event, sizeof(event));
> > > > +       if (len < (ssize_t)sizeof(event))
> > > > +               return -1;
> > > > +
> > > > +       if (event.mask != expected_event || event.wd != expected_wd) {
> > > > +               fprintf(stderr,
> > > > +                       "event does not match expected values: mask %d (expected %d) wd %d (expected %d)\n",
> > > > +                       event.mask, expected_event, event.wd, expected_wd);
> > > > +               return -1;
> > > > +       }
> > > > +
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +static int test_memcg_inotify_delete_file(const char *root)
> > > > +{
> > > > +       int ret = KSFT_FAIL;
> > > > +       char *memcg = NULL;
> > > > +       int fd, wd;
> > > > +
> > > > +       memcg = cg_name(root, "memcg_test_0");
> > > > +
> > > > +       if (!memcg)
> > > > +               goto cleanup;
> > > > +
> > > > +       if (cg_create(memcg))
> > > > +               goto cleanup;
> > > > +
> > > > +       fd = inotify_init1(0);
> > > > +       if (fd == -1)
> > > > +               goto cleanup;
> > > > +
> > > > +       wd = inotify_add_watch(fd, cg_control(memcg, "memory.events"), IN_DELETE_SELF);
> > > > +       if (wd == -1)
> > > > +               goto cleanup;
> > > > +
> > > > +       if (cg_destroy(memcg))
> > > > +               goto cleanup;
> > > > +       free(memcg);
> > > > +       memcg = NULL;
> > > > +
> > > > +       if (read_event(fd, IN_DELETE_SELF, wd))
> > > > +               goto cleanup;
> > > > +
> > > > +       if (read_event(fd, IN_IGNORED, wd))
> > > > +               goto cleanup;
> > > > +
> > > > +       ret = KSFT_PASS;
> > > > +
> > > > +cleanup:
> > > > +       if (fd >= 0)
> > > > +               close(fd);
> > > > +       if (memcg)
> > > > +               cg_destroy(memcg);
> > > > +       free(memcg);
> > > > +
> > > > +       return ret;
> > > > +}
> > > > +
> > > > +static int test_memcg_inotify_delete_dir(const char *root)
> > > > +{
> > > > +       int ret = KSFT_FAIL;
> > > > +       char *memcg = NULL;
> > > > +       int fd, wd;
> > > > +
> > > > +       memcg = cg_name(root, "memcg_test_0");
> > > > +
> > > > +       if (!memcg)
> > > > +               goto cleanup;
> > > > +
> > > > +       if (cg_create(memcg))
> > > > +               goto cleanup;
> > > > +
> > > > +       fd = inotify_init1(0);
> > > > +       if (fd == -1)
> > > > +               goto cleanup;
> > > > +
> > > > +       wd = inotify_add_watch(fd, memcg, IN_DELETE_SELF);
> > > > +       if (wd == -1)
> > > > +               goto cleanup;
> > > > +
> > > > +       if (cg_destroy(memcg))
> > > > +               goto cleanup;
> > > > +       free(memcg);
> > > > +       memcg = NULL;
> > > > +
> > > > +       if (read_event(fd, IN_DELETE_SELF, wd))
> > > > +               goto cleanup;
> > >
> > >
> > > Does this test pass? I expect that listener would get event mask
> > > IN_DELETE_SELF | IN_ISDIR?
> >
> > Yes, I tested on 4 different machines across different filesystems and
> > none of them set IN_ISDIR with IN_DELETE_SELF. The inotify docs say,
> > "may be set"... I wonder if that is wishful thinking?
>
> Oh, very intentional:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/notify/inotify/inotify_fsnotify.c?h=v6.19#n109

LOL yeh ok :)

Thanks for checking

Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [syzbot ci] Re: kernfs: Add inotify IN_DELETE_SELF, IN_IGNORED support
  2026-02-20 10:14 ` [syzbot ci] Re: kernfs: Add inotify IN_DELETE_SELF, IN_IGNORED support syzbot ci
@ 2026-02-20 18:41   ` T.J. Mercier
  0 siblings, 0 replies; 20+ messages in thread
From: T.J. Mercier @ 2026-02-20 18:41 UTC (permalink / raw)
  To: syzbot ci
  Cc: amir73il, cgroups, driver-core, gregkh, jack, linux-fsdevel,
	linux-kernel, linux-kselftest, shuah, tj, syzbot, syzkaller-bugs

On Fri, Feb 20, 2026 at 2:14 AM syzbot ci
<syzbot+cif2121bcf05a8d84e@syzkaller.appspotmail.com> wrote:
>
> syzbot ci has tested the following series
>
> [v4] kernfs: Add inotify IN_DELETE_SELF, IN_IGNORED support
> https://lore.kernel.org/all/20260220055449.3073-1-tjmercier@google.com
> * [PATCH v4 1/3] kernfs: Don't set_nlink for directories being removed
> * [PATCH v4 2/3] kernfs: Send IN_DELETE_SELF and IN_IGNORED
> * [PATCH v4 3/3] selftests: memcg: Add tests for IN_DELETE_SELF and IN_IGNORED
>
> and found the following issue:
> possible deadlock in __kernfs_remove
>
> Full report is available here:
> https://ci.syzbot.org/series/4b44d5c2-c2eb-4425-a19a-f9963b64f74f
>
> ***
>
> possible deadlock in __kernfs_remove
>
> tree:      bpf-next
> URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/bpf/bpf-next.git
> base:      ba268514ea14b44570030e8ed2aef92a38679e85
> arch:      amd64
> compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
> config:    https://ci.syzbot.org/builds/45ab774f-e8d7-4def-8279-888a5cb2d01e/config
> syz repro: https://ci.syzbot.org/findings/b74cbc6a-1cef-4ae9-be46-dd9e8b29b648/syz_repro
>
> ======================================================
> WARNING: possible circular locking dependency detected
> syzkaller #0 Not tainted
> ------------------------------------------------------
> kworker/u8:1/13 is trying to acquire lock:
> ffff88816ef2b878 (kn->active#5){++++}-{0:0}, at: __kernfs_remove+0x47e/0x8c0 fs/kernfs/dir.c:1533
>
> but task is already holding lock:
> ffff8881012e8ab8 (&root->kernfs_supers_rwsem){++++}-{4:4}, at: kernfs_remove_by_name_ns+0x3f/0x140 fs/kernfs/dir.c:1745
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (&root->kernfs_supers_rwsem){++++}-{4:4}:
>        down_read+0x47/0x2e0 kernel/locking/rwsem.c:1537
>        kernfs_remove_by_name_ns+0x3f/0x140 fs/kernfs/dir.c:1745
>        acpi_unbind_one+0x2d8/0x3b0 drivers/acpi/glue.c:337
>        device_platform_notify_remove drivers/base/core.c:2386 [inline]
>        device_del+0x547/0x8f0 drivers/base/core.c:3881
>        serdev_controller_add+0x46f/0x640 drivers/tty/serdev/core.c:785
>        serdev_tty_port_register+0x159/0x260 drivers/tty/serdev/serdev-ttyport.c:291
>        tty_port_register_device_attr_serdev+0xe7/0x170 drivers/tty/tty_port.c:187
>        serial_core_add_one_port drivers/tty/serial/serial_core.c:3107 [inline]
>        serial_core_register_port+0x103a/0x28b0 drivers/tty/serial/serial_core.c:3305
>        serial8250_register_8250_port+0x1658/0x1fd0 drivers/tty/serial/8250/8250_core.c:822
>        serial_pnp_probe+0x568/0x7f0 drivers/tty/serial/8250/8250_pnp.c:480
>        pnp_device_probe+0x30b/0x4c0 drivers/pnp/driver.c:111
>        call_driver_probe drivers/base/dd.c:-1 [inline]
>        really_probe+0x267/0xaf0 drivers/base/dd.c:661
>        __driver_probe_device+0x18c/0x320 drivers/base/dd.c:803
>        driver_probe_device+0x4f/0x240 drivers/base/dd.c:833
>        __driver_attach+0x3e7/0x710 drivers/base/dd.c:1227
>        bus_for_each_dev+0x23b/0x2c0 drivers/base/bus.c:383
>        bus_add_driver+0x345/0x670 drivers/base/bus.c:715
>        driver_register+0x23a/0x320 drivers/base/driver.c:249
>        serial8250_init+0x8f/0x160 drivers/tty/serial/8250/8250_platform.c:317
>        do_one_initcall+0x250/0x840 init/main.c:1378
>        do_initcall_level+0x104/0x190 init/main.c:1440
>        do_initcalls+0x59/0xa0 init/main.c:1456
>        kernel_init_freeable+0x2a6/0x3d0 init/main.c:1688
>        kernel_init+0x1d/0x1d0 init/main.c:1578
>        ret_from_fork+0x51b/0xa40 arch/x86/kernel/process.c:158
>        ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
>
> -> #1 (&device->physical_node_lock){+.+.}-{4:4}:
>        __mutex_lock_common kernel/locking/mutex.c:614 [inline]
>        __mutex_lock+0x19f/0x1300 kernel/locking/mutex.c:776
>        acpi_get_first_physical_node drivers/acpi/bus.c:691 [inline]
>        acpi_primary_dev_companion drivers/acpi/bus.c:710 [inline]
>        acpi_companion_match+0x8a/0x120 drivers/acpi/bus.c:764
>        acpi_device_uevent_modalias+0x1a/0x30 drivers/acpi/device_sysfs.c:280
>        platform_uevent+0x3c/0xb0 drivers/base/platform.c:1411
>        dev_uevent+0x446/0x8a0 drivers/base/core.c:2692
>        kobject_uevent_env+0x477/0x9e0 lib/kobject_uevent.c:573
>        kobject_synth_uevent+0x585/0xbd0 lib/kobject_uevent.c:207
>        uevent_store+0x26/0x70 drivers/base/core.c:2773
>        kernfs_fop_write_iter+0x3af/0x540 fs/kernfs/file.c:352
>        new_sync_write fs/read_write.c:593 [inline]
>        vfs_write+0x61d/0xb90 fs/read_write.c:686
>        ksys_write+0x150/0x270 fs/read_write.c:738
>        do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
>        do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94
>        entry_SYSCALL_64_after_hwframe+0x77/0x7f
>
> -> #0 (kn->active#5){++++}-{0:0}:
>        check_prev_add kernel/locking/lockdep.c:3165 [inline]
>        check_prevs_add kernel/locking/lockdep.c:3284 [inline]
>        validate_chain kernel/locking/lockdep.c:3908 [inline]
>        __lock_acquire+0x15a5/0x2cf0 kernel/locking/lockdep.c:5237
>        lock_acquire+0x106/0x330 kernel/locking/lockdep.c:5868
>        kernfs_drain+0x27c/0x5f0 fs/kernfs/dir.c:511
>        __kernfs_remove+0x47e/0x8c0 fs/kernfs/dir.c:1533
>        kernfs_remove_by_name_ns+0xc0/0x140 fs/kernfs/dir.c:1751
>        sysfs_remove_file include/linux/sysfs.h:780 [inline]
>        device_remove_file drivers/base/core.c:3071 [inline]
>        device_del+0x506/0x8f0 drivers/base/core.c:3876
>        device_unregister+0x21/0xf0 drivers/base/core.c:3919
>        mac80211_hwsim_del_radio+0x2dc/0x490 drivers/net/wireless/virtual/mac80211_hwsim.c:5918
>        hwsim_exit_net+0xede/0xfa0 drivers/net/wireless/virtual/mac80211_hwsim.c:6807
>        ops_exit_list net/core/net_namespace.c:199 [inline]
>        ops_undo_list+0x49f/0x940 net/core/net_namespace.c:252
>        cleanup_net+0x4df/0x7b0 net/core/net_namespace.c:696
>        process_one_work kernel/workqueue.c:3257 [inline]
>        process_scheduled_works+0xaec/0x17a0 kernel/workqueue.c:3340
>        worker_thread+0xda6/0x1360 kernel/workqueue.c:3421
>        kthread+0x726/0x8b0 kernel/kthread.c:463
>        ret_from_fork+0x51b/0xa40 arch/x86/kernel/process.c:158
>        ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
>
> other info that might help us debug this:
>
> Chain exists of:
>   kn->active#5 --> &device->physical_node_lock --> &root->kernfs_supers_rwsem
>
>  Possible unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   rlock(&root->kernfs_supers_rwsem);
>                                lock(&device->physical_node_lock);
>                                lock(&root->kernfs_supers_rwsem);
>   lock(kn->active#5);
>
>  *** DEADLOCK ***
>
> 4 locks held by kworker/u8:1/13:
>  #0: ffff888100ef7948 ((wq_completion)netns){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3232 [inline]
>  #0: ffff888100ef7948 ((wq_completion)netns){+.+.}-{0:0}, at: process_scheduled_works+0x9d4/0x17a0 kernel/workqueue.c:3340
>  #1: ffffc90000127bc0 (net_cleanup_work){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3233 [inline]
>  #1: ffffc90000127bc0 (net_cleanup_work){+.+.}-{0:0}, at: process_scheduled_works+0xa0f/0x17a0 kernel/workqueue.c:3340
>  #2: ffffffff8f99d2d0 (pernet_ops_rwsem){++++}-{4:4}, at: cleanup_net+0xfe/0x7b0 net/core/net_namespace.c:670
>  #3: ffff8881012e8ab8 (&root->kernfs_supers_rwsem){++++}-{4:4}, at: kernfs_remove_by_name_ns+0x3f/0x140 fs/kernfs/dir.c:1745
>
> stack backtrace:
> CPU: 0 UID: 0 PID: 13 Comm: kworker/u8:1 Not tainted syzkaller #0 PREEMPT(full)
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> Workqueue: netns cleanup_net
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
>  print_circular_bug+0x2e1/0x300 kernel/locking/lockdep.c:2043
>  check_noncircular+0x12e/0x150 kernel/locking/lockdep.c:2175
>  check_prev_add kernel/locking/lockdep.c:3165 [inline]
>  check_prevs_add kernel/locking/lockdep.c:3284 [inline]
>  validate_chain kernel/locking/lockdep.c:3908 [inline]
>  __lock_acquire+0x15a5/0x2cf0 kernel/locking/lockdep.c:5237
>  lock_acquire+0x106/0x330 kernel/locking/lockdep.c:5868
>  kernfs_drain+0x27c/0x5f0 fs/kernfs/dir.c:511
>  __kernfs_remove+0x47e/0x8c0 fs/kernfs/dir.c:1533
>  kernfs_remove_by_name_ns+0xc0/0x140 fs/kernfs/dir.c:1751
>  sysfs_remove_file include/linux/sysfs.h:780 [inline]
>  device_remove_file drivers/base/core.c:3071 [inline]
>  device_del+0x506/0x8f0 drivers/base/core.c:3876
>  device_unregister+0x21/0xf0 drivers/base/core.c:3919
>  mac80211_hwsim_del_radio+0x2dc/0x490 drivers/net/wireless/virtual/mac80211_hwsim.c:5918
>  hwsim_exit_net+0xede/0xfa0 drivers/net/wireless/virtual/mac80211_hwsim.c:6807
>  ops_exit_list net/core/net_namespace.c:199 [inline]
>  ops_undo_list+0x49f/0x940 net/core/net_namespace.c:252
>  cleanup_net+0x4df/0x7b0 net/core/net_namespace.c:696
>  process_one_work kernel/workqueue.c:3257 [inline]
>  process_scheduled_works+0xaec/0x17a0 kernel/workqueue.c:3340
>  worker_thread+0xda6/0x1360 kernel/workqueue.c:3421
>  kthread+0x726/0x8b0 kernel/kthread.c:463
>  ret_from_fork+0x51b/0xa40 arch/x86/kernel/process.c:158
>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
>  </TASK>
> hsr_slave_0: left promiscuous mode
> hsr_slave_1: left promiscuous mode
> batman_adv: batadv0: Interface deactivated: batadv_slave_0
> batman_adv: batadv0: Removing interface: batadv_slave_0
> batman_adv: batadv0: Interface deactivated: batadv_slave_1
> batman_adv: batadv0: Removing interface: batadv_slave_1
> veth1_macvtap: left promiscuous mode
> veth0_macvtap: left promiscuous mode
> veth1_vlan: left promiscuous mode
> veth0_vlan: left promiscuous mode
> team0 (unregistering): Port device team_slave_1 removed
> team0 (unregistering): Port device team_slave_0 removed
> netdevsim netdevsim2 netdevsim0: set [1, 0] type 2 family 0 port 6081 - 0
> netdevsim netdevsim2 netdevsim1: set [1, 0] type 2 family 0 port 6081 - 0
> netdevsim netdevsim2 netdevsim2: set [1, 0] type 2 family 0 port 6081 - 0
> netdevsim netdevsim2 netdevsim3: set [1, 0] type 2 family 0 port 6081 - 0
>
>
> ***
>
> If these findings have caused you to resend the series or submit a
> separate fix, please add the following tag to your commit message:
>   Tested-by: syzbot@syzkaller.appspotmail.com
>
> ---
> This report is generated by a bot. It may contain errors.
> syzbot ci engineers can be reached at syzkaller@googlegroups.com.

Hm, I can see two ways to fix this.

The first is to drop the acpi_dev->physical_node_lock mutex in
acpi_unbind_one before calling sysfs_remove_link. This keeps the node
ID reserved while the sysfs files are still being removed, so that we
don't get any sysfs filename collisions (which are based on the node
ID). This seems like a good optimization to do anyway:

+++ b/drivers/acpi/glue.c
@@ -329,18 +329,22 @@ int acpi_unbind_one(struct device *dev)
        list_for_each_entry(entry, &acpi_dev->physical_node_list, node)
                if (entry->dev == dev) {
                        char physnode_name[PHYSICAL_NODE_NAME_SIZE];

-                       list_del(&entry->node);
-                       acpi_dev->physical_node_count--;
+                       entry->dev = NULL;
+                       mutex_unlock(&acpi_dev->physical_node_lock);

                        acpi_physnode_link_name(physnode_name, entry->node_id);
                        sysfs_remove_link(&acpi_dev->dev.kobj, physnode_name);
                        sysfs_remove_link(&dev->kobj, "firmware_node");
                        ACPI_COMPANION_SET(dev, NULL);
                        /* Drop references taken by acpi_bind_one(). */
                        put_device(dev);
                        acpi_dev_put(acpi_dev);
+
+                       mutex_lock(&acpi_dev->physical_node_lock);
+                       list_del(&entry->node);
+                       acpi_dev->physical_node_count--;
                        kfree(entry);
                        break;
                }


The second is to drop the kernfs_supers_rwsem for the kernfs_drain,
similar to how the kernfs_rwsem is dropped there. I don't think
kernfs_supers_rwsem is usually heavily contended, but it's probably a
good idea to avoid holding it while potentially sleeping in
kernfs_drain. Since the kernfs_supers_rwsem is only held for
kernfs_drain in __kernfs_remove (but not kernfs_show) that means:

+++ b/fs/kernfs/dir.c
@@ -486,7 +486,7 @@ void kernfs_put_active(struct kernfs_node *kn)
  * removers may invoke this function concurrently on @kn and all will
  * return after draining is complete.
  */
-static void kernfs_drain(struct kernfs_node *kn)
+static void kernfs_drain(struct kernfs_node *kn, bool drop_supers)
        __releases(&kernfs_root(kn)->kernfs_rwsem)
        __acquires(&kernfs_root(kn)->kernfs_rwsem)
 {
@@ -506,6 +506,8 @@ static void kernfs_drain(struct kernfs_node *kn)
                return;

        up_write(&root->kernfs_rwsem);
+       if (drop_supers)
+               up_read(&root->kernfs_supers_rwsem);

        if (kernfs_lockdep(kn)) {
                rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP_);
@@ -524,6 +526,8 @@ static void kernfs_drain(struct kernfs_node *kn)
        if (kernfs_should_drain_open_files(kn))
                kernfs_drain_open_files(kn);

+       if (drop_supers)
+               down_read(&root->kernfs_supers_rwsem);
        down_write(&root->kernfs_rwsem);
 }

@@ -1465,7 +1469,7 @@ void kernfs_show(struct kernfs_node *kn, bool show)
                kn->flags |= KERNFS_HIDDEN;
                if (kernfs_active(kn))
                        atomic_add(KN_DEACTIVATED_BIAS, &kn->active);
-               kernfs_drain(kn);
+               kernfs_drain(kn, false);
        }

        up_write(&root->kernfs_rwsem);
@@ -1530,7 +1534,7 @@ static void __kernfs_remove(struct kernfs_node *kn)
                 */
                kernfs_get(pos);

-               kernfs_drain(pos);
+               kernfs_drain(pos, true);
                parent = kernfs_parent(pos);
                /*
                 * kernfs_unlink_sibling() succeeds once per node.  Use it

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/3] kernfs: Send IN_DELETE_SELF and IN_IGNORED
  2026-02-20 17:15     ` Amir Goldstein
@ 2026-02-20 19:50       ` Tejun Heo
  2026-02-20 20:11         ` Amir Goldstein
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2026-02-20 19:50 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: T.J. Mercier, gregkh, driver-core, linux-kernel, cgroups,
	linux-fsdevel, jack, shuah, linux-kselftest

Hello,

On Fri, Feb 20, 2026 at 07:15:56PM +0200, Amir Goldstein wrote:
...
> > Adding a comment with the above content would probably be useful. It also
> > might be worthwhile to note that fanotify recursive monitoring wouldn't work
> > reliably as cgroups can go away while inodes are not attached.
> 
> Sigh.. it's a shame to grow more weird semantics.

Yeah, I mean, kernfs *is* weird.

> But I take this back to the POV of "remote" vs. "local" vfs notifications.
> the IN_DELETE_SELF events added by this change are actually
> "local" vfs notifications.
> 
> If we would want to support monitoring cgroups fs super block
> for all added/removed cgroups with fanotify, we would be able
> to implement this as "remote" notifications and in this case, adding
> explicit fsnotify() calls could make sense.

Yeah, that can be useful. For cgroupfs, there would probably need to be a
way to scope it so that it can be used on delegation boundaries too (which
we can require to coincide with cgroup NS boundaries). Would it be possible
to make FAN_MNT_ATTACH work for that?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/3] kernfs: Send IN_DELETE_SELF and IN_IGNORED
  2026-02-20 19:50       ` Tejun Heo
@ 2026-02-20 20:11         ` Amir Goldstein
  2026-02-20 23:32           ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2026-02-20 20:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: T.J. Mercier, gregkh, driver-core, linux-kernel, cgroups,
	linux-fsdevel, jack, shuah, linux-kselftest

On Fri, Feb 20, 2026 at 8:50 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Feb 20, 2026 at 07:15:56PM +0200, Amir Goldstein wrote:
> ...
> > > Adding a comment with the above content would probably be useful. It also
> > > might be worthwhile to note that fanotify recursive monitoring wouldn't work
> > > reliably as cgroups can go away while inodes are not attached.
> >
> > Sigh.. it's a shame to grow more weird semantics.
>
> Yeah, I mean, kernfs *is* weird.
>
> > But I take this back to the POV of "remote" vs. "local" vfs notifications.
> > the IN_DELETE_SELF events added by this change are actually
> > "local" vfs notifications.
> >
> > If we would want to support monitoring cgroups fs super block
> > for all added/removed cgroups with fanotify, we would be able
> > to implement this as "remote" notifications and in this case, adding
> > explicit fsnotify() calls could make sense.
>
> Yeah, that can be useful. For cgroupfs, there would probably need to be a
> way to scope it so that it can be used on delegation boundaries too (which
> we can require to coincide with cgroup NS boundaries).

I have no idea what the above means.
I could ask Gemini or you and I prefer the latter ;)
What are delegation boundaries and NFS boundaries in this context?

> Would it be possible to make FAN_MNT_ATTACH work for that?
>

FAN_MNT_ATTACH is an event generated on a mntns object.
If "cgroup NS boundaries" is referring to a mntns object and if
this object is available in the context of cgroup create/destroy
then it should be possible.

But FAN_MNT_ATTACH reports a mountid. Is there a mountid
to report on cgroup create? Probably not?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/3] kernfs: Send IN_DELETE_SELF and IN_IGNORED
  2026-02-20 20:11         ` Amir Goldstein
@ 2026-02-20 23:32           ` Tejun Heo
  2026-02-21 16:11             ` Amir Goldstein
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2026-02-20 23:32 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: T.J. Mercier, gregkh, driver-core, linux-kernel, cgroups,
	linux-fsdevel, jack, shuah, linux-kselftest

Hello, Amir.

On Fri, Feb 20, 2026 at 10:11:15PM +0200, Amir Goldstein wrote:
> > Yeah, that can be useful. For cgroupfs, there would probably need to be a
> > way to scope it so that it can be used on delegation boundaries too (which
> > we can require to coincide with cgroup NS boundaries).
> 
> I have no idea what the above means.
> I could ask Gemini or you and I prefer the latter ;)

Ah, you chose wrong. :)

> What are delegation boundaries and NFS boundaries in this context?

cgroup delegation is giving control of a subtree to someone else:

https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/tree/Documentation/admin-guide/cgroup-v2.rst#n537

There's an old way of doing it by changing perms on some files and new way
using cgroup namespace.

> > Would it be possible to make FAN_MNT_ATTACH work for that?
> 
> FAN_MNT_ATTACH is an event generated on a mntns object.
> If "cgroup NS boundaries" is referring to a mntns object and if
> this object is available in the context of cgroup create/destroy
> then it should be possible.

Great, yes, cgroup namespace way should work then.

> But FAN_MNT_ATTACH reports a mountid. Is there a mountid
> to report on cgroup create? Probably not?

Sorry, I thought that was per-mount recursive file event monitoring.
FAN_MARK_MOUNT looks like the right thing if we want to allow monitoring
cgroup creations / destructions in a subtree without recursively watching
each cgroup.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/3] kernfs: Send IN_DELETE_SELF and IN_IGNORED
  2026-02-20 23:32           ` Tejun Heo
@ 2026-02-21 16:11             ` Amir Goldstein
  2026-02-23 16:27               ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2026-02-21 16:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: T.J. Mercier, gregkh, driver-core, linux-kernel, cgroups,
	linux-fsdevel, jack, shuah, linux-kselftest

On Sat, Feb 21, 2026 at 12:32 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Amir.
>
> On Fri, Feb 20, 2026 at 10:11:15PM +0200, Amir Goldstein wrote:
> > > Yeah, that can be useful. For cgroupfs, there would probably need to be a
> > > way to scope it so that it can be used on delegation boundaries too (which
> > > we can require to coincide with cgroup NS boundaries).
> >
> > I have no idea what the above means.
> > I could ask Gemini or you and I prefer the latter ;)
>
> Ah, you chose wrong. :)
>
> > What are delegation boundaries and NFS boundaries in this context?
>
> cgroup delegation is giving control of a subtree to someone else:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/tree/Documentation/admin-guide/cgroup-v2.rst#n537
>
> There's an old way of doing it by changing perms on some files and new way
> using cgroup namespace.
>
> > > Would it be possible to make FAN_MNT_ATTACH work for that?
> >
> > FAN_MNT_ATTACH is an event generated on a mntns object.
> > If "cgroup NS boundaries" is referring to a mntns object and if
> > this object is available in the context of cgroup create/destroy
> > then it should be possible.
>
> Great, yes, cgroup namespace way should work then.
>
> > But FAN_MNT_ATTACH reports a mountid. Is there a mountid
> > to report on cgroup create? Probably not?
>
> Sorry, I thought that was per-mount recursive file event monitoring.
> FAN_MARK_MOUNT looks like the right thing if we want to allow monitoring
> cgroup creations / destructions in a subtree without recursively watching
> each cgroup.

The problem sounds very similar to subtree monitoring for mkdir/rmdir on
a filesystem, which is a problem that we have not yet solved.

The problem with FAN_MARK_MOUNT is that it does not support the
events CREATE/DELETE, because those events are currently
monitored in context where the mount is not available and anyway
what users want to get notified on a deleted file/dir in a subtree
regardless of the mount through which the create/delete was done.

Since commit 58f5fbeb367ff ("fanotify: support watching filesystems
and mounts inside userns") and fnaotify groups can be associated
with a userns.

I was thinking that we can have a model where events are delivered
to a listener based on whether or not the uid/gid of the object are
mappable to the userns of the group.

In a filesystem, this criteria cannot guarantee the subtree isolation.
I imagine that for delegated cgroups this criteria could match what
you need, but I am basing this on pure speculation.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/3] kernfs: Send IN_DELETE_SELF and IN_IGNORED
  2026-02-21 16:11             ` Amir Goldstein
@ 2026-02-23 16:27               ` Tejun Heo
  2026-02-24 11:03                 ` Christian Brauner
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2026-02-23 16:27 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: T.J. Mercier, gregkh, driver-core, linux-kernel, cgroups,
	linux-fsdevel, jack, shuah, linux-kselftest, Christian Brauner

(cc'ing Christian Brauner)

On Sat, Feb 21, 2026 at 06:11:28PM +0200, Amir Goldstein wrote:
> On Sat, Feb 21, 2026 at 12:32 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello, Amir.
> >
> > On Fri, Feb 20, 2026 at 10:11:15PM +0200, Amir Goldstein wrote:
> > > > Yeah, that can be useful. For cgroupfs, there would probably need to be a
> > > > way to scope it so that it can be used on delegation boundaries too (which
> > > > we can require to coincide with cgroup NS boundaries).
> > >
> > > I have no idea what the above means.
> > > I could ask Gemini or you and I prefer the latter ;)
> >
> > Ah, you chose wrong. :)
> >
> > > What are delegation boundaries and NFS boundaries in this context?
> >
> > cgroup delegation is giving control of a subtree to someone else:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/tree/Documentation/admin-guide/cgroup-v2.rst#n537
> >
> > There's an old way of doing it by changing perms on some files and new way
> > using cgroup namespace.
> >
> > > > Would it be possible to make FAN_MNT_ATTACH work for that?
> > >
> > > FAN_MNT_ATTACH is an event generated on a mntns object.
> > > If "cgroup NS boundaries" is referring to a mntns object and if
> > > this object is available in the context of cgroup create/destroy
> > > then it should be possible.
> >
> > Great, yes, cgroup namespace way should work then.
> >
> > > But FAN_MNT_ATTACH reports a mountid. Is there a mountid
> > > to report on cgroup create? Probably not?
> >
> > Sorry, I thought that was per-mount recursive file event monitoring.
> > FAN_MARK_MOUNT looks like the right thing if we want to allow monitoring
> > cgroup creations / destructions in a subtree without recursively watching
> > each cgroup.
> 
> The problem sounds very similar to subtree monitoring for mkdir/rmdir on
> a filesystem, which is a problem that we have not yet solved.
> 
> The problem with FAN_MARK_MOUNT is that it does not support the
> events CREATE/DELETE, because those events are currently

Ah, bummer.

> monitored in context where the mount is not available and anyway
> what users want to get notified on a deleted file/dir in a subtree
> regardless of the mount through which the create/delete was done.
> 
> Since commit 58f5fbeb367ff ("fanotify: support watching filesystems
> and mounts inside userns") and fnaotify groups can be associated
> with a userns.
> 
> I was thinking that we can have a model where events are delivered
> to a listener based on whether or not the uid/gid of the object are
> mappable to the userns of the group.

Given how different NSes can be used independently of each other, it'd
probably be cleaner if it doesn't have to depend on another NS.

> In a filesystem, this criteria cannot guarantee the subtree isolation.
> I imagine that for delegated cgroups this criteria could match what
> you need, but I am basing this on pure speculation.

There's a lot of flexibility in the mechanism, so it's difficult to tell.
e.g. There's nothing preventing somebody from creating two separate subtrees
delegated to the same user.

Christian was mentioning allowing separate super for different cgroup mounts
in another thread. cc'ing him for context.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/3] kernfs: Send IN_DELETE_SELF and IN_IGNORED
  2026-02-23 16:27               ` Tejun Heo
@ 2026-02-24 11:03                 ` Christian Brauner
  2026-03-03 14:27                   ` Amir Goldstein
  0 siblings, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2026-02-24 11:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Amir Goldstein, T.J. Mercier, gregkh, driver-core, linux-kernel,
	cgroups, linux-fsdevel, jack, shuah, linux-kselftest

On Mon, Feb 23, 2026 at 06:27:31AM -1000, Tejun Heo wrote:
> (cc'ing Christian Brauner)
> 
> On Sat, Feb 21, 2026 at 06:11:28PM +0200, Amir Goldstein wrote:
> > On Sat, Feb 21, 2026 at 12:32 AM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > Hello, Amir.
> > >
> > > On Fri, Feb 20, 2026 at 10:11:15PM +0200, Amir Goldstein wrote:
> > > > > Yeah, that can be useful. For cgroupfs, there would probably need to be a
> > > > > way to scope it so that it can be used on delegation boundaries too (which
> > > > > we can require to coincide with cgroup NS boundaries).
> > > >
> > > > I have no idea what the above means.
> > > > I could ask Gemini or you and I prefer the latter ;)
> > >
> > > Ah, you chose wrong. :)
> > >
> > > > What are delegation boundaries and NFS boundaries in this context?
> > >
> > > cgroup delegation is giving control of a subtree to someone else:
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/tree/Documentation/admin-guide/cgroup-v2.rst#n537
> > >
> > > There's an old way of doing it by changing perms on some files and new way
> > > using cgroup namespace.
> > >
> > > > > Would it be possible to make FAN_MNT_ATTACH work for that?
> > > >
> > > > FAN_MNT_ATTACH is an event generated on a mntns object.
> > > > If "cgroup NS boundaries" is referring to a mntns object and if
> > > > this object is available in the context of cgroup create/destroy
> > > > then it should be possible.
> > >
> > > Great, yes, cgroup namespace way should work then.
> > >
> > > > But FAN_MNT_ATTACH reports a mountid. Is there a mountid
> > > > to report on cgroup create? Probably not?
> > >
> > > Sorry, I thought that was per-mount recursive file event monitoring.
> > > FAN_MARK_MOUNT looks like the right thing if we want to allow monitoring
> > > cgroup creations / destructions in a subtree without recursively watching
> > > each cgroup.
> > 
> > The problem sounds very similar to subtree monitoring for mkdir/rmdir on
> > a filesystem, which is a problem that we have not yet solved.
> > 
> > The problem with FAN_MARK_MOUNT is that it does not support the
> > events CREATE/DELETE, because those events are currently
> 
> Ah, bummer.
> 
> > monitored in context where the mount is not available and anyway
> > what users want to get notified on a deleted file/dir in a subtree
> > regardless of the mount through which the create/delete was done.
> > 
> > Since commit 58f5fbeb367ff ("fanotify: support watching filesystems
> > and mounts inside userns") and fnaotify groups can be associated
> > with a userns.
> > 
> > I was thinking that we can have a model where events are delivered
> > to a listener based on whether or not the uid/gid of the object are
> > mappable to the userns of the group.
> 
> Given how different NSes can be used independently of each other, it'd
> probably be cleaner if it doesn't have to depend on another NS.
> 
> > In a filesystem, this criteria cannot guarantee the subtree isolation.
> > I imagine that for delegated cgroups this criteria could match what
> > you need, but I am basing this on pure speculation.
> 
> There's a lot of flexibility in the mechanism, so it's difficult to tell.
> e.g. There's nothing preventing somebody from creating two separate subtrees
> delegated to the same user.

Delegation is based on inode ownership I'm not sure how well this will
fit into the fanotify model. Maybe the group logic for userns that
fanotify added works. I'm not super sure.

> Christian was mentioning allowing separate super for different cgroup mounts
> in another thread. cc'ing him for context.

If cgroupfs changes to tmpfs semantics where each mount gives you a new
superblock then it's possible to give each container its own superblock.
That in turn would make it possible to place fanotify watches on the
superblock itself. I think you'd roughly need something like the
following permission model:

* Cgroupfs mounted on the host -> would require global CAP_SYS_ADMIN as
  you'd get notified about all tree changes ofc.
* If cgroupfs is mounted in user namespace with a cgroup namespace then
  allow the container to monitor the whole superblock.

I think kernfs currently has logic to gate mounting of sysfs in a
container on the network namespace. We would need similar logic to gate
creation of a new superblock for cgroupfs behind the cgroup namespace
(that's the kernfs tagging mechanism iirc).

There's some more annoyance ofc: the current model has one superblock
for the whole system. As such each cgroup is associated with exactly one
inode. So any ownership changes to a given inode are visible _system
wide_. That leads to problems such as an unpriv user having a hard time
deleting cgroups that were delegated to an unprivileged container that
it owns - at least not without setting up a helper userns and running rm
-rf in it.

Note, if we allow separate cgroup superblocks then this automatically
entails that multiple inodes from different superblocks refer to the
same underlying cgroup - like separate procfs instances have different
inodes that refere to the same task struct or whatever. This should be
fine locking wise because you serialize on locks associated with the
underlying cgroup - which would be referenced by all inodes.

With this possible cgroupfs will be able to be mounted inside of a
container with a separate inode/dentry tree where each
inode->i_{uid,gid} can be set according to the containers user
namespace.

That also gets rid of the aforementioned problem where an unprivileged
container user on the host cannot remove cgroups that were delegated to
the container.

It also introduces a change in the delegation model that is worth
considering:

Right now if you delegate ownership it means chown()ing a bunch of files
to the relevant user. With separate superblocks mountable in containers
you could technically delegate write access to multiple containers at
the same time even though they might have completely distinct user
namespaces with isolated idmappings (iow, they're global uid/gid ranges
don't overlap and so they can't meaningfully interact with each other).

If container A mounts a new cgroupfs instance and container B mounts a
new cgroupfs instance and someone was crazy enough to let both A and B
share the same cgroup they could both write around in it. It would also
mean that all files in a given cgroup change ownership _within the
superblock that was mounted_ - other superblocks are ofc unaffected:

  mkdir /sys/fs/cgroup/lets/go/deeper/

delegate the cgroup

  echo 1234 > /sys/fs/cgroup/lets/go/deeper

Now the container payload running as 1234 does:

  unshare(CLONE_NEWUSER | CLONE_NEWNS | CLONE_NEWCGROUP);

Set ups the rootfs and chroots into it and then mounts cgroupfs within
its namespaces:

  mount("cgroup", "/sys/fs/cgroup", "cgroup2", 0, NULL);

This would create a new cgroupfs superblock with "deeper" being the
root dentry - similar to how "remounting" changes the visibility. Then
the idmapping associated with the user namespace is taken into account
and all files under "deeper" will be owned by the container root making
it all writable for the container (That is different from today where
you need to chown around.).

TL;DR:

* multi-instance cgroupfs implies multiple inodes for the same cgroup
  with custom ownership for each inode
* multi-instance cgroupfs means per-container superblock fanotify
  watches
* multi-instance cgroupfs means per-container superblock mount options

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/3] kernfs: Send IN_DELETE_SELF and IN_IGNORED
  2026-02-24 11:03                 ` Christian Brauner
@ 2026-03-03 14:27                   ` Amir Goldstein
  2026-03-04 13:26                     ` Christian Brauner
  0 siblings, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2026-03-03 14:27 UTC (permalink / raw)
  To: Christian Brauner, jack, Tejun Heo
  Cc: T.J. Mercier, gregkh, driver-core, linux-kernel, cgroups,
	linux-fsdevel, shuah, linux-kselftest

[-- Attachment #1: Type: text/plain, Size: 5576 bytes --]

On Tue, Feb 24, 2026 at 12:03 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Mon, Feb 23, 2026 at 06:27:31AM -1000, Tejun Heo wrote:
> > (cc'ing Christian Brauner)
> >
> > On Sat, Feb 21, 2026 at 06:11:28PM +0200, Amir Goldstein wrote:
> > > On Sat, Feb 21, 2026 at 12:32 AM Tejun Heo <tj@kernel.org> wrote:
> > > >
> > > > Hello, Amir.
> > > >
> > > > On Fri, Feb 20, 2026 at 10:11:15PM +0200, Amir Goldstein wrote:
> > > > > > Yeah, that can be useful. For cgroupfs, there would probably need to be a
> > > > > > way to scope it so that it can be used on delegation boundaries too (which
> > > > > > we can require to coincide with cgroup NS boundaries).
> > > > >
> > > > > I have no idea what the above means.
> > > > > I could ask Gemini or you and I prefer the latter ;)
> > > >
> > > > Ah, you chose wrong. :)
> > > >
> > > > > What are delegation boundaries and NFS boundaries in this context?
> > > >
> > > > cgroup delegation is giving control of a subtree to someone else:
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/tree/Documentation/admin-guide/cgroup-v2.rst#n537
> > > >
> > > > There's an old way of doing it by changing perms on some files and new way
> > > > using cgroup namespace.
> > > >
> > > > > > Would it be possible to make FAN_MNT_ATTACH work for that?
> > > > >
> > > > > FAN_MNT_ATTACH is an event generated on a mntns object.
> > > > > If "cgroup NS boundaries" is referring to a mntns object and if
> > > > > this object is available in the context of cgroup create/destroy
> > > > > then it should be possible.
> > > >
> > > > Great, yes, cgroup namespace way should work then.
> > > >
> > > > > But FAN_MNT_ATTACH reports a mountid. Is there a mountid
> > > > > to report on cgroup create? Probably not?
> > > >
> > > > Sorry, I thought that was per-mount recursive file event monitoring.
> > > > FAN_MARK_MOUNT looks like the right thing if we want to allow monitoring
> > > > cgroup creations / destructions in a subtree without recursively watching
> > > > each cgroup.
> > >
> > > The problem sounds very similar to subtree monitoring for mkdir/rmdir on
> > > a filesystem, which is a problem that we have not yet solved.
> > >
> > > The problem with FAN_MARK_MOUNT is that it does not support the
> > > events CREATE/DELETE, because those events are currently
> >
> > Ah, bummer.
> >
> > > monitored in context where the mount is not available and anyway
> > > what users want to get notified on a deleted file/dir in a subtree
> > > regardless of the mount through which the create/delete was done.
> > >
> > > Since commit 58f5fbeb367ff ("fanotify: support watching filesystems
> > > and mounts inside userns") and fnaotify groups can be associated
> > > with a userns.
> > >
> > > I was thinking that we can have a model where events are delivered
> > > to a listener based on whether or not the uid/gid of the object are
> > > mappable to the userns of the group.
> >
> > Given how different NSes can be used independently of each other, it'd
> > probably be cleaner if it doesn't have to depend on another NS.
> >
> > > In a filesystem, this criteria cannot guarantee the subtree isolation.
> > > I imagine that for delegated cgroups this criteria could match what
> > > you need, but I am basing this on pure speculation.
> >
> > There's a lot of flexibility in the mechanism, so it's difficult to tell.
> > e.g. There's nothing preventing somebody from creating two separate subtrees
> > delegated to the same user.
>
> Delegation is based on inode ownership I'm not sure how well this will
> fit into the fanotify model. Maybe the group logic for userns that
> fanotify added works. I'm not super sure.
>
> > Christian was mentioning allowing separate super for different cgroup mounts
> > in another thread. cc'ing him for context.
>
> If cgroupfs changes to tmpfs semantics where each mount gives you a new
> superblock then it's possible to give each container its own superblock.
> That in turn would make it possible to place fanotify watches on the
> superblock itself. I think you'd roughly need something like the
> following permission model:
>

It's hard for me to estimate the effort of changing to multi sb model,
but judging by the length of the email I trimmed below, it does not
sound trivial...

How do you guys feel about something like this patch which associates
an owner userns to every cgroup?

I have this POC branch from a long time ago [1] to filter all events
on sb by in_userns() criteria.  The semantics for real filesystems
were a bit difficult, but perhaps this model can work well for these
pseudo singleton fs.

I am trying to work on a model that could be useful for both cgroupfs
and nsfs:

If user is capable in userns, user will be able to set an sb
watch for all events (say DELETE_SELF) on the sb, for objects
whose owner_userns is in_userns() of the fanotify listener.

This will enable watching for torn down cgroups and namepsaces
which are visible to said user via delegated cgroups mount
or via listns().

I would like to allow calling fsnotify_obj_remove() hook with
encoded object fid (e.g. nsfs_file_handle) instead of the vfs inode,
so that cgroupfs/nsfs could report dying objects without needing
to associate a vfs inode with them.

WDYT? Is this an interesting direction to persure?

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxgt1Cx5jx3L6iaDvbzCWPv=fcMgLaa9ODkiu9h718MkwQ@mail.gmail.com/

[-- Attachment #2: 0001-cgroup-track-owner_userns-per-cgroup.patch --]
[-- Type: text/x-patch, Size: 3344 bytes --]

From 4b3a56b8ca548354214329729997a78c72a016d3 Mon Sep 17 00:00:00 2001
From: Amir Goldstein <amir73il@gmail.com>
Date: Tue, 3 Mar 2026 14:04:22 +0100
Subject: [PATCH] cgroup: track owner_userns per cgroup

Add owner_userns field to struct cgroup to record which user namespace
owns a given cgroup.

For hierarchy roots, the owner is always init_user_ns.
For cgroups created via mkdir (cgroup_create()), possibly inside a
delegated cgroup namespace, the owner is the user namespace of the
creating task's cgroup namespace.

This field is a prerequisite for delivering userns-scoped fsnotify
events (e.g. FAN_DELETE_SELF via FAN_FILESYSTEM_MARK) when a cgroup is
destroyed, allowing a sufficiently privileged admin inside a delegated
cgroup namespace to watch for cgroup teardown without requiring access
to the full system view.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 include/linux/cgroup-defs.h | 8 ++++++++
 kernel/cgroup/cgroup.c      | 6 ++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index bb92f5c169ca2..4ee344792a1d5 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -33,6 +33,7 @@ struct kernfs_ops;
 struct kernfs_open_file;
 struct seq_file;
 struct poll_table_struct;
+struct user_namespace;
 
 #define MAX_CGROUP_TYPE_NAMELEN 32
 #define MAX_CGROUP_ROOT_NAMELEN 64
@@ -551,6 +552,13 @@ struct cgroup {
 
 	struct cgroup_root *root;
 
+	/*
+	 * The user namespace that owns this cgroup: the creating task's
+	 * cgroup_ns->user_ns for child cgroups, or init_user_ns for
+	 * hierarchy roots.  Determines the scope of filesystem watches.
+	 */
+	struct user_namespace *owner_userns;
+
 	/*
 	 * List of cgrp_cset_links pointing at css_sets with tasks in this
 	 * cgroup.  Protected by css_set_lock.
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c22cda7766d84..e0beaf5cc8c49 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1381,6 +1381,7 @@ static void cgroup_exit_root_id(struct cgroup_root *root)
 
 void cgroup_free_root(struct cgroup_root *root)
 {
+	put_user_ns(root->cgrp.owner_userns);
 	kfree_rcu(root, rcu);
 }
 
@@ -2195,6 +2196,7 @@ int cgroup_setup_root(struct cgroup_root *root, u32 ss_mask)
 	root_cgrp->kn = kernfs_root_to_node(root->kf_root);
 	WARN_ON_ONCE(cgroup_ino(root_cgrp) != 1);
 	root_cgrp->ancestors[0] = root_cgrp;
+	root_cgrp->owner_userns = get_user_ns(&init_user_ns);
 
 	ret = css_populate_dir(&root_cgrp->self);
 	if (ret)
@@ -5607,6 +5609,7 @@ static void css_free_rwork_fn(struct work_struct *work)
 			cgroup_put(cgroup_parent(cgrp));
 			kernfs_put(cgrp->kn);
 			psi_cgroup_free(cgrp);
+			put_user_ns(cgrp->owner_userns);
 			kfree(cgrp);
 		} else {
 			/*
@@ -5848,6 +5851,8 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 	if (!cgrp)
 		return ERR_PTR(-ENOMEM);
 
+	cgrp->owner_userns = get_user_ns(current->nsproxy->cgroup_ns->user_ns);
+
 	ret = percpu_ref_init(&cgrp->self.refcnt, css_release, 0, GFP_KERNEL);
 	if (ret)
 		goto out_free_cgrp;
@@ -5956,6 +5961,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 out_cancel_ref:
 	percpu_ref_exit(&cgrp->self.refcnt);
 out_free_cgrp:
+	put_user_ns(cgrp->owner_userns);
 	kfree(cgrp);
 	return ERR_PTR(ret);
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/3] kernfs: Send IN_DELETE_SELF and IN_IGNORED
  2026-03-03 14:27                   ` Amir Goldstein
@ 2026-03-04 13:26                     ` Christian Brauner
  0 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2026-03-04 13:26 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: jack, Tejun Heo, T.J. Mercier, gregkh, driver-core, linux-kernel,
	cgroups, linux-fsdevel, shuah, linux-kselftest

On Tue, Mar 03, 2026 at 03:27:52PM +0100, Amir Goldstein wrote:
> On Tue, Feb 24, 2026 at 12:03 PM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Mon, Feb 23, 2026 at 06:27:31AM -1000, Tejun Heo wrote:
> > > (cc'ing Christian Brauner)
> > >
> > > On Sat, Feb 21, 2026 at 06:11:28PM +0200, Amir Goldstein wrote:
> > > > On Sat, Feb 21, 2026 at 12:32 AM Tejun Heo <tj@kernel.org> wrote:
> > > > >
> > > > > Hello, Amir.
> > > > >
> > > > > On Fri, Feb 20, 2026 at 10:11:15PM +0200, Amir Goldstein wrote:
> > > > > > > Yeah, that can be useful. For cgroupfs, there would probably need to be a
> > > > > > > way to scope it so that it can be used on delegation boundaries too (which
> > > > > > > we can require to coincide with cgroup NS boundaries).
> > > > > >
> > > > > > I have no idea what the above means.
> > > > > > I could ask Gemini or you and I prefer the latter ;)
> > > > >
> > > > > Ah, you chose wrong. :)
> > > > >
> > > > > > What are delegation boundaries and NFS boundaries in this context?
> > > > >
> > > > > cgroup delegation is giving control of a subtree to someone else:
> > > > >
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/tree/Documentation/admin-guide/cgroup-v2.rst#n537
> > > > >
> > > > > There's an old way of doing it by changing perms on some files and new way
> > > > > using cgroup namespace.
> > > > >
> > > > > > > Would it be possible to make FAN_MNT_ATTACH work for that?
> > > > > >
> > > > > > FAN_MNT_ATTACH is an event generated on a mntns object.
> > > > > > If "cgroup NS boundaries" is referring to a mntns object and if
> > > > > > this object is available in the context of cgroup create/destroy
> > > > > > then it should be possible.
> > > > >
> > > > > Great, yes, cgroup namespace way should work then.
> > > > >
> > > > > > But FAN_MNT_ATTACH reports a mountid. Is there a mountid
> > > > > > to report on cgroup create? Probably not?
> > > > >
> > > > > Sorry, I thought that was per-mount recursive file event monitoring.
> > > > > FAN_MARK_MOUNT looks like the right thing if we want to allow monitoring
> > > > > cgroup creations / destructions in a subtree without recursively watching
> > > > > each cgroup.
> > > >
> > > > The problem sounds very similar to subtree monitoring for mkdir/rmdir on
> > > > a filesystem, which is a problem that we have not yet solved.
> > > >
> > > > The problem with FAN_MARK_MOUNT is that it does not support the
> > > > events CREATE/DELETE, because those events are currently
> > >
> > > Ah, bummer.
> > >
> > > > monitored in context where the mount is not available and anyway
> > > > what users want to get notified on a deleted file/dir in a subtree
> > > > regardless of the mount through which the create/delete was done.
> > > >
> > > > Since commit 58f5fbeb367ff ("fanotify: support watching filesystems
> > > > and mounts inside userns") and fnaotify groups can be associated
> > > > with a userns.
> > > >
> > > > I was thinking that we can have a model where events are delivered
> > > > to a listener based on whether or not the uid/gid of the object are
> > > > mappable to the userns of the group.
> > >
> > > Given how different NSes can be used independently of each other, it'd
> > > probably be cleaner if it doesn't have to depend on another NS.
> > >
> > > > In a filesystem, this criteria cannot guarantee the subtree isolation.
> > > > I imagine that for delegated cgroups this criteria could match what
> > > > you need, but I am basing this on pure speculation.
> > >
> > > There's a lot of flexibility in the mechanism, so it's difficult to tell.
> > > e.g. There's nothing preventing somebody from creating two separate subtrees
> > > delegated to the same user.
> >
> > Delegation is based on inode ownership I'm not sure how well this will
> > fit into the fanotify model. Maybe the group logic for userns that
> > fanotify added works. I'm not super sure.
> >
> > > Christian was mentioning allowing separate super for different cgroup mounts
> > > in another thread. cc'ing him for context.
> >
> > If cgroupfs changes to tmpfs semantics where each mount gives you a new
> > superblock then it's possible to give each container its own superblock.
> > That in turn would make it possible to place fanotify watches on the
> > superblock itself. I think you'd roughly need something like the
> > following permission model:
> >
> 
> It's hard for me to estimate the effort of changing to multi sb model,
> but judging by the length of the email I trimmed below, it does not
> sound trivial...
> 
> How do you guys feel about something like this patch which associates
> an owner userns to every cgroup?
> 
> I have this POC branch from a long time ago [1] to filter all events
> on sb by in_userns() criteria.  The semantics for real filesystems
> were a bit difficult, but perhaps this model can work well for these
> pseudo singleton fs.
> 
> I am trying to work on a model that could be useful for both cgroupfs
> and nsfs:
> 
> If user is capable in userns, user will be able to set an sb
> watch for all events (say DELETE_SELF) on the sb, for objects
> whose owner_userns is in_userns() of the fanotify listener.
> 
> This will enable watching for torn down cgroups and namepsaces
> which are visible to said user via delegated cgroups mount
> or via listns().
> 
> I would like to allow calling fsnotify_obj_remove() hook with
> encoded object fid (e.g. nsfs_file_handle) instead of the vfs inode,
> so that cgroupfs/nsfs could report dying objects without needing
> to associate a vfs inode with them.
> 
> WDYT? Is this an interesting direction to persure?

I'd need to see the patches. I barely remember the details tbh.
It doesn't sound crazy though.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2026-03-04 13:26 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-20  5:54 [PATCH v4 0/3] kernfs: Add inotify IN_DELETE_SELF, IN_IGNORED support T.J. Mercier
2026-02-20  5:54 ` [PATCH v4 1/3] kernfs: Don't set_nlink for directories being removed T.J. Mercier
2026-02-20  5:54 ` [PATCH v4 2/3] kernfs: Send IN_DELETE_SELF and IN_IGNORED T.J. Mercier
2026-02-20 15:32   ` Tejun Heo
2026-02-20 17:15     ` Amir Goldstein
2026-02-20 19:50       ` Tejun Heo
2026-02-20 20:11         ` Amir Goldstein
2026-02-20 23:32           ` Tejun Heo
2026-02-21 16:11             ` Amir Goldstein
2026-02-23 16:27               ` Tejun Heo
2026-02-24 11:03                 ` Christian Brauner
2026-03-03 14:27                   ` Amir Goldstein
2026-03-04 13:26                     ` Christian Brauner
2026-02-20  5:54 ` [PATCH v4 3/3] selftests: memcg: Add tests for " T.J. Mercier
2026-02-20 17:43   ` Amir Goldstein
2026-02-20 17:46     ` T.J. Mercier
2026-02-20 17:53       ` T.J. Mercier
2026-02-20 18:01         ` Amir Goldstein
2026-02-20 10:14 ` [syzbot ci] Re: kernfs: Add inotify IN_DELETE_SELF, IN_IGNORED support syzbot ci
2026-02-20 18:41   ` T.J. Mercier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox