[REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations

Linux cgroups development
 help / color / mirror / Atom feed

* [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations
@ 2026-04-29  9:21 Martin Pitt
  2026-04-29 16:21 ` Tejun Heo
  2026-05-01  2:29 ` [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated Tejun Heo
  0 siblings, 2 replies; 8+ messages in thread
From: Martin Pitt @ 2026-04-29  9:21 UTC (permalink / raw)
  To: regressions; +Cc: cgroups, tj, lizefan.x, hannes

Hello,

Our cockpit tests found a kernel regression introduced between 6.9.10 (working)
and 6.9.11 (broken) that causes a system hang during cgroup cleanup after
podman container operations. I've kept notes in
https://github.com/cockpit-project/bots/pull/8970#issuecomment-4342147158 , but
now I am at the end of my wisdom how to squeeze more information out of this.

=== Summary ===

When running podman REST API operations on rootless containers followed by user
session cleanup (loginctl/pkill), systemd (pid 1) gets stuck in
cgroup_drain_dying trying to remove an empty cgroup. After that, I'm

- Unable to run commands that access /proc (ps, top, lsns, ls /proc, etc.)
- Unable to create new SSH sessions or VT logins
- If I previously logged into the QEMU VT, that login session remains
  more or less functional, except not being able to run most commands

=== Kernel Versions ===

- Last known working: 6.9.10
- Broken: 6.9.11 (OpenSUSE Tumbleweed), 6.9.13 (Fedora 44), 6.9.14 (Fedora 44),
  Ubuntu 26.04 (7.0.0)

=== Stack Trace ===

From sysrq-trigger task dump, systemd is stuck in:

[  207.958946] task:systemd         state:D stack:0     pid:1     tgid:1     ppid:0
[  207.959734] Call Trace:
[  207.960117]  <TASK>
[  207.960333]  __schedule+0x2b2/0x5d0
[  207.960603]  schedule+0x27/0x80
[  207.960945]  cgroup_drain_dying+0xef/0x1a0
[  207.961287]  ? __pfx_autoremove_wake_function+0x10/0x10
[  207.961639]  cgroup_rmdir+0x37/0x100
[  207.961945]  kernfs_iop_rmdir+0x6a/0xd0
[  207.962239]  vfs_rmdir+0x154/0x270
[  207.962486]  do_rmdir+0x201/0x280
[  207.962723]  __x64_sys_unlinkat+0x8c/0xd0

=== Observations ===

- /sys/fs/cgroup/user.slice/user-1000.slice/cgroup.procs was empty, indicating
  all processes were killed but the cgroup itself cannot be removed
- Multiple zombie processes present, unable to be reaped (user@1000.service
  systemd, podman, conmon processes)
- RCU subsystem appears healthy (rcu_exp_gp_kthr in S state)

=== Reproducer ===

The bug is triggered by a specific sequence of podman REST API operations on
rootless containers, followed by user cleanup. The reproducer is part of the
cockpit-podman test suite. I created a branch where I reduced the test to the
absolute minimum, and also replaced as many UI clicks as possible with shell
operations (all but one):

  https://github.com/martinpitt/cockpit-podman/blob/kernel-hang/test/check-application#L1486

Sequence:
1. Create and stop a rootless container as the admin user
2. Call podman REST API lifecycle operations: start → restart → stop
3. Create an exec session (console/TTY connection) via REST API
4. Start the container again via REST API
5. Cleanup: loginctl terminate-user admin; loginctl kill-user admin; pkill -9 -u admin

Using podman CLI commands (e.g., "podman start swamped-crate") instead of the
REST API does NOT trigger the hang, only when using the REST API. That may be
because of the different process layout, or just sheer timing -- as eventually,
both CLI and API should result in the same actual cgroup/container operations
on the podman side.

The bug is very timing-sensitive. I attempted to create a standalone shell
script reproducer, but failed, it always passes with that. Even with the
original cockpit-podman integration test failure it's unreliable: it can hang
on the first iteration, most of the time it fails within 5 runs, but I've had
stretches where 50+ iterations passed before the hang happened.

=== Full debug output ===

The above GitHub PR comment links to the full dmesg log. Direct link:
https://github.com/user-attachments/files/27195205/dmesg-cgrouphang.txt

This covers initial boot up to the hang, and then the outputs of sysrq task
dump (t), memory info (m), and blocked tasks (w).

=== Additional Notes ===

In one early test run, a different hang pattern was observed where
rcu_exp_gp_kthr was in D state with a process stuck in
synchronize_rcu_expedited during namespace cleanup, but this variant has not
reproduced in subsequent runs. The cgroup cleanup deadlock appears to be the
primary manifestation.

This is my first (non-trivial) kernel bug report, so please bear with me. I'm
normally stay firmly in userland.

Thanks,

Martin Pitt

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations
  2026-04-29  9:21 [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations Martin Pitt
@ 2026-04-29 16:21 ` Tejun Heo
  2026-04-29 21:15   ` Tejun Heo
  2026-04-30  6:15   ` Martin Pitt
  2026-05-01  2:29 ` [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated Tejun Heo
  1 sibling, 2 replies; 8+ messages in thread
From: Tejun Heo @ 2026-04-29 16:21 UTC (permalink / raw)
  To: Martin Pitt; +Cc: regressions, cgroups, lizefan.x, hannes

Hello,

Thanks for the report. The dmesg you attached has only a partial sysrq-t
- the dying-task stacks I need were pushed out of the ring buffer. Could
you increase log_buf_len, reproduce, trigger sysrq-t, and send the
resulting dmesg?

Thanks.
--
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations
  2026-04-29 16:21 ` Tejun Heo
@ 2026-04-29 21:15   ` Tejun Heo
  2026-04-30  6:15   ` Martin Pitt
  1 sibling, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-04-29 21:15 UTC (permalink / raw)
  To: Martin Pitt
  Cc: regressions, cgroups, lizefan.x, hannes,
	Sebastian Andrzej Siewior

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 7343 bytes --]

Hello,

I think I have the mechanism. The deadlock chains three things together.

1. Host PID 1 systemd is doing rmdir on user-1001.slice. rmdir enters
   cgroup_drain_dying() which waits until nr_populated_csets drops to
   0. The wait was added in 1b164b876c36 ("cgroup: Wait for dying
   tasks to leave on rmdir") so that rmdir doesn't succeed while
   dying tasks are still on the cgroup's css_set - the controller
   invariant being kept is "no tasks running in an offlined css"
   (see d245698d727a for where that got established on the dying-task
   side).

2. nr_populated_csets only drops when cgroup_task_dead() runs, which
   happens from finish_task_switch() after the dying task's
   do_task_dead() - i.e., after the very last context switch out of
   the task.

3. The container's PID 1 (whatever the entrypoint runs) is in
   do_exit() but parked in zap_pid_ns_processes' second wait loop:

       for (;;) {
           set_current_state(TASK_INTERRUPTIBLE);
           if (pid_ns->pid_allocated == init_pids)
               break;
           schedule();
       }

   pid_allocated stays > init_pids because at least one struct pid in
   the namespace is still in the idr - i.e., its task hasn't been
   reaped (release_task -> free_pid hasn't run).

The unreaped task is the host parent of one of the container helpers
- in the podman case, an exec session child whose host parent died
during the pkill cascade (conmon, the user manager, etc.) and got
re-parented to host PID 1. PID 1 was supposed to wait4() it as part of
normal reaping, but PID 1 is blocked in (1). Chicken-and-egg.

Why fuse-overlayfs masks it: with fuse-overlayfs the container
teardown finishes fast enough that the container's PID 1 reaches
do_task_dead before host PID 1 has had time to become the reaper of
the orphan. With kernel-overlayfs the teardown drags long enough that
the reparent + drain wait land in the iterator-hidden window
simultaneously, and once they do, the wedge is permanent.

I have a minimal C reproducer (~150 lines, no podman, no daemons,
runs as root) that hits the exact same code paths. The shape: A
unshares a pid namespace and forks B (PID 1 of the namespace) and C
(PID 2; A is C's host parent, NOT B). Both run in an inner cgroup. A
SIGKILLs both, then calls rmdir() without wait4()-ing C. C becomes a
zombie pinning pid_allocated; B parks in zap_pid_ns_processes; A's
rmdir parks in cgroup_drain_dying. Verified on stock 6.19.14 and
mainline 7.1-rc1+, also under virtme-ng. Source inlined at the bottom
of this mail.

For the fix, I'm still thinking. The rough direction is to split
cgroup rmdir: the user-visible side (directory unlink, cgroup.procs
disappears, the cgroup looks gone to userspace) completes as soon as
the iterator goes empty, while the backend teardown (kill_css /
css_offline) gets deferred until nr_populated_csets actually drops to
0. That keeps the "no running tasks in an offlined css" invariant
intact while letting the rmdir syscall return so host PID 1 can
resume reaping. Need to look at what shapes this cleanly without
breaking other cgroup core
expectations.

Thanks for the report.

--
tejun

----- min-repro.c -----

/*
 * Minimal reproducer for cgroup_drain_dying / zap_pid_ns_processes
 * deadlock. No podman, no daemons.
 *
 *   A — host process. Forks B (PID 1 of new pidns) and C (PID 2 of
 *       same pidns, but with A as host-side parent). After unshare,
 *       A's nsproxy->pid_ns_for_children is the level-1 namespace,
 *       so each subsequent fork lands a new task there.
 *   B — A's first child after unshare. PID 1 of new pidns.
 *   C — A's second child. PID 2 of same pidns, but parented to A
 *       at the host level (not to B). When C dies, SIGCHLD goes to
 *       A, NOT to B's zap_pid_ns_processes loop.
 *
 * Sequence:
 *   1. A puts itself in inner cgroup; B and C inherit on fork.
 *   2. A unshares CLONE_NEWPID, forks B and C.
 *   3. A moves out of inner so inner can be rmdir'd later.
 *   4. A SIGKILLs B and C. Both are killed.
 *   5. B's exit invokes zap_pid_ns_processes. zap walks idr, sees C
 *      (already SIGKILL'd), but C is NOT B's child — kernel_wait4(-1)
 *      returns -ECHILD immediately. zap reaches second loop:
 *      `pid_allocated > 1` because C's struct pid is still in idr
 *      (C is zombie awaiting A's wait4). zap sleeps.
 *   6. A calls rmdir(inner) WITHOUT calling wait4. cgroup_drain_dying
 *      sees nr_populated_csets > 0 (B still on cset->tasks; C may or
 *      may not be), iterator empty (B has PF_EXITING && live==0).
 *      Drain sleeps waiting for cgroup_task_dead, which only fires
 *      after B reaches do_task_dead, which it can't until zap returns,
 *      which it can't until pid_allocated drops to 1, which won't
 *      happen until A reaps C — but A is stuck in rmdir.
 *
 * Run as root.
 */

#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <sched.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

static void die(const char *msg)
{
	perror(msg);
	exit(1);
}

static void write_str(const char *path, const char *s)
{
	int fd = open(path, O_WRONLY);
	if (fd < 0)
		die(path);
	if (write(fd, s, strlen(s)) < 0)
		die(path);
	close(fd);
}

int main(int argc, char **argv)
{
	const char *cgroot = (argc > 1) ? argv[1] : "/sys/fs/cgroup/drain-min";
	char cginner[256], path[300], buf[32];

	snprintf(cginner, sizeof(cginner), "%s/inner", cgroot);

	mkdir(cgroot, 0755);
	if (mkdir(cginner, 0755) < 0 && errno != EEXIST)
		die("mkdir inner");

	snprintf(path, sizeof(path), "%s/cgroup.procs", cginner);
	snprintf(buf, sizeof(buf), "%d", getpid());
	write_str(path, buf);

	if (unshare(CLONE_NEWPID) < 0)
		die("unshare CLONE_NEWPID");

	pid_t b = fork();
	if (b < 0)
		die("fork B");
	if (b == 0) {
		/* B: PID 1 of new pidns */
		pause();
		_exit(0);
	}

	pid_t c = fork();
	if (c < 0)
		die("fork C");
	if (c == 0) {
		/* C: PID 2 of new pidns, host parent is A */
		pause();
		_exit(0);
	}

	snprintf(path, sizeof(path), "%s/cgroup.procs", cgroot);
	write_str(path, buf);

	fprintf(stderr, "A: B host pid=%d, C host pid=%d\n", b, c);

	/* Briefly verify pidns membership */
	for (int i = 0; i < 2; i++) {
		pid_t p = (i == 0) ? b : c;
		char dpath[64], dbuf[256];
		snprintf(dpath, sizeof(dpath), "/proc/%d/status", p);
		int fd = open(dpath, O_RDONLY);
		if (fd >= 0) {
			ssize_t n = read(fd, dbuf, sizeof(dbuf) - 1);
			close(fd);
			if (n > 0) {
				dbuf[n] = 0;
				char *l = strstr(dbuf, "NSpid:");
				if (l) {
					char *e = strchr(l, '\n');
					if (e) *e = 0;
					fprintf(stderr, "  pid=%d %s\n", p, l);
				}
			}
		}
	}

	/* SIGKILL both. After this, B starts zap_pid_ns_processes; C
	 * dies and becomes a zombie waiting for A's wait4. */
	kill(b, SIGKILL);
	kill(c, SIGKILL);

	usleep(500000);

	fprintf(stderr, "A: rmdir(%s) — wedges if bug present "
		"(deliberately NOT wait4-ing C)\n", cginner);

	int rc = rmdir(cginner);
	int saved_errno = errno;
	fprintf(stderr, "A: rmdir returned %d (errno=%d %s)\n", rc,
		saved_errno, strerror(saved_errno));

	/* For cleanup if rmdir succeeded */
	waitpid(c, NULL, WNOHANG);
	waitpid(b, NULL, WNOHANG);
	rmdir(cgroot);
	return rc;
}
----- end min-repro.c -----

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations
  2026-04-29 16:21 ` Tejun Heo
  2026-04-29 21:15   ` Tejun Heo
@ 2026-04-30  6:15   ` Martin Pitt
  1 sibling, 0 replies; 8+ messages in thread
From: Martin Pitt @ 2026-04-30  6:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: regressions, cgroups, hannes, Sebastian Andrzej Siewior

Hello Tejun,

(Dropping lizefan.x@bytedance.com from CC:, it doesn't exist any more)

Tejun Heo [2026-04-29  6:21 -1000]:
> Thanks for the report. The dmesg you attached has only a partial sysrq-t
> - the dying-task stacks I need were pushed out of the ring buffer. Could
> you increase log_buf_len, reproduce, trigger sysrq-t, and send the
> resulting dmesg?

Increased to 4M, which was enough. I added it to the bottom of the debug notes
comment [1], direct link: [2]. I suppose its' not necessary any more, but just
for the records..

[1] https://github.com/cockpit-project/bots/pull/8970#issuecomment-4342147158
[2] https://github.com/user-attachments/files/27231725/dmesg-task-dump.txt

Tejun Heo [2026-04-29 11:15 -1000]:
> I think I have the mechanism. The deadlock chains three things together.

You are a genius!

> 3. The container's PID 1 (whatever the entrypoint runs) is in
>    do_exit() but parked in zap_pid_ns_processes' second wait loop:

FTR, the container is pretty dumb, just 

  podman run quay.io/prometheus/busybox sh -c 'echo 123; sleep infinity'

we are not actually interested in the container workload for this tests, but
testing cockpit-podman for managing containers on the host.

However, I just confirmed that busybox'es sh, like "proper" bash, does reap
child processes (unlike for example running `sleep` directly as pid 1, then you
do get zombies)

> ----- min-repro.c -----

On Fedora 44 with 6.9.13, this hangs at

    A: rmdir(/sys/fs/cgroup/drain-min/inner) — wedges if bug present (deliberately NOT wait4-ing C)

root        1501  0.0  0.1   2460  1764 pts/0    D+   06:10   0:00 /tmp/repr
root        1502  0.0  0.0      0     0 pts/0    S+   06:10   0:00 [repr]
root        1503  0.0  0.0      0     0 pts/0    Z+   06:10   0:00 [repr] <defunct>

as expected. It does not wedge up the system in the same way as breaking all
"ls /proc" and such.

On Fedora 44 with older 6.9.10 kernel the reproducer finishes (no hang), with
EBUSY:

: B host pid=1444, C host pid=1445
  pid=1444 NSpid:	1444	1
  pid=1445 NSpid:	1445	2
A: rmdir(/sys/fs/cgroup/drain-min/inner) — wedges if bug present (deliberately NOT wait4-ing C)
A: rmdir returned -1 (errno=16 Device or resource busy)

I suppose you know all that, but just in case confirming on my setup helps in
any way.

Thanks!

Martin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated
  2026-04-29  9:21 [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations Martin Pitt
  2026-04-29 16:21 ` Tejun Heo
@ 2026-05-01  2:29 ` Tejun Heo
  2026-05-03 19:30   ` kernel test robot
                     ` (2 more replies)
  1 sibling, 3 replies; 8+ messages in thread
From: Tejun Heo @ 2026-05-01  2:29 UTC (permalink / raw)
  To: Martin Pitt
  Cc: regressions, cgroups, lizefan.x, hannes,
	Sebastian Andrzej Siewior, linux-kernel, Tejun Heo

A chain of commits going back to v7.0 reworked rmdir to satisfy the
controller invariant that a subsystem's ->css_offline() must not run while
tasks are still doing kernel-side work in the cgroup.

[1] d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out")
[2] a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup")
[3] 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir")
[4] 4c56a8ac6869 ("cgroup: Fix cgroup_drain_dying() testing the wrong condition")
[5] 13e786b64bd3 ("cgroup: Increment nr_dying_subsys_* from rmdir context")

[1] moved task cset unlink from do_exit() to finish_task_switch() so a
task's cset link drops only after the task has fully stopped scheduling.
That made tasks past exit_signals() linger on cset->tasks until their final
context switch, which led to a series of problems as what userspace expected
to see after rmdir diverged from what the kernel needs to wait for. [2]-[5]
tried to bridge that divergence: [2] filtered the exiting tasks from
cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4]
fixed the wait's condition; [5] made nr_dying_subsys_* visible
synchronously.

The cgroup_drain_dying() wait in [3] turned out to be a dead end. When the
rmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g.
host PID 1 systemd reaping orphan pids that were re-parented to it during
the same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those
pids to free, the pids can't free because PID 1 is the reaper and it's stuck
in rmdir, and the system A-A deadlocks. No internal lock ordering breaks
this; the wait itself is the bug.

The css killing side that drove the original reorder, however, can be made
cleanly asynchronous: ->css_offline() is already async, run from
css_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to
make that chain start only after all tasks have left the cgroup. rmdir's
user-visible side then returns as soon as cgroup.procs and friends are
empty, while ->css_offline() still runs only after the cgroup is fully
drained.

Verified by the original reproducer (pidns teardown + zombie reaper, runs
under vng) which hangs vanilla and succeeds here, and by per-commit
deterministic repros for [2], [3], [4], [5] with a boot parameter that
widens the post-exit_signals() window so each state is reliably reachable.
Some stress tests on top of that.

This seems like the right approach and I don't see problems with it. The
changes are somewhat invasive but not excessively so, so backporting to
-stable should be okay. If something does turn out to be wrong, the fallback
is to revert the entire chain ([1]-[5]) and rework in the development branch
instead.

Fixes: 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir")
Cc: stable@vger.kernel.org # v7.0+
Reported-by: Martin Pitt <martin@piware.de>
Link: https://lore.kernel.org/all/afHNg2VX2jy9bW7y@piware.de/
Link: https://lore.kernel.org/all/35e0670adb4abeab13da2c321582af9f@kernel.org/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
Hello Martin,

Could you give this a try? It defers the percpu_ref kill chain on
rmdir until the cgroup is fully drained, which removes the
TASK_UNINTERRUPTIBLE wait that deadlocked against PID 1 reaping. The
patch description has the details.

Thanks.

 include/linux/cgroup-defs.h |   4 +-
 kernel/cgroup/cgroup.c      | 241 ++++++++++++++++--------------------
 2 files changed, 110 insertions(+), 135 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index f42563739d2e..50a784da7a81 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -611,8 +611,8 @@ struct cgroup {
 	/* used to wait for offlining of csses */
 	wait_queue_head_t offline_waitq;
 
-	/* used by cgroup_rmdir() to wait for dying tasks to leave */
-	wait_queue_head_t dying_populated_waitq;
+	/* defers killing csses after removal until cgroup is depopulated */
+	struct work_struct finish_destroy_work;
 
 	/* used to schedule release agent */
 	struct work_struct release_agent_work;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c928dea9dea6..5634bac9cb9c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -264,10 +264,10 @@ static void cgroup_finalize_control(struct cgroup *cgrp, int ret);
 static void css_task_iter_skip(struct css_task_iter *it,
 			       struct task_struct *task);
 static int cgroup_destroy_locked(struct cgroup *cgrp);
+static void cgroup_finish_destroy(struct cgroup *cgrp);
 static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 					      struct cgroup_subsys *ss);
 static void css_release(struct percpu_ref *ref);
-static void kill_css(struct cgroup_subsys_state *css);
 static int cgroup_addrm_files(struct cgroup_subsys_state *css,
 			      struct cgroup *cgrp, struct cftype cfts[],
 			      bool is_add);
@@ -797,6 +797,10 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
 		if (was_populated == cgroup_is_populated(cgrp))
 			break;
 
+		/* subtree just emptied below an offlined cgrp; fire deferred destroy */
+		if (was_populated && !css_is_online(&cgrp->self))
+			queue_work(cgroup_offline_wq, &cgrp->finish_destroy_work);
+
 		cgroup1_check_for_release(cgrp);
 		TRACE_CGROUP_PATH(notify_populated, cgrp,
 				  cgroup_is_populated(cgrp));
@@ -2039,6 +2043,15 @@ static int cgroup_reconfigure(struct fs_context *fc)
 	return 0;
 }
 
+static void cgroup_finish_destroy_work_fn(struct work_struct *work)
+{
+	struct cgroup *cgrp = container_of(work, struct cgroup, finish_destroy_work);
+
+	cgroup_lock();
+	cgroup_finish_destroy(cgrp);
+	cgroup_unlock();
+}
+
 static void init_cgroup_housekeeping(struct cgroup *cgrp)
 {
 	struct cgroup_subsys *ss;
@@ -2065,7 +2078,7 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 #endif
 
 	init_waitqueue_head(&cgrp->offline_waitq);
-	init_waitqueue_head(&cgrp->dying_populated_waitq);
+	INIT_WORK(&cgrp->finish_destroy_work, cgroup_finish_destroy_work_fn);
 	INIT_WORK(&cgrp->release_agent_work, cgroup1_release_agent);
 }
 
@@ -3375,7 +3388,8 @@ static void cgroup_apply_control_disable(struct cgroup *cgrp)
 
 			if (css->parent &&
 			    !(cgroup_ss_mask(dsct) & (1 << ss->id))) {
-				kill_css(css);
+				kill_css_sync(css);
+				kill_css_finish(css);
 			} else if (!css_visible(css)) {
 				css_clear_dir(css);
 				if (ss->css_reset)
@@ -5514,7 +5528,7 @@ static struct cftype cgroup_psi_files[] = {
  * css destruction is four-stage process.
  *
  * 1. Destruction starts.  Killing of the percpu_ref is initiated.
- *    Implemented in kill_css().
+ *    Implemented in kill_css_finish().
  *
  * 2. When the percpu_ref is confirmed to be visible as killed on all CPUs
  *    and thus css_tryget_online() is guaranteed to fail, the css can be
@@ -5993,7 +6007,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 /*
  * This is called when the refcnt of a css is confirmed to be killed.
  * css_tryget_online() is now guaranteed to fail.  Tell the subsystem to
- * initiate destruction and put the css ref from kill_css().
+ * initiate destruction and put the css ref from kill_css_finish().
  */
 static void css_killed_work_fn(struct work_struct *work)
 {
@@ -6025,15 +6039,12 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
 }
 
 /**
- * kill_css - destroy a css
- * @css: css to destroy
+ * kill_css_sync - synchronous half of css teardown
+ * @css: css being killed
  *
- * This function initiates destruction of @css by removing cgroup interface
- * files and putting its base reference.  ->css_offline() will be invoked
- * asynchronously once css_tryget_online() is guaranteed to fail and when
- * the reference count reaches zero, @css will be released.
+ * See cgroup_destroy_locked().
  */
-static void kill_css(struct cgroup_subsys_state *css)
+static void kill_css_sync(struct cgroup_subsys_state *css)
 {
 	struct cgroup_subsys *ss = css->ss;
 
@@ -6056,24 +6067,6 @@ static void kill_css(struct cgroup_subsys_state *css)
 	 */
 	css_clear_dir(css);
 
-	/*
-	 * Killing would put the base ref, but we need to keep it alive
-	 * until after ->css_offline().
-	 */
-	css_get(css);
-
-	/*
-	 * cgroup core guarantees that, by the time ->css_offline() is
-	 * invoked, no new css reference will be given out via
-	 * css_tryget_online().  We can't simply call percpu_ref_kill() and
-	 * proceed to offlining css's because percpu_ref_kill() doesn't
-	 * guarantee that the ref is seen as killed on all CPUs on return.
-	 *
-	 * Use percpu_ref_kill_and_confirm() to get notifications as each
-	 * css is confirmed to be seen as killed on all CPUs.
-	 */
-	percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn);
-
 	css->cgroup->nr_dying_subsys[ss->id]++;
 	/*
 	 * Parent css and cgroup cannot be freed until after the freeing
@@ -6086,44 +6079,88 @@ static void kill_css(struct cgroup_subsys_state *css)
 }
 
 /**
- * cgroup_destroy_locked - the first stage of cgroup destruction
+ * kill_css_finish - deferred half of css teardown
+ * @css: css being killed
+ *
+ * See cgroup_destroy_locked().
+ */
+static void kill_css_finish(struct cgroup_subsys_state *css)
+{
+	lockdep_assert_held(&cgroup_mutex);
+
+	/*
+	 * Skip on re-entry: cgroup_apply_control_disable() may have killed @css
+	 * earlier. cgroup_destroy_locked() can still walk it because
+	 * offline_css() (which NULLs cgrp->subsys[ssid]) runs async.
+	 */
+	if (percpu_ref_is_dying(&css->refcnt))
+		return;
+
+	/*
+	 * Killing would put the base ref, but we need to keep it alive until
+	 * after ->css_offline().
+	 */
+	css_get(css);
+
+	/*
+	 * cgroup core guarantees that, by the time ->css_offline() is invoked,
+	 * no new css reference will be given out via css_tryget_online(). We
+	 * can't simply call percpu_ref_kill() and proceed to offlining css's
+	 * because percpu_ref_kill() doesn't guarantee that the ref is seen as
+	 * killed on all CPUs on return.
+	 *
+	 * Use percpu_ref_kill_and_confirm() to get notifications as each css is
+	 * confirmed to be seen as killed on all CPUs.
+	 */
+	percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn);
+}
+
+/**
+ * cgroup_destroy_locked - destroy @cgrp (called on rmdir)
  * @cgrp: cgroup to be destroyed
  *
- * css's make use of percpu refcnts whose killing latency shouldn't be
- * exposed to userland and are RCU protected.  Also, cgroup core needs to
- * guarantee that css_tryget_online() won't succeed by the time
- * ->css_offline() is invoked.  To satisfy all the requirements,
- * destruction is implemented in the following two steps.
- *
- * s1. Verify @cgrp can be destroyed and mark it dying.  Remove all
- *     userland visible parts and start killing the percpu refcnts of
- *     css's.  Set up so that the next stage will be kicked off once all
- *     the percpu refcnts are confirmed to be killed.
- *
- * s2. Invoke ->css_offline(), mark the cgroup dead and proceed with the
- *     rest of destruction.  Once all cgroup references are gone, the
- *     cgroup is RCU-freed.
- *
- * This function implements s1.  After this step, @cgrp is gone as far as
- * the userland is concerned and a new cgroup with the same name may be
- * created.  As cgroup doesn't care about the names internally, this
- * doesn't cause any problem.
+ * Tear down @cgrp on behalf of rmdir. Constraints:
+ *
+ * - Userspace: rmdir must succeed when cgroup.procs and friends are empty.
+ *
+ * - Kernel: subsystem ->css_offline() must not run while any task in @cgrp's
+ *   subtree is still doing kernel work. A task hidden from cgroup.procs (past
+ *   exit_signals() with signal->live cleared) can still schedule, allocate, and
+ *   consume resources until its final context switch. Dying descendants in the
+ *   subtree can host such tasks too.
+ *
+ * - Kernel: css_tryget_online() must fail by the time ->css_offline() runs.
+ *
+ * The destruction runs in three parts:
+ *
+ * - This function: synchronous user-visible state teardown plus kill_css_sync()
+ *   on each subsystem css.
+ *
+ * - cgroup_finish_destroy(): kicks the percpu_ref kill via kill_css_finish() on
+ *   each subsystem css. Fires once @cgrp's subtree is fully drained, either
+ *   inline here or from cgroup_update_populated().
+ *
+ * - The percpu_ref kill chain: css_killed_ref_fn -> css_killed_work_fn ->
+ *   ->css_offline() -> release/free.
+ *
+ * Return 0 on success, -EBUSY if a userspace-visible task or an online child
+ * remains.
  */
 static int cgroup_destroy_locked(struct cgroup *cgrp)
-	__releases(&cgroup_mutex) __acquires(&cgroup_mutex)
 {
 	struct cgroup *tcgrp, *parent = cgroup_parent(cgrp);
 	struct cgroup_subsys_state *css;
 	struct cgrp_cset_link *link;
+	struct css_task_iter it;
+	struct task_struct *task;
 	int ssid, ret;
 
 	lockdep_assert_held(&cgroup_mutex);
 
-	/*
-	 * Only migration can raise populated from zero and we're already
-	 * holding cgroup_mutex.
-	 */
-	if (cgroup_is_populated(cgrp))
+	css_task_iter_start(&cgrp->self, 0, &it);
+	task = css_task_iter_next(&it);
+	css_task_iter_end(&it);
+	if (task)
 		return -EBUSY;
 
 	/*
@@ -6147,9 +6184,8 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 		link->cset->dead = true;
 	spin_unlock_irq(&css_set_lock);
 
-	/* initiate massacre of all css's */
 	for_each_css(css, ssid, cgrp)
-		kill_css(css);
+		kill_css_sync(css);
 
 	/* clear and remove @cgrp dir, @cgrp has an extra ref on its kn */
 	css_clear_dir(&cgrp->self);
@@ -6180,79 +6216,27 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 	/* put the base reference */
 	percpu_ref_kill(&cgrp->self.refcnt);
 
+	if (!cgroup_is_populated(cgrp))
+		cgroup_finish_destroy(cgrp);
+
 	return 0;
 };
 
 /**
- * cgroup_drain_dying - wait for dying tasks to leave before rmdir
- * @cgrp: the cgroup being removed
- *
- * cgroup.procs and cgroup.threads use css_task_iter which filters out
- * PF_EXITING tasks so that userspace doesn't see tasks that have already been
- * reaped via waitpid(). However, cgroup_has_tasks() - which tests whether the
- * cgroup has non-empty css_sets - is only updated when dying tasks pass through
- * cgroup_task_dead() in finish_task_switch(). This creates a window where
- * cgroup.procs reads empty but cgroup_has_tasks() is still true, making rmdir
- * fail with -EBUSY from cgroup_destroy_locked() even though userspace sees no
- * tasks.
+ * cgroup_finish_destroy - deferred half of @cgrp destruction
+ * @cgrp: cgroup whose subtree just became empty
  *
- * This function aligns cgroup_has_tasks() with what userspace can observe. If
- * cgroup_has_tasks() but the task iterator sees nothing (all remaining tasks are
- * PF_EXITING), we wait for cgroup_task_dead() to finish processing them. As the
- * window between PF_EXITING and cgroup_task_dead() is short, the wait is brief.
- *
- * This function only concerns itself with this cgroup's own dying tasks.
- * Whether the cgroup has children is cgroup_destroy_locked()'s problem.
- *
- * Each cgroup_task_dead() kicks the waitqueue via cset->cgrp_links, and we
- * retry the full check from scratch.
- *
- * Must be called with cgroup_mutex held.
+ * See cgroup_destroy_locked() for the rationale.
  */
-static int cgroup_drain_dying(struct cgroup *cgrp)
-	__releases(&cgroup_mutex) __acquires(&cgroup_mutex)
+static void cgroup_finish_destroy(struct cgroup *cgrp)
 {
-	struct css_task_iter it;
-	struct task_struct *task;
-	DEFINE_WAIT(wait);
+	struct cgroup_subsys_state *css;
+	int ssid;
 
 	lockdep_assert_held(&cgroup_mutex);
-retry:
-	if (!cgroup_has_tasks(cgrp))
-		return 0;
-
-	/* Same iterator as cgroup.threads - if any task is visible, it's busy */
-	css_task_iter_start(&cgrp->self, 0, &it);
-	task = css_task_iter_next(&it);
-	css_task_iter_end(&it);
-
-	if (task)
-		return -EBUSY;
 
-	/*
-	 * All remaining tasks are PF_EXITING and will pass through
-	 * cgroup_task_dead() shortly. Wait for a kick and retry.
-	 *
-	 * cgroup_has_tasks() can't transition from false to true while we're
-	 * holding cgroup_mutex, but the true to false transition happens
-	 * under css_set_lock (via cgroup_task_dead()). We must retest and
-	 * prepare_to_wait() under css_set_lock. Otherwise, the transition
-	 * can happen between our first test and prepare_to_wait(), and we
-	 * sleep with no one to wake us.
-	 */
-	spin_lock_irq(&css_set_lock);
-	if (!cgroup_has_tasks(cgrp)) {
-		spin_unlock_irq(&css_set_lock);
-		return 0;
-	}
-	prepare_to_wait(&cgrp->dying_populated_waitq, &wait,
-			TASK_UNINTERRUPTIBLE);
-	spin_unlock_irq(&css_set_lock);
-	mutex_unlock(&cgroup_mutex);
-	schedule();
-	finish_wait(&cgrp->dying_populated_waitq, &wait);
-	mutex_lock(&cgroup_mutex);
-	goto retry;
+	for_each_css(css, ssid, cgrp)
+		kill_css_finish(css);
 }
 
 int cgroup_rmdir(struct kernfs_node *kn)
@@ -6264,12 +6248,9 @@ int cgroup_rmdir(struct kernfs_node *kn)
 	if (!cgrp)
 		return 0;
 
-	ret = cgroup_drain_dying(cgrp);
-	if (!ret) {
-		ret = cgroup_destroy_locked(cgrp);
-		if (!ret)
-			TRACE_CGROUP_PATH(rmdir, cgrp);
-	}
+	ret = cgroup_destroy_locked(cgrp);
+	if (!ret)
+		TRACE_CGROUP_PATH(rmdir, cgrp);
 
 	cgroup_kn_unlock(kn);
 	return ret;
@@ -7029,7 +7010,6 @@ void cgroup_task_exit(struct task_struct *tsk)
 
 static void do_cgroup_task_dead(struct task_struct *tsk)
 {
-	struct cgrp_cset_link *link;
 	struct css_set *cset;
 	unsigned long flags;
 
@@ -7043,11 +7023,6 @@ static void do_cgroup_task_dead(struct task_struct *tsk)
 	if (thread_group_leader(tsk) && atomic_read(&tsk->signal->live))
 		list_add_tail(&tsk->cg_list, &cset->dying_tasks);
 
-	/* kick cgroup_drain_dying() waiters, see cgroup_rmdir() */
-	list_for_each_entry(link, &cset->cgrp_links, cgrp_link)
-		if (waitqueue_active(&link->cgrp->dying_populated_waitq))
-			wake_up(&link->cgrp->dying_populated_waitq);
-
 	if (dl_task(tsk))
 		dec_dl_tasks_cs(tsk);
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated
  2026-05-01  2:29 ` [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated Tejun Heo
@ 2026-05-03 19:30   ` kernel test robot
  2026-05-03 20:15   ` kernel test robot
  2026-05-03 22:45   ` kernel test robot
  2 siblings, 0 replies; 8+ messages in thread
From: kernel test robot @ 2026-05-03 19:30 UTC (permalink / raw)
  To: Tejun Heo, Martin Pitt
  Cc: oe-kbuild-all, regressions, cgroups, lizefan.x, hannes,
	Sebastian Andrzej Siewior, linux-kernel, Tejun Heo

Hi Tejun,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tj-cgroup/for-next]
[also build test WARNING on linus/master next-20260430]
[cannot apply to v7.1-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Tejun-Heo/cgroup-Defer-css-percpu_ref-kill-on-rmdir-until-cgroup-is-depopulated/20260503-165802
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
patch link:    https://lore.kernel.org/r/20260501022943.3714461-1-tj%40kernel.org
patch subject: [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated
config: m68k-allmodconfig (https://download.01.org/0day-ci/archive/20260504/202605040315.QbFTzfWy-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260504/202605040315.QbFTzfWy-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605040315.QbFTzfWy-lkp@intel.com/

All warnings (new ones prefixed by >>):

   kernel/cgroup/cgroup.c: In function 'cgroup_apply_control_disable':
   kernel/cgroup/cgroup.c:3391:33: error: implicit declaration of function 'kill_css_sync'; did you mean 'kill_fasync'? [-Wimplicit-function-declaration]
    3391 |                                 kill_css_sync(css);
         |                                 ^~~~~~~~~~~~~
         |                                 kill_fasync
   kernel/cgroup/cgroup.c:3392:33: error: implicit declaration of function 'kill_css_finish' [-Wimplicit-function-declaration]
    3392 |                                 kill_css_finish(css);
         |                                 ^~~~~~~~~~~~~~~
   kernel/cgroup/cgroup.c: At top level:
>> kernel/cgroup/cgroup.c:6047:13: warning: conflicting types for 'kill_css_sync'; have 'void(struct cgroup_subsys_state *)'
    6047 | static void kill_css_sync(struct cgroup_subsys_state *css)
         |             ^~~~~~~~~~~~~
   kernel/cgroup/cgroup.c:6047:13: error: static declaration of 'kill_css_sync' follows non-static declaration
   kernel/cgroup/cgroup.c:3391:33: note: previous implicit declaration of 'kill_css_sync' with type 'void(struct cgroup_subsys_state *)'
    3391 |                                 kill_css_sync(css);
         |                                 ^~~~~~~~~~~~~
>> kernel/cgroup/cgroup.c:6087:13: warning: conflicting types for 'kill_css_finish'; have 'void(struct cgroup_subsys_state *)'
    6087 | static void kill_css_finish(struct cgroup_subsys_state *css)
         |             ^~~~~~~~~~~~~~~
   kernel/cgroup/cgroup.c:6087:13: error: static declaration of 'kill_css_finish' follows non-static declaration
   kernel/cgroup/cgroup.c:3392:33: note: previous implicit declaration of 'kill_css_finish' with type 'void(struct cgroup_subsys_state *)'
    3392 |                                 kill_css_finish(css);
         |                                 ^~~~~~~~~~~~~~~


vim +6047 kernel/cgroup/cgroup.c

  6040	
  6041	/**
  6042	 * kill_css_sync - synchronous half of css teardown
  6043	 * @css: css being killed
  6044	 *
  6045	 * See cgroup_destroy_locked().
  6046	 */
> 6047	static void kill_css_sync(struct cgroup_subsys_state *css)
  6048	{
  6049		struct cgroup_subsys *ss = css->ss;
  6050	
  6051		lockdep_assert_held(&cgroup_mutex);
  6052	
  6053		if (css->flags & CSS_DYING)
  6054			return;
  6055	
  6056		/*
  6057		 * Call css_killed(), if defined, before setting the CSS_DYING flag
  6058		 */
  6059		if (css->ss->css_killed)
  6060			css->ss->css_killed(css);
  6061	
  6062		css->flags |= CSS_DYING;
  6063	
  6064		/*
  6065		 * This must happen before css is disassociated with its cgroup.
  6066		 * See seq_css() for details.
  6067		 */
  6068		css_clear_dir(css);
  6069	
  6070		css->cgroup->nr_dying_subsys[ss->id]++;
  6071		/*
  6072		 * Parent css and cgroup cannot be freed until after the freeing
  6073		 * of child css, see css_free_rwork_fn().
  6074		 */
  6075		while ((css = css->parent)) {
  6076			css->nr_descendants--;
  6077			css->cgroup->nr_dying_subsys[ss->id]++;
  6078		}
  6079	}
  6080	
  6081	/**
  6082	 * kill_css_finish - deferred half of css teardown
  6083	 * @css: css being killed
  6084	 *
  6085	 * See cgroup_destroy_locked().
  6086	 */
> 6087	static void kill_css_finish(struct cgroup_subsys_state *css)
  6088	{
  6089		lockdep_assert_held(&cgroup_mutex);
  6090	
  6091		/*
  6092		 * Skip on re-entry: cgroup_apply_control_disable() may have killed @css
  6093		 * earlier. cgroup_destroy_locked() can still walk it because
  6094		 * offline_css() (which NULLs cgrp->subsys[ssid]) runs async.
  6095		 */
  6096		if (percpu_ref_is_dying(&css->refcnt))
  6097			return;
  6098	
  6099		/*
  6100		 * Killing would put the base ref, but we need to keep it alive until
  6101		 * after ->css_offline().
  6102		 */
  6103		css_get(css);
  6104	
  6105		/*
  6106		 * cgroup core guarantees that, by the time ->css_offline() is invoked,
  6107		 * no new css reference will be given out via css_tryget_online(). We
  6108		 * can't simply call percpu_ref_kill() and proceed to offlining css's
  6109		 * because percpu_ref_kill() doesn't guarantee that the ref is seen as
  6110		 * killed on all CPUs on return.
  6111		 *
  6112		 * Use percpu_ref_kill_and_confirm() to get notifications as each css is
  6113		 * confirmed to be seen as killed on all CPUs.
  6114		 */
  6115		percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn);
  6116	}
  6117	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated
  2026-05-01  2:29 ` [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated Tejun Heo
  2026-05-03 19:30   ` kernel test robot
@ 2026-05-03 20:15   ` kernel test robot
  2026-05-03 22:45   ` kernel test robot
  2 siblings, 0 replies; 8+ messages in thread
From: kernel test robot @ 2026-05-03 20:15 UTC (permalink / raw)
  To: Tejun Heo, Martin Pitt
  Cc: oe-kbuild-all, regressions, cgroups, lizefan.x, hannes,
	Sebastian Andrzej Siewior, linux-kernel, Tejun Heo

Hi Tejun,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tj-cgroup/for-next]
[also build test WARNING on linus/master next-20260430]
[cannot apply to v7.1-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Tejun-Heo/cgroup-Defer-css-percpu_ref-kill-on-rmdir-until-cgroup-is-depopulated/20260503-165802
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
patch link:    https://lore.kernel.org/r/20260501022943.3714461-1-tj%40kernel.org
patch subject: [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated
config: powerpc-randconfig-r071-20260504 (https://download.01.org/0day-ci/archive/20260504/202605040408.yt7xcKug-lkp@intel.com/config)
compiler: powerpc-linux-gcc (GCC) 8.5.0
smatch: v0.5.0-9065-ge9cc34fd
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260504/202605040408.yt7xcKug-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605040408.yt7xcKug-lkp@intel.com/

All warnings (new ones prefixed by >>):

   kernel/cgroup/cgroup.c: In function 'cgroup_apply_control_disable':
   kernel/cgroup/cgroup.c:3391:5: error: implicit declaration of function 'kill_css_sync'; did you mean 'kill_fasync'? [-Werror=implicit-function-declaration]
        kill_css_sync(css);
        ^~~~~~~~~~~~~
        kill_fasync
   kernel/cgroup/cgroup.c:3392:5: error: implicit declaration of function 'kill_css_finish'; did you mean 'kill_cad_pid'? [-Werror=implicit-function-declaration]
        kill_css_finish(css);
        ^~~~~~~~~~~~~~~
        kill_cad_pid
   kernel/cgroup/cgroup.c: At top level:
>> kernel/cgroup/cgroup.c:6047:13: warning: conflicting types for 'kill_css_sync'
    static void kill_css_sync(struct cgroup_subsys_state *css)
                ^~~~~~~~~~~~~
   kernel/cgroup/cgroup.c:6047:13: error: static declaration of 'kill_css_sync' follows non-static declaration
   kernel/cgroup/cgroup.c:3391:5: note: previous implicit declaration of 'kill_css_sync' was here
        kill_css_sync(css);
        ^~~~~~~~~~~~~
>> kernel/cgroup/cgroup.c:6087:13: warning: conflicting types for 'kill_css_finish'
    static void kill_css_finish(struct cgroup_subsys_state *css)
                ^~~~~~~~~~~~~~~
   kernel/cgroup/cgroup.c:6087:13: error: static declaration of 'kill_css_finish' follows non-static declaration
   kernel/cgroup/cgroup.c:3392:5: note: previous implicit declaration of 'kill_css_finish' was here
        kill_css_finish(css);
        ^~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +/kill_css_sync +6047 kernel/cgroup/cgroup.c

  6040	
  6041	/**
  6042	 * kill_css_sync - synchronous half of css teardown
  6043	 * @css: css being killed
  6044	 *
  6045	 * See cgroup_destroy_locked().
  6046	 */
> 6047	static void kill_css_sync(struct cgroup_subsys_state *css)
  6048	{
  6049		struct cgroup_subsys *ss = css->ss;
  6050	
  6051		lockdep_assert_held(&cgroup_mutex);
  6052	
  6053		if (css->flags & CSS_DYING)
  6054			return;
  6055	
  6056		/*
  6057		 * Call css_killed(), if defined, before setting the CSS_DYING flag
  6058		 */
  6059		if (css->ss->css_killed)
  6060			css->ss->css_killed(css);
  6061	
  6062		css->flags |= CSS_DYING;
  6063	
  6064		/*
  6065		 * This must happen before css is disassociated with its cgroup.
  6066		 * See seq_css() for details.
  6067		 */
  6068		css_clear_dir(css);
  6069	
  6070		css->cgroup->nr_dying_subsys[ss->id]++;
  6071		/*
  6072		 * Parent css and cgroup cannot be freed until after the freeing
  6073		 * of child css, see css_free_rwork_fn().
  6074		 */
  6075		while ((css = css->parent)) {
  6076			css->nr_descendants--;
  6077			css->cgroup->nr_dying_subsys[ss->id]++;
  6078		}
  6079	}
  6080	
  6081	/**
  6082	 * kill_css_finish - deferred half of css teardown
  6083	 * @css: css being killed
  6084	 *
  6085	 * See cgroup_destroy_locked().
  6086	 */
> 6087	static void kill_css_finish(struct cgroup_subsys_state *css)
  6088	{
  6089		lockdep_assert_held(&cgroup_mutex);
  6090	
  6091		/*
  6092		 * Skip on re-entry: cgroup_apply_control_disable() may have killed @css
  6093		 * earlier. cgroup_destroy_locked() can still walk it because
  6094		 * offline_css() (which NULLs cgrp->subsys[ssid]) runs async.
  6095		 */
  6096		if (percpu_ref_is_dying(&css->refcnt))
  6097			return;
  6098	
  6099		/*
  6100		 * Killing would put the base ref, but we need to keep it alive until
  6101		 * after ->css_offline().
  6102		 */
  6103		css_get(css);
  6104	
  6105		/*
  6106		 * cgroup core guarantees that, by the time ->css_offline() is invoked,
  6107		 * no new css reference will be given out via css_tryget_online(). We
  6108		 * can't simply call percpu_ref_kill() and proceed to offlining css's
  6109		 * because percpu_ref_kill() doesn't guarantee that the ref is seen as
  6110		 * killed on all CPUs on return.
  6111		 *
  6112		 * Use percpu_ref_kill_and_confirm() to get notifications as each css is
  6113		 * confirmed to be seen as killed on all CPUs.
  6114		 */
  6115		percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn);
  6116	}
  6117	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated
  2026-05-01  2:29 ` [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated Tejun Heo
  2026-05-03 19:30   ` kernel test robot
  2026-05-03 20:15   ` kernel test robot
@ 2026-05-03 22:45   ` kernel test robot
  2 siblings, 0 replies; 8+ messages in thread
From: kernel test robot @ 2026-05-03 22:45 UTC (permalink / raw)
  To: Tejun Heo, Martin Pitt
  Cc: oe-kbuild-all, regressions, cgroups, lizefan.x, hannes,
	Sebastian Andrzej Siewior, linux-kernel, Tejun Heo

Hi Tejun,

kernel test robot noticed the following build errors:

[auto build test ERROR on tj-cgroup/for-next]
[also build test ERROR on linus/master next-20260430]
[cannot apply to v7.1-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Tejun-Heo/cgroup-Defer-css-percpu_ref-kill-on-rmdir-until-cgroup-is-depopulated/20260503-165802
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
patch link:    https://lore.kernel.org/r/20260501022943.3714461-1-tj%40kernel.org
patch subject: [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated
config: m68k-allmodconfig (https://download.01.org/0day-ci/archive/20260504/202605040655.KI0GsBVb-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260504/202605040655.KI0GsBVb-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605040655.KI0GsBVb-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/cgroup/cgroup.c: In function 'cgroup_apply_control_disable':
>> kernel/cgroup/cgroup.c:3391:33: error: implicit declaration of function 'kill_css_sync'; did you mean 'kill_fasync'? [-Wimplicit-function-declaration]
    3391 |                                 kill_css_sync(css);
         |                                 ^~~~~~~~~~~~~
         |                                 kill_fasync
>> kernel/cgroup/cgroup.c:3392:33: error: implicit declaration of function 'kill_css_finish' [-Wimplicit-function-declaration]
    3392 |                                 kill_css_finish(css);
         |                                 ^~~~~~~~~~~~~~~
   kernel/cgroup/cgroup.c: At top level:
   kernel/cgroup/cgroup.c:6047:13: warning: conflicting types for 'kill_css_sync'; have 'void(struct cgroup_subsys_state *)'
    6047 | static void kill_css_sync(struct cgroup_subsys_state *css)
         |             ^~~~~~~~~~~~~
>> kernel/cgroup/cgroup.c:6047:13: error: static declaration of 'kill_css_sync' follows non-static declaration
   kernel/cgroup/cgroup.c:3391:33: note: previous implicit declaration of 'kill_css_sync' with type 'void(struct cgroup_subsys_state *)'
    3391 |                                 kill_css_sync(css);
         |                                 ^~~~~~~~~~~~~
   kernel/cgroup/cgroup.c:6087:13: warning: conflicting types for 'kill_css_finish'; have 'void(struct cgroup_subsys_state *)'
    6087 | static void kill_css_finish(struct cgroup_subsys_state *css)
         |             ^~~~~~~~~~~~~~~
>> kernel/cgroup/cgroup.c:6087:13: error: static declaration of 'kill_css_finish' follows non-static declaration
   kernel/cgroup/cgroup.c:3392:33: note: previous implicit declaration of 'kill_css_finish' with type 'void(struct cgroup_subsys_state *)'
    3392 |                                 kill_css_finish(css);
         |                                 ^~~~~~~~~~~~~~~


vim +3391 kernel/cgroup/cgroup.c

  3359	
  3360	/**
  3361	 * cgroup_apply_control_disable - kill or hide csses according to control
  3362	 * @cgrp: root of the target subtree
  3363	 *
  3364	 * Walk @cgrp's subtree and kill and hide csses so that they match
  3365	 * cgroup_ss_mask() and cgroup_visible_mask().
  3366	 *
  3367	 * A css is hidden when the userland requests it to be disabled while other
  3368	 * subsystems are still depending on it.  The css must not actively control
  3369	 * resources and be in the vanilla state if it's made visible again later.
  3370	 * Controllers which may be depended upon should provide ->css_reset() for
  3371	 * this purpose.
  3372	 */
  3373	static void cgroup_apply_control_disable(struct cgroup *cgrp)
  3374	{
  3375		struct cgroup *dsct;
  3376		struct cgroup_subsys_state *d_css;
  3377		struct cgroup_subsys *ss;
  3378		int ssid;
  3379	
  3380		cgroup_for_each_live_descendant_post(dsct, d_css, cgrp) {
  3381			for_each_subsys(ss, ssid) {
  3382				struct cgroup_subsys_state *css = cgroup_css(dsct, ss);
  3383	
  3384				if (!css)
  3385					continue;
  3386	
  3387				WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt));
  3388	
  3389				if (css->parent &&
  3390				    !(cgroup_ss_mask(dsct) & (1 << ss->id))) {
> 3391					kill_css_sync(css);
> 3392					kill_css_finish(css);
  3393				} else if (!css_visible(css)) {
  3394					css_clear_dir(css);
  3395					if (ss->css_reset)
  3396						ss->css_reset(css);
  3397				}
  3398			}
  3399		}
  3400	}
  3401	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-05-03 22:45 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-29  9:21 [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations Martin Pitt
2026-04-29 16:21 ` Tejun Heo
2026-04-29 21:15   ` Tejun Heo
2026-04-30  6:15   ` Martin Pitt
2026-05-01  2:29 ` [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated Tejun Heo
2026-05-03 19:30   ` kernel test robot
2026-05-03 20:15   ` kernel test robot
2026-05-03 22:45   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox