* [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations @ 2026-04-29 9:21 Martin Pitt 2026-04-29 16:21 ` Tejun Heo 2026-05-01 2:29 ` [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated Tejun Heo 0 siblings, 2 replies; 8+ messages in thread From: Martin Pitt @ 2026-04-29 9:21 UTC (permalink / raw) To: regressions; +Cc: cgroups, tj, lizefan.x, hannes Hello, Our cockpit tests found a kernel regression introduced between 6.9.10 (working) and 6.9.11 (broken) that causes a system hang during cgroup cleanup after podman container operations. I've kept notes in https://github.com/cockpit-project/bots/pull/8970#issuecomment-4342147158 , but now I am at the end of my wisdom how to squeeze more information out of this. === Summary === When running podman REST API operations on rootless containers followed by user session cleanup (loginctl/pkill), systemd (pid 1) gets stuck in cgroup_drain_dying trying to remove an empty cgroup. After that, I'm - Unable to run commands that access /proc (ps, top, lsns, ls /proc, etc.) - Unable to create new SSH sessions or VT logins - If I previously logged into the QEMU VT, that login session remains more or less functional, except not being able to run most commands === Kernel Versions === - Last known working: 6.9.10 - Broken: 6.9.11 (OpenSUSE Tumbleweed), 6.9.13 (Fedora 44), 6.9.14 (Fedora 44), Ubuntu 26.04 (7.0.0) === Stack Trace === From sysrq-trigger task dump, systemd is stuck in: [ 207.958946] task:systemd state:D stack:0 pid:1 tgid:1 ppid:0 [ 207.959734] Call Trace: [ 207.960117] <TASK> [ 207.960333] __schedule+0x2b2/0x5d0 [ 207.960603] schedule+0x27/0x80 [ 207.960945] cgroup_drain_dying+0xef/0x1a0 [ 207.961287] ? __pfx_autoremove_wake_function+0x10/0x10 [ 207.961639] cgroup_rmdir+0x37/0x100 [ 207.961945] kernfs_iop_rmdir+0x6a/0xd0 [ 207.962239] vfs_rmdir+0x154/0x270 [ 207.962486] do_rmdir+0x201/0x280 [ 207.962723] __x64_sys_unlinkat+0x8c/0xd0 === Observations === - /sys/fs/cgroup/user.slice/user-1000.slice/cgroup.procs was empty, indicating all processes were killed but the cgroup itself cannot be removed - Multiple zombie processes present, unable to be reaped (user@1000.service systemd, podman, conmon processes) - RCU subsystem appears healthy (rcu_exp_gp_kthr in S state) === Reproducer === The bug is triggered by a specific sequence of podman REST API operations on rootless containers, followed by user cleanup. The reproducer is part of the cockpit-podman test suite. I created a branch where I reduced the test to the absolute minimum, and also replaced as many UI clicks as possible with shell operations (all but one): https://github.com/martinpitt/cockpit-podman/blob/kernel-hang/test/check-application#L1486 Sequence: 1. Create and stop a rootless container as the admin user 2. Call podman REST API lifecycle operations: start → restart → stop 3. Create an exec session (console/TTY connection) via REST API 4. Start the container again via REST API 5. Cleanup: loginctl terminate-user admin; loginctl kill-user admin; pkill -9 -u admin Using podman CLI commands (e.g., "podman start swamped-crate") instead of the REST API does NOT trigger the hang, only when using the REST API. That may be because of the different process layout, or just sheer timing -- as eventually, both CLI and API should result in the same actual cgroup/container operations on the podman side. The bug is very timing-sensitive. I attempted to create a standalone shell script reproducer, but failed, it always passes with that. Even with the original cockpit-podman integration test failure it's unreliable: it can hang on the first iteration, most of the time it fails within 5 runs, but I've had stretches where 50+ iterations passed before the hang happened. === Full debug output === The above GitHub PR comment links to the full dmesg log. Direct link: https://github.com/user-attachments/files/27195205/dmesg-cgrouphang.txt This covers initial boot up to the hang, and then the outputs of sysrq task dump (t), memory info (m), and blocked tasks (w). === Additional Notes === In one early test run, a different hang pattern was observed where rcu_exp_gp_kthr was in D state with a process stuck in synchronize_rcu_expedited during namespace cleanup, but this variant has not reproduced in subsequent runs. The cgroup cleanup deadlock appears to be the primary manifestation. This is my first (non-trivial) kernel bug report, so please bear with me. I'm normally stay firmly in userland. Thanks, Martin Pitt ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations 2026-04-29 9:21 [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations Martin Pitt @ 2026-04-29 16:21 ` Tejun Heo 2026-04-29 21:15 ` Tejun Heo 2026-04-30 6:15 ` Martin Pitt 2026-05-01 2:29 ` [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated Tejun Heo 1 sibling, 2 replies; 8+ messages in thread From: Tejun Heo @ 2026-04-29 16:21 UTC (permalink / raw) To: Martin Pitt; +Cc: regressions, cgroups, lizefan.x, hannes Hello, Thanks for the report. The dmesg you attached has only a partial sysrq-t - the dying-task stacks I need were pushed out of the ring buffer. Could you increase log_buf_len, reproduce, trigger sysrq-t, and send the resulting dmesg? Thanks. -- tejun ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations 2026-04-29 16:21 ` Tejun Heo @ 2026-04-29 21:15 ` Tejun Heo 2026-04-30 6:15 ` Martin Pitt 1 sibling, 0 replies; 8+ messages in thread From: Tejun Heo @ 2026-04-29 21:15 UTC (permalink / raw) To: Martin Pitt Cc: regressions, cgroups, lizefan.x, hannes, Sebastian Andrzej Siewior [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 7343 bytes --] Hello, I think I have the mechanism. The deadlock chains three things together. 1. Host PID 1 systemd is doing rmdir on user-1001.slice. rmdir enters cgroup_drain_dying() which waits until nr_populated_csets drops to 0. The wait was added in 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir") so that rmdir doesn't succeed while dying tasks are still on the cgroup's css_set - the controller invariant being kept is "no tasks running in an offlined css" (see d245698d727a for where that got established on the dying-task side). 2. nr_populated_csets only drops when cgroup_task_dead() runs, which happens from finish_task_switch() after the dying task's do_task_dead() - i.e., after the very last context switch out of the task. 3. The container's PID 1 (whatever the entrypoint runs) is in do_exit() but parked in zap_pid_ns_processes' second wait loop: for (;;) { set_current_state(TASK_INTERRUPTIBLE); if (pid_ns->pid_allocated == init_pids) break; schedule(); } pid_allocated stays > init_pids because at least one struct pid in the namespace is still in the idr - i.e., its task hasn't been reaped (release_task -> free_pid hasn't run). The unreaped task is the host parent of one of the container helpers - in the podman case, an exec session child whose host parent died during the pkill cascade (conmon, the user manager, etc.) and got re-parented to host PID 1. PID 1 was supposed to wait4() it as part of normal reaping, but PID 1 is blocked in (1). Chicken-and-egg. Why fuse-overlayfs masks it: with fuse-overlayfs the container teardown finishes fast enough that the container's PID 1 reaches do_task_dead before host PID 1 has had time to become the reaper of the orphan. With kernel-overlayfs the teardown drags long enough that the reparent + drain wait land in the iterator-hidden window simultaneously, and once they do, the wedge is permanent. I have a minimal C reproducer (~150 lines, no podman, no daemons, runs as root) that hits the exact same code paths. The shape: A unshares a pid namespace and forks B (PID 1 of the namespace) and C (PID 2; A is C's host parent, NOT B). Both run in an inner cgroup. A SIGKILLs both, then calls rmdir() without wait4()-ing C. C becomes a zombie pinning pid_allocated; B parks in zap_pid_ns_processes; A's rmdir parks in cgroup_drain_dying. Verified on stock 6.19.14 and mainline 7.1-rc1+, also under virtme-ng. Source inlined at the bottom of this mail. For the fix, I'm still thinking. The rough direction is to split cgroup rmdir: the user-visible side (directory unlink, cgroup.procs disappears, the cgroup looks gone to userspace) completes as soon as the iterator goes empty, while the backend teardown (kill_css / css_offline) gets deferred until nr_populated_csets actually drops to 0. That keeps the "no running tasks in an offlined css" invariant intact while letting the rmdir syscall return so host PID 1 can resume reaping. Need to look at what shapes this cleanly without breaking other cgroup core expectations. Thanks for the report. -- tejun ----- min-repro.c ----- /* * Minimal reproducer for cgroup_drain_dying / zap_pid_ns_processes * deadlock. No podman, no daemons. * * A — host process. Forks B (PID 1 of new pidns) and C (PID 2 of * same pidns, but with A as host-side parent). After unshare, * A's nsproxy->pid_ns_for_children is the level-1 namespace, * so each subsequent fork lands a new task there. * B — A's first child after unshare. PID 1 of new pidns. * C — A's second child. PID 2 of same pidns, but parented to A * at the host level (not to B). When C dies, SIGCHLD goes to * A, NOT to B's zap_pid_ns_processes loop. * * Sequence: * 1. A puts itself in inner cgroup; B and C inherit on fork. * 2. A unshares CLONE_NEWPID, forks B and C. * 3. A moves out of inner so inner can be rmdir'd later. * 4. A SIGKILLs B and C. Both are killed. * 5. B's exit invokes zap_pid_ns_processes. zap walks idr, sees C * (already SIGKILL'd), but C is NOT B's child — kernel_wait4(-1) * returns -ECHILD immediately. zap reaches second loop: * `pid_allocated > 1` because C's struct pid is still in idr * (C is zombie awaiting A's wait4). zap sleeps. * 6. A calls rmdir(inner) WITHOUT calling wait4. cgroup_drain_dying * sees nr_populated_csets > 0 (B still on cset->tasks; C may or * may not be), iterator empty (B has PF_EXITING && live==0). * Drain sleeps waiting for cgroup_task_dead, which only fires * after B reaches do_task_dead, which it can't until zap returns, * which it can't until pid_allocated drops to 1, which won't * happen until A reaps C — but A is stuck in rmdir. * * Run as root. */ #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <sched.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/stat.h> #include <sys/syscall.h> #include <sys/types.h> #include <sys/wait.h> #include <unistd.h> static void die(const char *msg) { perror(msg); exit(1); } static void write_str(const char *path, const char *s) { int fd = open(path, O_WRONLY); if (fd < 0) die(path); if (write(fd, s, strlen(s)) < 0) die(path); close(fd); } int main(int argc, char **argv) { const char *cgroot = (argc > 1) ? argv[1] : "/sys/fs/cgroup/drain-min"; char cginner[256], path[300], buf[32]; snprintf(cginner, sizeof(cginner), "%s/inner", cgroot); mkdir(cgroot, 0755); if (mkdir(cginner, 0755) < 0 && errno != EEXIST) die("mkdir inner"); snprintf(path, sizeof(path), "%s/cgroup.procs", cginner); snprintf(buf, sizeof(buf), "%d", getpid()); write_str(path, buf); if (unshare(CLONE_NEWPID) < 0) die("unshare CLONE_NEWPID"); pid_t b = fork(); if (b < 0) die("fork B"); if (b == 0) { /* B: PID 1 of new pidns */ pause(); _exit(0); } pid_t c = fork(); if (c < 0) die("fork C"); if (c == 0) { /* C: PID 2 of new pidns, host parent is A */ pause(); _exit(0); } snprintf(path, sizeof(path), "%s/cgroup.procs", cgroot); write_str(path, buf); fprintf(stderr, "A: B host pid=%d, C host pid=%d\n", b, c); /* Briefly verify pidns membership */ for (int i = 0; i < 2; i++) { pid_t p = (i == 0) ? b : c; char dpath[64], dbuf[256]; snprintf(dpath, sizeof(dpath), "/proc/%d/status", p); int fd = open(dpath, O_RDONLY); if (fd >= 0) { ssize_t n = read(fd, dbuf, sizeof(dbuf) - 1); close(fd); if (n > 0) { dbuf[n] = 0; char *l = strstr(dbuf, "NSpid:"); if (l) { char *e = strchr(l, '\n'); if (e) *e = 0; fprintf(stderr, " pid=%d %s\n", p, l); } } } } /* SIGKILL both. After this, B starts zap_pid_ns_processes; C * dies and becomes a zombie waiting for A's wait4. */ kill(b, SIGKILL); kill(c, SIGKILL); usleep(500000); fprintf(stderr, "A: rmdir(%s) — wedges if bug present " "(deliberately NOT wait4-ing C)\n", cginner); int rc = rmdir(cginner); int saved_errno = errno; fprintf(stderr, "A: rmdir returned %d (errno=%d %s)\n", rc, saved_errno, strerror(saved_errno)); /* For cleanup if rmdir succeeded */ waitpid(c, NULL, WNOHANG); waitpid(b, NULL, WNOHANG); rmdir(cgroot); return rc; } ----- end min-repro.c ----- ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations 2026-04-29 16:21 ` Tejun Heo 2026-04-29 21:15 ` Tejun Heo @ 2026-04-30 6:15 ` Martin Pitt 1 sibling, 0 replies; 8+ messages in thread From: Martin Pitt @ 2026-04-30 6:15 UTC (permalink / raw) To: Tejun Heo; +Cc: regressions, cgroups, hannes, Sebastian Andrzej Siewior Hello Tejun, (Dropping lizefan.x@bytedance.com from CC:, it doesn't exist any more) Tejun Heo [2026-04-29 6:21 -1000]: > Thanks for the report. The dmesg you attached has only a partial sysrq-t > - the dying-task stacks I need were pushed out of the ring buffer. Could > you increase log_buf_len, reproduce, trigger sysrq-t, and send the > resulting dmesg? Increased to 4M, which was enough. I added it to the bottom of the debug notes comment [1], direct link: [2]. I suppose its' not necessary any more, but just for the records.. [1] https://github.com/cockpit-project/bots/pull/8970#issuecomment-4342147158 [2] https://github.com/user-attachments/files/27231725/dmesg-task-dump.txt Tejun Heo [2026-04-29 11:15 -1000]: > I think I have the mechanism. The deadlock chains three things together. You are a genius! > 3. The container's PID 1 (whatever the entrypoint runs) is in > do_exit() but parked in zap_pid_ns_processes' second wait loop: FTR, the container is pretty dumb, just podman run quay.io/prometheus/busybox sh -c 'echo 123; sleep infinity' we are not actually interested in the container workload for this tests, but testing cockpit-podman for managing containers on the host. However, I just confirmed that busybox'es sh, like "proper" bash, does reap child processes (unlike for example running `sleep` directly as pid 1, then you do get zombies) > ----- min-repro.c ----- On Fedora 44 with 6.9.13, this hangs at A: rmdir(/sys/fs/cgroup/drain-min/inner) — wedges if bug present (deliberately NOT wait4-ing C) root 1501 0.0 0.1 2460 1764 pts/0 D+ 06:10 0:00 /tmp/repr root 1502 0.0 0.0 0 0 pts/0 S+ 06:10 0:00 [repr] root 1503 0.0 0.0 0 0 pts/0 Z+ 06:10 0:00 [repr] <defunct> as expected. It does not wedge up the system in the same way as breaking all "ls /proc" and such. On Fedora 44 with older 6.9.10 kernel the reproducer finishes (no hang), with EBUSY: : B host pid=1444, C host pid=1445 pid=1444 NSpid: 1444 1 pid=1445 NSpid: 1445 2 A: rmdir(/sys/fs/cgroup/drain-min/inner) — wedges if bug present (deliberately NOT wait4-ing C) A: rmdir returned -1 (errno=16 Device or resource busy) I suppose you know all that, but just in case confirming on my setup helps in any way. Thanks! Martin ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated 2026-04-29 9:21 [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations Martin Pitt 2026-04-29 16:21 ` Tejun Heo @ 2026-05-01 2:29 ` Tejun Heo 2026-05-03 19:30 ` kernel test robot ` (2 more replies) 1 sibling, 3 replies; 8+ messages in thread From: Tejun Heo @ 2026-05-01 2:29 UTC (permalink / raw) To: Martin Pitt Cc: regressions, cgroups, lizefan.x, hannes, Sebastian Andrzej Siewior, linux-kernel, Tejun Heo A chain of commits going back to v7.0 reworked rmdir to satisfy the controller invariant that a subsystem's ->css_offline() must not run while tasks are still doing kernel-side work in the cgroup. [1] d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out") [2] a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") [3] 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir") [4] 4c56a8ac6869 ("cgroup: Fix cgroup_drain_dying() testing the wrong condition") [5] 13e786b64bd3 ("cgroup: Increment nr_dying_subsys_* from rmdir context") [1] moved task cset unlink from do_exit() to finish_task_switch() so a task's cset link drops only after the task has fully stopped scheduling. That made tasks past exit_signals() linger on cset->tasks until their final context switch, which led to a series of problems as what userspace expected to see after rmdir diverged from what the kernel needs to wait for. [2]-[5] tried to bridge that divergence: [2] filtered the exiting tasks from cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4] fixed the wait's condition; [5] made nr_dying_subsys_* visible synchronously. The cgroup_drain_dying() wait in [3] turned out to be a dead end. When the rmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g. host PID 1 systemd reaping orphan pids that were re-parented to it during the same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those pids to free, the pids can't free because PID 1 is the reaper and it's stuck in rmdir, and the system A-A deadlocks. No internal lock ordering breaks this; the wait itself is the bug. The css killing side that drove the original reorder, however, can be made cleanly asynchronous: ->css_offline() is already async, run from css_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to make that chain start only after all tasks have left the cgroup. rmdir's user-visible side then returns as soon as cgroup.procs and friends are empty, while ->css_offline() still runs only after the cgroup is fully drained. Verified by the original reproducer (pidns teardown + zombie reaper, runs under vng) which hangs vanilla and succeeds here, and by per-commit deterministic repros for [2], [3], [4], [5] with a boot parameter that widens the post-exit_signals() window so each state is reliably reachable. Some stress tests on top of that. This seems like the right approach and I don't see problems with it. The changes are somewhat invasive but not excessively so, so backporting to -stable should be okay. If something does turn out to be wrong, the fallback is to revert the entire chain ([1]-[5]) and rework in the development branch instead. Fixes: 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir") Cc: stable@vger.kernel.org # v7.0+ Reported-by: Martin Pitt <martin@piware.de> Link: https://lore.kernel.org/all/afHNg2VX2jy9bW7y@piware.de/ Link: https://lore.kernel.org/all/35e0670adb4abeab13da2c321582af9f@kernel.org/ Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> --- Hello Martin, Could you give this a try? It defers the percpu_ref kill chain on rmdir until the cgroup is fully drained, which removes the TASK_UNINTERRUPTIBLE wait that deadlocked against PID 1 reaping. The patch description has the details. Thanks. include/linux/cgroup-defs.h | 4 +- kernel/cgroup/cgroup.c | 241 ++++++++++++++++-------------------- 2 files changed, 110 insertions(+), 135 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index f42563739d2e..50a784da7a81 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -611,8 +611,8 @@ struct cgroup { /* used to wait for offlining of csses */ wait_queue_head_t offline_waitq; - /* used by cgroup_rmdir() to wait for dying tasks to leave */ - wait_queue_head_t dying_populated_waitq; + /* defers killing csses after removal until cgroup is depopulated */ + struct work_struct finish_destroy_work; /* used to schedule release agent */ struct work_struct release_agent_work; diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index c928dea9dea6..5634bac9cb9c 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -264,10 +264,10 @@ static void cgroup_finalize_control(struct cgroup *cgrp, int ret); static void css_task_iter_skip(struct css_task_iter *it, struct task_struct *task); static int cgroup_destroy_locked(struct cgroup *cgrp); +static void cgroup_finish_destroy(struct cgroup *cgrp); static struct cgroup_subsys_state *css_create(struct cgroup *cgrp, struct cgroup_subsys *ss); static void css_release(struct percpu_ref *ref); -static void kill_css(struct cgroup_subsys_state *css); static int cgroup_addrm_files(struct cgroup_subsys_state *css, struct cgroup *cgrp, struct cftype cfts[], bool is_add); @@ -797,6 +797,10 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated) if (was_populated == cgroup_is_populated(cgrp)) break; + /* subtree just emptied below an offlined cgrp; fire deferred destroy */ + if (was_populated && !css_is_online(&cgrp->self)) + queue_work(cgroup_offline_wq, &cgrp->finish_destroy_work); + cgroup1_check_for_release(cgrp); TRACE_CGROUP_PATH(notify_populated, cgrp, cgroup_is_populated(cgrp)); @@ -2039,6 +2043,15 @@ static int cgroup_reconfigure(struct fs_context *fc) return 0; } +static void cgroup_finish_destroy_work_fn(struct work_struct *work) +{ + struct cgroup *cgrp = container_of(work, struct cgroup, finish_destroy_work); + + cgroup_lock(); + cgroup_finish_destroy(cgrp); + cgroup_unlock(); +} + static void init_cgroup_housekeeping(struct cgroup *cgrp) { struct cgroup_subsys *ss; @@ -2065,7 +2078,7 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp) #endif init_waitqueue_head(&cgrp->offline_waitq); - init_waitqueue_head(&cgrp->dying_populated_waitq); + INIT_WORK(&cgrp->finish_destroy_work, cgroup_finish_destroy_work_fn); INIT_WORK(&cgrp->release_agent_work, cgroup1_release_agent); } @@ -3375,7 +3388,8 @@ static void cgroup_apply_control_disable(struct cgroup *cgrp) if (css->parent && !(cgroup_ss_mask(dsct) & (1 << ss->id))) { - kill_css(css); + kill_css_sync(css); + kill_css_finish(css); } else if (!css_visible(css)) { css_clear_dir(css); if (ss->css_reset) @@ -5514,7 +5528,7 @@ static struct cftype cgroup_psi_files[] = { * css destruction is four-stage process. * * 1. Destruction starts. Killing of the percpu_ref is initiated. - * Implemented in kill_css(). + * Implemented in kill_css_finish(). * * 2. When the percpu_ref is confirmed to be visible as killed on all CPUs * and thus css_tryget_online() is guaranteed to fail, the css can be @@ -5993,7 +6007,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode) /* * This is called when the refcnt of a css is confirmed to be killed. * css_tryget_online() is now guaranteed to fail. Tell the subsystem to - * initiate destruction and put the css ref from kill_css(). + * initiate destruction and put the css ref from kill_css_finish(). */ static void css_killed_work_fn(struct work_struct *work) { @@ -6025,15 +6039,12 @@ static void css_killed_ref_fn(struct percpu_ref *ref) } /** - * kill_css - destroy a css - * @css: css to destroy + * kill_css_sync - synchronous half of css teardown + * @css: css being killed * - * This function initiates destruction of @css by removing cgroup interface - * files and putting its base reference. ->css_offline() will be invoked - * asynchronously once css_tryget_online() is guaranteed to fail and when - * the reference count reaches zero, @css will be released. + * See cgroup_destroy_locked(). */ -static void kill_css(struct cgroup_subsys_state *css) +static void kill_css_sync(struct cgroup_subsys_state *css) { struct cgroup_subsys *ss = css->ss; @@ -6056,24 +6067,6 @@ static void kill_css(struct cgroup_subsys_state *css) */ css_clear_dir(css); - /* - * Killing would put the base ref, but we need to keep it alive - * until after ->css_offline(). - */ - css_get(css); - - /* - * cgroup core guarantees that, by the time ->css_offline() is - * invoked, no new css reference will be given out via - * css_tryget_online(). We can't simply call percpu_ref_kill() and - * proceed to offlining css's because percpu_ref_kill() doesn't - * guarantee that the ref is seen as killed on all CPUs on return. - * - * Use percpu_ref_kill_and_confirm() to get notifications as each - * css is confirmed to be seen as killed on all CPUs. - */ - percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn); - css->cgroup->nr_dying_subsys[ss->id]++; /* * Parent css and cgroup cannot be freed until after the freeing @@ -6086,44 +6079,88 @@ static void kill_css(struct cgroup_subsys_state *css) } /** - * cgroup_destroy_locked - the first stage of cgroup destruction + * kill_css_finish - deferred half of css teardown + * @css: css being killed + * + * See cgroup_destroy_locked(). + */ +static void kill_css_finish(struct cgroup_subsys_state *css) +{ + lockdep_assert_held(&cgroup_mutex); + + /* + * Skip on re-entry: cgroup_apply_control_disable() may have killed @css + * earlier. cgroup_destroy_locked() can still walk it because + * offline_css() (which NULLs cgrp->subsys[ssid]) runs async. + */ + if (percpu_ref_is_dying(&css->refcnt)) + return; + + /* + * Killing would put the base ref, but we need to keep it alive until + * after ->css_offline(). + */ + css_get(css); + + /* + * cgroup core guarantees that, by the time ->css_offline() is invoked, + * no new css reference will be given out via css_tryget_online(). We + * can't simply call percpu_ref_kill() and proceed to offlining css's + * because percpu_ref_kill() doesn't guarantee that the ref is seen as + * killed on all CPUs on return. + * + * Use percpu_ref_kill_and_confirm() to get notifications as each css is + * confirmed to be seen as killed on all CPUs. + */ + percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn); +} + +/** + * cgroup_destroy_locked - destroy @cgrp (called on rmdir) * @cgrp: cgroup to be destroyed * - * css's make use of percpu refcnts whose killing latency shouldn't be - * exposed to userland and are RCU protected. Also, cgroup core needs to - * guarantee that css_tryget_online() won't succeed by the time - * ->css_offline() is invoked. To satisfy all the requirements, - * destruction is implemented in the following two steps. - * - * s1. Verify @cgrp can be destroyed and mark it dying. Remove all - * userland visible parts and start killing the percpu refcnts of - * css's. Set up so that the next stage will be kicked off once all - * the percpu refcnts are confirmed to be killed. - * - * s2. Invoke ->css_offline(), mark the cgroup dead and proceed with the - * rest of destruction. Once all cgroup references are gone, the - * cgroup is RCU-freed. - * - * This function implements s1. After this step, @cgrp is gone as far as - * the userland is concerned and a new cgroup with the same name may be - * created. As cgroup doesn't care about the names internally, this - * doesn't cause any problem. + * Tear down @cgrp on behalf of rmdir. Constraints: + * + * - Userspace: rmdir must succeed when cgroup.procs and friends are empty. + * + * - Kernel: subsystem ->css_offline() must not run while any task in @cgrp's + * subtree is still doing kernel work. A task hidden from cgroup.procs (past + * exit_signals() with signal->live cleared) can still schedule, allocate, and + * consume resources until its final context switch. Dying descendants in the + * subtree can host such tasks too. + * + * - Kernel: css_tryget_online() must fail by the time ->css_offline() runs. + * + * The destruction runs in three parts: + * + * - This function: synchronous user-visible state teardown plus kill_css_sync() + * on each subsystem css. + * + * - cgroup_finish_destroy(): kicks the percpu_ref kill via kill_css_finish() on + * each subsystem css. Fires once @cgrp's subtree is fully drained, either + * inline here or from cgroup_update_populated(). + * + * - The percpu_ref kill chain: css_killed_ref_fn -> css_killed_work_fn -> + * ->css_offline() -> release/free. + * + * Return 0 on success, -EBUSY if a userspace-visible task or an online child + * remains. */ static int cgroup_destroy_locked(struct cgroup *cgrp) - __releases(&cgroup_mutex) __acquires(&cgroup_mutex) { struct cgroup *tcgrp, *parent = cgroup_parent(cgrp); struct cgroup_subsys_state *css; struct cgrp_cset_link *link; + struct css_task_iter it; + struct task_struct *task; int ssid, ret; lockdep_assert_held(&cgroup_mutex); - /* - * Only migration can raise populated from zero and we're already - * holding cgroup_mutex. - */ - if (cgroup_is_populated(cgrp)) + css_task_iter_start(&cgrp->self, 0, &it); + task = css_task_iter_next(&it); + css_task_iter_end(&it); + if (task) return -EBUSY; /* @@ -6147,9 +6184,8 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) link->cset->dead = true; spin_unlock_irq(&css_set_lock); - /* initiate massacre of all css's */ for_each_css(css, ssid, cgrp) - kill_css(css); + kill_css_sync(css); /* clear and remove @cgrp dir, @cgrp has an extra ref on its kn */ css_clear_dir(&cgrp->self); @@ -6180,79 +6216,27 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) /* put the base reference */ percpu_ref_kill(&cgrp->self.refcnt); + if (!cgroup_is_populated(cgrp)) + cgroup_finish_destroy(cgrp); + return 0; }; /** - * cgroup_drain_dying - wait for dying tasks to leave before rmdir - * @cgrp: the cgroup being removed - * - * cgroup.procs and cgroup.threads use css_task_iter which filters out - * PF_EXITING tasks so that userspace doesn't see tasks that have already been - * reaped via waitpid(). However, cgroup_has_tasks() - which tests whether the - * cgroup has non-empty css_sets - is only updated when dying tasks pass through - * cgroup_task_dead() in finish_task_switch(). This creates a window where - * cgroup.procs reads empty but cgroup_has_tasks() is still true, making rmdir - * fail with -EBUSY from cgroup_destroy_locked() even though userspace sees no - * tasks. + * cgroup_finish_destroy - deferred half of @cgrp destruction + * @cgrp: cgroup whose subtree just became empty * - * This function aligns cgroup_has_tasks() with what userspace can observe. If - * cgroup_has_tasks() but the task iterator sees nothing (all remaining tasks are - * PF_EXITING), we wait for cgroup_task_dead() to finish processing them. As the - * window between PF_EXITING and cgroup_task_dead() is short, the wait is brief. - * - * This function only concerns itself with this cgroup's own dying tasks. - * Whether the cgroup has children is cgroup_destroy_locked()'s problem. - * - * Each cgroup_task_dead() kicks the waitqueue via cset->cgrp_links, and we - * retry the full check from scratch. - * - * Must be called with cgroup_mutex held. + * See cgroup_destroy_locked() for the rationale. */ -static int cgroup_drain_dying(struct cgroup *cgrp) - __releases(&cgroup_mutex) __acquires(&cgroup_mutex) +static void cgroup_finish_destroy(struct cgroup *cgrp) { - struct css_task_iter it; - struct task_struct *task; - DEFINE_WAIT(wait); + struct cgroup_subsys_state *css; + int ssid; lockdep_assert_held(&cgroup_mutex); -retry: - if (!cgroup_has_tasks(cgrp)) - return 0; - - /* Same iterator as cgroup.threads - if any task is visible, it's busy */ - css_task_iter_start(&cgrp->self, 0, &it); - task = css_task_iter_next(&it); - css_task_iter_end(&it); - - if (task) - return -EBUSY; - /* - * All remaining tasks are PF_EXITING and will pass through - * cgroup_task_dead() shortly. Wait for a kick and retry. - * - * cgroup_has_tasks() can't transition from false to true while we're - * holding cgroup_mutex, but the true to false transition happens - * under css_set_lock (via cgroup_task_dead()). We must retest and - * prepare_to_wait() under css_set_lock. Otherwise, the transition - * can happen between our first test and prepare_to_wait(), and we - * sleep with no one to wake us. - */ - spin_lock_irq(&css_set_lock); - if (!cgroup_has_tasks(cgrp)) { - spin_unlock_irq(&css_set_lock); - return 0; - } - prepare_to_wait(&cgrp->dying_populated_waitq, &wait, - TASK_UNINTERRUPTIBLE); - spin_unlock_irq(&css_set_lock); - mutex_unlock(&cgroup_mutex); - schedule(); - finish_wait(&cgrp->dying_populated_waitq, &wait); - mutex_lock(&cgroup_mutex); - goto retry; + for_each_css(css, ssid, cgrp) + kill_css_finish(css); } int cgroup_rmdir(struct kernfs_node *kn) @@ -6264,12 +6248,9 @@ int cgroup_rmdir(struct kernfs_node *kn) if (!cgrp) return 0; - ret = cgroup_drain_dying(cgrp); - if (!ret) { - ret = cgroup_destroy_locked(cgrp); - if (!ret) - TRACE_CGROUP_PATH(rmdir, cgrp); - } + ret = cgroup_destroy_locked(cgrp); + if (!ret) + TRACE_CGROUP_PATH(rmdir, cgrp); cgroup_kn_unlock(kn); return ret; @@ -7029,7 +7010,6 @@ void cgroup_task_exit(struct task_struct *tsk) static void do_cgroup_task_dead(struct task_struct *tsk) { - struct cgrp_cset_link *link; struct css_set *cset; unsigned long flags; @@ -7043,11 +7023,6 @@ static void do_cgroup_task_dead(struct task_struct *tsk) if (thread_group_leader(tsk) && atomic_read(&tsk->signal->live)) list_add_tail(&tsk->cg_list, &cset->dying_tasks); - /* kick cgroup_drain_dying() waiters, see cgroup_rmdir() */ - list_for_each_entry(link, &cset->cgrp_links, cgrp_link) - if (waitqueue_active(&link->cgrp->dying_populated_waitq)) - wake_up(&link->cgrp->dying_populated_waitq); - if (dl_task(tsk)) dec_dl_tasks_cs(tsk); -- 2.54.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated 2026-05-01 2:29 ` [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated Tejun Heo @ 2026-05-03 19:30 ` kernel test robot 2026-05-03 20:15 ` kernel test robot 2026-05-03 22:45 ` kernel test robot 2 siblings, 0 replies; 8+ messages in thread From: kernel test robot @ 2026-05-03 19:30 UTC (permalink / raw) To: Tejun Heo, Martin Pitt Cc: oe-kbuild-all, regressions, cgroups, lizefan.x, hannes, Sebastian Andrzej Siewior, linux-kernel, Tejun Heo Hi Tejun, kernel test robot noticed the following build warnings: [auto build test WARNING on tj-cgroup/for-next] [also build test WARNING on linus/master next-20260430] [cannot apply to v7.1-rc1] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Tejun-Heo/cgroup-Defer-css-percpu_ref-kill-on-rmdir-until-cgroup-is-depopulated/20260503-165802 base: https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next patch link: https://lore.kernel.org/r/20260501022943.3714461-1-tj%40kernel.org patch subject: [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated config: m68k-allmodconfig (https://download.01.org/0day-ci/archive/20260504/202605040315.QbFTzfWy-lkp@intel.com/config) compiler: m68k-linux-gcc (GCC) 15.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260504/202605040315.QbFTzfWy-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202605040315.QbFTzfWy-lkp@intel.com/ All warnings (new ones prefixed by >>): kernel/cgroup/cgroup.c: In function 'cgroup_apply_control_disable': kernel/cgroup/cgroup.c:3391:33: error: implicit declaration of function 'kill_css_sync'; did you mean 'kill_fasync'? [-Wimplicit-function-declaration] 3391 | kill_css_sync(css); | ^~~~~~~~~~~~~ | kill_fasync kernel/cgroup/cgroup.c:3392:33: error: implicit declaration of function 'kill_css_finish' [-Wimplicit-function-declaration] 3392 | kill_css_finish(css); | ^~~~~~~~~~~~~~~ kernel/cgroup/cgroup.c: At top level: >> kernel/cgroup/cgroup.c:6047:13: warning: conflicting types for 'kill_css_sync'; have 'void(struct cgroup_subsys_state *)' 6047 | static void kill_css_sync(struct cgroup_subsys_state *css) | ^~~~~~~~~~~~~ kernel/cgroup/cgroup.c:6047:13: error: static declaration of 'kill_css_sync' follows non-static declaration kernel/cgroup/cgroup.c:3391:33: note: previous implicit declaration of 'kill_css_sync' with type 'void(struct cgroup_subsys_state *)' 3391 | kill_css_sync(css); | ^~~~~~~~~~~~~ >> kernel/cgroup/cgroup.c:6087:13: warning: conflicting types for 'kill_css_finish'; have 'void(struct cgroup_subsys_state *)' 6087 | static void kill_css_finish(struct cgroup_subsys_state *css) | ^~~~~~~~~~~~~~~ kernel/cgroup/cgroup.c:6087:13: error: static declaration of 'kill_css_finish' follows non-static declaration kernel/cgroup/cgroup.c:3392:33: note: previous implicit declaration of 'kill_css_finish' with type 'void(struct cgroup_subsys_state *)' 3392 | kill_css_finish(css); | ^~~~~~~~~~~~~~~ vim +6047 kernel/cgroup/cgroup.c 6040 6041 /** 6042 * kill_css_sync - synchronous half of css teardown 6043 * @css: css being killed 6044 * 6045 * See cgroup_destroy_locked(). 6046 */ > 6047 static void kill_css_sync(struct cgroup_subsys_state *css) 6048 { 6049 struct cgroup_subsys *ss = css->ss; 6050 6051 lockdep_assert_held(&cgroup_mutex); 6052 6053 if (css->flags & CSS_DYING) 6054 return; 6055 6056 /* 6057 * Call css_killed(), if defined, before setting the CSS_DYING flag 6058 */ 6059 if (css->ss->css_killed) 6060 css->ss->css_killed(css); 6061 6062 css->flags |= CSS_DYING; 6063 6064 /* 6065 * This must happen before css is disassociated with its cgroup. 6066 * See seq_css() for details. 6067 */ 6068 css_clear_dir(css); 6069 6070 css->cgroup->nr_dying_subsys[ss->id]++; 6071 /* 6072 * Parent css and cgroup cannot be freed until after the freeing 6073 * of child css, see css_free_rwork_fn(). 6074 */ 6075 while ((css = css->parent)) { 6076 css->nr_descendants--; 6077 css->cgroup->nr_dying_subsys[ss->id]++; 6078 } 6079 } 6080 6081 /** 6082 * kill_css_finish - deferred half of css teardown 6083 * @css: css being killed 6084 * 6085 * See cgroup_destroy_locked(). 6086 */ > 6087 static void kill_css_finish(struct cgroup_subsys_state *css) 6088 { 6089 lockdep_assert_held(&cgroup_mutex); 6090 6091 /* 6092 * Skip on re-entry: cgroup_apply_control_disable() may have killed @css 6093 * earlier. cgroup_destroy_locked() can still walk it because 6094 * offline_css() (which NULLs cgrp->subsys[ssid]) runs async. 6095 */ 6096 if (percpu_ref_is_dying(&css->refcnt)) 6097 return; 6098 6099 /* 6100 * Killing would put the base ref, but we need to keep it alive until 6101 * after ->css_offline(). 6102 */ 6103 css_get(css); 6104 6105 /* 6106 * cgroup core guarantees that, by the time ->css_offline() is invoked, 6107 * no new css reference will be given out via css_tryget_online(). We 6108 * can't simply call percpu_ref_kill() and proceed to offlining css's 6109 * because percpu_ref_kill() doesn't guarantee that the ref is seen as 6110 * killed on all CPUs on return. 6111 * 6112 * Use percpu_ref_kill_and_confirm() to get notifications as each css is 6113 * confirmed to be seen as killed on all CPUs. 6114 */ 6115 percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn); 6116 } 6117 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated 2026-05-01 2:29 ` [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated Tejun Heo 2026-05-03 19:30 ` kernel test robot @ 2026-05-03 20:15 ` kernel test robot 2026-05-03 22:45 ` kernel test robot 2 siblings, 0 replies; 8+ messages in thread From: kernel test robot @ 2026-05-03 20:15 UTC (permalink / raw) To: Tejun Heo, Martin Pitt Cc: oe-kbuild-all, regressions, cgroups, lizefan.x, hannes, Sebastian Andrzej Siewior, linux-kernel, Tejun Heo Hi Tejun, kernel test robot noticed the following build warnings: [auto build test WARNING on tj-cgroup/for-next] [also build test WARNING on linus/master next-20260430] [cannot apply to v7.1-rc1] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Tejun-Heo/cgroup-Defer-css-percpu_ref-kill-on-rmdir-until-cgroup-is-depopulated/20260503-165802 base: https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next patch link: https://lore.kernel.org/r/20260501022943.3714461-1-tj%40kernel.org patch subject: [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated config: powerpc-randconfig-r071-20260504 (https://download.01.org/0day-ci/archive/20260504/202605040408.yt7xcKug-lkp@intel.com/config) compiler: powerpc-linux-gcc (GCC) 8.5.0 smatch: v0.5.0-9065-ge9cc34fd reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260504/202605040408.yt7xcKug-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202605040408.yt7xcKug-lkp@intel.com/ All warnings (new ones prefixed by >>): kernel/cgroup/cgroup.c: In function 'cgroup_apply_control_disable': kernel/cgroup/cgroup.c:3391:5: error: implicit declaration of function 'kill_css_sync'; did you mean 'kill_fasync'? [-Werror=implicit-function-declaration] kill_css_sync(css); ^~~~~~~~~~~~~ kill_fasync kernel/cgroup/cgroup.c:3392:5: error: implicit declaration of function 'kill_css_finish'; did you mean 'kill_cad_pid'? [-Werror=implicit-function-declaration] kill_css_finish(css); ^~~~~~~~~~~~~~~ kill_cad_pid kernel/cgroup/cgroup.c: At top level: >> kernel/cgroup/cgroup.c:6047:13: warning: conflicting types for 'kill_css_sync' static void kill_css_sync(struct cgroup_subsys_state *css) ^~~~~~~~~~~~~ kernel/cgroup/cgroup.c:6047:13: error: static declaration of 'kill_css_sync' follows non-static declaration kernel/cgroup/cgroup.c:3391:5: note: previous implicit declaration of 'kill_css_sync' was here kill_css_sync(css); ^~~~~~~~~~~~~ >> kernel/cgroup/cgroup.c:6087:13: warning: conflicting types for 'kill_css_finish' static void kill_css_finish(struct cgroup_subsys_state *css) ^~~~~~~~~~~~~~~ kernel/cgroup/cgroup.c:6087:13: error: static declaration of 'kill_css_finish' follows non-static declaration kernel/cgroup/cgroup.c:3392:5: note: previous implicit declaration of 'kill_css_finish' was here kill_css_finish(css); ^~~~~~~~~~~~~~~ cc1: some warnings being treated as errors vim +/kill_css_sync +6047 kernel/cgroup/cgroup.c 6040 6041 /** 6042 * kill_css_sync - synchronous half of css teardown 6043 * @css: css being killed 6044 * 6045 * See cgroup_destroy_locked(). 6046 */ > 6047 static void kill_css_sync(struct cgroup_subsys_state *css) 6048 { 6049 struct cgroup_subsys *ss = css->ss; 6050 6051 lockdep_assert_held(&cgroup_mutex); 6052 6053 if (css->flags & CSS_DYING) 6054 return; 6055 6056 /* 6057 * Call css_killed(), if defined, before setting the CSS_DYING flag 6058 */ 6059 if (css->ss->css_killed) 6060 css->ss->css_killed(css); 6061 6062 css->flags |= CSS_DYING; 6063 6064 /* 6065 * This must happen before css is disassociated with its cgroup. 6066 * See seq_css() for details. 6067 */ 6068 css_clear_dir(css); 6069 6070 css->cgroup->nr_dying_subsys[ss->id]++; 6071 /* 6072 * Parent css and cgroup cannot be freed until after the freeing 6073 * of child css, see css_free_rwork_fn(). 6074 */ 6075 while ((css = css->parent)) { 6076 css->nr_descendants--; 6077 css->cgroup->nr_dying_subsys[ss->id]++; 6078 } 6079 } 6080 6081 /** 6082 * kill_css_finish - deferred half of css teardown 6083 * @css: css being killed 6084 * 6085 * See cgroup_destroy_locked(). 6086 */ > 6087 static void kill_css_finish(struct cgroup_subsys_state *css) 6088 { 6089 lockdep_assert_held(&cgroup_mutex); 6090 6091 /* 6092 * Skip on re-entry: cgroup_apply_control_disable() may have killed @css 6093 * earlier. cgroup_destroy_locked() can still walk it because 6094 * offline_css() (which NULLs cgrp->subsys[ssid]) runs async. 6095 */ 6096 if (percpu_ref_is_dying(&css->refcnt)) 6097 return; 6098 6099 /* 6100 * Killing would put the base ref, but we need to keep it alive until 6101 * after ->css_offline(). 6102 */ 6103 css_get(css); 6104 6105 /* 6106 * cgroup core guarantees that, by the time ->css_offline() is invoked, 6107 * no new css reference will be given out via css_tryget_online(). We 6108 * can't simply call percpu_ref_kill() and proceed to offlining css's 6109 * because percpu_ref_kill() doesn't guarantee that the ref is seen as 6110 * killed on all CPUs on return. 6111 * 6112 * Use percpu_ref_kill_and_confirm() to get notifications as each css is 6113 * confirmed to be seen as killed on all CPUs. 6114 */ 6115 percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn); 6116 } 6117 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated 2026-05-01 2:29 ` [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated Tejun Heo 2026-05-03 19:30 ` kernel test robot 2026-05-03 20:15 ` kernel test robot @ 2026-05-03 22:45 ` kernel test robot 2 siblings, 0 replies; 8+ messages in thread From: kernel test robot @ 2026-05-03 22:45 UTC (permalink / raw) To: Tejun Heo, Martin Pitt Cc: oe-kbuild-all, regressions, cgroups, lizefan.x, hannes, Sebastian Andrzej Siewior, linux-kernel, Tejun Heo Hi Tejun, kernel test robot noticed the following build errors: [auto build test ERROR on tj-cgroup/for-next] [also build test ERROR on linus/master next-20260430] [cannot apply to v7.1-rc1] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Tejun-Heo/cgroup-Defer-css-percpu_ref-kill-on-rmdir-until-cgroup-is-depopulated/20260503-165802 base: https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next patch link: https://lore.kernel.org/r/20260501022943.3714461-1-tj%40kernel.org patch subject: [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated config: m68k-allmodconfig (https://download.01.org/0day-ci/archive/20260504/202605040655.KI0GsBVb-lkp@intel.com/config) compiler: m68k-linux-gcc (GCC) 15.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260504/202605040655.KI0GsBVb-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202605040655.KI0GsBVb-lkp@intel.com/ All errors (new ones prefixed by >>): kernel/cgroup/cgroup.c: In function 'cgroup_apply_control_disable': >> kernel/cgroup/cgroup.c:3391:33: error: implicit declaration of function 'kill_css_sync'; did you mean 'kill_fasync'? [-Wimplicit-function-declaration] 3391 | kill_css_sync(css); | ^~~~~~~~~~~~~ | kill_fasync >> kernel/cgroup/cgroup.c:3392:33: error: implicit declaration of function 'kill_css_finish' [-Wimplicit-function-declaration] 3392 | kill_css_finish(css); | ^~~~~~~~~~~~~~~ kernel/cgroup/cgroup.c: At top level: kernel/cgroup/cgroup.c:6047:13: warning: conflicting types for 'kill_css_sync'; have 'void(struct cgroup_subsys_state *)' 6047 | static void kill_css_sync(struct cgroup_subsys_state *css) | ^~~~~~~~~~~~~ >> kernel/cgroup/cgroup.c:6047:13: error: static declaration of 'kill_css_sync' follows non-static declaration kernel/cgroup/cgroup.c:3391:33: note: previous implicit declaration of 'kill_css_sync' with type 'void(struct cgroup_subsys_state *)' 3391 | kill_css_sync(css); | ^~~~~~~~~~~~~ kernel/cgroup/cgroup.c:6087:13: warning: conflicting types for 'kill_css_finish'; have 'void(struct cgroup_subsys_state *)' 6087 | static void kill_css_finish(struct cgroup_subsys_state *css) | ^~~~~~~~~~~~~~~ >> kernel/cgroup/cgroup.c:6087:13: error: static declaration of 'kill_css_finish' follows non-static declaration kernel/cgroup/cgroup.c:3392:33: note: previous implicit declaration of 'kill_css_finish' with type 'void(struct cgroup_subsys_state *)' 3392 | kill_css_finish(css); | ^~~~~~~~~~~~~~~ vim +3391 kernel/cgroup/cgroup.c 3359 3360 /** 3361 * cgroup_apply_control_disable - kill or hide csses according to control 3362 * @cgrp: root of the target subtree 3363 * 3364 * Walk @cgrp's subtree and kill and hide csses so that they match 3365 * cgroup_ss_mask() and cgroup_visible_mask(). 3366 * 3367 * A css is hidden when the userland requests it to be disabled while other 3368 * subsystems are still depending on it. The css must not actively control 3369 * resources and be in the vanilla state if it's made visible again later. 3370 * Controllers which may be depended upon should provide ->css_reset() for 3371 * this purpose. 3372 */ 3373 static void cgroup_apply_control_disable(struct cgroup *cgrp) 3374 { 3375 struct cgroup *dsct; 3376 struct cgroup_subsys_state *d_css; 3377 struct cgroup_subsys *ss; 3378 int ssid; 3379 3380 cgroup_for_each_live_descendant_post(dsct, d_css, cgrp) { 3381 for_each_subsys(ss, ssid) { 3382 struct cgroup_subsys_state *css = cgroup_css(dsct, ss); 3383 3384 if (!css) 3385 continue; 3386 3387 WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt)); 3388 3389 if (css->parent && 3390 !(cgroup_ss_mask(dsct) & (1 << ss->id))) { > 3391 kill_css_sync(css); > 3392 kill_css_finish(css); 3393 } else if (!css_visible(css)) { 3394 css_clear_dir(css); 3395 if (ss->css_reset) 3396 ss->css_reset(css); 3397 } 3398 } 3399 } 3400 } 3401 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-05-03 22:45 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-29 9:21 [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations Martin Pitt 2026-04-29 16:21 ` Tejun Heo 2026-04-29 21:15 ` Tejun Heo 2026-04-30 6:15 ` Martin Pitt 2026-05-01 2:29 ` [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated Tejun Heo 2026-05-03 19:30 ` kernel test robot 2026-05-03 20:15 ` kernel test robot 2026-05-03 22:45 ` kernel test robot
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox