From: Tejun Heo <tj@kernel.org>
To: Martin Pitt <martin@piware.de>
Cc: regressions@lists.linux.dev, cgroups@vger.kernel.org,
lizefan.x@bytedance.com, hannes@cmpxchg.org,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: Re: [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations
Date: Wed, 29 Apr 2026 11:15:03 -1000 [thread overview]
Message-ID: <35e0670adb4abeab13da2c321582af9f@kernel.org> (raw)
In-Reply-To: <f19d08689301f9cc0211e6273f833246@kernel.org>
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 7343 bytes --]
Hello,
I think I have the mechanism. The deadlock chains three things together.
1. Host PID 1 systemd is doing rmdir on user-1001.slice. rmdir enters
cgroup_drain_dying() which waits until nr_populated_csets drops to
0. The wait was added in 1b164b876c36 ("cgroup: Wait for dying
tasks to leave on rmdir") so that rmdir doesn't succeed while
dying tasks are still on the cgroup's css_set - the controller
invariant being kept is "no tasks running in an offlined css"
(see d245698d727a for where that got established on the dying-task
side).
2. nr_populated_csets only drops when cgroup_task_dead() runs, which
happens from finish_task_switch() after the dying task's
do_task_dead() - i.e., after the very last context switch out of
the task.
3. The container's PID 1 (whatever the entrypoint runs) is in
do_exit() but parked in zap_pid_ns_processes' second wait loop:
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
if (pid_ns->pid_allocated == init_pids)
break;
schedule();
}
pid_allocated stays > init_pids because at least one struct pid in
the namespace is still in the idr - i.e., its task hasn't been
reaped (release_task -> free_pid hasn't run).
The unreaped task is the host parent of one of the container helpers
- in the podman case, an exec session child whose host parent died
during the pkill cascade (conmon, the user manager, etc.) and got
re-parented to host PID 1. PID 1 was supposed to wait4() it as part of
normal reaping, but PID 1 is blocked in (1). Chicken-and-egg.
Why fuse-overlayfs masks it: with fuse-overlayfs the container
teardown finishes fast enough that the container's PID 1 reaches
do_task_dead before host PID 1 has had time to become the reaper of
the orphan. With kernel-overlayfs the teardown drags long enough that
the reparent + drain wait land in the iterator-hidden window
simultaneously, and once they do, the wedge is permanent.
I have a minimal C reproducer (~150 lines, no podman, no daemons,
runs as root) that hits the exact same code paths. The shape: A
unshares a pid namespace and forks B (PID 1 of the namespace) and C
(PID 2; A is C's host parent, NOT B). Both run in an inner cgroup. A
SIGKILLs both, then calls rmdir() without wait4()-ing C. C becomes a
zombie pinning pid_allocated; B parks in zap_pid_ns_processes; A's
rmdir parks in cgroup_drain_dying. Verified on stock 6.19.14 and
mainline 7.1-rc1+, also under virtme-ng. Source inlined at the bottom
of this mail.
For the fix, I'm still thinking. The rough direction is to split
cgroup rmdir: the user-visible side (directory unlink, cgroup.procs
disappears, the cgroup looks gone to userspace) completes as soon as
the iterator goes empty, while the backend teardown (kill_css /
css_offline) gets deferred until nr_populated_csets actually drops to
0. That keeps the "no running tasks in an offlined css" invariant
intact while letting the rmdir syscall return so host PID 1 can
resume reaping. Need to look at what shapes this cleanly without
breaking other cgroup core
expectations.
Thanks for the report.
--
tejun
----- min-repro.c -----
/*
* Minimal reproducer for cgroup_drain_dying / zap_pid_ns_processes
* deadlock. No podman, no daemons.
*
* A — host process. Forks B (PID 1 of new pidns) and C (PID 2 of
* same pidns, but with A as host-side parent). After unshare,
* A's nsproxy->pid_ns_for_children is the level-1 namespace,
* so each subsequent fork lands a new task there.
* B — A's first child after unshare. PID 1 of new pidns.
* C — A's second child. PID 2 of same pidns, but parented to A
* at the host level (not to B). When C dies, SIGCHLD goes to
* A, NOT to B's zap_pid_ns_processes loop.
*
* Sequence:
* 1. A puts itself in inner cgroup; B and C inherit on fork.
* 2. A unshares CLONE_NEWPID, forks B and C.
* 3. A moves out of inner so inner can be rmdir'd later.
* 4. A SIGKILLs B and C. Both are killed.
* 5. B's exit invokes zap_pid_ns_processes. zap walks idr, sees C
* (already SIGKILL'd), but C is NOT B's child — kernel_wait4(-1)
* returns -ECHILD immediately. zap reaches second loop:
* `pid_allocated > 1` because C's struct pid is still in idr
* (C is zombie awaiting A's wait4). zap sleeps.
* 6. A calls rmdir(inner) WITHOUT calling wait4. cgroup_drain_dying
* sees nr_populated_csets > 0 (B still on cset->tasks; C may or
* may not be), iterator empty (B has PF_EXITING && live==0).
* Drain sleeps waiting for cgroup_task_dead, which only fires
* after B reaches do_task_dead, which it can't until zap returns,
* which it can't until pid_allocated drops to 1, which won't
* happen until A reaps C — but A is stuck in rmdir.
*
* Run as root.
*/
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <sched.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
static void die(const char *msg)
{
perror(msg);
exit(1);
}
static void write_str(const char *path, const char *s)
{
int fd = open(path, O_WRONLY);
if (fd < 0)
die(path);
if (write(fd, s, strlen(s)) < 0)
die(path);
close(fd);
}
int main(int argc, char **argv)
{
const char *cgroot = (argc > 1) ? argv[1] : "/sys/fs/cgroup/drain-min";
char cginner[256], path[300], buf[32];
snprintf(cginner, sizeof(cginner), "%s/inner", cgroot);
mkdir(cgroot, 0755);
if (mkdir(cginner, 0755) < 0 && errno != EEXIST)
die("mkdir inner");
snprintf(path, sizeof(path), "%s/cgroup.procs", cginner);
snprintf(buf, sizeof(buf), "%d", getpid());
write_str(path, buf);
if (unshare(CLONE_NEWPID) < 0)
die("unshare CLONE_NEWPID");
pid_t b = fork();
if (b < 0)
die("fork B");
if (b == 0) {
/* B: PID 1 of new pidns */
pause();
_exit(0);
}
pid_t c = fork();
if (c < 0)
die("fork C");
if (c == 0) {
/* C: PID 2 of new pidns, host parent is A */
pause();
_exit(0);
}
snprintf(path, sizeof(path), "%s/cgroup.procs", cgroot);
write_str(path, buf);
fprintf(stderr, "A: B host pid=%d, C host pid=%d\n", b, c);
/* Briefly verify pidns membership */
for (int i = 0; i < 2; i++) {
pid_t p = (i == 0) ? b : c;
char dpath[64], dbuf[256];
snprintf(dpath, sizeof(dpath), "/proc/%d/status", p);
int fd = open(dpath, O_RDONLY);
if (fd >= 0) {
ssize_t n = read(fd, dbuf, sizeof(dbuf) - 1);
close(fd);
if (n > 0) {
dbuf[n] = 0;
char *l = strstr(dbuf, "NSpid:");
if (l) {
char *e = strchr(l, '\n');
if (e) *e = 0;
fprintf(stderr, " pid=%d %s\n", p, l);
}
}
}
}
/* SIGKILL both. After this, B starts zap_pid_ns_processes; C
* dies and becomes a zombie waiting for A's wait4. */
kill(b, SIGKILL);
kill(c, SIGKILL);
usleep(500000);
fprintf(stderr, "A: rmdir(%s) — wedges if bug present "
"(deliberately NOT wait4-ing C)\n", cginner);
int rc = rmdir(cginner);
int saved_errno = errno;
fprintf(stderr, "A: rmdir returned %d (errno=%d %s)\n", rc,
saved_errno, strerror(saved_errno));
/* For cleanup if rmdir succeeded */
waitpid(c, NULL, WNOHANG);
waitpid(b, NULL, WNOHANG);
rmdir(cgroot);
return rc;
}
----- end min-repro.c -----
next prev parent reply other threads:[~2026-04-29 21:15 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-29 9:21 [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations Martin Pitt
2026-04-29 16:21 ` Tejun Heo
2026-04-29 21:15 ` Tejun Heo [this message]
2026-04-30 6:15 ` Martin Pitt
2026-05-01 2:29 ` [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated Tejun Heo
2026-05-03 19:30 ` kernel test robot
2026-05-03 20:15 ` kernel test robot
2026-05-03 22:45 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=35e0670adb4abeab13da2c321582af9f@kernel.org \
--to=tj@kernel.org \
--cc=bigeasy@linutronix.de \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=lizefan.x@bytedance.com \
--cc=martin@piware.de \
--cc=regressions@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox