From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BEFB63AD53B; Wed, 29 Apr 2026 21:15:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777497304; cv=none; b=KAqHDk4fFthwHeT9wEX8ptutpL9iXdRvtJogLZqCATkb31mRmr10Fj4Y6nXJgJW+2mz4Qa12mtZ1blKlFK+I9fUU2NnSA5bGDhcLxZYMqa3kCcWczElPtKr+AtSPS6jnLVixTj6XdC+0AHm0K41bLdb3TryFCejkFdniLOUIHdo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777497304; c=relaxed/simple; bh=rcmPdl53ACE8rMRw190SHxJ9XhLeqWSrk5Jk80PuOME=; h=Date:Message-ID:From:To:Cc:Subject:In-Reply-To:References; b=K23zGZv3AS6zvh1ZoTH/ywQIDggflxXxWvTIwaTz5HpFWzZFkVfIXlUR1orQPV898wC3/KrQEYPBg/LajBQOYay/znBzOWwYGKdSubiSIr/YTkk6+/de6/2afRGBH2f+HhWlQL8zGNaKRHS7DNsrOmD4tx55rS6Ka59muhL7Ur0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=K7LDmZZK; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="K7LDmZZK" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3B597C2BCB3; Wed, 29 Apr 2026 21:15:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1777497304; bh=rcmPdl53ACE8rMRw190SHxJ9XhLeqWSrk5Jk80PuOME=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=K7LDmZZKB/I8+3uemoOBioOk2fv9IfeItIHL1S+ylIkxJy/SnyPErZFk1stxWDOOk M+bKsj5h69A46uBM46q7sv7b1cCFIl4gJ63yVZk/G9QkByvgZ4V+uyggDGeBH7TqZc 8QUSd1No27ijOUcb19qQpMjS5qyyg/A6sB5rYO9pCJLLkGOvl0vKuqLmltRrf+LKmv dEOZ5d4qMmNHbKubVHSucsy3og6kuBtoTh0x7r9rtS8p9hJmxbrG+1LOoJyGFipFdq 5cPGI0I3VaCE+rOU9ThP8E87o20J44WJeI6S/Hv11+uRK8T5Os4scEox2k0gZp8b0m oZ6OZ+jdqnyLw== Date: Wed, 29 Apr 2026 11:15:03 -1000 Message-ID: <35e0670adb4abeab13da2c321582af9f@kernel.org> From: Tejun Heo To: Martin Pitt Cc: regressions@lists.linux.dev, cgroups@vger.kernel.org, lizefan.x@bytedance.com, hannes@cmpxchg.org, Sebastian Andrzej Siewior Subject: Re: [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations In-Reply-To: References: Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Hello, I think I have the mechanism. The deadlock chains three things together. 1. Host PID 1 systemd is doing rmdir on user-1001.slice. rmdir enters cgroup_drain_dying() which waits until nr_populated_csets drops to 0. The wait was added in 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir") so that rmdir doesn't succeed while dying tasks are still on the cgroup's css_set - the controller invariant being kept is "no tasks running in an offlined css" (see d245698d727a for where that got established on the dying-task side). 2. nr_populated_csets only drops when cgroup_task_dead() runs, which happens from finish_task_switch() after the dying task's do_task_dead() - i.e., after the very last context switch out of the task. 3. The container's PID 1 (whatever the entrypoint runs) is in do_exit() but parked in zap_pid_ns_processes' second wait loop: for (;;) { set_current_state(TASK_INTERRUPTIBLE); if (pid_ns->pid_allocated == init_pids) break; schedule(); } pid_allocated stays > init_pids because at least one struct pid in the namespace is still in the idr - i.e., its task hasn't been reaped (release_task -> free_pid hasn't run). The unreaped task is the host parent of one of the container helpers - in the podman case, an exec session child whose host parent died during the pkill cascade (conmon, the user manager, etc.) and got re-parented to host PID 1. PID 1 was supposed to wait4() it as part of normal reaping, but PID 1 is blocked in (1). Chicken-and-egg. Why fuse-overlayfs masks it: with fuse-overlayfs the container teardown finishes fast enough that the container's PID 1 reaches do_task_dead before host PID 1 has had time to become the reaper of the orphan. With kernel-overlayfs the teardown drags long enough that the reparent + drain wait land in the iterator-hidden window simultaneously, and once they do, the wedge is permanent. I have a minimal C reproducer (~150 lines, no podman, no daemons, runs as root) that hits the exact same code paths. The shape: A unshares a pid namespace and forks B (PID 1 of the namespace) and C (PID 2; A is C's host parent, NOT B). Both run in an inner cgroup. A SIGKILLs both, then calls rmdir() without wait4()-ing C. C becomes a zombie pinning pid_allocated; B parks in zap_pid_ns_processes; A's rmdir parks in cgroup_drain_dying. Verified on stock 6.19.14 and mainline 7.1-rc1+, also under virtme-ng. Source inlined at the bottom of this mail. For the fix, I'm still thinking. The rough direction is to split cgroup rmdir: the user-visible side (directory unlink, cgroup.procs disappears, the cgroup looks gone to userspace) completes as soon as the iterator goes empty, while the backend teardown (kill_css / css_offline) gets deferred until nr_populated_csets actually drops to 0. That keeps the "no running tasks in an offlined css" invariant intact while letting the rmdir syscall return so host PID 1 can resume reaping. Need to look at what shapes this cleanly without breaking other cgroup core expectations. Thanks for the report. -- tejun ----- min-repro.c ----- /* * Minimal reproducer for cgroup_drain_dying / zap_pid_ns_processes * deadlock. No podman, no daemons. * * A — host process. Forks B (PID 1 of new pidns) and C (PID 2 of * same pidns, but with A as host-side parent). After unshare, * A's nsproxy->pid_ns_for_children is the level-1 namespace, * so each subsequent fork lands a new task there. * B — A's first child after unshare. PID 1 of new pidns. * C — A's second child. PID 2 of same pidns, but parented to A * at the host level (not to B). When C dies, SIGCHLD goes to * A, NOT to B's zap_pid_ns_processes loop. * * Sequence: * 1. A puts itself in inner cgroup; B and C inherit on fork. * 2. A unshares CLONE_NEWPID, forks B and C. * 3. A moves out of inner so inner can be rmdir'd later. * 4. A SIGKILLs B and C. Both are killed. * 5. B's exit invokes zap_pid_ns_processes. zap walks idr, sees C * (already SIGKILL'd), but C is NOT B's child — kernel_wait4(-1) * returns -ECHILD immediately. zap reaches second loop: * `pid_allocated > 1` because C's struct pid is still in idr * (C is zombie awaiting A's wait4). zap sleeps. * 6. A calls rmdir(inner) WITHOUT calling wait4. cgroup_drain_dying * sees nr_populated_csets > 0 (B still on cset->tasks; C may or * may not be), iterator empty (B has PF_EXITING && live==0). * Drain sleeps waiting for cgroup_task_dead, which only fires * after B reaches do_task_dead, which it can't until zap returns, * which it can't until pid_allocated drops to 1, which won't * happen until A reaps C — but A is stuck in rmdir. * * Run as root. */ #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include static void die(const char *msg) { perror(msg); exit(1); } static void write_str(const char *path, const char *s) { int fd = open(path, O_WRONLY); if (fd < 0) die(path); if (write(fd, s, strlen(s)) < 0) die(path); close(fd); } int main(int argc, char **argv) { const char *cgroot = (argc > 1) ? argv[1] : "/sys/fs/cgroup/drain-min"; char cginner[256], path[300], buf[32]; snprintf(cginner, sizeof(cginner), "%s/inner", cgroot); mkdir(cgroot, 0755); if (mkdir(cginner, 0755) < 0 && errno != EEXIST) die("mkdir inner"); snprintf(path, sizeof(path), "%s/cgroup.procs", cginner); snprintf(buf, sizeof(buf), "%d", getpid()); write_str(path, buf); if (unshare(CLONE_NEWPID) < 0) die("unshare CLONE_NEWPID"); pid_t b = fork(); if (b < 0) die("fork B"); if (b == 0) { /* B: PID 1 of new pidns */ pause(); _exit(0); } pid_t c = fork(); if (c < 0) die("fork C"); if (c == 0) { /* C: PID 2 of new pidns, host parent is A */ pause(); _exit(0); } snprintf(path, sizeof(path), "%s/cgroup.procs", cgroot); write_str(path, buf); fprintf(stderr, "A: B host pid=%d, C host pid=%d\n", b, c); /* Briefly verify pidns membership */ for (int i = 0; i < 2; i++) { pid_t p = (i == 0) ? b : c; char dpath[64], dbuf[256]; snprintf(dpath, sizeof(dpath), "/proc/%d/status", p); int fd = open(dpath, O_RDONLY); if (fd >= 0) { ssize_t n = read(fd, dbuf, sizeof(dbuf) - 1); close(fd); if (n > 0) { dbuf[n] = 0; char *l = strstr(dbuf, "NSpid:"); if (l) { char *e = strchr(l, '\n'); if (e) *e = 0; fprintf(stderr, " pid=%d %s\n", p, l); } } } } /* SIGKILL both. After this, B starts zap_pid_ns_processes; C * dies and becomes a zombie waiting for A's wait4. */ kill(b, SIGKILL); kill(c, SIGKILL); usleep(500000); fprintf(stderr, "A: rmdir(%s) — wedges if bug present " "(deliberately NOT wait4-ing C)\n", cginner); int rc = rmdir(cginner); int saved_errno = errno; fprintf(stderr, "A: rmdir returned %d (errno=%d %s)\n", rc, saved_errno, strerror(saved_errno)); /* For cleanup if rmdir succeeded */ waitpid(c, NULL, WNOHANG); waitpid(b, NULL, WNOHANG); rmdir(cgroot); return rc; } ----- end min-repro.c -----