Re: [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Tejun Heo <tj@kernel.org>
To: Martin Pitt <martin@piware.de>
Cc: regressions@lists.linux.dev, cgroups@vger.kernel.org,
	lizefan.x@bytedance.com, hannes@cmpxchg.org,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: Re: [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations
Date: Wed, 29 Apr 2026 11:15:03 -1000	[thread overview]
Message-ID: <35e0670adb4abeab13da2c321582af9f@kernel.org> (raw)
In-Reply-To: <f19d08689301f9cc0211e6273f833246@kernel.org>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 7343 bytes --]

Hello,

I think I have the mechanism. The deadlock chains three things together.

1. Host PID 1 systemd is doing rmdir on user-1001.slice. rmdir enters
   cgroup_drain_dying() which waits until nr_populated_csets drops to
   0. The wait was added in 1b164b876c36 ("cgroup: Wait for dying
   tasks to leave on rmdir") so that rmdir doesn't succeed while
   dying tasks are still on the cgroup's css_set - the controller
   invariant being kept is "no tasks running in an offlined css"
   (see d245698d727a for where that got established on the dying-task
   side).

2. nr_populated_csets only drops when cgroup_task_dead() runs, which
   happens from finish_task_switch() after the dying task's
   do_task_dead() - i.e., after the very last context switch out of
   the task.

3. The container's PID 1 (whatever the entrypoint runs) is in
   do_exit() but parked in zap_pid_ns_processes' second wait loop:

       for (;;) {
           set_current_state(TASK_INTERRUPTIBLE);
           if (pid_ns->pid_allocated == init_pids)
               break;
           schedule();
       }

   pid_allocated stays > init_pids because at least one struct pid in
   the namespace is still in the idr - i.e., its task hasn't been
   reaped (release_task -> free_pid hasn't run).

The unreaped task is the host parent of one of the container helpers
- in the podman case, an exec session child whose host parent died
during the pkill cascade (conmon, the user manager, etc.) and got
re-parented to host PID 1. PID 1 was supposed to wait4() it as part of
normal reaping, but PID 1 is blocked in (1). Chicken-and-egg.

Why fuse-overlayfs masks it: with fuse-overlayfs the container
teardown finishes fast enough that the container's PID 1 reaches
do_task_dead before host PID 1 has had time to become the reaper of
the orphan. With kernel-overlayfs the teardown drags long enough that
the reparent + drain wait land in the iterator-hidden window
simultaneously, and once they do, the wedge is permanent.

I have a minimal C reproducer (~150 lines, no podman, no daemons,
runs as root) that hits the exact same code paths. The shape: A
unshares a pid namespace and forks B (PID 1 of the namespace) and C
(PID 2; A is C's host parent, NOT B). Both run in an inner cgroup. A
SIGKILLs both, then calls rmdir() without wait4()-ing C. C becomes a
zombie pinning pid_allocated; B parks in zap_pid_ns_processes; A's
rmdir parks in cgroup_drain_dying. Verified on stock 6.19.14 and
mainline 7.1-rc1+, also under virtme-ng. Source inlined at the bottom
of this mail.

For the fix, I'm still thinking. The rough direction is to split
cgroup rmdir: the user-visible side (directory unlink, cgroup.procs
disappears, the cgroup looks gone to userspace) completes as soon as
the iterator goes empty, while the backend teardown (kill_css /
css_offline) gets deferred until nr_populated_csets actually drops to
0. That keeps the "no running tasks in an offlined css" invariant
intact while letting the rmdir syscall return so host PID 1 can
resume reaping. Need to look at what shapes this cleanly without
breaking other cgroup core
expectations.

Thanks for the report.

--
tejun

----- min-repro.c -----

/*
 * Minimal reproducer for cgroup_drain_dying / zap_pid_ns_processes
 * deadlock. No podman, no daemons.
 *
 *   A — host process. Forks B (PID 1 of new pidns) and C (PID 2 of
 *       same pidns, but with A as host-side parent). After unshare,
 *       A's nsproxy->pid_ns_for_children is the level-1 namespace,
 *       so each subsequent fork lands a new task there.
 *   B — A's first child after unshare. PID 1 of new pidns.
 *   C — A's second child. PID 2 of same pidns, but parented to A
 *       at the host level (not to B). When C dies, SIGCHLD goes to
 *       A, NOT to B's zap_pid_ns_processes loop.
 *
 * Sequence:
 *   1. A puts itself in inner cgroup; B and C inherit on fork.
 *   2. A unshares CLONE_NEWPID, forks B and C.
 *   3. A moves out of inner so inner can be rmdir'd later.
 *   4. A SIGKILLs B and C. Both are killed.
 *   5. B's exit invokes zap_pid_ns_processes. zap walks idr, sees C
 *      (already SIGKILL'd), but C is NOT B's child — kernel_wait4(-1)
 *      returns -ECHILD immediately. zap reaches second loop:
 *      `pid_allocated > 1` because C's struct pid is still in idr
 *      (C is zombie awaiting A's wait4). zap sleeps.
 *   6. A calls rmdir(inner) WITHOUT calling wait4. cgroup_drain_dying
 *      sees nr_populated_csets > 0 (B still on cset->tasks; C may or
 *      may not be), iterator empty (B has PF_EXITING && live==0).
 *      Drain sleeps waiting for cgroup_task_dead, which only fires
 *      after B reaches do_task_dead, which it can't until zap returns,
 *      which it can't until pid_allocated drops to 1, which won't
 *      happen until A reaps C — but A is stuck in rmdir.
 *
 * Run as root.
 */

#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <sched.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

static void die(const char *msg)
{
	perror(msg);
	exit(1);
}

static void write_str(const char *path, const char *s)
{
	int fd = open(path, O_WRONLY);
	if (fd < 0)
		die(path);
	if (write(fd, s, strlen(s)) < 0)
		die(path);
	close(fd);
}

int main(int argc, char **argv)
{
	const char *cgroot = (argc > 1) ? argv[1] : "/sys/fs/cgroup/drain-min";
	char cginner[256], path[300], buf[32];

	snprintf(cginner, sizeof(cginner), "%s/inner", cgroot);

	mkdir(cgroot, 0755);
	if (mkdir(cginner, 0755) < 0 && errno != EEXIST)
		die("mkdir inner");

	snprintf(path, sizeof(path), "%s/cgroup.procs", cginner);
	snprintf(buf, sizeof(buf), "%d", getpid());
	write_str(path, buf);

	if (unshare(CLONE_NEWPID) < 0)
		die("unshare CLONE_NEWPID");

	pid_t b = fork();
	if (b < 0)
		die("fork B");
	if (b == 0) {
		/* B: PID 1 of new pidns */
		pause();
		_exit(0);
	}

	pid_t c = fork();
	if (c < 0)
		die("fork C");
	if (c == 0) {
		/* C: PID 2 of new pidns, host parent is A */
		pause();
		_exit(0);
	}

	snprintf(path, sizeof(path), "%s/cgroup.procs", cgroot);
	write_str(path, buf);

	fprintf(stderr, "A: B host pid=%d, C host pid=%d\n", b, c);

	/* Briefly verify pidns membership */
	for (int i = 0; i < 2; i++) {
		pid_t p = (i == 0) ? b : c;
		char dpath[64], dbuf[256];
		snprintf(dpath, sizeof(dpath), "/proc/%d/status", p);
		int fd = open(dpath, O_RDONLY);
		if (fd >= 0) {
			ssize_t n = read(fd, dbuf, sizeof(dbuf) - 1);
			close(fd);
			if (n > 0) {
				dbuf[n] = 0;
				char *l = strstr(dbuf, "NSpid:");
				if (l) {
					char *e = strchr(l, '\n');
					if (e) *e = 0;
					fprintf(stderr, "  pid=%d %s\n", p, l);
				}
			}
		}
	}

	/* SIGKILL both. After this, B starts zap_pid_ns_processes; C
	 * dies and becomes a zombie waiting for A's wait4. */
	kill(b, SIGKILL);
	kill(c, SIGKILL);

	usleep(500000);

	fprintf(stderr, "A: rmdir(%s) — wedges if bug present "
		"(deliberately NOT wait4-ing C)\n", cginner);

	int rc = rmdir(cginner);
	int saved_errno = errno;
	fprintf(stderr, "A: rmdir returned %d (errno=%d %s)\n", rc,
		saved_errno, strerror(saved_errno));

	/* For cleanup if rmdir succeeded */
	waitpid(c, NULL, WNOHANG);
	waitpid(b, NULL, WNOHANG);
	rmdir(cgroot);
	return rc;
}
----- end min-repro.c -----

next prev parent reply	other threads:[~2026-04-29 21:15 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-29  9:21 [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations Martin Pitt
2026-04-29 16:21 ` Tejun Heo
2026-04-29 21:15   ` Tejun Heo [this message]
2026-04-30  6:15   ` Martin Pitt
2026-05-01  2:29 ` [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated Tejun Heo
2026-05-03 19:30   ` kernel test robot
2026-05-03 20:15   ` kernel test robot
2026-05-03 22:45   ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=35e0670adb4abeab13da2c321582af9f@kernel.org \
    --to=tj@kernel.org \
    --cc=bigeasy@linutronix.de \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=lizefan.x@bytedance.com \
    --cc=martin@piware.de \
    --cc=regressions@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.