Hello,

I think I have the mechanism. The deadlock chains three things together.

1. Host PID 1 systemd is doing rmdir on user-1001.slice. rmdir enters
   cgroup_drain_dying() which waits until nr_populated_csets drops to
   0. The wait was added in 1b164b876c36 ("cgroup: Wait for dying
   tasks to leave on rmdir") so that rmdir doesn't succeed while
   dying tasks are still on the cgroup's css_set - the controller
   invariant being kept is "no tasks running in an offlined css"
   (see d245698d727a for where that got established on the dying-task
   side).

2. nr_populated_csets only drops when cgroup_task_dead() runs, which
   happens from finish_task_switch() after the dying task's
   do_task_dead() - i.e., after the very last context switch out of
   the task.

3. The container's PID 1 (whatever the entrypoint runs) is in
   do_exit() but parked in zap_pid_ns_processes' second wait loop:

       for (;;) {
           set_current_state(TASK_INTERRUPTIBLE);
           if (pid_ns->pid_allocated == init_pids)
               break;
           schedule();
       }

   pid_allocated stays > init_pids because at least one struct pid in
   the namespace is still in the idr - i.e., its task hasn't been
   reaped (release_task -> free_pid hasn't run).

The unreaped task is the host parent of one of the container helpers
- in the podman case, an exec session child whose host parent died
during the pkill cascade (conmon, the user manager, etc.) and got
re-parented to host PID 1. PID 1 was supposed to wait4() it as part of
normal reaping, but PID 1 is blocked in (1). Chicken-and-egg.

Why fuse-overlayfs masks it: with fuse-overlayfs the container
teardown finishes fast enough that the container's PID 1 reaches
do_task_dead before host PID 1 has had time to become the reaper of
the orphan. With kernel-overlayfs the teardown drags long enough that
the reparent + drain wait land in the iterator-hidden window
simultaneously, and once they do, the wedge is permanent.

I have a minimal C reproducer (~150 lines, no podman, no daemons,
runs as root) that hits the exact same code paths. The shape: A
unshares a pid namespace and forks B (PID 1 of the namespace) and C
(PID 2; A is C's host parent, NOT B). Both run in an inner cgroup. A
SIGKILLs both, then calls rmdir() without wait4()-ing C. C becomes a
zombie pinning pid_allocated; B parks in zap_pid_ns_processes; A's
rmdir parks in cgroup_drain_dying. Verified on stock 6.19.14 and
mainline 7.1-rc1+, also under virtme-ng. Source inlined at the bottom
of this mail.

For the fix, I'm still thinking. The rough direction is to split
cgroup rmdir: the user-visible side (directory unlink, cgroup.procs
disappears, the cgroup looks gone to userspace) completes as soon as
the iterator goes empty, while the backend teardown (kill_css /
css_offline) gets deferred until nr_populated_csets actually drops to
0. That keeps the "no running tasks in an offlined css" invariant
intact while letting the rmdir syscall return so host PID 1 can
resume reaping. Need to look at what shapes this cleanly without
breaking other cgroup core
expectations.

Thanks for the report.

--
tejun

----- min-repro.c -----

/*
 * Minimal reproducer for cgroup_drain_dying / zap_pid_ns_processes
 * deadlock. No podman, no daemons.
 *
 *   A — host process. Forks B (PID 1 of new pidns) and C (PID 2 of
 *       same pidns, but with A as host-side parent). After unshare,
 *       A's nsproxy->pid_ns_for_children is the level-1 namespace,
 *       so each subsequent fork lands a new task there.
 *   B — A's first child after unshare. PID 1 of new pidns.
 *   C — A's second child. PID 2 of same pidns, but parented to A
 *       at the host level (not to B). When C dies, SIGCHLD goes to
 *       A, NOT to B's zap_pid_ns_processes loop.
 *
 * Sequence:
 *   1. A puts itself in inner cgroup; B and C inherit on fork.
 *   2. A unshares CLONE_NEWPID, forks B and C.
 *   3. A moves out of inner so inner can be rmdir'd later.
 *   4. A SIGKILLs B and C. Both are killed.
 *   5. B's exit invokes zap_pid_ns_processes. zap walks idr, sees C
 *      (already SIGKILL'd), but C is NOT B's child — kernel_wait4(-1)
 *      returns -ECHILD immediately. zap reaches second loop:
 *      `pid_allocated > 1` because C's struct pid is still in idr
 *      (C is zombie awaiting A's wait4). zap sleeps.
 *   6. A calls rmdir(inner) WITHOUT calling wait4. cgroup_drain_dying
 *      sees nr_populated_csets > 0 (B still on cset->tasks; C may or
 *      may not be), iterator empty (B has PF_EXITING && live==0).
 *      Drain sleeps waiting for cgroup_task_dead, which only fires
 *      after B reaches do_task_dead, which it can't until zap returns,
 *      which it can't until pid_allocated drops to 1, which won't
 *      happen until A reaps C — but A is stuck in rmdir.
 *
 * Run as root.
 */

#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <sched.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

static void die(const char *msg)
{
	perror(msg);
	exit(1);
}

static void write_str(const char *path, const char *s)
{
	int fd = open(path, O_WRONLY);
	if (fd < 0)
		die(path);
	if (write(fd, s, strlen(s)) < 0)
		die(path);
	close(fd);
}

int main(int argc, char **argv)
{
	const char *cgroot = (argc > 1) ? argv[1] : "/sys/fs/cgroup/drain-min";
	char cginner[256], path[300], buf[32];

	snprintf(cginner, sizeof(cginner), "%s/inner", cgroot);

	mkdir(cgroot, 0755);
	if (mkdir(cginner, 0755) < 0 && errno != EEXIST)
		die("mkdir inner");

	snprintf(path, sizeof(path), "%s/cgroup.procs", cginner);
	snprintf(buf, sizeof(buf), "%d", getpid());
	write_str(path, buf);

	if (unshare(CLONE_NEWPID) < 0)
		die("unshare CLONE_NEWPID");

	pid_t b = fork();
	if (b < 0)
		die("fork B");
	if (b == 0) {
		/* B: PID 1 of new pidns */
		pause();
		_exit(0);
	}

	pid_t c = fork();
	if (c < 0)
		die("fork C");
	if (c == 0) {
		/* C: PID 2 of new pidns, host parent is A */
		pause();
		_exit(0);
	}

	snprintf(path, sizeof(path), "%s/cgroup.procs", cgroot);
	write_str(path, buf);

	fprintf(stderr, "A: B host pid=%d, C host pid=%d\n", b, c);

	/* Briefly verify pidns membership */
	for (int i = 0; i < 2; i++) {
		pid_t p = (i == 0) ? b : c;
		char dpath[64], dbuf[256];
		snprintf(dpath, sizeof(dpath), "/proc/%d/status", p);
		int fd = open(dpath, O_RDONLY);
		if (fd >= 0) {
			ssize_t n = read(fd, dbuf, sizeof(dbuf) - 1);
			close(fd);
			if (n > 0) {
				dbuf[n] = 0;
				char *l = strstr(dbuf, "NSpid:");
				if (l) {
					char *e = strchr(l, '\n');
					if (e) *e = 0;
					fprintf(stderr, "  pid=%d %s\n", p, l);
				}
			}
		}
	}

	/* SIGKILL both. After this, B starts zap_pid_ns_processes; C
	 * dies and becomes a zombie waiting for A's wait4. */
	kill(b, SIGKILL);
	kill(c, SIGKILL);

	usleep(500000);

	fprintf(stderr, "A: rmdir(%s) — wedges if bug present "
		"(deliberately NOT wait4-ing C)\n", cginner);

	int rc = rmdir(cginner);
	int saved_errno = errno;
	fprintf(stderr, "A: rmdir returned %d (errno=%d %s)\n", rc,
		saved_errno, strerror(saved_errno));

	/* For cleanup if rmdir succeeded */
	waitpid(c, NULL, WNOHANG);
	waitpid(b, NULL, WNOHANG);
	rmdir(cgroot);
	return rc;
}
----- end min-repro.c -----