Linux cgroups development
 help / color / mirror / Atom feed
* [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations
@ 2026-04-29  9:21 Martin Pitt
  2026-04-29 16:21 ` Tejun Heo
  2026-05-01  2:29 ` [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated Tejun Heo
  0 siblings, 2 replies; 8+ messages in thread
From: Martin Pitt @ 2026-04-29  9:21 UTC (permalink / raw)
  To: regressions; +Cc: cgroups, tj, lizefan.x, hannes

Hello,

Our cockpit tests found a kernel regression introduced between 6.9.10 (working)
and 6.9.11 (broken) that causes a system hang during cgroup cleanup after
podman container operations. I've kept notes in
https://github.com/cockpit-project/bots/pull/8970#issuecomment-4342147158 , but
now I am at the end of my wisdom how to squeeze more information out of this.

=== Summary ===

When running podman REST API operations on rootless containers followed by user
session cleanup (loginctl/pkill), systemd (pid 1) gets stuck in
cgroup_drain_dying trying to remove an empty cgroup. After that, I'm

- Unable to run commands that access /proc (ps, top, lsns, ls /proc, etc.)
- Unable to create new SSH sessions or VT logins
- If I previously logged into the QEMU VT, that login session remains
  more or less functional, except not being able to run most commands

=== Kernel Versions ===

- Last known working: 6.9.10
- Broken: 6.9.11 (OpenSUSE Tumbleweed), 6.9.13 (Fedora 44), 6.9.14 (Fedora 44),
  Ubuntu 26.04 (7.0.0)

=== Stack Trace ===

From sysrq-trigger task dump, systemd is stuck in:

[  207.958946] task:systemd         state:D stack:0     pid:1     tgid:1     ppid:0
[  207.959734] Call Trace:
[  207.960117]  <TASK>
[  207.960333]  __schedule+0x2b2/0x5d0
[  207.960603]  schedule+0x27/0x80
[  207.960945]  cgroup_drain_dying+0xef/0x1a0
[  207.961287]  ? __pfx_autoremove_wake_function+0x10/0x10
[  207.961639]  cgroup_rmdir+0x37/0x100
[  207.961945]  kernfs_iop_rmdir+0x6a/0xd0
[  207.962239]  vfs_rmdir+0x154/0x270
[  207.962486]  do_rmdir+0x201/0x280
[  207.962723]  __x64_sys_unlinkat+0x8c/0xd0

=== Observations ===

- /sys/fs/cgroup/user.slice/user-1000.slice/cgroup.procs was empty, indicating
  all processes were killed but the cgroup itself cannot be removed
- Multiple zombie processes present, unable to be reaped (user@1000.service
  systemd, podman, conmon processes)
- RCU subsystem appears healthy (rcu_exp_gp_kthr in S state)

=== Reproducer ===

The bug is triggered by a specific sequence of podman REST API operations on
rootless containers, followed by user cleanup. The reproducer is part of the
cockpit-podman test suite. I created a branch where I reduced the test to the
absolute minimum, and also replaced as many UI clicks as possible with shell
operations (all but one):

  https://github.com/martinpitt/cockpit-podman/blob/kernel-hang/test/check-application#L1486

Sequence:
1. Create and stop a rootless container as the admin user
2. Call podman REST API lifecycle operations: start → restart → stop
3. Create an exec session (console/TTY connection) via REST API
4. Start the container again via REST API
5. Cleanup: loginctl terminate-user admin; loginctl kill-user admin; pkill -9 -u admin

Using podman CLI commands (e.g., "podman start swamped-crate") instead of the
REST API does NOT trigger the hang, only when using the REST API. That may be
because of the different process layout, or just sheer timing -- as eventually,
both CLI and API should result in the same actual cgroup/container operations
on the podman side.

The bug is very timing-sensitive. I attempted to create a standalone shell
script reproducer, but failed, it always passes with that. Even with the
original cockpit-podman integration test failure it's unreliable: it can hang
on the first iteration, most of the time it fails within 5 runs, but I've had
stretches where 50+ iterations passed before the hang happened.

=== Full debug output ===

The above GitHub PR comment links to the full dmesg log. Direct link:
https://github.com/user-attachments/files/27195205/dmesg-cgrouphang.txt

This covers initial boot up to the hang, and then the outputs of sysrq task
dump (t), memory info (m), and blocked tasks (w).

=== Additional Notes ===

In one early test run, a different hang pattern was observed where
rcu_exp_gp_kthr was in D state with a process stuck in
synchronize_rcu_expedited during namespace cleanup, but this variant has not
reproduced in subsequent runs. The cgroup cleanup deadlock appears to be the
primary manifestation.

This is my first (non-trivial) kernel bug report, so please bear with me. I'm
normally stay firmly in userland.

Thanks,

Martin Pitt

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-05-03 22:45 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-29  9:21 [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations Martin Pitt
2026-04-29 16:21 ` Tejun Heo
2026-04-29 21:15   ` Tejun Heo
2026-04-30  6:15   ` Martin Pitt
2026-05-01  2:29 ` [PATCH] cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated Tejun Heo
2026-05-03 19:30   ` kernel test robot
2026-05-03 20:15   ` kernel test robot
2026-05-03 22:45   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox