From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail.piware.de (mail.piware.de [37.120.164.117]) by smtp.subspace.kernel.org (Postfix) with ESMTP id DAB2C3B95F9 for ; Wed, 29 Apr 2026 09:27:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=37.120.164.117 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777454880; cv=none; b=aH1QNIjikKg1nIpbPWoz+jzXAAK6wR0hHYo5HXouH7o3qxmk2rbqJCq8xEYxGsUMi4Udz0SdF/R2H4SBFHhDPRc42ND9SPcZghsDJQSgYhmp3uiQz/1ba1WjxgV8/c5607CxZ5MY3oNC2nssNlhJuwzVpec1GRy/oYG/jVQlXqE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777454880; c=relaxed/simple; bh=lbnV1yh2q7Kw2N6Hs/665oEorNYsmRrbx0PyoUuKNVs=; h=Date:From:To:Cc:Subject:Message-ID:MIME-Version:Content-Type: Content-Disposition; b=hyrIlOYZSF30ddlBXEiaz2KuhGNRp78xpUBZmRDsZlUyXRXrN9HyT+nHF0HXVDo8k56FUip18HsWiECRLzokUpYx+jQUpD5/fFFtpFFjfwvm4ZWSkfb8bpC0y/OigoQZew+2WpTkacZ4e+IHMYlotAtsOOzPj/Uo0WGMjkD3Mqs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=piware.de; spf=pass smtp.mailfrom=piware.de; dkim=pass (2048-bit key) header.d=piware.de header.i=@piware.de header.b=C87voxvv; arc=none smtp.client-ip=37.120.164.117 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=piware.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=piware.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=piware.de header.i=@piware.de header.b="C87voxvv" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=piware.de; s=2025; t=1777454468; bh=lbnV1yh2q7Kw2N6Hs/665oEorNYsmRrbx0PyoUuKNVs=; h=Date:From:To:Cc:Subject:From; b=C87voxvvfYTJiTUkg/JB+oLEJkglbZQ1oN3PO5bPjHZeKaQMsVhU9PfSfzflWZ/Mk 80yTwbhQol4ZazuH+8cGpejhmVZVOfmnH8spFncT9SqiK+ME9OR2O/SqTCHCLqJgZ2 wduFKsEsRmzA2NjOeiWFLAxpBmbhic0+QolP7eUn8GLpwXyVcsLNYeIqSGwhs/VdYX CJ9PmaAobAKz1dHsn2Y024TfZdc9dGrGKk7JQkNC06RTVKfmOuuSvlgREGGKnN6FJO Dr3bIZpB7K7No6RLX2tNmfxMymow5pdQ7RYp1VbMu/pGAsnV3Snf3eLke2OhsAJLsF +D3drWwN+dlWQ== Received: from piware.de (localhost [127.0.0.1]) by mail.piware.de (Postfix) with ESMTP id 81470FF881; Wed, 29 Apr 2026 11:21:08 +0200 (CEST) Date: Wed, 29 Apr 2026 11:21:07 +0200 From: Martin Pitt To: regressions@lists.linux.dev Cc: cgroups@vger.kernel.org, tj@kernel.org, lizefan.x@bytedance.com, hannes@cmpxchg.org Subject: [REGRESSION] 6.9.11: systemd hangs in cgroup_drain_dying during cleanup after podman operations Message-ID: Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit Hello, Our cockpit tests found a kernel regression introduced between 6.9.10 (working) and 6.9.11 (broken) that causes a system hang during cgroup cleanup after podman container operations. I've kept notes in https://github.com/cockpit-project/bots/pull/8970#issuecomment-4342147158 , but now I am at the end of my wisdom how to squeeze more information out of this. === Summary === When running podman REST API operations on rootless containers followed by user session cleanup (loginctl/pkill), systemd (pid 1) gets stuck in cgroup_drain_dying trying to remove an empty cgroup. After that, I'm - Unable to run commands that access /proc (ps, top, lsns, ls /proc, etc.) - Unable to create new SSH sessions or VT logins - If I previously logged into the QEMU VT, that login session remains more or less functional, except not being able to run most commands === Kernel Versions === - Last known working: 6.9.10 - Broken: 6.9.11 (OpenSUSE Tumbleweed), 6.9.13 (Fedora 44), 6.9.14 (Fedora 44), Ubuntu 26.04 (7.0.0) === Stack Trace === >From sysrq-trigger task dump, systemd is stuck in: [ 207.958946] task:systemd state:D stack:0 pid:1 tgid:1 ppid:0 [ 207.959734] Call Trace: [ 207.960117] [ 207.960333] __schedule+0x2b2/0x5d0 [ 207.960603] schedule+0x27/0x80 [ 207.960945] cgroup_drain_dying+0xef/0x1a0 [ 207.961287] ? __pfx_autoremove_wake_function+0x10/0x10 [ 207.961639] cgroup_rmdir+0x37/0x100 [ 207.961945] kernfs_iop_rmdir+0x6a/0xd0 [ 207.962239] vfs_rmdir+0x154/0x270 [ 207.962486] do_rmdir+0x201/0x280 [ 207.962723] __x64_sys_unlinkat+0x8c/0xd0 === Observations === - /sys/fs/cgroup/user.slice/user-1000.slice/cgroup.procs was empty, indicating all processes were killed but the cgroup itself cannot be removed - Multiple zombie processes present, unable to be reaped (user@1000.service systemd, podman, conmon processes) - RCU subsystem appears healthy (rcu_exp_gp_kthr in S state) === Reproducer === The bug is triggered by a specific sequence of podman REST API operations on rootless containers, followed by user cleanup. The reproducer is part of the cockpit-podman test suite. I created a branch where I reduced the test to the absolute minimum, and also replaced as many UI clicks as possible with shell operations (all but one): https://github.com/martinpitt/cockpit-podman/blob/kernel-hang/test/check-application#L1486 Sequence: 1. Create and stop a rootless container as the admin user 2. Call podman REST API lifecycle operations: start → restart → stop 3. Create an exec session (console/TTY connection) via REST API 4. Start the container again via REST API 5. Cleanup: loginctl terminate-user admin; loginctl kill-user admin; pkill -9 -u admin Using podman CLI commands (e.g., "podman start swamped-crate") instead of the REST API does NOT trigger the hang, only when using the REST API. That may be because of the different process layout, or just sheer timing -- as eventually, both CLI and API should result in the same actual cgroup/container operations on the podman side. The bug is very timing-sensitive. I attempted to create a standalone shell script reproducer, but failed, it always passes with that. Even with the original cockpit-podman integration test failure it's unreliable: it can hang on the first iteration, most of the time it fails within 5 runs, but I've had stretches where 50+ iterations passed before the hang happened. === Full debug output === The above GitHub PR comment links to the full dmesg log. Direct link: https://github.com/user-attachments/files/27195205/dmesg-cgrouphang.txt This covers initial boot up to the hang, and then the outputs of sysrq task dump (t), memory info (m), and blocked tasks (w). === Additional Notes === In one early test run, a different hang pattern was observed where rcu_exp_gp_kthr was in D state with a process stuck in synchronize_rcu_expedited during namespace cleanup, but this variant has not reproduced in subsequent runs. The cgroup cleanup deadlock appears to be the primary manifestation. This is my first (non-trivial) kernel bug report, so please bear with me. I'm normally stay firmly in userland. Thanks, Martin Pitt