From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 36A72388E7F; Mon, 23 Mar 2026 19:55:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774295742; cv=none; b=fDD6huhKPZQi/aOsdufLRw7Z/rnz5JUzTUReRydYmQu4ikaO0+tUKmMZgUYVJXiMt7xI2Xq7oSjZ78IV4ksFIT5vlYUQKYAqJyrBfpDe0PIIPrBE3qCy4Av/BSWDwyp7SvauQEAvyRKR1g0mckjIoAzPpWGKfgkVnePdI8Ya7fQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774295742; c=relaxed/simple; bh=byZQznOIurQHP9alNvuxUOraKKFUxp+VSHGVKRf5ldY=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=fyWCHLR1hcWg0LW15zMEnIRKE2ay/EWT0oseSSJ/0G5WJxWg6yy1cqPy7a94RnI2gNVMhSi2I0C6d+ObEXz90UQtjvpRh4jT/8ePLp2Lwv48O6fNJhVZ6S2GWLp+I8clFzas+G0jAyovoUId0LzlI0Hnl0K1q+rINMs0ZHDeNCI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=fBjORxUq; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="fBjORxUq" Received: by smtp.kernel.org (Postfix) with ESMTPSA id BC268C4CEF7; Mon, 23 Mar 2026 19:55:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1774295741; bh=byZQznOIurQHP9alNvuxUOraKKFUxp+VSHGVKRf5ldY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=fBjORxUqs7aK2nzvWYMIcCGYpMeMYvWZdYd405nXSXNocpK7QtDAiU6aNXNVY5hbX t/8eDjMKYwpN3b9YcykzFpVGsxmR3CapaF7/sO0ZMEehRkDV0C2WLw87Nntm4IQbhd IgsZ1UgDPd4ISz8Ac+G/MNnIuUpkdhuoreeVOtdT1q1TnnyafSF4KEY6uzAoqSNKTm WzbqQHJDOYJoXxADuY5T5Uu1CgNGrccm9Fy1sSs03wYqqsX18tvpxTdOaSFMbqyDoB c33f/g/F6+Rm8D7aH44kRQN++/xsvo0uMrahCrmEEoNyST4f2wRs4N4HGaM7ddOT6b 1plEUWOFg3bcA== Date: Mon, 23 Mar 2026 09:55:40 -1000 From: Tejun Heo To: Sebastian Andrzej Siewior Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Bert Karwatzki , Michal Koutny , kernel test robot Subject: Re: [PATCH] cgroup: Wait for dying tasks to leave on rmdir Message-ID: References: <20260323035806.724798-1-tj@kernel.org> <20260323113252.xsuwQA3z@linutronix.de> Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20260323113252.xsuwQA3z@linutronix.de> Hello, On Mon, Mar 23, 2026 at 12:32:52PM +0100, Sebastian Andrzej Siewior wrote: ... > I saw instances on PREEMPT_RT where the above cgroup_is_populated() > reported true due to cgrp->nr_populated_csets = 1, the following > iterator returned NULL but in that time do_cgroup_task_dead() saw no > waiter and continued without a wake_up and then the following schedule() > hung. Ah, right, false->true is protected by cgroup_mutex but true->false is only by css_set_lock. It should check populated again with css_set_lock held and then do prepare_to_wait(). > There is no serialisation between this wait/ check and latter wake. An > alternative would be to check and prepare_to_wait() under css_set_lock. Yeap. > > + finish_wait(&cgrp->dying_populated_waitq, &wait); > > + mutex_lock(&cgroup_mutex); > > + goto retry; > > +} > > Then I added my RCU patch. This led to a problem already during boot up > (didn't manage to get to the test suite). Is that the patch to move cgroup_task_dead() to delayed_put_task_struct()? I don't think we can delay populated state update till usage count reaches zero. e.g. bpf_task_acquire() can be used by arbitrary bpf programs and will pin the usage count indefinitely delaying populated state update. Similar to delaying the event to free path, you can construct a deadlock scenario too. > systemd-1 places modprobe-1044 in a cgroup, then destroys the cgroup. > It hangs in cgroup_drain_dying() because nr_populated_csets is still 1. > modprobe-1044 is still there in Z so the cgroup removal didn't get there > yet. That irq_work was quicker than RCU in this case. This can be > reproduced without RCU by Isn't this the exact scenario? systemd is the one who should reap and drop the usage count but it's waiting for rmdir() to finish which can't finish due to the usage count which hasn't been reapted by systemd? We can't interlock these two. They have to make progress independently. > - irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork)); > + schedule_delayed_work(this_cpu_ptr(&cgrp_delayed_tasks_iwork), HZ); > > So there is always a one second delay. If I give up waiting after 10secs > then it boots eventually and there are no zombies around. The test_core > seems to complete… > > Having the irq_work as-is, then the "cgroup_dead()" happens on the HZ > tick. test_core then complains just with > | not ok 7 test_cgcore_populated The test is assuming that waitpid() success guarantees cgroup !populated event. While before all these changes, that held, it wasn't intentional and the test just picked up on arbitrary ordering. I'll just remove that particular test. Thanks. -- tejun