From: chenridong <chenridong@huawei.com>
To: "Michal Koutný" <mkoutny@suse.com>
Cc: Hillf Danton <hdanton@sina.com>,
Roman Gushchin <roman.gushchin@linux.dev>, <tj@kernel.org>,
<bpf@vger.kernel.org>, <cgroups@vger.kernel.org>,
<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH -v2] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
Date: Thu, 8 Aug 2024 10:22:21 +0800 [thread overview]
Message-ID: <8be4c357-a111-4134-b7de-ffa6f769c9e4@huawei.com> (raw)
In-Reply-To: <mxyismki3ln2pvrbhd36japfffpfcwgyvgmy5him3n746w6wd6@24zlflalef6x>
On 2024/8/7 21:32, Michal Koutný wrote:
> Hello.
>
> On Sat, Jul 27, 2024 at 06:21:55PM GMT, chenridong <chenridong@huawei.com> wrote:
>> Yes, I have offered the scripts in Link(V1).
>
> Thanks (and thanks for patience).
> There is no lockdep complain about a deadlock (i.e. some circular
> locking dependencies). (I admit the multiple holders of cgroup_mutex
> reported there confuse me, I guess that's an artifact of this lockdep
> report and they could be also waiters.)
>
>>> Who'd be the holder of cgroup_mutex preventing cgroup_bpf_release from
>>> progress? (That's not clear to me from your diagram.)
>>>
>> This is a cumulative process. The stress testing deletes a large member of
>> cgroups, and cgroup_bpf_release is asynchronous, competing with cgroup
>> release works.
>
> Those are different situations:
> - waiting for one holder that's stuck for some reason (that's what we're
> after),
> - waiting because the mutex is contended (that's slow but progresses
> eventually).
>
>> You know, cgroup_mutex is used in many places. Finally, the number of
>> `cgroup_bpf_release` instances in system_wq accumulates up to 256, and
>> it leads to this issue.
>
> Reaching max_active doesn't mean that queue_work() would block or the
> items were lost. They are only queued onto inactive_works list.
Yes, I agree. But what if 256 active works can't finish because they are
waiting for a lock? the works at inactive list can never be executed.
> (Remark: cgroup_destroy_wq has only max_active=1 but it apparently
> doesn't stop progress should there be more items queued (when
> when cgroup_mutex is not guarding losing references.))
>
cgroup_destroy_wq is not stopped by cgroup_mutex, it has acquired
cgroup_mutex, but it was blocked cpu_hotplug_lock.read.
cpu_hotplug_lock.write is held by cpu offline process(step3).
> ---
>
> The change on its own (deferred cgroup bpf progs removal via
> cgroup_destroy_wq instead of system_wq) is sensible by collecting
> related objects removal together (at the same time it shouldn't cause
> problems by sharing one cgroup_destroy_wq).
>
> But the reasoning in the commit message doesn't add up to me. There
> isn't obvious deadlock, I'd say that system is overloaded with repeated
> calls of __lockup_detector_reconfigure() and it is not in deadlock
> state -- i.e. when you stop the test, it should eventually recover.
> Given that, I'd neither put Fixes: 4bfc0bb2c60e there.
> If I stop test, it can never recover. It does not need to be fixed if it
could recover.
I have to admit, it is a complicated issue.
System_wq was not overloaded with __lockup_detector_reconfigure, but
with cgroup_bpf_release_fn. A large number of cgroups were deleted.
There were 256 active works in system_wq that were
cgroup_bpf_release_fn, and they were all blocked by cgroup_mutex.
To make it simple, just imagine what if the max_active max_active of
system_wq is 1? Could it result in a deadlock? If it could be deadlock,
just imagine all works in system_wq are same.
> (One could symetrically argue to move smp_call_on_cpu() away from
> system_wq instead of cgroup_bpf_release_fn().)
>
I also agree, why I move cgroup_bpf_release_fn away, cgroup has it own
queue. As TJ said "system wqs are for misc things which shouldn't create
a large number of concurrent work items. If something is going to
generate 256+ concurrent work items, it should use its own workqueue."
> Honestly, I'm not sure it's worth the effort if there's no deadlock.
>
There is a deadlock, and i think it have to be fixed.
> It's possible that I'm misunderstanding or I've missed a substantial
> detail for why this could lead to a deadlock. It'd be best visible in a
> sequence diagram with tasks/CPUs left-to-right and time top-down (in the
> original scheme it looks like time goes right-to-left and there's the
> unclear situation of the initial cgroup_mutex holder).
>
> Thanks,
> Michal
I will modify the diagram.
And I hope you can understand how it leads to deadlock.
Thank you Michal for your reply.
Thanks,
Ridong
next prev parent reply other threads:[~2024-08-08 2:22 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-19 2:52 [PATCH -v2] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock Chen Ridong
2024-07-19 18:54 ` bot+bpf-ci
2024-07-20 3:15 ` bot+bpf-ci
2024-07-24 0:53 ` chenridong
2024-08-01 1:34 ` chenridong
2024-07-24 11:08 ` Hillf Danton
2024-07-25 1:48 ` chenridong
2024-07-25 11:01 ` Hillf Danton
2024-07-26 13:04 ` Michal Koutný
2024-07-27 10:21 ` chenridong
2024-08-07 13:32 ` Michal Koutný
2024-08-08 2:22 ` chenridong [this message]
2024-08-08 17:03 ` Roman Gushchin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8be4c357-a111-4134-b7de-ffa6f769c9e4@huawei.com \
--to=chenridong@huawei.com \
--cc=bpf@vger.kernel.org \
--cc=cgroups@vger.kernel.org \
--cc=hdanton@sina.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mkoutny@suse.com \
--cc=roman.gushchin@linux.dev \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox