All of lore.kernel.org
 help / color / mirror / Atom feed
From: Roman Gushchin <roman.gushchin@linux.dev>
To: Chen Ridong <chenridong@huaweicloud.com>
Cc: "Michal Koutný" <mkoutny@suse.com>,
	"Chen Ridong" <chenridong@huawei.com>,
	martin.lau@linux.dev, ast@kernel.org, daniel@iogearbox.net,
	andrii@kernel.org, eddyz87@gmail.com, song@kernel.org,
	yonghong.song@linux.dev, john.fastabend@gmail.com,
	kpsingh@kernel.org, sdf@google.com, haoluo@google.com,
	jolsa@kernel.org, tj@kernel.org, lizefan.x@bytedance.com,
	hannes@cmpxchg.org, bpf@vger.kernel.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
Date: Tue, 10 Sep 2024 21:02:59 +0000	[thread overview]
Message-ID: <ZuC0A98pxYc3TODM@google.com> (raw)
In-Reply-To: <07501c67-3b18-48e3-8929-e773d8d6920f@huaweicloud.com>

On Tue, Sep 10, 2024 at 09:31:41AM +0800, Chen Ridong wrote:
> 
> 
> On 2024/9/9 22:19, Michal Koutný wrote:
> > On Sat, Aug 17, 2024 at 09:33:34AM GMT, Chen Ridong <chenridong@huawei.com> wrote:
> > > The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
> > > acquired in different tasks, which may lead to deadlock.
> > > It can lead to a deadlock through the following steps:
> > > 1. A large number of cpusets are deleted asynchronously, which puts a
> > >     large number of cgroup_bpf_release works into system_wq. The max_active
> > >     of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
> > >     cgroup_bpf_release works, and many cgroup_bpf_release works will be put
> > >     into inactive queue. As illustrated in the diagram, there are 256 (in
> > >     the acvtive queue) + n (in the inactive queue) works.
> > > 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
> > >     smp_call_on_cpu work into system_wq. However step 1 has already filled
> > >     system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
> > >     to wait until the works that were put into the inacvtive queue earlier
> > >     have executed (n cgroup_bpf_release), so it will be blocked for a while.
> > > 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
> > > 4. Cpusets that were deleted at step 1 put cgroup_release works into
> > >     cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
> > >     When cgroup_metux is acqured by work at css_killed_work_fn, it will
> > >     call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
> > >     However, cpuset_css_offline will be blocked for step 3.
> > > 5. At this moment, there are 256 works in active queue that are
> > >     cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
> > >     a result, all of them are blocked. Consequently, sscs.work can not be
> > >     executed. Ultimately, this situation leads to four processes being
> > >     blocked, forming a deadlock.
> > > 
> > > system_wq(step1)		WatchDog(step2)			cpu offline(step3)	cgroup_destroy_wq(step4)
> > > ...
> > > 2000+ cgroups deleted asyn
> > > 256 actives + n inactives
> > > 				__lockup_detector_reconfigure
> > > 				P(cpu_hotplug_lock.read)
> > > 				put sscs.work into system_wq
> > > 256 + n + 1(sscs.work)
> > > sscs.work wait to be executed
> > > 				warting sscs.work finish
> > > 								percpu_down_write
> > > 								P(cpu_hotplug_lock.write)
> > > 								...blocking...
> > > 											css_killed_work_fn
> > > 											P(cgroup_mutex)
> > > 											cpuset_css_offline
> > > 											P(cpu_hotplug_lock.read)
> > > 											...blocking...
> > > 256 cgroup_bpf_release
> > > mutex_lock(&cgroup_mutex);
> > > ..blocking...
> > 
> > Thanks, Ridong, for laying this out.
> > Let me try to extract the core of the deps above.
> > 
> > The correct lock ordering is: cgroup_mutex then cpu_hotplug_lock.
> > However, the smp_call_on_cpu() under cpus_read_lock may lead to
> > a deadlock (ABBA over those two locks).
> > 
> 
> That's right.
> 
> > This is OK
> > 	thread T					system_wq worker
> > 	
> > 	  						lock(cgroup_mutex) (II)
> > 							...
> > 							unlock(cgroup_mutex)
> > 	down(cpu_hotplug_lock.read)
> > 	smp_call_on_cpu
> > 	  queue_work_on(cpu, system_wq, scss) (I)
> > 							scss.func
> > 	  wait_for_completion(scss)
> > 	up(cpu_hotplug_lock.read)
> > 
> > However, there is no ordering between (I) and (II) so they can also happen
> > in opposite
> > 
> > 	thread T					system_wq worker
> > 	
> > 	down(cpu_hotplug_lock.read)
> > 	smp_call_on_cpu
> > 	  queue_work_on(cpu, system_wq, scss) (I)
> > 	  						lock(cgroup_mutex)  (II)
> > 							...
> > 							unlock(cgroup_mutex)
> > 							scss.func
> > 	  wait_for_completion(scss)
> > 	up(cpu_hotplug_lock.read)
> > 
> > And here the thread T + system_wq worker effectively call
> > cpu_hotplug_lock and cgroup_mutex in the wrong order. (And since they're
> > two threads, it won't be caught by lockdep.)
> > 
> > By that reasoning any holder of cgroup_mutex on system_wq makes system
> > susceptible to a deadlock (in presence of cpu_hotplug_lock waiting
> > writers + cpuset operations). And the two work items must meet in same
> > worker's processing hence probability is low (zero?) with less than
> > WQ_DFL_ACTIVE items.

Right, I'm on the same page. Should we document then somewhere that
the cgroup mutex can't be locked from a system wq context?

I think thus will also make the Fixes tag more meaningful.

Thank you!

  reply	other threads:[~2024-09-10 21:03 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-17  9:33 [PATCH v3 0/1] Fix deadlock caused by cgroup_mutex and cpu_hotplug_lock Chen Ridong
2024-08-17  9:33 ` [PATCH v3 1/1] cgroup: fix " Chen Ridong
2024-08-22  0:57   ` Chen Ridong
2024-08-30  2:08     ` Chen Ridong
2024-09-09 14:19   ` Michal Koutný
2024-09-10  1:31     ` Chen Ridong
2024-09-10 21:02       ` Roman Gushchin [this message]
2024-09-10 21:17         ` Tejun Heo
2024-09-12  1:33           ` Chen Ridong
2024-09-12  5:59             ` Tejun Heo
2024-09-11 11:15     ` Hillf Danton
2024-09-26 12:53       ` Michal Koutný
2024-09-27 11:25         ` Hillf Danton
2024-09-27 14:03           ` Michal Koutný
2024-09-28  8:11       ` Tetsuo Handa
2024-09-29  1:30         ` Chen Ridong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZuC0A98pxYc3TODM@google.com \
    --to=roman.gushchin@linux.dev \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=chenridong@huawei.com \
    --cc=chenridong@huaweicloud.com \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=haoluo@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizefan.x@bytedance.com \
    --cc=martin.lau@linux.dev \
    --cc=mkoutny@suse.com \
    --cc=sdf@google.com \
    --cc=song@kernel.org \
    --cc=tj@kernel.org \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.