public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/1] Fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
@ 2024-08-17  9:33 Chen Ridong
  2024-08-17  9:33 ` [PATCH v3 1/1] cgroup: fix " Chen Ridong
  0 siblings, 1 reply; 16+ messages in thread
From: Chen Ridong @ 2024-08-17  9:33 UTC (permalink / raw)
  To: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x,
	hannes, roman.gushchin
  Cc: bpf, cgroups, linux-kernel

changes from v1/v2:
- Optimize commit msg.

Link v1: https://lore.kernel.org/cgroups/20240607110313.2230669-1-chenridong@huawei.com/
Link v2: https://lore.kernel.org/cgroups/20240719025232.2143638-1-chenridong@huawei.com/

Chen Ridong (1):
  cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock

 kernel/bpf/cgroup.c             | 2 +-
 kernel/cgroup/cgroup-internal.h | 1 +
 kernel/cgroup/cgroup.c          | 2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
  2024-08-17  9:33 [PATCH v3 0/1] Fix deadlock caused by cgroup_mutex and cpu_hotplug_lock Chen Ridong
@ 2024-08-17  9:33 ` Chen Ridong
  2024-08-22  0:57   ` Chen Ridong
  2024-09-09 14:19   ` Michal Koutný
  0 siblings, 2 replies; 16+ messages in thread
From: Chen Ridong @ 2024-08-17  9:33 UTC (permalink / raw)
  To: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x,
	hannes, roman.gushchin
  Cc: bpf, cgroups, linux-kernel

We found a hung_task problem as shown below:

INFO: task kworker/0:0:8 blocked for more than 327 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/0:0     state:D stack:13920 pid:8     ppid:2       flags:0x00004000
Workqueue: events cgroup_bpf_release
Call Trace:
 <TASK>
 __schedule+0x5a2/0x2050
 ? find_held_lock+0x33/0x100
 ? wq_worker_sleeping+0x9e/0xe0
 schedule+0x9f/0x180
 schedule_preempt_disabled+0x25/0x50
 __mutex_lock+0x512/0x740
 ? cgroup_bpf_release+0x1e/0x4d0
 ? cgroup_bpf_release+0xcf/0x4d0
 ? process_scheduled_works+0x161/0x8a0
 ? cgroup_bpf_release+0x1e/0x4d0
 ? mutex_lock_nested+0x2b/0x40
 ? __pfx_delay_tsc+0x10/0x10
 mutex_lock_nested+0x2b/0x40
 cgroup_bpf_release+0xcf/0x4d0
 ? process_scheduled_works+0x161/0x8a0
 ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
 ? process_scheduled_works+0x161/0x8a0
 process_scheduled_works+0x23a/0x8a0
 worker_thread+0x231/0x5b0
 ? __pfx_worker_thread+0x10/0x10
 kthread+0x14d/0x1c0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x59/0x70
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>

This issue can be reproduced by the following pressuse test:
1. A large number of cpuset cgroups are deleted.
2. Set cpu on and off repeatly.
3. Set watchdog_thresh repeatly.
The scripts can be obtained at LINK mentioned above the signature.

The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
acquired in different tasks, which may lead to deadlock.
It can lead to a deadlock through the following steps:
1. A large number of cpusets are deleted asynchronously, which puts a
   large number of cgroup_bpf_release works into system_wq. The max_active
   of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
   cgroup_bpf_release works, and many cgroup_bpf_release works will be put
   into inactive queue. As illustrated in the diagram, there are 256 (in
   the acvtive queue) + n (in the inactive queue) works.
2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
   smp_call_on_cpu work into system_wq. However step 1 has already filled
   system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
   to wait until the works that were put into the inacvtive queue earlier
   have executed (n cgroup_bpf_release), so it will be blocked for a while.
3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
4. Cpusets that were deleted at step 1 put cgroup_release works into
   cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
   When cgroup_metux is acqured by work at css_killed_work_fn, it will
   call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
   However, cpuset_css_offline will be blocked for step 3.
5. At this moment, there are 256 works in active queue that are
   cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
   a result, all of them are blocked. Consequently, sscs.work can not be
   executed. Ultimately, this situation leads to four processes being
   blocked, forming a deadlock.

system_wq(step1)		WatchDog(step2)			cpu offline(step3)	cgroup_destroy_wq(step4)
...
2000+ cgroups deleted asyn
256 actives + n inactives
				__lockup_detector_reconfigure
				P(cpu_hotplug_lock.read)
				put sscs.work into system_wq
256 + n + 1(sscs.work)
sscs.work wait to be executed
				warting sscs.work finish
								percpu_down_write
								P(cpu_hotplug_lock.write)
								...blocking...
											css_killed_work_fn
											P(cgroup_mutex)
											cpuset_css_offline
											P(cpu_hotplug_lock.read)
											...blocking...
256 cgroup_bpf_release
mutex_lock(&cgroup_mutex);
..blocking...

To fix the problem, place cgroup_bpf_release works on cgroup_destroy_wq,
which can break the loop and solve the problem. System wqs are for misc
things which shouldn't create a large number of concurrent work items.
If something is going to generate >WQ_DFL_ACTIVE(256) concurrent work
items, it should use its own dedicated workqueue.

Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 kernel/bpf/cgroup.c             | 2 +-
 kernel/cgroup/cgroup-internal.h | 1 +
 kernel/cgroup/cgroup.c          | 2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 8ba73042a239..a611a1274788 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -334,7 +334,7 @@ static void cgroup_bpf_release_fn(struct percpu_ref *ref)
 	struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);
 
 	INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
-	queue_work(system_wq, &cgrp->bpf.release_work);
+	queue_work(cgroup_destroy_wq, &cgrp->bpf.release_work);
 }
 
 /* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's through
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index c964dd7ff967..17ac19bc8106 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -13,6 +13,7 @@
 extern spinlock_t trace_cgroup_path_lock;
 extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN];
 extern void __init enable_debug_cgroup(void);
+extern struct workqueue_struct *cgroup_destroy_wq;
 
 /*
  * cgroup_path() takes a spin lock. It is good practice not to take
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 75058fbf4450..77fa9ed69c86 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -124,7 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
  * destruction work items don't end up filling up max_active of system_wq
  * which may lead to deadlock.
  */
-static struct workqueue_struct *cgroup_destroy_wq;
+struct workqueue_struct *cgroup_destroy_wq;
 
 /* generate an array of cgroup subsystem pointers */
 #define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
  2024-08-17  9:33 ` [PATCH v3 1/1] cgroup: fix " Chen Ridong
@ 2024-08-22  0:57   ` Chen Ridong
  2024-08-30  2:08     ` Chen Ridong
  2024-09-09 14:19   ` Michal Koutný
  1 sibling, 1 reply; 16+ messages in thread
From: Chen Ridong @ 2024-08-22  0:57 UTC (permalink / raw)
  To: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x,
	hannes, roman.gushchin, Michal Koutný
  Cc: bpf, cgroups, linux-kernel



On 2024/8/17 17:33, Chen Ridong wrote:
> We found a hung_task problem as shown below:
> 
> INFO: task kworker/0:0:8 blocked for more than 327 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/0:0     state:D stack:13920 pid:8     ppid:2       flags:0x00004000
> Workqueue: events cgroup_bpf_release
> Call Trace:
>   <TASK>
>   __schedule+0x5a2/0x2050
>   ? find_held_lock+0x33/0x100
>   ? wq_worker_sleeping+0x9e/0xe0
>   schedule+0x9f/0x180
>   schedule_preempt_disabled+0x25/0x50
>   __mutex_lock+0x512/0x740
>   ? cgroup_bpf_release+0x1e/0x4d0
>   ? cgroup_bpf_release+0xcf/0x4d0
>   ? process_scheduled_works+0x161/0x8a0
>   ? cgroup_bpf_release+0x1e/0x4d0
>   ? mutex_lock_nested+0x2b/0x40
>   ? __pfx_delay_tsc+0x10/0x10
>   mutex_lock_nested+0x2b/0x40
>   cgroup_bpf_release+0xcf/0x4d0
>   ? process_scheduled_works+0x161/0x8a0
>   ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
>   ? process_scheduled_works+0x161/0x8a0
>   process_scheduled_works+0x23a/0x8a0
>   worker_thread+0x231/0x5b0
>   ? __pfx_worker_thread+0x10/0x10
>   kthread+0x14d/0x1c0
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x59/0x70
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1b/0x30
>   </TASK>
> 
> This issue can be reproduced by the following pressuse test:
> 1. A large number of cpuset cgroups are deleted.
> 2. Set cpu on and off repeatly.
> 3. Set watchdog_thresh repeatly.
> The scripts can be obtained at LINK mentioned above the signature.
> 
> The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
> acquired in different tasks, which may lead to deadlock.
> It can lead to a deadlock through the following steps:
> 1. A large number of cpusets are deleted asynchronously, which puts a
>     large number of cgroup_bpf_release works into system_wq. The max_active
>     of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
>     cgroup_bpf_release works, and many cgroup_bpf_release works will be put
>     into inactive queue. As illustrated in the diagram, there are 256 (in
>     the acvtive queue) + n (in the inactive queue) works.
> 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
>     smp_call_on_cpu work into system_wq. However step 1 has already filled
>     system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
>     to wait until the works that were put into the inacvtive queue earlier
>     have executed (n cgroup_bpf_release), so it will be blocked for a while.
> 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
> 4. Cpusets that were deleted at step 1 put cgroup_release works into
>     cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
>     When cgroup_metux is acqured by work at css_killed_work_fn, it will
>     call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
>     However, cpuset_css_offline will be blocked for step 3.
> 5. At this moment, there are 256 works in active queue that are
>     cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
>     a result, all of them are blocked. Consequently, sscs.work can not be
>     executed. Ultimately, this situation leads to four processes being
>     blocked, forming a deadlock.
> 
> system_wq(step1)		WatchDog(step2)			cpu offline(step3)	cgroup_destroy_wq(step4)
> ...
> 2000+ cgroups deleted asyn
> 256 actives + n inactives
> 				__lockup_detector_reconfigure
> 				P(cpu_hotplug_lock.read)
> 				put sscs.work into system_wq
> 256 + n + 1(sscs.work)
> sscs.work wait to be executed
> 				warting sscs.work finish
> 								percpu_down_write
> 								P(cpu_hotplug_lock.write)
> 								...blocking...
> 											css_killed_work_fn
> 											P(cgroup_mutex)
> 											cpuset_css_offline
> 											P(cpu_hotplug_lock.read)
> 											...blocking...
> 256 cgroup_bpf_release
> mutex_lock(&cgroup_mutex);
> ..blocking...
> 
> To fix the problem, place cgroup_bpf_release works on cgroup_destroy_wq,
> which can break the loop and solve the problem. System wqs are for misc
> things which shouldn't create a large number of concurrent work items.
> If something is going to generate >WQ_DFL_ACTIVE(256) concurrent work
> items, it should use its own dedicated workqueue.
> 
> Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
> Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> ---
>   kernel/bpf/cgroup.c             | 2 +-
>   kernel/cgroup/cgroup-internal.h | 1 +
>   kernel/cgroup/cgroup.c          | 2 +-
>   3 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 8ba73042a239..a611a1274788 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -334,7 +334,7 @@ static void cgroup_bpf_release_fn(struct percpu_ref *ref)
>   	struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);
>   
>   	INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
> -	queue_work(system_wq, &cgrp->bpf.release_work);
> +	queue_work(cgroup_destroy_wq, &cgrp->bpf.release_work);
>   }
>   
>   /* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's through
> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
> index c964dd7ff967..17ac19bc8106 100644
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -13,6 +13,7 @@
>   extern spinlock_t trace_cgroup_path_lock;
>   extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN];
>   extern void __init enable_debug_cgroup(void);
> +extern struct workqueue_struct *cgroup_destroy_wq;
>   
>   /*
>    * cgroup_path() takes a spin lock. It is good practice not to take
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 75058fbf4450..77fa9ed69c86 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -124,7 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
>    * destruction work items don't end up filling up max_active of system_wq
>    * which may lead to deadlock.
>    */
> -static struct workqueue_struct *cgroup_destroy_wq;
> +struct workqueue_struct *cgroup_destroy_wq;
>   
>   /* generate an array of cgroup subsystem pointers */
>   #define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,

Ping.
Hi,TJ, Roman and Michal, I have updated commit message, I think it can 
be much clearer now, can you review it again?

Thanks,
Ridong


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
  2024-08-22  0:57   ` Chen Ridong
@ 2024-08-30  2:08     ` Chen Ridong
  0 siblings, 0 replies; 16+ messages in thread
From: Chen Ridong @ 2024-08-30  2:08 UTC (permalink / raw)
  To: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x,
	hannes, roman.gushchin, Michal Koutný
  Cc: bpf, cgroups, linux-kernel



On 2024/8/22 8:57, Chen Ridong wrote:
> 
> 
> On 2024/8/17 17:33, Chen Ridong wrote:
>> We found a hung_task problem as shown below:
>>
>> INFO: task kworker/0:0:8 blocked for more than 327 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> task:kworker/0:0     state:D stack:13920 pid:8     ppid:2       
>> flags:0x00004000
>> Workqueue: events cgroup_bpf_release
>> Call Trace:
>>   <TASK>
>>   __schedule+0x5a2/0x2050
>>   ? find_held_lock+0x33/0x100
>>   ? wq_worker_sleeping+0x9e/0xe0
>>   schedule+0x9f/0x180
>>   schedule_preempt_disabled+0x25/0x50
>>   __mutex_lock+0x512/0x740
>>   ? cgroup_bpf_release+0x1e/0x4d0
>>   ? cgroup_bpf_release+0xcf/0x4d0
>>   ? process_scheduled_works+0x161/0x8a0
>>   ? cgroup_bpf_release+0x1e/0x4d0
>>   ? mutex_lock_nested+0x2b/0x40
>>   ? __pfx_delay_tsc+0x10/0x10
>>   mutex_lock_nested+0x2b/0x40
>>   cgroup_bpf_release+0xcf/0x4d0
>>   ? process_scheduled_works+0x161/0x8a0
>>   ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
>>   ? process_scheduled_works+0x161/0x8a0
>>   process_scheduled_works+0x23a/0x8a0
>>   worker_thread+0x231/0x5b0
>>   ? __pfx_worker_thread+0x10/0x10
>>   kthread+0x14d/0x1c0
>>   ? __pfx_kthread+0x10/0x10
>>   ret_from_fork+0x59/0x70
>>   ? __pfx_kthread+0x10/0x10
>>   ret_from_fork_asm+0x1b/0x30
>>   </TASK>
>>
>> This issue can be reproduced by the following pressuse test:
>> 1. A large number of cpuset cgroups are deleted.
>> 2. Set cpu on and off repeatly.
>> 3. Set watchdog_thresh repeatly.
>> The scripts can be obtained at LINK mentioned above the signature.
>>
>> The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
>> acquired in different tasks, which may lead to deadlock.
>> It can lead to a deadlock through the following steps:
>> 1. A large number of cpusets are deleted asynchronously, which puts a
>>     large number of cgroup_bpf_release works into system_wq. The 
>> max_active
>>     of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works 
>> are
>>     cgroup_bpf_release works, and many cgroup_bpf_release works will 
>> be put
>>     into inactive queue. As illustrated in the diagram, there are 256 (in
>>     the acvtive queue) + n (in the inactive queue) works.
>> 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
>>     smp_call_on_cpu work into system_wq. However step 1 has already 
>> filled
>>     system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
>>     to wait until the works that were put into the inacvtive queue 
>> earlier
>>     have executed (n cgroup_bpf_release), so it will be blocked for a 
>> while.
>> 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by 
>> step 2.
>> 4. Cpusets that were deleted at step 1 put cgroup_release works into
>>     cgroup_destroy_wq. They are competing to get cgroup_mutex all the 
>> time.
>>     When cgroup_metux is acqured by work at css_killed_work_fn, it will
>>     call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
>>     However, cpuset_css_offline will be blocked for step 3.
>> 5. At this moment, there are 256 works in active queue that are
>>     cgroup_bpf_release, they are attempting to acquire cgroup_mutex, 
>> and as
>>     a result, all of them are blocked. Consequently, sscs.work can not be
>>     executed. Ultimately, this situation leads to four processes being
>>     blocked, forming a deadlock.
>>
>> system_wq(step1)        WatchDog(step2)            cpu 
>> offline(step3)    cgroup_destroy_wq(step4)
>> ...
>> 2000+ cgroups deleted asyn
>> 256 actives + n inactives
>>                 __lockup_detector_reconfigure
>>                 P(cpu_hotplug_lock.read)
>>                 put sscs.work into system_wq
>> 256 + n + 1(sscs.work)
>> sscs.work wait to be executed
>>                 warting sscs.work finish
>>                                 percpu_down_write
>>                                 P(cpu_hotplug_lock.write)
>>                                 ...blocking...
>>                                             css_killed_work_fn
>>                                             P(cgroup_mutex)
>>                                             cpuset_css_offline
>>                                             P(cpu_hotplug_lock.read)
>>                                             ...blocking...
>> 256 cgroup_bpf_release
>> mutex_lock(&cgroup_mutex);
>> ..blocking...
>>
>> To fix the problem, place cgroup_bpf_release works on cgroup_destroy_wq,
>> which can break the loop and solve the problem. System wqs are for misc
>> things which shouldn't create a large number of concurrent work items.
>> If something is going to generate >WQ_DFL_ACTIVE(256) concurrent work
>> items, it should use its own dedicated workqueue.
>>
>> Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from 
>> cgroup itself")
>> Link: 
>> https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>> ---
>>   kernel/bpf/cgroup.c             | 2 +-
>>   kernel/cgroup/cgroup-internal.h | 1 +
>>   kernel/cgroup/cgroup.c          | 2 +-
>>   3 files changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index 8ba73042a239..a611a1274788 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -334,7 +334,7 @@ static void cgroup_bpf_release_fn(struct 
>> percpu_ref *ref)
>>       struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);
>>       INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
>> -    queue_work(system_wq, &cgrp->bpf.release_work);
>> +    queue_work(cgroup_destroy_wq, &cgrp->bpf.release_work);
>>   }
>>   /* Get underlying bpf_prog of bpf_prog_list entry, regardless if 
>> it's through
>> diff --git a/kernel/cgroup/cgroup-internal.h 
>> b/kernel/cgroup/cgroup-internal.h
>> index c964dd7ff967..17ac19bc8106 100644
>> --- a/kernel/cgroup/cgroup-internal.h
>> +++ b/kernel/cgroup/cgroup-internal.h
>> @@ -13,6 +13,7 @@
>>   extern spinlock_t trace_cgroup_path_lock;
>>   extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN];
>>   extern void __init enable_debug_cgroup(void);
>> +extern struct workqueue_struct *cgroup_destroy_wq;
>>   /*
>>    * cgroup_path() takes a spin lock. It is good practice not to take
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 75058fbf4450..77fa9ed69c86 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -124,7 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
>>    * destruction work items don't end up filling up max_active of 
>> system_wq
>>    * which may lead to deadlock.
>>    */
>> -static struct workqueue_struct *cgroup_destroy_wq;
>> +struct workqueue_struct *cgroup_destroy_wq;
>>   /* generate an array of cgroup subsystem pointers */
>>   #define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
> 
> Ping.
> Hi,TJ, Roman and Michal, I have updated commit message, I think it can 
> be much clearer now, can you review it again?
> 
> Thanks,
> Ridong
> 
Friendly ping.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
  2024-08-17  9:33 ` [PATCH v3 1/1] cgroup: fix " Chen Ridong
  2024-08-22  0:57   ` Chen Ridong
@ 2024-09-09 14:19   ` Michal Koutný
  2024-09-10  1:31     ` Chen Ridong
  2024-09-11 11:15     ` Hillf Danton
  1 sibling, 2 replies; 16+ messages in thread
From: Michal Koutný @ 2024-09-09 14:19 UTC (permalink / raw)
  To: Chen Ridong
  Cc: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x,
	hannes, roman.gushchin, bpf, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5924 bytes --]

On Sat, Aug 17, 2024 at 09:33:34AM GMT, Chen Ridong <chenridong@huawei.com> wrote:
> The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
> acquired in different tasks, which may lead to deadlock.
> It can lead to a deadlock through the following steps:
> 1. A large number of cpusets are deleted asynchronously, which puts a
>    large number of cgroup_bpf_release works into system_wq. The max_active
>    of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
>    cgroup_bpf_release works, and many cgroup_bpf_release works will be put
>    into inactive queue. As illustrated in the diagram, there are 256 (in
>    the acvtive queue) + n (in the inactive queue) works.
> 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
>    smp_call_on_cpu work into system_wq. However step 1 has already filled
>    system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
>    to wait until the works that were put into the inacvtive queue earlier
>    have executed (n cgroup_bpf_release), so it will be blocked for a while.
> 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
> 4. Cpusets that were deleted at step 1 put cgroup_release works into
>    cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
>    When cgroup_metux is acqured by work at css_killed_work_fn, it will
>    call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
>    However, cpuset_css_offline will be blocked for step 3.
> 5. At this moment, there are 256 works in active queue that are
>    cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
>    a result, all of them are blocked. Consequently, sscs.work can not be
>    executed. Ultimately, this situation leads to four processes being
>    blocked, forming a deadlock.
> 
> system_wq(step1)		WatchDog(step2)			cpu offline(step3)	cgroup_destroy_wq(step4)
> ...
> 2000+ cgroups deleted asyn
> 256 actives + n inactives
> 				__lockup_detector_reconfigure
> 				P(cpu_hotplug_lock.read)
> 				put sscs.work into system_wq
> 256 + n + 1(sscs.work)
> sscs.work wait to be executed
> 				warting sscs.work finish
> 								percpu_down_write
> 								P(cpu_hotplug_lock.write)
> 								...blocking...
> 											css_killed_work_fn
> 											P(cgroup_mutex)
> 											cpuset_css_offline
> 											P(cpu_hotplug_lock.read)
> 											...blocking...
> 256 cgroup_bpf_release
> mutex_lock(&cgroup_mutex);
> ..blocking...

Thanks, Ridong, for laying this out.
Let me try to extract the core of the deps above.

The correct lock ordering is: cgroup_mutex then cpu_hotplug_lock.
However, the smp_call_on_cpu() under cpus_read_lock may lead to
a deadlock (ABBA over those two locks).

This is OK
	thread T					system_wq worker
	
	  						lock(cgroup_mutex) (II)
							...
							unlock(cgroup_mutex)
	down(cpu_hotplug_lock.read)
	smp_call_on_cpu
	  queue_work_on(cpu, system_wq, scss) (I)
							scss.func
	  wait_for_completion(scss)
	up(cpu_hotplug_lock.read)

However, there is no ordering between (I) and (II) so they can also happen
in opposite

	thread T					system_wq worker
	
	down(cpu_hotplug_lock.read)
	smp_call_on_cpu
	  queue_work_on(cpu, system_wq, scss) (I)
	  						lock(cgroup_mutex)  (II)
							...
							unlock(cgroup_mutex)
							scss.func
	  wait_for_completion(scss)
	up(cpu_hotplug_lock.read)

And here the thread T + system_wq worker effectively call
cpu_hotplug_lock and cgroup_mutex in the wrong order. (And since they're
two threads, it won't be caught by lockdep.)

By that reasoning any holder of cgroup_mutex on system_wq makes system
susceptible to a deadlock (in presence of cpu_hotplug_lock waiting
writers + cpuset operations). And the two work items must meet in same
worker's processing hence probability is low (zero?) with less than
WQ_DFL_ACTIVE items.

(And more generally, any lock that is ordered before cpu_hotplug_lock
should not be taken in system_wq work functions. Or at least such works
items should not saturate WQ_DFL_ACTIVE workers.)

Wrt other uses of cgroup_mutex, I only see
  bpf_map_free_in_work
    queue_work(system_unbound_wq)
      bpf_map_free_deferred
        ops->map_free == cgroup_storage_map_free
          cgroup_lock()
which is safe since it uses a different workqueue than system_wq.

> To fix the problem, place cgroup_bpf_release works on cgroup_destroy_wq,
> which can break the loop and solve the problem.

Yes, it moves the problematic cgroup_mutex holder away from system_wq
and cgroup_destroy_wq could not cause similar problems because there are
no explicit waiter for particular work items or its flushing.


> System wqs are for misc things which shouldn't create a large number
> of concurrent work items.  If something is going to generate
> >WQ_DFL_ACTIVE(256) concurrent work
> items, it should use its own dedicated workqueue.

Actually, I'm not sure (because I lack workqueue knowledge) if producing
less than WQ_DFL_ACTIVE work items completely eliminates the chance of
two offending work items producing the wrong lock ordering.


> Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")

I'm now indifferent whether this is needed (perhaps in the sense it is
the _latest_ of multiple changes that contributed to possibility of this
deadlock scenario).


> Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> ---
>  kernel/bpf/cgroup.c             | 2 +-
>  kernel/cgroup/cgroup-internal.h | 1 +
>  kernel/cgroup/cgroup.c          | 2 +-
>  3 files changed, 3 insertions(+), 2 deletions(-)

I have convinved myself now that you can put

Reviewed-by: Michal Koutný <mkoutny@suse.com>

Regards,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
  2024-09-09 14:19   ` Michal Koutný
@ 2024-09-10  1:31     ` Chen Ridong
  2024-09-10 21:02       ` Roman Gushchin
  2024-09-11 11:15     ` Hillf Danton
  1 sibling, 1 reply; 16+ messages in thread
From: Chen Ridong @ 2024-09-10  1:31 UTC (permalink / raw)
  To: Michal Koutný, Chen Ridong
  Cc: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x,
	hannes, roman.gushchin, bpf, cgroups, linux-kernel



On 2024/9/9 22:19, Michal Koutný wrote:
> On Sat, Aug 17, 2024 at 09:33:34AM GMT, Chen Ridong <chenridong@huawei.com> wrote:
>> The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
>> acquired in different tasks, which may lead to deadlock.
>> It can lead to a deadlock through the following steps:
>> 1. A large number of cpusets are deleted asynchronously, which puts a
>>     large number of cgroup_bpf_release works into system_wq. The max_active
>>     of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
>>     cgroup_bpf_release works, and many cgroup_bpf_release works will be put
>>     into inactive queue. As illustrated in the diagram, there are 256 (in
>>     the acvtive queue) + n (in the inactive queue) works.
>> 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
>>     smp_call_on_cpu work into system_wq. However step 1 has already filled
>>     system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
>>     to wait until the works that were put into the inacvtive queue earlier
>>     have executed (n cgroup_bpf_release), so it will be blocked for a while.
>> 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
>> 4. Cpusets that were deleted at step 1 put cgroup_release works into
>>     cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
>>     When cgroup_metux is acqured by work at css_killed_work_fn, it will
>>     call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
>>     However, cpuset_css_offline will be blocked for step 3.
>> 5. At this moment, there are 256 works in active queue that are
>>     cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
>>     a result, all of them are blocked. Consequently, sscs.work can not be
>>     executed. Ultimately, this situation leads to four processes being
>>     blocked, forming a deadlock.
>>
>> system_wq(step1)		WatchDog(step2)			cpu offline(step3)	cgroup_destroy_wq(step4)
>> ...
>> 2000+ cgroups deleted asyn
>> 256 actives + n inactives
>> 				__lockup_detector_reconfigure
>> 				P(cpu_hotplug_lock.read)
>> 				put sscs.work into system_wq
>> 256 + n + 1(sscs.work)
>> sscs.work wait to be executed
>> 				warting sscs.work finish
>> 								percpu_down_write
>> 								P(cpu_hotplug_lock.write)
>> 								...blocking...
>> 											css_killed_work_fn
>> 											P(cgroup_mutex)
>> 											cpuset_css_offline
>> 											P(cpu_hotplug_lock.read)
>> 											...blocking...
>> 256 cgroup_bpf_release
>> mutex_lock(&cgroup_mutex);
>> ..blocking...
> 
> Thanks, Ridong, for laying this out.
> Let me try to extract the core of the deps above.
> 
> The correct lock ordering is: cgroup_mutex then cpu_hotplug_lock.
> However, the smp_call_on_cpu() under cpus_read_lock may lead to
> a deadlock (ABBA over those two locks).
> 

That's right.

> This is OK
> 	thread T					system_wq worker
> 	
> 	  						lock(cgroup_mutex) (II)
> 							...
> 							unlock(cgroup_mutex)
> 	down(cpu_hotplug_lock.read)
> 	smp_call_on_cpu
> 	  queue_work_on(cpu, system_wq, scss) (I)
> 							scss.func
> 	  wait_for_completion(scss)
> 	up(cpu_hotplug_lock.read)
> 
> However, there is no ordering between (I) and (II) so they can also happen
> in opposite
> 
> 	thread T					system_wq worker
> 	
> 	down(cpu_hotplug_lock.read)
> 	smp_call_on_cpu
> 	  queue_work_on(cpu, system_wq, scss) (I)
> 	  						lock(cgroup_mutex)  (II)
> 							...
> 							unlock(cgroup_mutex)
> 							scss.func
> 	  wait_for_completion(scss)
> 	up(cpu_hotplug_lock.read)
> 
> And here the thread T + system_wq worker effectively call
> cpu_hotplug_lock and cgroup_mutex in the wrong order. (And since they're
> two threads, it won't be caught by lockdep.)
> 
> By that reasoning any holder of cgroup_mutex on system_wq makes system
> susceptible to a deadlock (in presence of cpu_hotplug_lock waiting
> writers + cpuset operations). And the two work items must meet in same
> worker's processing hence probability is low (zero?) with less than
> WQ_DFL_ACTIVE items.
> 
> (And more generally, any lock that is ordered before cpu_hotplug_lock
> should not be taken in system_wq work functions. Or at least such works
> items should not saturate WQ_DFL_ACTIVE workers.)
> 
> Wrt other uses of cgroup_mutex, I only see
>    bpf_map_free_in_work
>      queue_work(system_unbound_wq)
>        bpf_map_free_deferred
>          ops->map_free == cgroup_storage_map_free
>            cgroup_lock()
> which is safe since it uses a different workqueue than system_wq.
> 
>> To fix the problem, place cgroup_bpf_release works on cgroup_destroy_wq,
>> which can break the loop and solve the problem.
> 
> Yes, it moves the problematic cgroup_mutex holder away from system_wq
> and cgroup_destroy_wq could not cause similar problems because there are
> no explicit waiter for particular work items or its flushing.
> 
> 
>> System wqs are for misc things which shouldn't create a large number
>> of concurrent work items.  If something is going to generate
>>> WQ_DFL_ACTIVE(256) concurrent work
>> items, it should use its own dedicated workqueue.
> 
> Actually, I'm not sure (because I lack workqueue knowledge) if producing
> less than WQ_DFL_ACTIVE work items completely eliminates the chance of
> two offending work items producing the wrong lock ordering.
> 

If producing less than WQ_DFL_ACTIVE work items, it won't lead to a 
deadlock. Because scss.func can be executed and doesn't have to wait for 
work that holds cgroup_mutex to be completed. Therefore, the probability 
is low and this issue can only be reproduced under pressure test.

> 
>> Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
> 
> I'm now indifferent whether this is needed (perhaps in the sense it is
> the _latest_ of multiple changes that contributed to possibility of this
> deadlock scenario).
> 
> 
>> Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>> ---
>>   kernel/bpf/cgroup.c             | 2 +-
>>   kernel/cgroup/cgroup-internal.h | 1 +
>>   kernel/cgroup/cgroup.c          | 2 +-
>>   3 files changed, 3 insertions(+), 2 deletions(-)
> 
> I have convinved myself now that you can put
> 
> Reviewed-by: Michal Koutný <mkoutny@suse.com>
> 
> Regards,
> Michal

Thank you very much.

Best Regards,
Ridong


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
  2024-09-10  1:31     ` Chen Ridong
@ 2024-09-10 21:02       ` Roman Gushchin
  2024-09-10 21:17         ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Roman Gushchin @ 2024-09-10 21:02 UTC (permalink / raw)
  To: Chen Ridong
  Cc: Michal Koutný, Chen Ridong, martin.lau, ast, daniel, andrii,
	eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, tj, lizefan.x, hannes, bpf, cgroups, linux-kernel

On Tue, Sep 10, 2024 at 09:31:41AM +0800, Chen Ridong wrote:
> 
> 
> On 2024/9/9 22:19, Michal Koutný wrote:
> > On Sat, Aug 17, 2024 at 09:33:34AM GMT, Chen Ridong <chenridong@huawei.com> wrote:
> > > The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
> > > acquired in different tasks, which may lead to deadlock.
> > > It can lead to a deadlock through the following steps:
> > > 1. A large number of cpusets are deleted asynchronously, which puts a
> > >     large number of cgroup_bpf_release works into system_wq. The max_active
> > >     of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
> > >     cgroup_bpf_release works, and many cgroup_bpf_release works will be put
> > >     into inactive queue. As illustrated in the diagram, there are 256 (in
> > >     the acvtive queue) + n (in the inactive queue) works.
> > > 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
> > >     smp_call_on_cpu work into system_wq. However step 1 has already filled
> > >     system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
> > >     to wait until the works that were put into the inacvtive queue earlier
> > >     have executed (n cgroup_bpf_release), so it will be blocked for a while.
> > > 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
> > > 4. Cpusets that were deleted at step 1 put cgroup_release works into
> > >     cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
> > >     When cgroup_metux is acqured by work at css_killed_work_fn, it will
> > >     call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
> > >     However, cpuset_css_offline will be blocked for step 3.
> > > 5. At this moment, there are 256 works in active queue that are
> > >     cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
> > >     a result, all of them are blocked. Consequently, sscs.work can not be
> > >     executed. Ultimately, this situation leads to four processes being
> > >     blocked, forming a deadlock.
> > > 
> > > system_wq(step1)		WatchDog(step2)			cpu offline(step3)	cgroup_destroy_wq(step4)
> > > ...
> > > 2000+ cgroups deleted asyn
> > > 256 actives + n inactives
> > > 				__lockup_detector_reconfigure
> > > 				P(cpu_hotplug_lock.read)
> > > 				put sscs.work into system_wq
> > > 256 + n + 1(sscs.work)
> > > sscs.work wait to be executed
> > > 				warting sscs.work finish
> > > 								percpu_down_write
> > > 								P(cpu_hotplug_lock.write)
> > > 								...blocking...
> > > 											css_killed_work_fn
> > > 											P(cgroup_mutex)
> > > 											cpuset_css_offline
> > > 											P(cpu_hotplug_lock.read)
> > > 											...blocking...
> > > 256 cgroup_bpf_release
> > > mutex_lock(&cgroup_mutex);
> > > ..blocking...
> > 
> > Thanks, Ridong, for laying this out.
> > Let me try to extract the core of the deps above.
> > 
> > The correct lock ordering is: cgroup_mutex then cpu_hotplug_lock.
> > However, the smp_call_on_cpu() under cpus_read_lock may lead to
> > a deadlock (ABBA over those two locks).
> > 
> 
> That's right.
> 
> > This is OK
> > 	thread T					system_wq worker
> > 	
> > 	  						lock(cgroup_mutex) (II)
> > 							...
> > 							unlock(cgroup_mutex)
> > 	down(cpu_hotplug_lock.read)
> > 	smp_call_on_cpu
> > 	  queue_work_on(cpu, system_wq, scss) (I)
> > 							scss.func
> > 	  wait_for_completion(scss)
> > 	up(cpu_hotplug_lock.read)
> > 
> > However, there is no ordering between (I) and (II) so they can also happen
> > in opposite
> > 
> > 	thread T					system_wq worker
> > 	
> > 	down(cpu_hotplug_lock.read)
> > 	smp_call_on_cpu
> > 	  queue_work_on(cpu, system_wq, scss) (I)
> > 	  						lock(cgroup_mutex)  (II)
> > 							...
> > 							unlock(cgroup_mutex)
> > 							scss.func
> > 	  wait_for_completion(scss)
> > 	up(cpu_hotplug_lock.read)
> > 
> > And here the thread T + system_wq worker effectively call
> > cpu_hotplug_lock and cgroup_mutex in the wrong order. (And since they're
> > two threads, it won't be caught by lockdep.)
> > 
> > By that reasoning any holder of cgroup_mutex on system_wq makes system
> > susceptible to a deadlock (in presence of cpu_hotplug_lock waiting
> > writers + cpuset operations). And the two work items must meet in same
> > worker's processing hence probability is low (zero?) with less than
> > WQ_DFL_ACTIVE items.

Right, I'm on the same page. Should we document then somewhere that
the cgroup mutex can't be locked from a system wq context?

I think thus will also make the Fixes tag more meaningful.

Thank you!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
  2024-09-10 21:02       ` Roman Gushchin
@ 2024-09-10 21:17         ` Tejun Heo
  2024-09-12  1:33           ` Chen Ridong
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2024-09-10 21:17 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Chen Ridong, Michal Koutný, Chen Ridong, martin.lau, ast,
	daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, lizefan.x, hannes, bpf, cgroups,
	linux-kernel

On Tue, Sep 10, 2024 at 09:02:59PM +0000, Roman Gushchin wrote:
...
> > > By that reasoning any holder of cgroup_mutex on system_wq makes system
> > > susceptible to a deadlock (in presence of cpu_hotplug_lock waiting
> > > writers + cpuset operations). And the two work items must meet in same
> > > worker's processing hence probability is low (zero?) with less than
> > > WQ_DFL_ACTIVE items.
> 
> Right, I'm on the same page. Should we document then somewhere that
> the cgroup mutex can't be locked from a system wq context?
> 
> I think thus will also make the Fixes tag more meaningful.

I think that's completely fine. What's not fine is saturating system_wq.
Anything which creates a large number of concurrent work items should be
using its own workqueue. If anything, workqueue needs to add a warning for
saturation conditions and who are the offenders.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
  2024-09-09 14:19   ` Michal Koutný
  2024-09-10  1:31     ` Chen Ridong
@ 2024-09-11 11:15     ` Hillf Danton
  2024-09-26 12:53       ` Michal Koutný
  2024-09-28  8:11       ` Tetsuo Handa
  1 sibling, 2 replies; 16+ messages in thread
From: Hillf Danton @ 2024-09-11 11:15 UTC (permalink / raw)
  To: Michal Koutny
  Cc: Chen Ridong, tj, cgroups, Tetsuo Handa, Boqun Feng,
	Linus Torvalds, linux-kernel

On Mon, 9 Sep 2024 16:19:38 +0200 Michal Koutny <mkoutny@suse.com>
> On Sat, Aug 17, 2024 at 09:33:34AM GMT, Chen Ridong <chenridong@huawei.com> wrote:
> > The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
> > acquired in different tasks, which may lead to deadlock.
> > It can lead to a deadlock through the following steps:
> > 1. A large number of cpusets are deleted asynchronously, which puts a
> >    large number of cgroup_bpf_release works into system_wq. The max_active
> >    of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
> >    cgroup_bpf_release works, and many cgroup_bpf_release works will be put
> >    into inactive queue. As illustrated in the diagram, there are 256 (in
> >    the acvtive queue) + n (in the inactive queue) works.
> > 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
> >    smp_call_on_cpu work into system_wq. However step 1 has already filled
> >    system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
> >    to wait until the works that were put into the inacvtive queue earlier
> >    have executed (n cgroup_bpf_release), so it will be blocked for a while.
> > 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
> > 4. Cpusets that were deleted at step 1 put cgroup_release works into
> >    cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
> >    When cgroup_metux is acqured by work at css_killed_work_fn, it will
> >    call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
> >    However, cpuset_css_offline will be blocked for step 3.
> > 5. At this moment, there are 256 works in active queue that are
> >    cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
> >    a result, all of them are blocked. Consequently, sscs.work can not be
> >    executed. Ultimately, this situation leads to four processes being
> >    blocked, forming a deadlock.
> >
> > system_wq(step1)		WatchDog(step2)			cpu offline(step3)	cgroup_destroy_wq(step4)
> > ...
> > 2000+ cgroups deleted asyn
> > 256 actives + n inactives
> > 				__lockup_detector_reconfigure
> > 				P(cpu_hotplug_lock.read)
> > 				put sscs.work into system_wq
> > 256 + n + 1(sscs.work)
> > sscs.work wait to be executed
> > 				warting sscs.work finish
> > 								percpu_down_write
> > 								P(cpu_hotplug_lock.write)
> > 								...blocking...
> > 											css_killed_work_fn
> > 											P(cgroup_mutex)
> > 											cpuset_css_offline
> > 											P(cpu_hotplug_lock.read)
> > 											...blocking...
> > 256 cgroup_bpf_release
> > mutex_lock(&cgroup_mutex);
> > ..blocking...
> 
> Thanks, Ridong, for laying this out.
> Let me try to extract the core of the deps above.
> 
> The correct lock ordering is: cgroup_mutex then cpu_hotplug_lock.
> However, the smp_call_on_cpu() under cpus_read_lock may lead to
> a deadlock (ABBA over those two locks).
> 
> This is OK
> 	thread T					system_wq worker
> 
> 	  						lock(cgroup_mutex) (II)
> 							...
> 							unlock(cgroup_mutex)
> 	down(cpu_hotplug_lock.read)
> 	smp_call_on_cpu
> 	  queue_work_on(cpu, system_wq, scss) (I)
> 							scss.func
> 	  wait_for_completion(scss)
> 	up(cpu_hotplug_lock.read)
> 
> However, there is no ordering between (I) and (II) so they can also happen
> in opposite
> 
> 	thread T					system_wq worker
> 
> 	down(cpu_hotplug_lock.read)
> 	smp_call_on_cpu
> 	  queue_work_on(cpu, system_wq, scss) (I)
> 	  						lock(cgroup_mutex)  (II)
> 							...
> 							unlock(cgroup_mutex)
> 							scss.func
> 	  wait_for_completion(scss)
> 	up(cpu_hotplug_lock.read)
> 
> And here the thread T + system_wq worker effectively call
> cpu_hotplug_lock and cgroup_mutex in the wrong order. (And since they're
> two threads, it won't be caught by lockdep.)
> 
Given no workqueue work executed without being dequeued, any queued work,
regardless if they are more than 2048, that acquires cgroup_mutex could not
prevent the work queued by thread-T from being executed, so thread-T can
make safe forward progress, therefore with no chance left for the ABBA 
deadlock you spotted where lockdep fails to work.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
  2024-09-10 21:17         ` Tejun Heo
@ 2024-09-12  1:33           ` Chen Ridong
  2024-09-12  5:59             ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Chen Ridong @ 2024-09-12  1:33 UTC (permalink / raw)
  To: Tejun Heo, Roman Gushchin
  Cc: Michal Koutný, Chen Ridong, martin.lau, ast, daniel, andrii,
	eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, lizefan.x, hannes, bpf, cgroups, linux-kernel



On 2024/9/11 5:17, Tejun Heo wrote:
> On Tue, Sep 10, 2024 at 09:02:59PM +0000, Roman Gushchin wrote:
> ...
>>>> By that reasoning any holder of cgroup_mutex on system_wq makes system
>>>> susceptible to a deadlock (in presence of cpu_hotplug_lock waiting
>>>> writers + cpuset operations). And the two work items must meet in same
>>>> worker's processing hence probability is low (zero?) with less than
>>>> WQ_DFL_ACTIVE items.
>>
>> Right, I'm on the same page. Should we document then somewhere that
>> the cgroup mutex can't be locked from a system wq context?
>>
>> I think thus will also make the Fixes tag more meaningful.
> 
> I think that's completely fine. What's not fine is saturating system_wq.
> Anything which creates a large number of concurrent work items should be
> using its own workqueue. If anything, workqueue needs to add a warning for
> saturation conditions and who are the offenders.
> 
> Thanks.
> 

I will add a patch do document that.
Should we modify WQ_DFL_ACTIVE(256 now)? Maybe 1024 is acceptable?

Best regards,
Ridong


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
  2024-09-12  1:33           ` Chen Ridong
@ 2024-09-12  5:59             ` Tejun Heo
  0 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2024-09-12  5:59 UTC (permalink / raw)
  To: Chen Ridong
  Cc: Roman Gushchin, Michal Koutný, Chen Ridong, martin.lau, ast,
	daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, lizefan.x, hannes, bpf, cgroups,
	linux-kernel

On Thu, Sep 12, 2024 at 09:33:23AM +0800, Chen Ridong wrote:
> I will add a patch do document that.
> Should we modify WQ_DFL_ACTIVE(256 now)? Maybe 1024 is acceptable?

Yeah, let's bump it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
  2024-09-11 11:15     ` Hillf Danton
@ 2024-09-26 12:53       ` Michal Koutný
  2024-09-27 11:25         ` Hillf Danton
  2024-09-28  8:11       ` Tetsuo Handa
  1 sibling, 1 reply; 16+ messages in thread
From: Michal Koutný @ 2024-09-26 12:53 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Chen Ridong, tj, cgroups, Tetsuo Handa, Boqun Feng,
	Linus Torvalds, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1411 bytes --]

Hello Hillf.

(sorry for later reply)

On Wed, Sep 11, 2024 at 07:15:42PM GMT, Hillf Danton <hdanton@sina.com> wrote:
> > However, there is no ordering between (I) and (II) so they can also happen
> > in opposite
> > 
> > 	thread T					system_wq worker
> > 
> > 	down(cpu_hotplug_lock.read)
> > 	smp_call_on_cpu
> > 	  queue_work_on(cpu, system_wq, scss) (I)
> > 	  						lock(cgroup_mutex)  (II)
> > 							...
> > 							unlock(cgroup_mutex)
> > 							scss.func
> > 	  wait_for_completion(scss)
> > 	up(cpu_hotplug_lock.read)
> > 
> > And here the thread T + system_wq worker effectively call
> > cpu_hotplug_lock and cgroup_mutex in the wrong order. (And since they're
> > two threads, it won't be caught by lockdep.)
> > 
> Given no workqueue work executed without being dequeued, any queued work,
> regardless if they are more than 2048, that acquires cgroup_mutex could not
> prevent the work queued by thread-T from being executed, so thread-T can
> make safe forward progress, therefore with no chance left for the ABBA 
> deadlock you spotted where lockdep fails to work.

Is there a forgotten negation and did you intend to write: "any queued
work ... that acquired cgroup_mutex could prevent"?

Or if the negation is correct, why do you mean that processed work item
is _not_ preventing thread T from running (in the case I left quoted
above)?

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
  2024-09-26 12:53       ` Michal Koutný
@ 2024-09-27 11:25         ` Hillf Danton
  2024-09-27 14:03           ` Michal Koutný
  0 siblings, 1 reply; 16+ messages in thread
From: Hillf Danton @ 2024-09-27 11:25 UTC (permalink / raw)
  To: Michal Koutny
  Cc: Chen Ridong, tj, cgroups, Tetsuo Handa, Boqun Feng,
	Linus Torvalds, linux-kernel

On Thu, 26 Sep 2024 14:53:46 +0200 Michal Koutny <mkoutny@suse.com>
> On Wed, Sep 11, 2024 at 07:15:42PM GMT, Hillf Danton <hdanton@sina.com> wrote:
> > > However, there is no ordering between (I) and (II) so they can also happen
> > > in opposite
> > > 
> > > 	thread T					system_wq worker
> > > 
> > > 	down(cpu_hotplug_lock.read)
> > > 	smp_call_on_cpu
> > > 	  queue_work_on(cpu, system_wq, scss) (I)
> > > 	  						lock(cgroup_mutex)  (II)
> > > 							...
> > > 							unlock(cgroup_mutex)
> > > 							scss.func
> > > 	  wait_for_completion(scss)
> > > 	up(cpu_hotplug_lock.read)
> > > 
> > > And here the thread T + system_wq worker effectively call
> > > cpu_hotplug_lock and cgroup_mutex in the wrong order. (And since they're
> > > two threads, it won't be caught by lockdep.)
> > > 
> > Given no workqueue work executed without being dequeued, any queued work,
> > regardless if they are more than 2048, that acquires cgroup_mutex could not
> > prevent the work queued by thread-T from being executed, so thread-T can
> > make safe forward progress, therefore with no chance left for the ABBA 
> > deadlock you spotted where lockdep fails to work.
> 
> Is there a forgotten negation and did you intend to write: "any queued
> work ... that acquired cgroup_mutex could prevent"?
> 
No I did not.

> Or if the negation is correct, why do you mean that processed work item
> is _not_ preventing thread T from running (in the case I left quoted
> above)?
>
If N (N > 1) cgroup work items are queued before one cpu hotplug work, then
1) workqueue worker1 dequeues cgroup work1 and executes it,
2) worker1 goes off cpu and falls in nap because of failure of acquiring
cgroup_mutex,
3) worker2 starts processing cgroup work2 and repeats 1) and 2),
4) after N sleepers, workerN+1 dequeus the hotplug work and executes it
and completes finally.

Clear lad?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
  2024-09-27 11:25         ` Hillf Danton
@ 2024-09-27 14:03           ` Michal Koutný
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Koutný @ 2024-09-27 14:03 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Chen Ridong, tj, cgroups, Tetsuo Handa, Boqun Feng,
	Linus Torvalds, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1179 bytes --]

On Fri, Sep 27, 2024 at 07:25:16PM GMT, Hillf Danton <hdanton@sina.com> wrote:
> > Or if the negation is correct, why do you mean that processed work item
> > is _not_ preventing thread T from running (in the case I left quoted
> > above)?
> >
> If N (N > 1) cgroup work items are queued before one cpu hotplug work, then
> 1) workqueue worker1 dequeues cgroup work1 and executes it,
> 2) worker1 goes off cpu and falls in nap because of failure of acquiring
> cgroup_mutex,
> 3) worker2 starts processing cgroup work2 and repeats 1) and 2),
> 4) after N sleepers, workerN+1 dequeus the hotplug work and executes it
> and completes finally.

My picture of putting everything under one system_wq worker was a bit
clumsy. I see how other workers can help out with processing the queue,
that's where then N >= WQ_DFL_ACTIVE comes into play, then this gets
stuck(?). [1]

IOW, if N < WQ_DFL_ACTIVE, the mutex waiters in the queue are harmless.

> Clear lad?

I hope, thanks!

Michal

[1] I don't see a trivial way how to modify lockdep to catch this
    (besides taking wq saturation into account it would also need to
    propagate some info across complete->wait_for_completion).


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
  2024-09-11 11:15     ` Hillf Danton
  2024-09-26 12:53       ` Michal Koutný
@ 2024-09-28  8:11       ` Tetsuo Handa
  2024-09-29  1:30         ` Chen Ridong
  1 sibling, 1 reply; 16+ messages in thread
From: Tetsuo Handa @ 2024-09-28  8:11 UTC (permalink / raw)
  To: Hillf Danton, Michal Koutny
  Cc: Chen Ridong, tj, cgroups, Boqun Feng, Linus Torvalds,
	linux-kernel

On 2024/09/11 20:15, Hillf Danton wrote:
> On Mon, 9 Sep 2024 16:19:38 +0200 Michal Koutny <mkoutny@suse.com>
>> On Sat, Aug 17, 2024 at 09:33:34AM GMT, Chen Ridong <chenridong@huawei.com> wrote:
>>> The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
>>> acquired in different tasks, which may lead to deadlock.
>>> It can lead to a deadlock through the following steps:
>>> 1. A large number of cpusets are deleted asynchronously, which puts a
>>>    large number of cgroup_bpf_release works into system_wq. The max_active
>>>    of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
>>>    cgroup_bpf_release works, and many cgroup_bpf_release works will be put
>>>    into inactive queue. As illustrated in the diagram, there are 256 (in
>>>    the acvtive queue) + n (in the inactive queue) works.
> Given no workqueue work executed without being dequeued, any queued work,
> regardless if they are more than 2048, that acquires cgroup_mutex could not
> prevent the work queued by thread-T from being executed, so thread-T can
> make safe forward progress, therefore with no chance left for the ABBA 
> deadlock you spotted where lockdep fails to work.

I made a simple test which queues many work items into system_wq and
measures time needed for flushing last work item.

As number of work items increased, time needed also increased.
Although nobody uses flush_workqueue() on system_wq, several users
use flush_work() on work item in system_wq. Therefore, I think that
queuing thousands of work items in system_wq should be avoided,
regardless of whether there is possibility of deadlock.

----------------------------------------
#include <linux/module.h>
#include <linux/workqueue.h>

static void worker_func(struct work_struct *work)
{
        schedule_timeout_uninterruptible(HZ);
}

#define MAX_WORKS 8192
static struct work_struct works[MAX_WORKS];

static int __init test_init(void)
{
        int i;
        unsigned long start, end;

        for (i = 0; i < MAX_WORKS; i++) {
                INIT_WORK(&works[i], worker_func);
                schedule_work(&works[i]);
        }
        start = jiffies;
        flush_work(&works[MAX_WORKS - 1]);
        end = jiffies;
        printk("%u: Took %lu jiffies. (HZ=%u)\n", MAX_WORKS, end - start, HZ);
        for (i = 0; i < MAX_WORKS; i++)
                flush_work(&works[i]);
        return -EINVAL;
}

module_init(test_init);
MODULE_LICENSE("GPL");
----------------------------------------

12 CPUs
256: Took 1025 jiffies. (HZ=1000)
512: Took 2091 jiffies. (HZ=1000)
1024: Took 4105 jiffies. (HZ=1000)
2048: Took 8321 jiffies. (HZ=1000)
4096: Took 16382 jiffies. (HZ=1000)
8192: Took 32770 jiffies. (HZ=1000)

1 CPU
256: Took 1133 jiffies. (HZ=1000)
512: Took 2047 jiffies. (HZ=1000)
1024: Took 4117 jiffies. (HZ=1000)
2048: Took 8210 jiffies. (HZ=1000)
4096: Took 16424 jiffies. (HZ=1000)
8192: Took 32774 jiffies. (HZ=1000)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
  2024-09-28  8:11       ` Tetsuo Handa
@ 2024-09-29  1:30         ` Chen Ridong
  0 siblings, 0 replies; 16+ messages in thread
From: Chen Ridong @ 2024-09-29  1:30 UTC (permalink / raw)
  To: Tetsuo Handa, Hillf Danton, Michal Koutny
  Cc: tj, cgroups, Boqun Feng, Linus Torvalds, linux-kernel



On 2024/9/28 16:11, Tetsuo Handa wrote:
> On 2024/09/11 20:15, Hillf Danton wrote:
>> On Mon, 9 Sep 2024 16:19:38 +0200 Michal Koutny <mkoutny@suse.com>
>>> On Sat, Aug 17, 2024 at 09:33:34AM GMT, Chen Ridong <chenridong@huawei.com> wrote:
>>>> The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
>>>> acquired in different tasks, which may lead to deadlock.
>>>> It can lead to a deadlock through the following steps:
>>>> 1. A large number of cpusets are deleted asynchronously, which puts a
>>>>     large number of cgroup_bpf_release works into system_wq. The max_active
>>>>     of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
>>>>     cgroup_bpf_release works, and many cgroup_bpf_release works will be put
>>>>     into inactive queue. As illustrated in the diagram, there are 256 (in
>>>>     the acvtive queue) + n (in the inactive queue) works.
>> Given no workqueue work executed without being dequeued, any queued work,
>> regardless if they are more than 2048, that acquires cgroup_mutex could not
>> prevent the work queued by thread-T from being executed, so thread-T can
>> make safe forward progress, therefore with no chance left for the ABBA
>> deadlock you spotted where lockdep fails to work.
> 
> I made a simple test which queues many work items into system_wq and
> measures time needed for flushing last work item.
> 
> As number of work items increased, time needed also increased.
> Although nobody uses flush_workqueue() on system_wq, several users
> use flush_work() on work item in system_wq. Therefore, I think that
> queuing thousands of work items in system_wq should be avoided,
> regardless of whether there is possibility of deadlock.
> 

I have sent a patch to document this.

Link: 
https://lore.kernel.org/linux-kernel/20240923114352.4001560-3-chenridong@huaweicloud.com/

Michal and I are discussing how to make this constraint clear. If you 
can express this constraint more clearly, just reply.

Best regards,
Ridong


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-09-29  1:30 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-17  9:33 [PATCH v3 0/1] Fix deadlock caused by cgroup_mutex and cpu_hotplug_lock Chen Ridong
2024-08-17  9:33 ` [PATCH v3 1/1] cgroup: fix " Chen Ridong
2024-08-22  0:57   ` Chen Ridong
2024-08-30  2:08     ` Chen Ridong
2024-09-09 14:19   ` Michal Koutný
2024-09-10  1:31     ` Chen Ridong
2024-09-10 21:02       ` Roman Gushchin
2024-09-10 21:17         ` Tejun Heo
2024-09-12  1:33           ` Chen Ridong
2024-09-12  5:59             ` Tejun Heo
2024-09-11 11:15     ` Hillf Danton
2024-09-26 12:53       ` Michal Koutný
2024-09-27 11:25         ` Hillf Danton
2024-09-27 14:03           ` Michal Koutný
2024-09-28  8:11       ` Tetsuo Handa
2024-09-29  1:30         ` Chen Ridong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox