* [PATCH -v2] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
@ 2024-07-19 2:52 Chen Ridong
2024-07-24 0:53 ` chenridong
2024-07-24 11:08 ` Hillf Danton
0 siblings, 2 replies; 11+ messages in thread
From: Chen Ridong @ 2024-07-19 2:52 UTC (permalink / raw)
To: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x,
hannes, roman.gushchin
Cc: bpf, cgroups, linux-kernel
We found a hung_task problem as shown below:
INFO: task kworker/0:0:8 blocked for more than 327 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/0:0 state:D stack:13920 pid:8 ppid:2 flags:0x00004000
Workqueue: events cgroup_bpf_release
Call Trace:
<TASK>
__schedule+0x5a2/0x2050
? find_held_lock+0x33/0x100
? wq_worker_sleeping+0x9e/0xe0
schedule+0x9f/0x180
schedule_preempt_disabled+0x25/0x50
__mutex_lock+0x512/0x740
? cgroup_bpf_release+0x1e/0x4d0
? cgroup_bpf_release+0xcf/0x4d0
? process_scheduled_works+0x161/0x8a0
? cgroup_bpf_release+0x1e/0x4d0
? mutex_lock_nested+0x2b/0x40
? __pfx_delay_tsc+0x10/0x10
mutex_lock_nested+0x2b/0x40
cgroup_bpf_release+0xcf/0x4d0
? process_scheduled_works+0x161/0x8a0
? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
? process_scheduled_works+0x161/0x8a0
process_scheduled_works+0x23a/0x8a0
worker_thread+0x231/0x5b0
? __pfx_worker_thread+0x10/0x10
kthread+0x14d/0x1c0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x59/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
This issue can be reproduced by the following methods:
1. A large number of cpuset cgroups are deleted.
2. Set cpu on and off repeatly.
3. Set watchdog_thresh repeatly.
The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
acquired in different tasks, which may lead to deadlock.
It can lead to a deadlock through the following steps:
1. A large number of cgroups are deleted, which will put a large
number of cgroup_bpf_release works into system_wq. The max_active
of system_wq is WQ_DFL_ACTIVE(256). When cgroup_bpf_release can not
get cgroup_metux, it may cram system_wq, and it will block work
enqueued later.
2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
smp_call_on_cpu work into system_wq. However it may be blocked by
step 1.
3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
4. When a cpuset is deleted, cgroup release work is placed on
cgroup_destroy_wq, it will hold cgroup_metux and acquire
cpu_hotplug_lock.read. Acquiring cpu_hotplug_lock.read is blocked by
cpu_hotplug_lock.write as mentioned by step 3. Finally, it forms a
loop and leads to a deadlock.
cgroup_destroy_wq(step4) cpu offline(step3) WatchDog(step2) system_wq(step1)
......
__lockup_detector_reconfigure:
P(cpu_hotplug_lock.read)
...
...
percpu_down_write:
P(cpu_hotplug_lock.write)
...256+ works
cgroup_bpf_release:
P(cgroup_mutex)
smp_call_on_cpu:
Wait system_wq
...
css_killed_work_fn:
P(cgroup_mutex)
...
cpuset_css_offline:
P(cpu_hotplug_lock.read)
To fix the problem, place cgroup_bpf_release works on cgroup_destroy_wq,
which can break the loop and solve the problem. System wqs are for misc
things which shouldn't create a large number of concurrent work items.
If something is going to generate >WQ_DFL_ACTIVE(256) concurrent work
items, it should use its own dedicated workqueue.
Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
kernel/bpf/cgroup.c | 2 +-
kernel/cgroup/cgroup-internal.h | 1 +
kernel/cgroup/cgroup.c | 2 +-
3 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 8ba73042a239..a611a1274788 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -334,7 +334,7 @@ static void cgroup_bpf_release_fn(struct percpu_ref *ref)
struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);
INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
- queue_work(system_wq, &cgrp->bpf.release_work);
+ queue_work(cgroup_destroy_wq, &cgrp->bpf.release_work);
}
/* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's through
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 520b90dd97ec..9e57f3e9316e 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -13,6 +13,7 @@
extern spinlock_t trace_cgroup_path_lock;
extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN];
extern void __init enable_debug_cgroup(void);
+extern struct workqueue_struct *cgroup_destroy_wq;
/*
* cgroup_path() takes a spin lock. It is good practice not to take
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index e32b6972c478..3317e03fe2fb 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -124,7 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
* destruction work items don't end up filling up max_active of system_wq
* which may lead to deadlock.
*/
-static struct workqueue_struct *cgroup_destroy_wq;
+struct workqueue_struct *cgroup_destroy_wq;
/* generate an array of cgroup subsystem pointers */
#define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
--
2.34.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH -v2] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
2024-07-19 2:52 [PATCH -v2] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock Chen Ridong
@ 2024-07-24 0:53 ` chenridong
2024-08-01 1:34 ` chenridong
2024-07-24 11:08 ` Hillf Danton
1 sibling, 1 reply; 11+ messages in thread
From: chenridong @ 2024-07-24 0:53 UTC (permalink / raw)
To: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x,
hannes, roman.gushchin
Cc: bpf, cgroups, linux-kernel
On 2024/7/19 10:52, Chen Ridong wrote:
> We found a hung_task problem as shown below:
>
> INFO: task kworker/0:0:8 blocked for more than 327 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/0:0 state:D stack:13920 pid:8 ppid:2 flags:0x00004000
> Workqueue: events cgroup_bpf_release
> Call Trace:
> <TASK>
> __schedule+0x5a2/0x2050
> ? find_held_lock+0x33/0x100
> ? wq_worker_sleeping+0x9e/0xe0
> schedule+0x9f/0x180
> schedule_preempt_disabled+0x25/0x50
> __mutex_lock+0x512/0x740
> ? cgroup_bpf_release+0x1e/0x4d0
> ? cgroup_bpf_release+0xcf/0x4d0
> ? process_scheduled_works+0x161/0x8a0
> ? cgroup_bpf_release+0x1e/0x4d0
> ? mutex_lock_nested+0x2b/0x40
> ? __pfx_delay_tsc+0x10/0x10
> mutex_lock_nested+0x2b/0x40
> cgroup_bpf_release+0xcf/0x4d0
> ? process_scheduled_works+0x161/0x8a0
> ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
> ? process_scheduled_works+0x161/0x8a0
> process_scheduled_works+0x23a/0x8a0
> worker_thread+0x231/0x5b0
> ? __pfx_worker_thread+0x10/0x10
> kthread+0x14d/0x1c0
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x59/0x70
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1b/0x30
> </TASK>
>
> This issue can be reproduced by the following methods:
> 1. A large number of cpuset cgroups are deleted.
> 2. Set cpu on and off repeatly.
> 3. Set watchdog_thresh repeatly.
>
> The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
> acquired in different tasks, which may lead to deadlock.
> It can lead to a deadlock through the following steps:
> 1. A large number of cgroups are deleted, which will put a large
> number of cgroup_bpf_release works into system_wq. The max_active
> of system_wq is WQ_DFL_ACTIVE(256). When cgroup_bpf_release can not
> get cgroup_metux, it may cram system_wq, and it will block work
> enqueued later.
> 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
> smp_call_on_cpu work into system_wq. However it may be blocked by
> step 1.
> 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
> 4. When a cpuset is deleted, cgroup release work is placed on
> cgroup_destroy_wq, it will hold cgroup_metux and acquire
> cpu_hotplug_lock.read. Acquiring cpu_hotplug_lock.read is blocked by
> cpu_hotplug_lock.write as mentioned by step 3. Finally, it forms a
> loop and leads to a deadlock.
>
> cgroup_destroy_wq(step4) cpu offline(step3) WatchDog(step2) system_wq(step1)
> ......
> __lockup_detector_reconfigure:
> P(cpu_hotplug_lock.read)
> ...
> ...
> percpu_down_write:
> P(cpu_hotplug_lock.write)
> ...256+ works
> cgroup_bpf_release:
> P(cgroup_mutex)
> smp_call_on_cpu:
> Wait system_wq
> ...
> css_killed_work_fn:
> P(cgroup_mutex)
> ...
> cpuset_css_offline:
> P(cpu_hotplug_lock.read)
>
> To fix the problem, place cgroup_bpf_release works on cgroup_destroy_wq,
> which can break the loop and solve the problem. System wqs are for misc
> things which shouldn't create a large number of concurrent work items.
> If something is going to generate >WQ_DFL_ACTIVE(256) concurrent work
> items, it should use its own dedicated workqueue.
>
> Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
> Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> ---
> kernel/bpf/cgroup.c | 2 +-
> kernel/cgroup/cgroup-internal.h | 1 +
> kernel/cgroup/cgroup.c | 2 +-
> 3 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 8ba73042a239..a611a1274788 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -334,7 +334,7 @@ static void cgroup_bpf_release_fn(struct percpu_ref *ref)
> struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);
>
> INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
> - queue_work(system_wq, &cgrp->bpf.release_work);
> + queue_work(cgroup_destroy_wq, &cgrp->bpf.release_work);
> }
>
> /* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's through
> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
> index 520b90dd97ec..9e57f3e9316e 100644
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -13,6 +13,7 @@
> extern spinlock_t trace_cgroup_path_lock;
> extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN];
> extern void __init enable_debug_cgroup(void);
> +extern struct workqueue_struct *cgroup_destroy_wq;
>
> /*
> * cgroup_path() takes a spin lock. It is good practice not to take
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index e32b6972c478..3317e03fe2fb 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -124,7 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
> * destruction work items don't end up filling up max_active of system_wq
> * which may lead to deadlock.
> */
> -static struct workqueue_struct *cgroup_destroy_wq;
> +struct workqueue_struct *cgroup_destroy_wq;
>
> /* generate an array of cgroup subsystem pointers */
> #define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
Friendly ping.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH -v2] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
2024-07-19 2:52 [PATCH -v2] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock Chen Ridong
2024-07-24 0:53 ` chenridong
@ 2024-07-24 11:08 ` Hillf Danton
2024-07-25 1:48 ` chenridong
1 sibling, 1 reply; 11+ messages in thread
From: Hillf Danton @ 2024-07-24 11:08 UTC (permalink / raw)
To: Chen Ridong; +Cc: Roman Gushchin, tj, bpf, cgroups, linux-kernel
On Fri, 19 Jul 2024 02:52:32 +0000 Chen Ridong <chenridong@huawei.com>
> We found a hung_task problem as shown below:
>
> INFO: task kworker/0:0:8 blocked for more than 327 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/0:0 state:D stack:13920 pid:8 ppid:2 flags:0x00004000
> Workqueue: events cgroup_bpf_release
> Call Trace:
> <TASK>
> __schedule+0x5a2/0x2050
> ? find_held_lock+0x33/0x100
> ? wq_worker_sleeping+0x9e/0xe0
> schedule+0x9f/0x180
> schedule_preempt_disabled+0x25/0x50
> __mutex_lock+0x512/0x740
> ? cgroup_bpf_release+0x1e/0x4d0
> ? cgroup_bpf_release+0xcf/0x4d0
> ? process_scheduled_works+0x161/0x8a0
> ? cgroup_bpf_release+0x1e/0x4d0
> ? mutex_lock_nested+0x2b/0x40
> ? __pfx_delay_tsc+0x10/0x10
> mutex_lock_nested+0x2b/0x40
> cgroup_bpf_release+0xcf/0x4d0
> ? process_scheduled_works+0x161/0x8a0
> ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
> ? process_scheduled_works+0x161/0x8a0
> process_scheduled_works+0x23a/0x8a0
> worker_thread+0x231/0x5b0
> ? __pfx_worker_thread+0x10/0x10
> kthread+0x14d/0x1c0
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x59/0x70
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1b/0x30
> </TASK>
>
> This issue can be reproduced by the following methods:
> 1. A large number of cpuset cgroups are deleted.
> 2. Set cpu on and off repeatly.
> 3. Set watchdog_thresh repeatly.
>
> The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
> acquired in different tasks, which may lead to deadlock.
> It can lead to a deadlock through the following steps:
> 1. A large number of cgroups are deleted, which will put a large
> number of cgroup_bpf_release works into system_wq. The max_active
> of system_wq is WQ_DFL_ACTIVE(256). When cgroup_bpf_release can not
> get cgroup_metux, it may cram system_wq, and it will block work
> enqueued later.
> 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
> smp_call_on_cpu work into system_wq. However it may be blocked by
> step 1.
> 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
> 4. When a cpuset is deleted, cgroup release work is placed on
> cgroup_destroy_wq, it will hold cgroup_metux and acquire
> cpu_hotplug_lock.read. Acquiring cpu_hotplug_lock.read is blocked by
> cpu_hotplug_lock.write as mentioned by step 3. Finally, it forms a
> loop and leads to a deadlock.
>
> cgroup_destroy_wq(step4) cpu offline(step3) WatchDog(step2) system_wq(step1)
> ......
> __lockup_detector_reconfigure:
> P(cpu_hotplug_lock.read)
> ...
> ...
> percpu_down_write:
> P(cpu_hotplug_lock.write)
> ...256+ works
> cgroup_bpf_release:
> P(cgroup_mutex)
> smp_call_on_cpu:
> Wait system_wq
> ...
> css_killed_work_fn:
> P(cgroup_mutex)
> ...
> cpuset_css_offline:
> P(cpu_hotplug_lock.read)
>
worker_thread()
manage_workers()
maybe_create_worker()
create_worker() // has nothing to do with WQ_DFL_ACTIVE
process_scheduled_works()
Given idle worker created independent of WQ_DFL_ACTIVE before handling
work item, no deadlock could rise in your scenario above.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH -v2] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
2024-07-24 11:08 ` Hillf Danton
@ 2024-07-25 1:48 ` chenridong
2024-07-25 11:01 ` Hillf Danton
2024-07-26 13:04 ` Michal Koutný
0 siblings, 2 replies; 11+ messages in thread
From: chenridong @ 2024-07-25 1:48 UTC (permalink / raw)
To: Hillf Danton; +Cc: Roman Gushchin, tj, bpf, cgroups, linux-kernel
On 2024/7/24 19:08, Hillf Danton wrote:
> On Fri, 19 Jul 2024 02:52:32 +0000 Chen Ridong <chenridong@huawei.com>
>> We found a hung_task problem as shown below:
>>
>> INFO: task kworker/0:0:8 blocked for more than 327 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> task:kworker/0:0 state:D stack:13920 pid:8 ppid:2 flags:0x00004000
>> Workqueue: events cgroup_bpf_release
>> Call Trace:
>> <TASK>
>> __schedule+0x5a2/0x2050
>> ? find_held_lock+0x33/0x100
>> ? wq_worker_sleeping+0x9e/0xe0
>> schedule+0x9f/0x180
>> schedule_preempt_disabled+0x25/0x50
>> __mutex_lock+0x512/0x740
>> ? cgroup_bpf_release+0x1e/0x4d0
>> ? cgroup_bpf_release+0xcf/0x4d0
>> ? process_scheduled_works+0x161/0x8a0
>> ? cgroup_bpf_release+0x1e/0x4d0
>> ? mutex_lock_nested+0x2b/0x40
>> ? __pfx_delay_tsc+0x10/0x10
>> mutex_lock_nested+0x2b/0x40
>> cgroup_bpf_release+0xcf/0x4d0
>> ? process_scheduled_works+0x161/0x8a0
>> ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
>> ? process_scheduled_works+0x161/0x8a0
>> process_scheduled_works+0x23a/0x8a0
>> worker_thread+0x231/0x5b0
>> ? __pfx_worker_thread+0x10/0x10
>> kthread+0x14d/0x1c0
>> ? __pfx_kthread+0x10/0x10
>> ret_from_fork+0x59/0x70
>> ? __pfx_kthread+0x10/0x10
>> ret_from_fork_asm+0x1b/0x30
>> </TASK>
>>
>> This issue can be reproduced by the following methods:
>> 1. A large number of cpuset cgroups are deleted.
>> 2. Set cpu on and off repeatly.
>> 3. Set watchdog_thresh repeatly.
>>
>> The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
>> acquired in different tasks, which may lead to deadlock.
>> It can lead to a deadlock through the following steps:
>> 1. A large number of cgroups are deleted, which will put a large
>> number of cgroup_bpf_release works into system_wq. The max_active
>> of system_wq is WQ_DFL_ACTIVE(256). When cgroup_bpf_release can not
>> get cgroup_metux, it may cram system_wq, and it will block work
>> enqueued later.
>> 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
>> smp_call_on_cpu work into system_wq. However it may be blocked by
>> step 1.
>> 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
>> 4. When a cpuset is deleted, cgroup release work is placed on
>> cgroup_destroy_wq, it will hold cgroup_metux and acquire
>> cpu_hotplug_lock.read. Acquiring cpu_hotplug_lock.read is blocked by
>> cpu_hotplug_lock.write as mentioned by step 3. Finally, it forms a
>> loop and leads to a deadlock.
>>
>> cgroup_destroy_wq(step4) cpu offline(step3) WatchDog(step2) system_wq(step1)
>> ......
>> __lockup_detector_reconfigure:
>> P(cpu_hotplug_lock.read)
>> ...
>> ...
>> percpu_down_write:
>> P(cpu_hotplug_lock.write)
>> ...256+ works
>> cgroup_bpf_release:
>> P(cgroup_mutex)
>> smp_call_on_cpu:
>> Wait system_wq
>> ...
>> css_killed_work_fn:
>> P(cgroup_mutex)
>> ...
>> cpuset_css_offline:
>> P(cpu_hotplug_lock.read)
>>
> worker_thread()
> manage_workers()
> maybe_create_worker()
> create_worker() // has nothing to do with WQ_DFL_ACTIVE
> process_scheduled_works()
>
> Given idle worker created independent of WQ_DFL_ACTIVE before handling
> work item, no deadlock could rise in your scenario above.
Hello Hillf, did you mean to say this issue couldn't happen?
I wish it hadn't happen, as it took me a long time to figure out.
However, it did happen. It could be reproduced with the method I
offered, You can access the scripts using this link:
https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t.
It's not about how the pool's workers were created, but rather the
limit(system_wq ) of workqueue. If system_wq reaches its max_active
limit, the work enqueued afterward will be placed on
pwq->inactive_works. In this scenario described above, the problem is
all active works(cgroup_bpf_release) are blocked, and the
inactive_works(smp_call_on_cpu) couldn't be executed when it forms a loop.
We have discussed this before in V1, you can find in the Link.
Thanks
Ridong
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH -v2] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
2024-07-25 1:48 ` chenridong
@ 2024-07-25 11:01 ` Hillf Danton
2024-07-26 13:04 ` Michal Koutný
1 sibling, 0 replies; 11+ messages in thread
From: Hillf Danton @ 2024-07-25 11:01 UTC (permalink / raw)
To: Chen Ridong; +Cc: Roman Gushchin, tj, bpf, cgroups, linux-kernel
On Thu, 25 Jul 2024 09:48:36 +0800 Chen Ridong <chenridong@huawei.com>
>
> did you mean to say this issue couldn't happen?
>
1) It is deadlock if watchdog work can not complete before any cgroup work.
2) It is not if you could not trigger it with no more than a dozen cgroup works.
Not important, boy, spin with deadlock erased.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH -v2] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
2024-07-25 1:48 ` chenridong
2024-07-25 11:01 ` Hillf Danton
@ 2024-07-26 13:04 ` Michal Koutný
2024-07-27 10:21 ` chenridong
1 sibling, 1 reply; 11+ messages in thread
From: Michal Koutný @ 2024-07-26 13:04 UTC (permalink / raw)
To: chenridong; +Cc: Hillf Danton, Roman Gushchin, tj, bpf, cgroups, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1457 bytes --]
Hello.
On Thu, Jul 25, 2024 at 09:48:36AM GMT, chenridong <chenridong@huawei.com> wrote:
> > > This issue can be reproduced by the following methods:
> > > 1. A large number of cpuset cgroups are deleted.
> > > 2. Set cpu on and off repeatly.
> > > 3. Set watchdog_thresh repeatly.
BTW I assume this is some stress testing, not a regular use scenario of
yours, right?
> > >
> > > The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
> > > acquired in different tasks, which may lead to deadlock.
> > > It can lead to a deadlock through the following steps:
> > > 1. A large number of cgroups are deleted, which will put a large
> > > number of cgroup_bpf_release works into system_wq. The max_active
> > > of system_wq is WQ_DFL_ACTIVE(256). When cgroup_bpf_release can not
> > > get cgroup_metux, it may cram system_wq, and it will block work
> > > enqueued later.
Who'd be the holder of cgroup_mutex preventing cgroup_bpf_release from
progress? (That's not clear to me from your diagram.)
...
> > Given idle worker created independent of WQ_DFL_ACTIVE before handling
> > work item, no deadlock could rise in your scenario above.
>
> Hello Hillf, did you mean to say this issue couldn't happen?
Ridong, can you reproduce this with CONFIG_PROVE_LOCKING (or do you have
lockdep message from it aready)? It'd be helpful to get insight into
the suspected dependencies.
Thanks,
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH -v2] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
2024-07-26 13:04 ` Michal Koutný
@ 2024-07-27 10:21 ` chenridong
2024-08-07 13:32 ` Michal Koutný
0 siblings, 1 reply; 11+ messages in thread
From: chenridong @ 2024-07-27 10:21 UTC (permalink / raw)
To: Michal Koutný
Cc: Hillf Danton, Roman Gushchin, tj, bpf, cgroups, linux-kernel
On 2024/7/26 21:04, Michal Koutný wrote:
> Hello.
>
> On Thu, Jul 25, 2024 at 09:48:36AM GMT, chenridong <chenridong@huawei.com> wrote:
>>>> This issue can be reproduced by the following methods:
>>>> 1. A large number of cpuset cgroups are deleted.
>>>> 2. Set cpu on and off repeatly.
>>>> 3. Set watchdog_thresh repeatly.
>
> BTW I assume this is some stress testing, not a regular use scenario of
> yours, right?
>
Yes, I have offered the scripts in Link(V1).
>>>>
>>>> The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
>>>> acquired in different tasks, which may lead to deadlock.
>>>> It can lead to a deadlock through the following steps:
>>>> 1. A large number of cgroups are deleted, which will put a large
>>>> number of cgroup_bpf_release works into system_wq. The max_active
>>>> of system_wq is WQ_DFL_ACTIVE(256). When cgroup_bpf_release can not
>>>> get cgroup_metux, it may cram system_wq, and it will block work
>>>> enqueued later.
>
> Who'd be the holder of cgroup_mutex preventing cgroup_bpf_release from
> progress? (That's not clear to me from your diagram.)
>
This is a cumulative process. The stress testing deletes a large member
of cgroups, and cgroup_bpf_release is asynchronous, competing with
cgroup release works. You know, cgroup_mutex is used in many places.
Finally, the number of `cgroup_bpf_release` instances in system_wq
accumulates up to 256, and it leads to this issue.
> ...
>>> Given idle worker created independent of WQ_DFL_ACTIVE before handling
>>> work item, no deadlock could rise in your scenario above.
>>
>> Hello Hillf, did you mean to say this issue couldn't happen?
>
> Ridong, can you reproduce this with CONFIG_PROVE_LOCKING (or do you have
> lockdep message from it aready)? It'd be helpful to get insight into
> the suspected dependencies.
>
> Thanks,
> Michal
There is a part of logs, I hope that is enough.
Thanks.
[ 955.972520] Not tainted 6.10.0-rc6-00163-g661e504db04c-dirty #135
[ 955.976820] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 955.979235] task:kworker/0:0 state:D stack:14824 pid:8 tgid:8
ppid:2 flags:00
[ 955.982879] Workqueue: events cgroup_bpf_release
[ 955.985403] Call Trace:
[ 955.986851] <TASK>
[ 955.988108] __schedule+0x393/0x1650
[ 955.990202] ? find_held_lock+0x2b/0x80
[ 955.991481] ? find_held_lock+0x2b/0x80
[ 955.992734] schedule+0x3e/0x130
[ 955.993661] schedule_preempt_disabled+0x15/0x30
[ 955.995196] __mutex_lock+0x6a7/0xce0
[ 955.996443] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 955.997280] ? cgroup_bpf_release+0x5c/0x360
[ 955.999220] ? cgroup_bpf_release+0x5c/0x360
[ 956.000452] ? __pfx_delay_tsc+0x10/0x10
[ 956.001390] cgroup_bpf_release+0x5c/0x360
[ 956.002865] ? trace_event_raw_event_workqueue_execute_start+0x52/0x90
[ 956.004836] process_one_work+0x1f0/0x610
[ 956.005974] worker_thread+0x183/0x340
[ 956.007115] ? __pfx_worker_thread+0x10/0x10
[ 956.008276] kthread+0xd6/0x110
[ 956.009291] ? __pfx_kthread+0x10/0x10
[ 956.010474] ret_from_fork+0x34/0x50
[ 956.011581] ? __pfx_kthread+0x10/0x10
[ 956.012741] ret_from_fork_asm+0x1a/0x30
[ 956.014069] </TASK>
[ 956.014875] INFO: task kworker/0:1:9 blocked for more than 327 seconds.
[ 956.016724] Not tainted 6.10.0-rc6-00163-g661e504db04c-dirty #135
[ 956.018531] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 956.020718] task:kworker/0:1 state:D stack:14648 pid:9 tgid:9
ppid:2 flags:00
[ 956.021664] Workqueue: events cgroup_bpf_release
[ 956.024475] Call Trace:
[ 956.025225] <TASK>
[ 956.025913] __schedule+0x393/0x1650
[ 956.026947] ? find_held_lock+0x2b/0x80
[ 956.027862] ? find_held_lock+0x2b/0x80
[ 956.029165] schedule+0x3e/0x130
[ 956.030012] schedule_preempt_disabled+0x15/0x30
[ 956.031474] __mutex_lock+0x6a7/0xce0
[ 956.032729] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 956.034174] ? cgroup_bpf_release+0x5c/0x360
[ 956.035561] ? cgroup_bpf_release+0x5c/0x360
[ 956.036839] ? __pfx_delay_tsc+0x10/0x10
[ 956.038046] cgroup_bpf_release+0x5c/0x360
[ 956.039261] ? trace_event_raw_event_workqueue_execute_start+0x52/0x90
[ 956.041119] process_one_work+0x1f0/0x610
[ 956.042348] worker_thread+0x183/0x340
[ 956.043531] ? __pfx_worker_thread+0x10/0x10
[ 956.044993] kthread+0xd6/0x110
[ 956.045171] ? __pfx_kthread+0x10/0x10
[ 956.045306] ret_from_fork+0x34/0x50
[ 956.045428] ? __pfx_kthread+0x10/0x10
[ 956.045552] ret_from_fork_asm+0x1a/0x30
[ 956.046025] </TASK>
[ 956.046162] INFO: task kworker/2:0:30 blocked for more than 327 seconds.
[ 956.046400] Not tainted 6.10.0-rc6-00163-g661e504db04c-dirty #135
[ 956.046566] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 956.047068] task:kworker/2:0 state:D stack:14152 pid:30
tgid:30 ppid:2 flags:00
[ 956.047377] Workqueue: events cgroup_bpf_release
[ 956.047539] Call Trace:
[ 956.047809] <TASK>
[ 956.047927] __schedule+0x393/0x1650
[ 956.048068] ? find_held_lock+0x2b/0x80
[ 956.048224] ? find_held_lock+0x2b/0x80
[ 956.048452] schedule+0x3e/0x130
[ 956.048561] schedule_preempt_disabled+0x15/0x30
[ 956.048881] __mutex_lock+0x6a7/0xce0
[ 956.049039] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 956.049203] ? cgroup_bpf_release+0x5c/0x360
[ 956.049441] ? cgroup_bpf_release+0x5c/0x360
[ 956.049584] ? __pfx_delay_tsc+0x10/0x10
[ 956.049887] cgroup_bpf_release+0x5c/0x360
[ 956.050038] ? trace_event_raw_event_workqueue_execute_start+0x52/0x90
[ 956.050247] process_one_work+0x1f0/0x610
[ 956.050395] worker_thread+0x183/0x340
[ 956.050570] ? __pfx_worker_thread+0x10/0x10
[ 956.050925] kthread+0xd6/0x110
[ 956.051049] ? __pfx_kthread+0x10/0x10
[ 956.051177] ret_from_fork+0x34/0x50
[ 956.051294] ? __pfx_kthread+0x10/0x10
[ 956.051414] ret_from_fork_asm+0x1a/0x30
[ 956.051604] </TASK>
[ 956.051804] INFO: task kworker/3:0:35 blocked for more than 327 seconds.
[ 956.052038] Not tainted 6.10.0-rc6-00163-g661e504db04c-dirty #135
[ 956.052222] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 956.052424] task:kworker/3:0 state:D stack:14968 pid:35
tgid:35 ppid:2 flags:00
[ 956.052964] Workqueue: events cgroup_bpf_release
[ 956.053199] Call Trace:
[ 956.053304] <TASK>
[ 956.053397] __schedule+0x393/0x1650
[ 956.053536] ? find_held_lock+0x2b/0x80
[ 956.053801] ? find_held_lock+0x2b/0x80
[ 956.054017] schedule+0x3e/0x130
[ 956.054166] schedule_preempt_disabled+0x15/0x30
[ 956.054312] __mutex_lock+0x6a7/0xce0
[ 956.054454] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 956.054645] ? cgroup_bpf_release+0x5c/0x360
[ 956.055025] ? cgroup_bpf_release+0x5c/0x360
[ 956.055200] ? __pfx_delay_tsc+0x10/0x10
[ 956.055347] cgroup_bpf_release+0x5c/0x360
[ 956.055479] ? trace_event_raw_event_workqueue_execute_start+0x52/0x90
[ 956.055846] process_one_work+0x1f0/0x610
[ 956.056026] worker_thread+0x183/0x340
[ 956.056211] ? __pfx_worker_thread+0x10/0x10
[ 956.056354] kthread+0xd6/0x110
[ 956.056511] ? __pfx_kthread+0x10/0x10
[ 956.056636] ret_from_fork+0x34/0x50
[ 956.056949] ? __pfx_kthread+0x10/0x10
[ 956.057091] ret_from_fork_asm+0x1a/0x30
[ 956.057315] </TASK>
[ 956.057430] INFO: task kworker/2:1:48 blocked for more than 327 seconds.
[ 956.057602] Not tainted 6.10.0-rc6-00163-g661e504db04c-dirty #135
[ 956.058009] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 956.058291] task:kworker/2:1 state:D stack:15016 pid:48
tgid:48 ppid:2 flags:00
[ 956.058555] Workqueue: events cgroup_bpf_release
[ 956.058727] Call Trace:
[ 956.058984] <TASK>
[ 956.059082] __schedule+0x393/0x1650
[ 956.059260] ? find_held_lock+0x2b/0x80
[ 956.059382] ? find_held_lock+0x2b/0x80
[ 956.059524] schedule+0x3e/0x130
[ 956.059636] schedule_preempt_disabled+0x15/0x30
[ 956.059965] __mutex_lock+0x6a7/0xce0
[ 956.060096] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 956.060344] ? cgroup_bpf_release+0x5c/0x360
[ 956.060529] ? cgroup_bpf_release+0x5c/0x360
[ 956.060828] ? __pfx_delay_tsc+0x10/0x10
[ 956.060980] cgroup_bpf_release+0x5c/0x360
[ 956.061114] ? trace_event_raw_event_workqueue_execute_start+0x52/0x90
[ 956.061412] process_one_work+0x1f0/0x610
[ 956.061614] worker_thread+0x183/0x340
[ 956.062090] ? __pfx_worker_thread+0x10/0x10
[ 956.062278] kthread+0xd6/0x110
[ 956.062395] ? __pfx_kthread+0x10/0x10
[ 956.062544] ret_from_fork+0x34/0x50
[ 956.062807] ? __pfx_kthread+0x10/0x10
[ 956.062955] ret_from_fork_asm+0x1a/0x30
[ 956.063167] </TASK>
[ 956.063276] INFO: task kworker/3:1:49 blocked for more than 327 seconds.
[ 956.063466] Not tainted 6.10.0-rc6-00163-g661e504db04c-dirty #135
[ 956.063646] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 956.064045] task:kworker/3:1 state:D stack:13592 pid:49
tgid:49 ppid:2 flags:00
[ 956.064339] Workqueue: events cgroup_bpf_release
[ 956.064488] Call Trace:
[ 956.064571] <TASK>
[ 956.064864] __schedule+0x393/0x1650
[ 956.065082] ? find_held_lock+0x2b/0x80
[ 956.065222] ? find_held_lock+0x2b/0x80
[ 956.065395] schedule+0x3e/0x130
[ 956.065521] schedule_preempt_disabled+0x15/0x30
[ 956.065945] __mutex_lock+0x6a7/0xce0
[ 956.066078] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 956.066240] ? cgroup_bpf_release+0x5c/0x360
[ 956.066396] ? cgroup_bpf_release+0x5c/0x360
[ 956.066542] ? __pfx_delay_tsc+0x10/0x10
[ 956.066777] cgroup_bpf_release+0x5c/0x360
[ 956.067008] ? trace_event_raw_event_workqueue_execute_start+0x52/0x90
[ 956.067225] process_one_work+0x1f0/0x610
[ 956.067411] worker_thread+0x183/0x340
[ 956.067554] ? __pfx_worker_thread+0x10/0x10
[ 956.067934] kthread+0xd6/0x110
[ 956.068092] ? __pfx_kthread+0x10/0x10
[ 956.068234] ret_from_fork+0x34/0x50
[ 956.068426] ? __pfx_kthread+0x10/0x10
[ 956.068554] ret_from_fork_asm+0x1a/0x30
[ 956.068962] </TASK>
[ 956.069129] INFO: task kworker/3:2:69 blocked for more than 327 seconds.
[ 956.069319] Not tainted 6.10.0-rc6-00163-g661e504db04c-dirty #135
[ 956.069484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 956.069842] task:kworker/3:2 state:D stack:14576 pid:69
tgid:69 ppid:2 flags:00
[ 956.070163] Workqueue: events cgroup_bpf_release
[ 956.070312] Call Trace:
[ 956.070419] <TASK>
[ 956.070510] __schedule+0x393/0x1650
[ 956.070948] ? find_held_lock+0x2b/0x80
[ 956.071104] ? find_held_lock+0x2b/0x80
[ 956.071280] schedule+0x3e/0x130
[ 956.071415] schedule_preempt_disabled+0x15/0x30
[ 956.071574] __mutex_lock+0x6a7/0xce0
[ 956.071888] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 956.072075] ? cgroup_bpf_release+0x5c/0x360
[ 956.072233] ? cgroup_bpf_release+0x5c/0x360
[ 956.072484] ? __pfx_delay_tsc+0x10/0x10
[ 956.072623] cgroup_bpf_release+0x5c/0x360
[ 956.073091] ? trace_event_raw_event_workqueue_execute_start+0x52/0x90
[ 956.073395] process_one_work+0x1f0/0x610
[ 956.073582] worker_thread+0x183/0x340
[ 956.073981] ? __pfx_worker_thread+0x10/0x10
[ 956.074125] kthread+0xd6/0x110
[ 956.074308] ? __pfx_kthread+0x10/0x10
[ 956.074441] ret_from_fork+0x34/0x50
[ 956.074555] ? __pfx_kthread+0x10/0x10
[ 956.074834] ret_from_fork_asm+0x1a/0x30
[ 956.075067] </TASK>
[ 956.075407] INFO: task kworker/2:2:78 blocked for more than 327 seconds.
[ 956.075593] Not tainted 6.10.0-rc6-00163-g661e504db04c-dirty #135
[ 956.076081] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 956.076375] task:kworker/2:2 state:D stack:14992 pid:78
tgid:78 ppid:2 flags:00
[ 956.076814] Workqueue: events cgroup_bpf_release
[ 956.076993] Call Trace:
[ 956.077081] <TASK>
[ 956.077174] __schedule+0x393/0x1650
[ 956.077369] ? find_held_lock+0x2b/0x80
[ 956.077524] ? find_held_lock+0x2b/0x80
[ 956.077796] schedule+0x3e/0x130
[ 956.077935] schedule_preempt_disabled+0x15/0x30
[ 956.078112] __mutex_lock+0x6a7/0xce0
[ 956.078261] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 956.078474] ? cgroup_bpf_release+0x5c/0x360
[ 956.078631] ? cgroup_bpf_release+0x5c/0x360
[ 956.079004] ? __pfx_delay_tsc+0x10/0x10
[ 956.079149] cgroup_bpf_release+0x5c/0x360
[ 956.079358] ? trace_event_raw_event_workqueue_execute_start+0x52/0x90
[ 956.079604] process_one_work+0x1f0/0x610
[ 956.080047] worker_thread+0x183/0x340
[ 956.080186] ? __pfx_worker_thread+0x10/0x10
[ 956.080369] kthread+0xd6/0x110
[ 956.080480] ? __pfx_kthread+0x10/0x10
[ 956.080608] ret_from_fork+0x34/0x50
[ 956.080954] ? __pfx_kthread+0x10/0x10
[ 956.081123] ret_from_fork_asm+0x1a/0x30
[ 956.081286] </TASK>
[ 956.081386] INFO: task kworker/0:2:110 blocked for more than 327 seconds.
[ 956.081621] Not tainted 6.10.0-rc6-00163-g661e504db04c-dirty #135
[ 956.082080] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 956.082313] task:kworker/0:2 state:D stack:14856 pid:110
tgid:110 ppid:2 flags:00
[ 956.082643] Workqueue: events cgroup_bpf_release
[ 956.082995] Call Trace:
[ 956.083091] <TASK>
[ 956.083226] __schedule+0x393/0x1650
[ 956.083366] ? find_held_lock+0x2b/0x80
[ 956.083531] ? find_held_lock+0x2b/0x80
[ 956.083977] schedule+0x3e/0x130
[ 956.084161] schedule_preempt_disabled+0x15/0x30
[ 956.084314] __mutex_lock+0x6a7/0xce0
[ 956.084450] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 956.084613] ? cgroup_bpf_release+0x5c/0x360
[ 956.085053] ? cgroup_bpf_release+0x5c/0x360
[ 956.085200] ? __pfx_delay_tsc+0x10/0x10
[ 956.085340] cgroup_bpf_release+0x5c/0x360
[ 956.085508] ? trace_event_raw_event_workqueue_execute_start+0x52/0x90
[ 956.085919] process_one_work+0x1f0/0x610
[ 956.086098] worker_thread+0x183/0x340
[ 956.086235] ? __pfx_worker_thread+0x10/0x10
[ 956.086367] kthread+0xd6/0x110
[ 956.086478] ? __pfx_kthread+0x10/0x10
[ 956.086610] ret_from_fork+0x34/0x50
[ 956.087015] ? __pfx_kthread+0x10/0x10
[ 956.087156] ret_from_fork_asm+0x1a/0x30
[ 956.087368] </TASK>
[ 956.087456] INFO: task kworker/0:3:111 blocked for more than 327 seconds.
[ 956.087631] Not tainted 6.10.0-rc6-00163-g661e504db04c-dirty #135
[ 956.088141] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 956.088397] task:kworker/0:3 state:D stack:14312 pid:111
tgid:111 ppid:2 flags:00
[ 956.088828] Workqueue: events cgroup_bpf_release
[ 956.089092] Call Trace:
[ 956.089200] <TASK>
[ 956.089308] __schedule+0x393/0x1650
[ 956.089480] ? find_held_lock+0x2b/0x80
[ 956.089624] ? find_held_lock+0x2b/0x80
[ 956.090053] schedule+0x3e/0x130
[ 956.090205] schedule_preempt_disabled+0x15/0x30
[ 956.090383] __mutex_lock+0x6a7/0xce0
[ 956.090503] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 956.090889] ? cgroup_bpf_release+0x5c/0x360
[ 956.091069] ? cgroup_bpf_release+0x5c/0x360
[ 956.091211] ? __pfx_delay_tsc+0x10/0x10
[ 956.091368] cgroup_bpf_release+0x5c/0x360
[ 956.091498] ? trace_event_raw_event_workqueue_execute_start+0x52/0x90
[ 956.092000] process_one_work+0x1f0/0x610
[ 956.092216] worker_thread+0x183/0x340
[ 956.092417] ? __pfx_worker_thread+0x10/0x10
[ 956.092558] kthread+0xd6/0x110
[ 956.092883] ? __pfx_kthread+0x10/0x10
[ 956.093102] ret_from_fork+0x34/0x50
[ 956.093229] ? __pfx_kthread+0x10/0x10
[ 956.093363] ret_from_fork_asm+0x1a/0x30
[ 956.093547] </TASK>
[ 956.093642] Future hung task reports are suppressed, see sysctl
kernel.hung_task_warnings
[ 956.139943]
[ 956.139943] Showing all locks held in the system:
[ 956.140446] 4 locks held by systemd/1:
[ 956.140645] #0: ffff8881021503f8 (sb_writers#7){....}-{0:0}, at:
do_rmdir+0xde/0x1b0
[ 956.141933] #1: ffff8881025c1350
(&type->i_mutex_dir_key#7/1){....}-{3:3}, at: do_rmdir+0x100
[ 956.142365] #2: ffff888105dd17d0
(&type->i_mutex_dir_key#8){....}-{3:3}, at: vfs_rmdir+0x5b/0
[ 956.142887] #3: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_kn_lock_live+0x4e/0x0
[ 956.143460] 3 locks held by kworker/0:0/8:
[ 956.143825] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.144198] #1: ffffc9000004be60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.144560] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.145087] 3 locks held by kworker/0:1/9:
[ 956.145256] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.145547] #1: ffffc90000053e60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.146180] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.146556] 3 locks held by kworker/2:0/30:
[ 956.146926] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.147253] #1: ffffc90000117e60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.147632] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.148195] 3 locks held by kworker/3:0/35:
[ 956.148331] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.148826] #1: ffffc90000143e60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.149171] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.149487] 1 lock held by khungtaskd/40:
[ 956.149624] #0: ffffffff82b58840 (rcu_read_lock){....}-{1:2}, at:
debug_show_all_locks+0x36/0
[ 956.150097] 3 locks held by kworker/2:1/48:
[ 956.150252] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.150541] #1: ffffc900001afe60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.151173] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.151457] 3 locks held by kworker/3:1/49:
[ 956.151619] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.152091] #1: ffffc900001b7e60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.152475] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.153026] 3 locks held by kworker/3:2/69:
[ 956.153156] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.153471] #1: ffffc90000dafe60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.154094] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.154430] 3 locks held by kworker/2:2/78:
[ 956.154589] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.155061] #1: ffffc90000d5fe60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.156230] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.156638] 3 locks held by kworker/0:2/110:
[ 956.157016] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.157357] #1: ffffc90000e43e60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.157983] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.158303] 3 locks held by kworker/0:3/111:
[ 956.158425] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.159041] #1: ffffc90000e53e60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.159390] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.159832] 3 locks held by kworker/3:3/113:
[ 956.160005] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.160357] #1: ffffc90000e6fe60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.160912] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.161235] 3 locks held by kworker/0:4/122:
[ 956.161388] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.161816] #1: ffffc90000ebbe60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.162164] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.162486] 3 locks held by kworker/0:5/124:
[ 956.162605] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.163015] #1: ffffc90000ec7e60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.163406] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.163890] 3 locks held by kworker/3:4/126:
[ 956.164045] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.164322] #1: ffffc90000ed7e60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.164906] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.165331] 4 locks held by systemd-udevd/127:
[ 956.165463] #0: ffff888103e376f8 (&p->lock){....}-{3:3}, at:
seq_read_iter+0x59/0x4c0
[ 956.166017] #1: ffff888111081488 (&of->mutex){....}-{3:3}, at:
kernfs_seq_start+0x27/0x110
[ 956.166409] #2: ffff8881008774d8 (kn->active#11){....}-{0:0}, at:
kernfs_seq_start+0x2f/0x110
[ 956.166953] #3: ffff888237cae410 (&dev->mutex){....}-{3:3}, at:
uevent_show+0x99/0x120
[ 956.167248] 3 locks held by kworker/3:5/128:
[ 956.167403] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.167965] #1: ffffc90000e9be60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.168292] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.168611] 3 locks held by kworker/0:6/132:
[ 956.168952] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.169238] #1: ffffc90000e77e60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.169578] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.170047] 3 locks held by kworker/2:3/173:
[ 956.170179] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.170485] #1: ffffc90000e0fe60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.171090] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.171455] 3 locks held by kworker/2:4/182:
[ 956.171603] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.172039] #1: ffffc90000f57e60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.172365] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.172801] 1 lock held by in:imklog/204:
[ 956.173012] 1 lock held by sshd/346:
[ 956.173153] 4 locks held by systemd-udevd/388:
[ 956.173299] #0: ffff888103e378c8 (&p->lock){....}-{3:3}, at:
seq_read_iter+0x59/0x4c0
[ 956.173633] #1: ffff888111080088 (&of->mutex){....}-{3:3}, at:
kernfs_seq_start+0x27/0x110
[ 956.174083] #2: ffff8881008774d8 (kn->active#11){....}-{0:0}, at:
kernfs_seq_start+0x2f/0x110
[ 956.174410] #3: ffff888237cae410 (&dev->mutex){....}-{3:3}, at:
uevent_show+0x99/0x120
[ 956.174882] 7 locks held by cpu_up_down.sh/389:
[ 956.175110] #0: ffff8881021413f8 (sb_writers#5){....}-{0:0}, at:
ksys_write+0x69/0xf0
[ 956.175440] #1: ffff888111080288 (&of->mutex#2){....}-{3:3}, at:
kernfs_fop_write_iter+0xf7/0
[ 956.176030] #2: ffff8881008776e8 (kn->active#54){....}-{0:0}, at:
kernfs_fop_write_iter+0xff0
[ 956.176355] #3: ffffffff82c12ce8 (device_hotplug_lock){....}-{3:3},
at: online_store+0x42/0x0
[ 956.176623] #4: ffff888237cae410 (&dev->mutex){....}-{3:3}, at:
device_online+0x24/0x90
[ 956.177126] #5: ffffffff82a67a48 (cpu_add_remove_lock){....}-{3:3},
at: cpu_up+0x31/0xb0
[ 956.177436] #6: ffffffff82a679b0 (cpu_hotplug_lock){....}-{0:0}, at:
_cpu_up+0x32/0x1e0
[ 956.178182] 3 locks held by watchdog.sh/391:
[ 956.178362] #0: ffff8881021403f8 (sb_writers#4){....}-{0:0}, at:
ksys_write+0x69/0xf0
[ 956.179092] #1: ffffffff82b6ea48 (watchdog_mutex){....}-{3:3}, at:
proc_watchdog_thresh+0x2f0
[ 956.179377] #2: ffffffff82a679b0 (cpu_hotplug_lock){....}-{0:0}, at:
__lockup_detector_recon0
[ 956.179996] 3 locks held by kworker/2:5/4630:
[ 956.180188] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.180503] #1: ffffc90007cebe60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.181158] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.181517] 3 locks held by kworker/3:6/4635:
[ 956.181640] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.182118] #1: ffffc90007803e60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.182542] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.183216] 3 locks held by kworker/0:7/4637:
[ 956.183348] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.183827] #1: ffffc90007ba3e60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.184228] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.184523] 3 locks held by kworker/2:6/4640:
[ 956.184812] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.185162] #1: ffffc90007bf3e60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.185518] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.186047] 3 locks held by kworker/3:7/4641:
[ 956.186211] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.186508] #1: ffffc90007d33e60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.187106] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.187443] 3 locks held by kworker/0:8/4642:
[ 956.187565] #0: ffff888100064548
((wq_completion)events){....}-{0:0}, at: process_one_work+00
[ 956.188177] #1: ffffc90007d3be60
((work_completion)(&cgrp->bpf.release_work)){....}-{0:0}, a0
[ 956.188589] #2: ffffffff82b67d08 (cgroup_mutex){....}-{3:3}, at:
cgroup_bpf_release+0x5c/0x30
[ 956.189051] 3 locks held by kworker/2:7/4643:
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH -v2] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
2024-07-24 0:53 ` chenridong
@ 2024-08-01 1:34 ` chenridong
0 siblings, 0 replies; 11+ messages in thread
From: chenridong @ 2024-08-01 1:34 UTC (permalink / raw)
To: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x,
hannes, roman.gushchin, Michal Koutný
Cc: bpf, cgroups, linux-kernel
On 2024/7/24 8:53, chenridong wrote:
>
>
> On 2024/7/19 10:52, Chen Ridong wrote:
>> We found a hung_task problem as shown below:
>>
>> INFO: task kworker/0:0:8 blocked for more than 327 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> task:kworker/0:0 state:D stack:13920 pid:8 ppid:2
>> flags:0x00004000
>> Workqueue: events cgroup_bpf_release
>> Call Trace:
>> <TASK>
>> __schedule+0x5a2/0x2050
>> ? find_held_lock+0x33/0x100
>> ? wq_worker_sleeping+0x9e/0xe0
>> schedule+0x9f/0x180
>> schedule_preempt_disabled+0x25/0x50
>> __mutex_lock+0x512/0x740
>> ? cgroup_bpf_release+0x1e/0x4d0
>> ? cgroup_bpf_release+0xcf/0x4d0
>> ? process_scheduled_works+0x161/0x8a0
>> ? cgroup_bpf_release+0x1e/0x4d0
>> ? mutex_lock_nested+0x2b/0x40
>> ? __pfx_delay_tsc+0x10/0x10
>> mutex_lock_nested+0x2b/0x40
>> cgroup_bpf_release+0xcf/0x4d0
>> ? process_scheduled_works+0x161/0x8a0
>> ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
>> ? process_scheduled_works+0x161/0x8a0
>> process_scheduled_works+0x23a/0x8a0
>> worker_thread+0x231/0x5b0
>> ? __pfx_worker_thread+0x10/0x10
>> kthread+0x14d/0x1c0
>> ? __pfx_kthread+0x10/0x10
>> ret_from_fork+0x59/0x70
>> ? __pfx_kthread+0x10/0x10
>> ret_from_fork_asm+0x1b/0x30
>> </TASK>
>>
>> This issue can be reproduced by the following methods:
>> 1. A large number of cpuset cgroups are deleted.
>> 2. Set cpu on and off repeatly.
>> 3. Set watchdog_thresh repeatly.
>>
>> The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
>> acquired in different tasks, which may lead to deadlock.
>> It can lead to a deadlock through the following steps:
>> 1. A large number of cgroups are deleted, which will put a large
>> number of cgroup_bpf_release works into system_wq. The max_active
>> of system_wq is WQ_DFL_ACTIVE(256). When cgroup_bpf_release can not
>> get cgroup_metux, it may cram system_wq, and it will block work
>> enqueued later.
>> 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
>> smp_call_on_cpu work into system_wq. However it may be blocked by
>> step 1.
>> 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by
>> step 2.
>> 4. When a cpuset is deleted, cgroup release work is placed on
>> cgroup_destroy_wq, it will hold cgroup_metux and acquire
>> cpu_hotplug_lock.read. Acquiring cpu_hotplug_lock.read is blocked by
>> cpu_hotplug_lock.write as mentioned by step 3. Finally, it forms a
>> loop and leads to a deadlock.
>>
>> cgroup_destroy_wq(step4) cpu offline(step3)
>> WatchDog(step2) system_wq(step1)
>> ......
>> __lockup_detector_reconfigure:
>> P(cpu_hotplug_lock.read)
>> ...
>> ...
>> percpu_down_write:
>> P(cpu_hotplug_lock.write)
>> ...256+ works
>> cgroup_bpf_release:
>> P(cgroup_mutex)
>> smp_call_on_cpu:
>> Wait system_wq
>> ...
>> css_killed_work_fn:
>> P(cgroup_mutex)
>> ...
>> cpuset_css_offline:
>> P(cpu_hotplug_lock.read)
>>
>> To fix the problem, place cgroup_bpf_release works on cgroup_destroy_wq,
>> which can break the loop and solve the problem. System wqs are for misc
>> things which shouldn't create a large number of concurrent work items.
>> If something is going to generate >WQ_DFL_ACTIVE(256) concurrent work
>> items, it should use its own dedicated workqueue.
>>
>> Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from
>> cgroup itself")
>> Link:
>> https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>> ---
>> kernel/bpf/cgroup.c | 2 +-
>> kernel/cgroup/cgroup-internal.h | 1 +
>> kernel/cgroup/cgroup.c | 2 +-
>> 3 files changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index 8ba73042a239..a611a1274788 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -334,7 +334,7 @@ static void cgroup_bpf_release_fn(struct
>> percpu_ref *ref)
>> struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);
>> INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
>> - queue_work(system_wq, &cgrp->bpf.release_work);
>> + queue_work(cgroup_destroy_wq, &cgrp->bpf.release_work);
>> }
>> /* Get underlying bpf_prog of bpf_prog_list entry, regardless if
>> it's through
>> diff --git a/kernel/cgroup/cgroup-internal.h
>> b/kernel/cgroup/cgroup-internal.h
>> index 520b90dd97ec..9e57f3e9316e 100644
>> --- a/kernel/cgroup/cgroup-internal.h
>> +++ b/kernel/cgroup/cgroup-internal.h
>> @@ -13,6 +13,7 @@
>> extern spinlock_t trace_cgroup_path_lock;
>> extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN];
>> extern void __init enable_debug_cgroup(void);
>> +extern struct workqueue_struct *cgroup_destroy_wq;
>> /*
>> * cgroup_path() takes a spin lock. It is good practice not to take
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index e32b6972c478..3317e03fe2fb 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -124,7 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
>> * destruction work items don't end up filling up max_active of
>> system_wq
>> * which may lead to deadlock.
>> */
>> -static struct workqueue_struct *cgroup_destroy_wq;
>> +struct workqueue_struct *cgroup_destroy_wq;
>> /* generate an array of cgroup subsystem pointers */
>> #define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
>
> Friendly ping.
>
Hi, Tejun,Roman,and Michal, do you have any opinion? Can this patch be
merged?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH -v2] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
2024-07-27 10:21 ` chenridong
@ 2024-08-07 13:32 ` Michal Koutný
2024-08-08 2:22 ` chenridong
2024-08-08 17:03 ` Roman Gushchin
0 siblings, 2 replies; 11+ messages in thread
From: Michal Koutný @ 2024-08-07 13:32 UTC (permalink / raw)
To: chenridong; +Cc: Hillf Danton, Roman Gushchin, tj, bpf, cgroups, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2592 bytes --]
Hello.
On Sat, Jul 27, 2024 at 06:21:55PM GMT, chenridong <chenridong@huawei.com> wrote:
> Yes, I have offered the scripts in Link(V1).
Thanks (and thanks for patience).
There is no lockdep complain about a deadlock (i.e. some circular
locking dependencies). (I admit the multiple holders of cgroup_mutex
reported there confuse me, I guess that's an artifact of this lockdep
report and they could be also waiters.)
> > Who'd be the holder of cgroup_mutex preventing cgroup_bpf_release from
> > progress? (That's not clear to me from your diagram.)
> >
> This is a cumulative process. The stress testing deletes a large member of
> cgroups, and cgroup_bpf_release is asynchronous, competing with cgroup
> release works.
Those are different situations:
- waiting for one holder that's stuck for some reason (that's what we're
after),
- waiting because the mutex is contended (that's slow but progresses
eventually).
> You know, cgroup_mutex is used in many places. Finally, the number of
> `cgroup_bpf_release` instances in system_wq accumulates up to 256, and
> it leads to this issue.
Reaching max_active doesn't mean that queue_work() would block or the
items were lost. They are only queued onto inactive_works list.
(Remark: cgroup_destroy_wq has only max_active=1 but it apparently
doesn't stop progress should there be more items queued (when
when cgroup_mutex is not guarding losing references.))
---
The change on its own (deferred cgroup bpf progs removal via
cgroup_destroy_wq instead of system_wq) is sensible by collecting
related objects removal together (at the same time it shouldn't cause
problems by sharing one cgroup_destroy_wq).
But the reasoning in the commit message doesn't add up to me. There
isn't obvious deadlock, I'd say that system is overloaded with repeated
calls of __lockup_detector_reconfigure() and it is not in deadlock
state -- i.e. when you stop the test, it should eventually recover.
Given that, I'd neither put Fixes: 4bfc0bb2c60e there.
(One could symetrically argue to move smp_call_on_cpu() away from
system_wq instead of cgroup_bpf_release_fn().)
Honestly, I'm not sure it's worth the effort if there's no deadlock.
It's possible that I'm misunderstanding or I've missed a substantial
detail for why this could lead to a deadlock. It'd be best visible in a
sequence diagram with tasks/CPUs left-to-right and time top-down (in the
original scheme it looks like time goes right-to-left and there's the
unclear situation of the initial cgroup_mutex holder).
Thanks,
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH -v2] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
2024-08-07 13:32 ` Michal Koutný
@ 2024-08-08 2:22 ` chenridong
2024-08-08 17:03 ` Roman Gushchin
1 sibling, 0 replies; 11+ messages in thread
From: chenridong @ 2024-08-08 2:22 UTC (permalink / raw)
To: Michal Koutný
Cc: Hillf Danton, Roman Gushchin, tj, bpf, cgroups, linux-kernel
On 2024/8/7 21:32, Michal Koutný wrote:
> Hello.
>
> On Sat, Jul 27, 2024 at 06:21:55PM GMT, chenridong <chenridong@huawei.com> wrote:
>> Yes, I have offered the scripts in Link(V1).
>
> Thanks (and thanks for patience).
> There is no lockdep complain about a deadlock (i.e. some circular
> locking dependencies). (I admit the multiple holders of cgroup_mutex
> reported there confuse me, I guess that's an artifact of this lockdep
> report and they could be also waiters.)
>
>>> Who'd be the holder of cgroup_mutex preventing cgroup_bpf_release from
>>> progress? (That's not clear to me from your diagram.)
>>>
>> This is a cumulative process. The stress testing deletes a large member of
>> cgroups, and cgroup_bpf_release is asynchronous, competing with cgroup
>> release works.
>
> Those are different situations:
> - waiting for one holder that's stuck for some reason (that's what we're
> after),
> - waiting because the mutex is contended (that's slow but progresses
> eventually).
>
>> You know, cgroup_mutex is used in many places. Finally, the number of
>> `cgroup_bpf_release` instances in system_wq accumulates up to 256, and
>> it leads to this issue.
>
> Reaching max_active doesn't mean that queue_work() would block or the
> items were lost. They are only queued onto inactive_works list.
Yes, I agree. But what if 256 active works can't finish because they are
waiting for a lock? the works at inactive list can never be executed.
> (Remark: cgroup_destroy_wq has only max_active=1 but it apparently
> doesn't stop progress should there be more items queued (when
> when cgroup_mutex is not guarding losing references.))
>
cgroup_destroy_wq is not stopped by cgroup_mutex, it has acquired
cgroup_mutex, but it was blocked cpu_hotplug_lock.read.
cpu_hotplug_lock.write is held by cpu offline process(step3).
> ---
>
> The change on its own (deferred cgroup bpf progs removal via
> cgroup_destroy_wq instead of system_wq) is sensible by collecting
> related objects removal together (at the same time it shouldn't cause
> problems by sharing one cgroup_destroy_wq).
>
> But the reasoning in the commit message doesn't add up to me. There
> isn't obvious deadlock, I'd say that system is overloaded with repeated
> calls of __lockup_detector_reconfigure() and it is not in deadlock
> state -- i.e. when you stop the test, it should eventually recover.
> Given that, I'd neither put Fixes: 4bfc0bb2c60e there.
> If I stop test, it can never recover. It does not need to be fixed if it
could recover.
I have to admit, it is a complicated issue.
System_wq was not overloaded with __lockup_detector_reconfigure, but
with cgroup_bpf_release_fn. A large number of cgroups were deleted.
There were 256 active works in system_wq that were
cgroup_bpf_release_fn, and they were all blocked by cgroup_mutex.
To make it simple, just imagine what if the max_active max_active of
system_wq is 1? Could it result in a deadlock? If it could be deadlock,
just imagine all works in system_wq are same.
> (One could symetrically argue to move smp_call_on_cpu() away from
> system_wq instead of cgroup_bpf_release_fn().)
>
I also agree, why I move cgroup_bpf_release_fn away, cgroup has it own
queue. As TJ said "system wqs are for misc things which shouldn't create
a large number of concurrent work items. If something is going to
generate 256+ concurrent work items, it should use its own workqueue."
> Honestly, I'm not sure it's worth the effort if there's no deadlock.
>
There is a deadlock, and i think it have to be fixed.
> It's possible that I'm misunderstanding or I've missed a substantial
> detail for why this could lead to a deadlock. It'd be best visible in a
> sequence diagram with tasks/CPUs left-to-right and time top-down (in the
> original scheme it looks like time goes right-to-left and there's the
> unclear situation of the initial cgroup_mutex holder).
>
> Thanks,
> Michal
I will modify the diagram.
And I hope you can understand how it leads to deadlock.
Thank you Michal for your reply.
Thanks,
Ridong
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH -v2] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock
2024-08-07 13:32 ` Michal Koutný
2024-08-08 2:22 ` chenridong
@ 2024-08-08 17:03 ` Roman Gushchin
1 sibling, 0 replies; 11+ messages in thread
From: Roman Gushchin @ 2024-08-08 17:03 UTC (permalink / raw)
To: Michal Koutný
Cc: chenridong, Hillf Danton, tj, bpf, cgroups, linux-kernel
On Wed, Aug 07, 2024 at 03:32:59PM +0200, Michal Koutny wrote:
> Hello.
>
> On Sat, Jul 27, 2024 at 06:21:55PM GMT, chenridong <chenridong@huawei.com> wrote:
> > Yes, I have offered the scripts in Link(V1).
>
> Thanks (and thanks for patience).
> There is no lockdep complain about a deadlock (i.e. some circular
> locking dependencies). (I admit the multiple holders of cgroup_mutex
> reported there confuse me, I guess that's an artifact of this lockdep
> report and they could be also waiters.)
>
> ...
>
> The change on its own (deferred cgroup bpf progs removal via
> cgroup_destroy_wq instead of system_wq) is sensible by collecting
> related objects removal together (at the same time it shouldn't cause
> problems by sharing one cgroup_destroy_wq).
>
> But the reasoning in the commit message doesn't add up to me. There
> isn't obvious deadlock, I'd say that system is overloaded with repeated
> calls of __lockup_detector_reconfigure() and it is not in deadlock
> state -- i.e. when you stop the test, it should eventually recover.
Thanks, Michal! I've exactly same feelings about this change.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-08-08 17:03 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-19 2:52 [PATCH -v2] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock Chen Ridong
2024-07-24 0:53 ` chenridong
2024-08-01 1:34 ` chenridong
2024-07-24 11:08 ` Hillf Danton
2024-07-25 1:48 ` chenridong
2024-07-25 11:01 ` Hillf Danton
2024-07-26 13:04 ` Michal Koutný
2024-07-27 10:21 ` chenridong
2024-08-07 13:32 ` Michal Koutný
2024-08-08 2:22 ` chenridong
2024-08-08 17:03 ` Roman Gushchin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).