* [PATCH v3 0/1] Fix deadlock caused by cgroup_mutex and cpu_hotplug_lock @ 2024-08-17 9:33 Chen Ridong 2024-08-17 9:33 ` [PATCH v3 1/1] cgroup: fix " Chen Ridong 0 siblings, 1 reply; 16+ messages in thread From: Chen Ridong @ 2024-08-17 9:33 UTC (permalink / raw) To: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x, hannes, roman.gushchin Cc: bpf, cgroups, linux-kernel changes from v1/v2: - Optimize commit msg. Link v1: https://lore.kernel.org/cgroups/20240607110313.2230669-1-chenridong@huawei.com/ Link v2: https://lore.kernel.org/cgroups/20240719025232.2143638-1-chenridong@huawei.com/ Chen Ridong (1): cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock kernel/bpf/cgroup.c | 2 +- kernel/cgroup/cgroup-internal.h | 1 + kernel/cgroup/cgroup.c | 2 +- 3 files changed, 3 insertions(+), 2 deletions(-) -- 2.34.1 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock 2024-08-17 9:33 [PATCH v3 0/1] Fix deadlock caused by cgroup_mutex and cpu_hotplug_lock Chen Ridong @ 2024-08-17 9:33 ` Chen Ridong 2024-08-22 0:57 ` Chen Ridong 2024-09-09 14:19 ` Michal Koutný 0 siblings, 2 replies; 16+ messages in thread From: Chen Ridong @ 2024-08-17 9:33 UTC (permalink / raw) To: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x, hannes, roman.gushchin Cc: bpf, cgroups, linux-kernel We found a hung_task problem as shown below: INFO: task kworker/0:0:8 blocked for more than 327 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/0:0 state:D stack:13920 pid:8 ppid:2 flags:0x00004000 Workqueue: events cgroup_bpf_release Call Trace: <TASK> __schedule+0x5a2/0x2050 ? find_held_lock+0x33/0x100 ? wq_worker_sleeping+0x9e/0xe0 schedule+0x9f/0x180 schedule_preempt_disabled+0x25/0x50 __mutex_lock+0x512/0x740 ? cgroup_bpf_release+0x1e/0x4d0 ? cgroup_bpf_release+0xcf/0x4d0 ? process_scheduled_works+0x161/0x8a0 ? cgroup_bpf_release+0x1e/0x4d0 ? mutex_lock_nested+0x2b/0x40 ? __pfx_delay_tsc+0x10/0x10 mutex_lock_nested+0x2b/0x40 cgroup_bpf_release+0xcf/0x4d0 ? process_scheduled_works+0x161/0x8a0 ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0 ? process_scheduled_works+0x161/0x8a0 process_scheduled_works+0x23a/0x8a0 worker_thread+0x231/0x5b0 ? __pfx_worker_thread+0x10/0x10 kthread+0x14d/0x1c0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x59/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> This issue can be reproduced by the following pressuse test: 1. A large number of cpuset cgroups are deleted. 2. Set cpu on and off repeatly. 3. Set watchdog_thresh repeatly. The scripts can be obtained at LINK mentioned above the signature. The reason for this issue is cgroup_mutex and cpu_hotplug_lock are acquired in different tasks, which may lead to deadlock. It can lead to a deadlock through the following steps: 1. A large number of cpusets are deleted asynchronously, which puts a large number of cgroup_bpf_release works into system_wq. The max_active of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are cgroup_bpf_release works, and many cgroup_bpf_release works will be put into inactive queue. As illustrated in the diagram, there are 256 (in the acvtive queue) + n (in the inactive queue) works. 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put smp_call_on_cpu work into system_wq. However step 1 has already filled system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has to wait until the works that were put into the inacvtive queue earlier have executed (n cgroup_bpf_release), so it will be blocked for a while. 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2. 4. Cpusets that were deleted at step 1 put cgroup_release works into cgroup_destroy_wq. They are competing to get cgroup_mutex all the time. When cgroup_metux is acqured by work at css_killed_work_fn, it will call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read. However, cpuset_css_offline will be blocked for step 3. 5. At this moment, there are 256 works in active queue that are cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as a result, all of them are blocked. Consequently, sscs.work can not be executed. Ultimately, this situation leads to four processes being blocked, forming a deadlock. system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4) ... 2000+ cgroups deleted asyn 256 actives + n inactives __lockup_detector_reconfigure P(cpu_hotplug_lock.read) put sscs.work into system_wq 256 + n + 1(sscs.work) sscs.work wait to be executed warting sscs.work finish percpu_down_write P(cpu_hotplug_lock.write) ...blocking... css_killed_work_fn P(cgroup_mutex) cpuset_css_offline P(cpu_hotplug_lock.read) ...blocking... 256 cgroup_bpf_release mutex_lock(&cgroup_mutex); ..blocking... To fix the problem, place cgroup_bpf_release works on cgroup_destroy_wq, which can break the loop and solve the problem. System wqs are for misc things which shouldn't create a large number of concurrent work items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent work items, it should use its own dedicated workqueue. Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t Signed-off-by: Chen Ridong <chenridong@huawei.com> --- kernel/bpf/cgroup.c | 2 +- kernel/cgroup/cgroup-internal.h | 1 + kernel/cgroup/cgroup.c | 2 +- 3 files changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index 8ba73042a239..a611a1274788 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -334,7 +334,7 @@ static void cgroup_bpf_release_fn(struct percpu_ref *ref) struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt); INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release); - queue_work(system_wq, &cgrp->bpf.release_work); + queue_work(cgroup_destroy_wq, &cgrp->bpf.release_work); } /* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's through diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index c964dd7ff967..17ac19bc8106 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -13,6 +13,7 @@ extern spinlock_t trace_cgroup_path_lock; extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN]; extern void __init enable_debug_cgroup(void); +extern struct workqueue_struct *cgroup_destroy_wq; /* * cgroup_path() takes a spin lock. It is good practice not to take diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 75058fbf4450..77fa9ed69c86 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -124,7 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem); * destruction work items don't end up filling up max_active of system_wq * which may lead to deadlock. */ -static struct workqueue_struct *cgroup_destroy_wq; +struct workqueue_struct *cgroup_destroy_wq; /* generate an array of cgroup subsystem pointers */ #define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys, -- 2.34.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock 2024-08-17 9:33 ` [PATCH v3 1/1] cgroup: fix " Chen Ridong @ 2024-08-22 0:57 ` Chen Ridong 2024-08-30 2:08 ` Chen Ridong 2024-09-09 14:19 ` Michal Koutný 1 sibling, 1 reply; 16+ messages in thread From: Chen Ridong @ 2024-08-22 0:57 UTC (permalink / raw) To: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x, hannes, roman.gushchin, Michal Koutný Cc: bpf, cgroups, linux-kernel On 2024/8/17 17:33, Chen Ridong wrote: > We found a hung_task problem as shown below: > > INFO: task kworker/0:0:8 blocked for more than 327 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:kworker/0:0 state:D stack:13920 pid:8 ppid:2 flags:0x00004000 > Workqueue: events cgroup_bpf_release > Call Trace: > <TASK> > __schedule+0x5a2/0x2050 > ? find_held_lock+0x33/0x100 > ? wq_worker_sleeping+0x9e/0xe0 > schedule+0x9f/0x180 > schedule_preempt_disabled+0x25/0x50 > __mutex_lock+0x512/0x740 > ? cgroup_bpf_release+0x1e/0x4d0 > ? cgroup_bpf_release+0xcf/0x4d0 > ? process_scheduled_works+0x161/0x8a0 > ? cgroup_bpf_release+0x1e/0x4d0 > ? mutex_lock_nested+0x2b/0x40 > ? __pfx_delay_tsc+0x10/0x10 > mutex_lock_nested+0x2b/0x40 > cgroup_bpf_release+0xcf/0x4d0 > ? process_scheduled_works+0x161/0x8a0 > ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0 > ? process_scheduled_works+0x161/0x8a0 > process_scheduled_works+0x23a/0x8a0 > worker_thread+0x231/0x5b0 > ? __pfx_worker_thread+0x10/0x10 > kthread+0x14d/0x1c0 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x59/0x70 > ? __pfx_kthread+0x10/0x10 > ret_from_fork_asm+0x1b/0x30 > </TASK> > > This issue can be reproduced by the following pressuse test: > 1. A large number of cpuset cgroups are deleted. > 2. Set cpu on and off repeatly. > 3. Set watchdog_thresh repeatly. > The scripts can be obtained at LINK mentioned above the signature. > > The reason for this issue is cgroup_mutex and cpu_hotplug_lock are > acquired in different tasks, which may lead to deadlock. > It can lead to a deadlock through the following steps: > 1. A large number of cpusets are deleted asynchronously, which puts a > large number of cgroup_bpf_release works into system_wq. The max_active > of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are > cgroup_bpf_release works, and many cgroup_bpf_release works will be put > into inactive queue. As illustrated in the diagram, there are 256 (in > the acvtive queue) + n (in the inactive queue) works. > 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put > smp_call_on_cpu work into system_wq. However step 1 has already filled > system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has > to wait until the works that were put into the inacvtive queue earlier > have executed (n cgroup_bpf_release), so it will be blocked for a while. > 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2. > 4. Cpusets that were deleted at step 1 put cgroup_release works into > cgroup_destroy_wq. They are competing to get cgroup_mutex all the time. > When cgroup_metux is acqured by work at css_killed_work_fn, it will > call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read. > However, cpuset_css_offline will be blocked for step 3. > 5. At this moment, there are 256 works in active queue that are > cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as > a result, all of them are blocked. Consequently, sscs.work can not be > executed. Ultimately, this situation leads to four processes being > blocked, forming a deadlock. > > system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4) > ... > 2000+ cgroups deleted asyn > 256 actives + n inactives > __lockup_detector_reconfigure > P(cpu_hotplug_lock.read) > put sscs.work into system_wq > 256 + n + 1(sscs.work) > sscs.work wait to be executed > warting sscs.work finish > percpu_down_write > P(cpu_hotplug_lock.write) > ...blocking... > css_killed_work_fn > P(cgroup_mutex) > cpuset_css_offline > P(cpu_hotplug_lock.read) > ...blocking... > 256 cgroup_bpf_release > mutex_lock(&cgroup_mutex); > ..blocking... > > To fix the problem, place cgroup_bpf_release works on cgroup_destroy_wq, > which can break the loop and solve the problem. System wqs are for misc > things which shouldn't create a large number of concurrent work items. > If something is going to generate >WQ_DFL_ACTIVE(256) concurrent work > items, it should use its own dedicated workqueue. > > Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") > Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t > Signed-off-by: Chen Ridong <chenridong@huawei.com> > --- > kernel/bpf/cgroup.c | 2 +- > kernel/cgroup/cgroup-internal.h | 1 + > kernel/cgroup/cgroup.c | 2 +- > 3 files changed, 3 insertions(+), 2 deletions(-) > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > index 8ba73042a239..a611a1274788 100644 > --- a/kernel/bpf/cgroup.c > +++ b/kernel/bpf/cgroup.c > @@ -334,7 +334,7 @@ static void cgroup_bpf_release_fn(struct percpu_ref *ref) > struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt); > > INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release); > - queue_work(system_wq, &cgrp->bpf.release_work); > + queue_work(cgroup_destroy_wq, &cgrp->bpf.release_work); > } > > /* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's through > diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h > index c964dd7ff967..17ac19bc8106 100644 > --- a/kernel/cgroup/cgroup-internal.h > +++ b/kernel/cgroup/cgroup-internal.h > @@ -13,6 +13,7 @@ > extern spinlock_t trace_cgroup_path_lock; > extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN]; > extern void __init enable_debug_cgroup(void); > +extern struct workqueue_struct *cgroup_destroy_wq; > > /* > * cgroup_path() takes a spin lock. It is good practice not to take > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 75058fbf4450..77fa9ed69c86 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -124,7 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem); > * destruction work items don't end up filling up max_active of system_wq > * which may lead to deadlock. > */ > -static struct workqueue_struct *cgroup_destroy_wq; > +struct workqueue_struct *cgroup_destroy_wq; > > /* generate an array of cgroup subsystem pointers */ > #define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys, Ping. Hi,TJ, Roman and Michal, I have updated commit message, I think it can be much clearer now, can you review it again? Thanks, Ridong ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock 2024-08-22 0:57 ` Chen Ridong @ 2024-08-30 2:08 ` Chen Ridong 0 siblings, 0 replies; 16+ messages in thread From: Chen Ridong @ 2024-08-30 2:08 UTC (permalink / raw) To: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x, hannes, roman.gushchin, Michal Koutný Cc: bpf, cgroups, linux-kernel On 2024/8/22 8:57, Chen Ridong wrote: > > > On 2024/8/17 17:33, Chen Ridong wrote: >> We found a hung_task problem as shown below: >> >> INFO: task kworker/0:0:8 blocked for more than 327 seconds. >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> task:kworker/0:0 state:D stack:13920 pid:8 ppid:2 >> flags:0x00004000 >> Workqueue: events cgroup_bpf_release >> Call Trace: >> <TASK> >> __schedule+0x5a2/0x2050 >> ? find_held_lock+0x33/0x100 >> ? wq_worker_sleeping+0x9e/0xe0 >> schedule+0x9f/0x180 >> schedule_preempt_disabled+0x25/0x50 >> __mutex_lock+0x512/0x740 >> ? cgroup_bpf_release+0x1e/0x4d0 >> ? cgroup_bpf_release+0xcf/0x4d0 >> ? process_scheduled_works+0x161/0x8a0 >> ? cgroup_bpf_release+0x1e/0x4d0 >> ? mutex_lock_nested+0x2b/0x40 >> ? __pfx_delay_tsc+0x10/0x10 >> mutex_lock_nested+0x2b/0x40 >> cgroup_bpf_release+0xcf/0x4d0 >> ? process_scheduled_works+0x161/0x8a0 >> ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0 >> ? process_scheduled_works+0x161/0x8a0 >> process_scheduled_works+0x23a/0x8a0 >> worker_thread+0x231/0x5b0 >> ? __pfx_worker_thread+0x10/0x10 >> kthread+0x14d/0x1c0 >> ? __pfx_kthread+0x10/0x10 >> ret_from_fork+0x59/0x70 >> ? __pfx_kthread+0x10/0x10 >> ret_from_fork_asm+0x1b/0x30 >> </TASK> >> >> This issue can be reproduced by the following pressuse test: >> 1. A large number of cpuset cgroups are deleted. >> 2. Set cpu on and off repeatly. >> 3. Set watchdog_thresh repeatly. >> The scripts can be obtained at LINK mentioned above the signature. >> >> The reason for this issue is cgroup_mutex and cpu_hotplug_lock are >> acquired in different tasks, which may lead to deadlock. >> It can lead to a deadlock through the following steps: >> 1. A large number of cpusets are deleted asynchronously, which puts a >> large number of cgroup_bpf_release works into system_wq. The >> max_active >> of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works >> are >> cgroup_bpf_release works, and many cgroup_bpf_release works will >> be put >> into inactive queue. As illustrated in the diagram, there are 256 (in >> the acvtive queue) + n (in the inactive queue) works. >> 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put >> smp_call_on_cpu work into system_wq. However step 1 has already >> filled >> system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has >> to wait until the works that were put into the inacvtive queue >> earlier >> have executed (n cgroup_bpf_release), so it will be blocked for a >> while. >> 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by >> step 2. >> 4. Cpusets that were deleted at step 1 put cgroup_release works into >> cgroup_destroy_wq. They are competing to get cgroup_mutex all the >> time. >> When cgroup_metux is acqured by work at css_killed_work_fn, it will >> call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read. >> However, cpuset_css_offline will be blocked for step 3. >> 5. At this moment, there are 256 works in active queue that are >> cgroup_bpf_release, they are attempting to acquire cgroup_mutex, >> and as >> a result, all of them are blocked. Consequently, sscs.work can not be >> executed. Ultimately, this situation leads to four processes being >> blocked, forming a deadlock. >> >> system_wq(step1) WatchDog(step2) cpu >> offline(step3) cgroup_destroy_wq(step4) >> ... >> 2000+ cgroups deleted asyn >> 256 actives + n inactives >> __lockup_detector_reconfigure >> P(cpu_hotplug_lock.read) >> put sscs.work into system_wq >> 256 + n + 1(sscs.work) >> sscs.work wait to be executed >> warting sscs.work finish >> percpu_down_write >> P(cpu_hotplug_lock.write) >> ...blocking... >> css_killed_work_fn >> P(cgroup_mutex) >> cpuset_css_offline >> P(cpu_hotplug_lock.read) >> ...blocking... >> 256 cgroup_bpf_release >> mutex_lock(&cgroup_mutex); >> ..blocking... >> >> To fix the problem, place cgroup_bpf_release works on cgroup_destroy_wq, >> which can break the loop and solve the problem. System wqs are for misc >> things which shouldn't create a large number of concurrent work items. >> If something is going to generate >WQ_DFL_ACTIVE(256) concurrent work >> items, it should use its own dedicated workqueue. >> >> Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from >> cgroup itself") >> Link: >> https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t >> Signed-off-by: Chen Ridong <chenridong@huawei.com> >> --- >> kernel/bpf/cgroup.c | 2 +- >> kernel/cgroup/cgroup-internal.h | 1 + >> kernel/cgroup/cgroup.c | 2 +- >> 3 files changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c >> index 8ba73042a239..a611a1274788 100644 >> --- a/kernel/bpf/cgroup.c >> +++ b/kernel/bpf/cgroup.c >> @@ -334,7 +334,7 @@ static void cgroup_bpf_release_fn(struct >> percpu_ref *ref) >> struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt); >> INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release); >> - queue_work(system_wq, &cgrp->bpf.release_work); >> + queue_work(cgroup_destroy_wq, &cgrp->bpf.release_work); >> } >> /* Get underlying bpf_prog of bpf_prog_list entry, regardless if >> it's through >> diff --git a/kernel/cgroup/cgroup-internal.h >> b/kernel/cgroup/cgroup-internal.h >> index c964dd7ff967..17ac19bc8106 100644 >> --- a/kernel/cgroup/cgroup-internal.h >> +++ b/kernel/cgroup/cgroup-internal.h >> @@ -13,6 +13,7 @@ >> extern spinlock_t trace_cgroup_path_lock; >> extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN]; >> extern void __init enable_debug_cgroup(void); >> +extern struct workqueue_struct *cgroup_destroy_wq; >> /* >> * cgroup_path() takes a spin lock. It is good practice not to take >> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c >> index 75058fbf4450..77fa9ed69c86 100644 >> --- a/kernel/cgroup/cgroup.c >> +++ b/kernel/cgroup/cgroup.c >> @@ -124,7 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem); >> * destruction work items don't end up filling up max_active of >> system_wq >> * which may lead to deadlock. >> */ >> -static struct workqueue_struct *cgroup_destroy_wq; >> +struct workqueue_struct *cgroup_destroy_wq; >> /* generate an array of cgroup subsystem pointers */ >> #define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys, > > Ping. > Hi,TJ, Roman and Michal, I have updated commit message, I think it can > be much clearer now, can you review it again? > > Thanks, > Ridong > Friendly ping. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock 2024-08-17 9:33 ` [PATCH v3 1/1] cgroup: fix " Chen Ridong 2024-08-22 0:57 ` Chen Ridong @ 2024-09-09 14:19 ` Michal Koutný 2024-09-10 1:31 ` Chen Ridong 2024-09-11 11:15 ` Hillf Danton 1 sibling, 2 replies; 16+ messages in thread From: Michal Koutný @ 2024-09-09 14:19 UTC (permalink / raw) To: Chen Ridong Cc: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x, hannes, roman.gushchin, bpf, cgroups, linux-kernel [-- Attachment #1: Type: text/plain, Size: 5924 bytes --] On Sat, Aug 17, 2024 at 09:33:34AM GMT, Chen Ridong <chenridong@huawei.com> wrote: > The reason for this issue is cgroup_mutex and cpu_hotplug_lock are > acquired in different tasks, which may lead to deadlock. > It can lead to a deadlock through the following steps: > 1. A large number of cpusets are deleted asynchronously, which puts a > large number of cgroup_bpf_release works into system_wq. The max_active > of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are > cgroup_bpf_release works, and many cgroup_bpf_release works will be put > into inactive queue. As illustrated in the diagram, there are 256 (in > the acvtive queue) + n (in the inactive queue) works. > 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put > smp_call_on_cpu work into system_wq. However step 1 has already filled > system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has > to wait until the works that were put into the inacvtive queue earlier > have executed (n cgroup_bpf_release), so it will be blocked for a while. > 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2. > 4. Cpusets that were deleted at step 1 put cgroup_release works into > cgroup_destroy_wq. They are competing to get cgroup_mutex all the time. > When cgroup_metux is acqured by work at css_killed_work_fn, it will > call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read. > However, cpuset_css_offline will be blocked for step 3. > 5. At this moment, there are 256 works in active queue that are > cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as > a result, all of them are blocked. Consequently, sscs.work can not be > executed. Ultimately, this situation leads to four processes being > blocked, forming a deadlock. > > system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4) > ... > 2000+ cgroups deleted asyn > 256 actives + n inactives > __lockup_detector_reconfigure > P(cpu_hotplug_lock.read) > put sscs.work into system_wq > 256 + n + 1(sscs.work) > sscs.work wait to be executed > warting sscs.work finish > percpu_down_write > P(cpu_hotplug_lock.write) > ...blocking... > css_killed_work_fn > P(cgroup_mutex) > cpuset_css_offline > P(cpu_hotplug_lock.read) > ...blocking... > 256 cgroup_bpf_release > mutex_lock(&cgroup_mutex); > ..blocking... Thanks, Ridong, for laying this out. Let me try to extract the core of the deps above. The correct lock ordering is: cgroup_mutex then cpu_hotplug_lock. However, the smp_call_on_cpu() under cpus_read_lock may lead to a deadlock (ABBA over those two locks). This is OK thread T system_wq worker lock(cgroup_mutex) (II) ... unlock(cgroup_mutex) down(cpu_hotplug_lock.read) smp_call_on_cpu queue_work_on(cpu, system_wq, scss) (I) scss.func wait_for_completion(scss) up(cpu_hotplug_lock.read) However, there is no ordering between (I) and (II) so they can also happen in opposite thread T system_wq worker down(cpu_hotplug_lock.read) smp_call_on_cpu queue_work_on(cpu, system_wq, scss) (I) lock(cgroup_mutex) (II) ... unlock(cgroup_mutex) scss.func wait_for_completion(scss) up(cpu_hotplug_lock.read) And here the thread T + system_wq worker effectively call cpu_hotplug_lock and cgroup_mutex in the wrong order. (And since they're two threads, it won't be caught by lockdep.) By that reasoning any holder of cgroup_mutex on system_wq makes system susceptible to a deadlock (in presence of cpu_hotplug_lock waiting writers + cpuset operations). And the two work items must meet in same worker's processing hence probability is low (zero?) with less than WQ_DFL_ACTIVE items. (And more generally, any lock that is ordered before cpu_hotplug_lock should not be taken in system_wq work functions. Or at least such works items should not saturate WQ_DFL_ACTIVE workers.) Wrt other uses of cgroup_mutex, I only see bpf_map_free_in_work queue_work(system_unbound_wq) bpf_map_free_deferred ops->map_free == cgroup_storage_map_free cgroup_lock() which is safe since it uses a different workqueue than system_wq. > To fix the problem, place cgroup_bpf_release works on cgroup_destroy_wq, > which can break the loop and solve the problem. Yes, it moves the problematic cgroup_mutex holder away from system_wq and cgroup_destroy_wq could not cause similar problems because there are no explicit waiter for particular work items or its flushing. > System wqs are for misc things which shouldn't create a large number > of concurrent work items. If something is going to generate > >WQ_DFL_ACTIVE(256) concurrent work > items, it should use its own dedicated workqueue. Actually, I'm not sure (because I lack workqueue knowledge) if producing less than WQ_DFL_ACTIVE work items completely eliminates the chance of two offending work items producing the wrong lock ordering. > Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") I'm now indifferent whether this is needed (perhaps in the sense it is the _latest_ of multiple changes that contributed to possibility of this deadlock scenario). > Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t > Signed-off-by: Chen Ridong <chenridong@huawei.com> > --- > kernel/bpf/cgroup.c | 2 +- > kernel/cgroup/cgroup-internal.h | 1 + > kernel/cgroup/cgroup.c | 2 +- > 3 files changed, 3 insertions(+), 2 deletions(-) I have convinved myself now that you can put Reviewed-by: Michal Koutný <mkoutny@suse.com> Regards, Michal [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock 2024-09-09 14:19 ` Michal Koutný @ 2024-09-10 1:31 ` Chen Ridong 2024-09-10 21:02 ` Roman Gushchin 2024-09-11 11:15 ` Hillf Danton 1 sibling, 1 reply; 16+ messages in thread From: Chen Ridong @ 2024-09-10 1:31 UTC (permalink / raw) To: Michal Koutný, Chen Ridong Cc: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x, hannes, roman.gushchin, bpf, cgroups, linux-kernel On 2024/9/9 22:19, Michal Koutný wrote: > On Sat, Aug 17, 2024 at 09:33:34AM GMT, Chen Ridong <chenridong@huawei.com> wrote: >> The reason for this issue is cgroup_mutex and cpu_hotplug_lock are >> acquired in different tasks, which may lead to deadlock. >> It can lead to a deadlock through the following steps: >> 1. A large number of cpusets are deleted asynchronously, which puts a >> large number of cgroup_bpf_release works into system_wq. The max_active >> of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are >> cgroup_bpf_release works, and many cgroup_bpf_release works will be put >> into inactive queue. As illustrated in the diagram, there are 256 (in >> the acvtive queue) + n (in the inactive queue) works. >> 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put >> smp_call_on_cpu work into system_wq. However step 1 has already filled >> system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has >> to wait until the works that were put into the inacvtive queue earlier >> have executed (n cgroup_bpf_release), so it will be blocked for a while. >> 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2. >> 4. Cpusets that were deleted at step 1 put cgroup_release works into >> cgroup_destroy_wq. They are competing to get cgroup_mutex all the time. >> When cgroup_metux is acqured by work at css_killed_work_fn, it will >> call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read. >> However, cpuset_css_offline will be blocked for step 3. >> 5. At this moment, there are 256 works in active queue that are >> cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as >> a result, all of them are blocked. Consequently, sscs.work can not be >> executed. Ultimately, this situation leads to four processes being >> blocked, forming a deadlock. >> >> system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4) >> ... >> 2000+ cgroups deleted asyn >> 256 actives + n inactives >> __lockup_detector_reconfigure >> P(cpu_hotplug_lock.read) >> put sscs.work into system_wq >> 256 + n + 1(sscs.work) >> sscs.work wait to be executed >> warting sscs.work finish >> percpu_down_write >> P(cpu_hotplug_lock.write) >> ...blocking... >> css_killed_work_fn >> P(cgroup_mutex) >> cpuset_css_offline >> P(cpu_hotplug_lock.read) >> ...blocking... >> 256 cgroup_bpf_release >> mutex_lock(&cgroup_mutex); >> ..blocking... > > Thanks, Ridong, for laying this out. > Let me try to extract the core of the deps above. > > The correct lock ordering is: cgroup_mutex then cpu_hotplug_lock. > However, the smp_call_on_cpu() under cpus_read_lock may lead to > a deadlock (ABBA over those two locks). > That's right. > This is OK > thread T system_wq worker > > lock(cgroup_mutex) (II) > ... > unlock(cgroup_mutex) > down(cpu_hotplug_lock.read) > smp_call_on_cpu > queue_work_on(cpu, system_wq, scss) (I) > scss.func > wait_for_completion(scss) > up(cpu_hotplug_lock.read) > > However, there is no ordering between (I) and (II) so they can also happen > in opposite > > thread T system_wq worker > > down(cpu_hotplug_lock.read) > smp_call_on_cpu > queue_work_on(cpu, system_wq, scss) (I) > lock(cgroup_mutex) (II) > ... > unlock(cgroup_mutex) > scss.func > wait_for_completion(scss) > up(cpu_hotplug_lock.read) > > And here the thread T + system_wq worker effectively call > cpu_hotplug_lock and cgroup_mutex in the wrong order. (And since they're > two threads, it won't be caught by lockdep.) > > By that reasoning any holder of cgroup_mutex on system_wq makes system > susceptible to a deadlock (in presence of cpu_hotplug_lock waiting > writers + cpuset operations). And the two work items must meet in same > worker's processing hence probability is low (zero?) with less than > WQ_DFL_ACTIVE items. > > (And more generally, any lock that is ordered before cpu_hotplug_lock > should not be taken in system_wq work functions. Or at least such works > items should not saturate WQ_DFL_ACTIVE workers.) > > Wrt other uses of cgroup_mutex, I only see > bpf_map_free_in_work > queue_work(system_unbound_wq) > bpf_map_free_deferred > ops->map_free == cgroup_storage_map_free > cgroup_lock() > which is safe since it uses a different workqueue than system_wq. > >> To fix the problem, place cgroup_bpf_release works on cgroup_destroy_wq, >> which can break the loop and solve the problem. > > Yes, it moves the problematic cgroup_mutex holder away from system_wq > and cgroup_destroy_wq could not cause similar problems because there are > no explicit waiter for particular work items or its flushing. > > >> System wqs are for misc things which shouldn't create a large number >> of concurrent work items. If something is going to generate >>> WQ_DFL_ACTIVE(256) concurrent work >> items, it should use its own dedicated workqueue. > > Actually, I'm not sure (because I lack workqueue knowledge) if producing > less than WQ_DFL_ACTIVE work items completely eliminates the chance of > two offending work items producing the wrong lock ordering. > If producing less than WQ_DFL_ACTIVE work items, it won't lead to a deadlock. Because scss.func can be executed and doesn't have to wait for work that holds cgroup_mutex to be completed. Therefore, the probability is low and this issue can only be reproduced under pressure test. > >> Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") > > I'm now indifferent whether this is needed (perhaps in the sense it is > the _latest_ of multiple changes that contributed to possibility of this > deadlock scenario). > > >> Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t >> Signed-off-by: Chen Ridong <chenridong@huawei.com> >> --- >> kernel/bpf/cgroup.c | 2 +- >> kernel/cgroup/cgroup-internal.h | 1 + >> kernel/cgroup/cgroup.c | 2 +- >> 3 files changed, 3 insertions(+), 2 deletions(-) > > I have convinved myself now that you can put > > Reviewed-by: Michal Koutný <mkoutny@suse.com> > > Regards, > Michal Thank you very much. Best Regards, Ridong ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock 2024-09-10 1:31 ` Chen Ridong @ 2024-09-10 21:02 ` Roman Gushchin 2024-09-10 21:17 ` Tejun Heo 0 siblings, 1 reply; 16+ messages in thread From: Roman Gushchin @ 2024-09-10 21:02 UTC (permalink / raw) To: Chen Ridong Cc: Michal Koutný, Chen Ridong, martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x, hannes, bpf, cgroups, linux-kernel On Tue, Sep 10, 2024 at 09:31:41AM +0800, Chen Ridong wrote: > > > On 2024/9/9 22:19, Michal Koutný wrote: > > On Sat, Aug 17, 2024 at 09:33:34AM GMT, Chen Ridong <chenridong@huawei.com> wrote: > > > The reason for this issue is cgroup_mutex and cpu_hotplug_lock are > > > acquired in different tasks, which may lead to deadlock. > > > It can lead to a deadlock through the following steps: > > > 1. A large number of cpusets are deleted asynchronously, which puts a > > > large number of cgroup_bpf_release works into system_wq. The max_active > > > of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are > > > cgroup_bpf_release works, and many cgroup_bpf_release works will be put > > > into inactive queue. As illustrated in the diagram, there are 256 (in > > > the acvtive queue) + n (in the inactive queue) works. > > > 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put > > > smp_call_on_cpu work into system_wq. However step 1 has already filled > > > system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has > > > to wait until the works that were put into the inacvtive queue earlier > > > have executed (n cgroup_bpf_release), so it will be blocked for a while. > > > 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2. > > > 4. Cpusets that were deleted at step 1 put cgroup_release works into > > > cgroup_destroy_wq. They are competing to get cgroup_mutex all the time. > > > When cgroup_metux is acqured by work at css_killed_work_fn, it will > > > call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read. > > > However, cpuset_css_offline will be blocked for step 3. > > > 5. At this moment, there are 256 works in active queue that are > > > cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as > > > a result, all of them are blocked. Consequently, sscs.work can not be > > > executed. Ultimately, this situation leads to four processes being > > > blocked, forming a deadlock. > > > > > > system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4) > > > ... > > > 2000+ cgroups deleted asyn > > > 256 actives + n inactives > > > __lockup_detector_reconfigure > > > P(cpu_hotplug_lock.read) > > > put sscs.work into system_wq > > > 256 + n + 1(sscs.work) > > > sscs.work wait to be executed > > > warting sscs.work finish > > > percpu_down_write > > > P(cpu_hotplug_lock.write) > > > ...blocking... > > > css_killed_work_fn > > > P(cgroup_mutex) > > > cpuset_css_offline > > > P(cpu_hotplug_lock.read) > > > ...blocking... > > > 256 cgroup_bpf_release > > > mutex_lock(&cgroup_mutex); > > > ..blocking... > > > > Thanks, Ridong, for laying this out. > > Let me try to extract the core of the deps above. > > > > The correct lock ordering is: cgroup_mutex then cpu_hotplug_lock. > > However, the smp_call_on_cpu() under cpus_read_lock may lead to > > a deadlock (ABBA over those two locks). > > > > That's right. > > > This is OK > > thread T system_wq worker > > > > lock(cgroup_mutex) (II) > > ... > > unlock(cgroup_mutex) > > down(cpu_hotplug_lock.read) > > smp_call_on_cpu > > queue_work_on(cpu, system_wq, scss) (I) > > scss.func > > wait_for_completion(scss) > > up(cpu_hotplug_lock.read) > > > > However, there is no ordering between (I) and (II) so they can also happen > > in opposite > > > > thread T system_wq worker > > > > down(cpu_hotplug_lock.read) > > smp_call_on_cpu > > queue_work_on(cpu, system_wq, scss) (I) > > lock(cgroup_mutex) (II) > > ... > > unlock(cgroup_mutex) > > scss.func > > wait_for_completion(scss) > > up(cpu_hotplug_lock.read) > > > > And here the thread T + system_wq worker effectively call > > cpu_hotplug_lock and cgroup_mutex in the wrong order. (And since they're > > two threads, it won't be caught by lockdep.) > > > > By that reasoning any holder of cgroup_mutex on system_wq makes system > > susceptible to a deadlock (in presence of cpu_hotplug_lock waiting > > writers + cpuset operations). And the two work items must meet in same > > worker's processing hence probability is low (zero?) with less than > > WQ_DFL_ACTIVE items. Right, I'm on the same page. Should we document then somewhere that the cgroup mutex can't be locked from a system wq context? I think thus will also make the Fixes tag more meaningful. Thank you! ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock 2024-09-10 21:02 ` Roman Gushchin @ 2024-09-10 21:17 ` Tejun Heo 2024-09-12 1:33 ` Chen Ridong 0 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2024-09-10 21:17 UTC (permalink / raw) To: Roman Gushchin Cc: Chen Ridong, Michal Koutný, Chen Ridong, martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, lizefan.x, hannes, bpf, cgroups, linux-kernel On Tue, Sep 10, 2024 at 09:02:59PM +0000, Roman Gushchin wrote: ... > > > By that reasoning any holder of cgroup_mutex on system_wq makes system > > > susceptible to a deadlock (in presence of cpu_hotplug_lock waiting > > > writers + cpuset operations). And the two work items must meet in same > > > worker's processing hence probability is low (zero?) with less than > > > WQ_DFL_ACTIVE items. > > Right, I'm on the same page. Should we document then somewhere that > the cgroup mutex can't be locked from a system wq context? > > I think thus will also make the Fixes tag more meaningful. I think that's completely fine. What's not fine is saturating system_wq. Anything which creates a large number of concurrent work items should be using its own workqueue. If anything, workqueue needs to add a warning for saturation conditions and who are the offenders. Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock 2024-09-10 21:17 ` Tejun Heo @ 2024-09-12 1:33 ` Chen Ridong 2024-09-12 5:59 ` Tejun Heo 0 siblings, 1 reply; 16+ messages in thread From: Chen Ridong @ 2024-09-12 1:33 UTC (permalink / raw) To: Tejun Heo, Roman Gushchin Cc: Michal Koutný, Chen Ridong, martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, lizefan.x, hannes, bpf, cgroups, linux-kernel On 2024/9/11 5:17, Tejun Heo wrote: > On Tue, Sep 10, 2024 at 09:02:59PM +0000, Roman Gushchin wrote: > ... >>>> By that reasoning any holder of cgroup_mutex on system_wq makes system >>>> susceptible to a deadlock (in presence of cpu_hotplug_lock waiting >>>> writers + cpuset operations). And the two work items must meet in same >>>> worker's processing hence probability is low (zero?) with less than >>>> WQ_DFL_ACTIVE items. >> >> Right, I'm on the same page. Should we document then somewhere that >> the cgroup mutex can't be locked from a system wq context? >> >> I think thus will also make the Fixes tag more meaningful. > > I think that's completely fine. What's not fine is saturating system_wq. > Anything which creates a large number of concurrent work items should be > using its own workqueue. If anything, workqueue needs to add a warning for > saturation conditions and who are the offenders. > > Thanks. > I will add a patch do document that. Should we modify WQ_DFL_ACTIVE(256 now)? Maybe 1024 is acceptable? Best regards, Ridong ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock 2024-09-12 1:33 ` Chen Ridong @ 2024-09-12 5:59 ` Tejun Heo 0 siblings, 0 replies; 16+ messages in thread From: Tejun Heo @ 2024-09-12 5:59 UTC (permalink / raw) To: Chen Ridong Cc: Roman Gushchin, Michal Koutný, Chen Ridong, martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, lizefan.x, hannes, bpf, cgroups, linux-kernel On Thu, Sep 12, 2024 at 09:33:23AM +0800, Chen Ridong wrote: > I will add a patch do document that. > Should we modify WQ_DFL_ACTIVE(256 now)? Maybe 1024 is acceptable? Yeah, let's bump it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock 2024-09-09 14:19 ` Michal Koutný 2024-09-10 1:31 ` Chen Ridong @ 2024-09-11 11:15 ` Hillf Danton 2024-09-26 12:53 ` Michal Koutný 2024-09-28 8:11 ` Tetsuo Handa 1 sibling, 2 replies; 16+ messages in thread From: Hillf Danton @ 2024-09-11 11:15 UTC (permalink / raw) To: Michal Koutny Cc: Chen Ridong, tj, cgroups, Tetsuo Handa, Boqun Feng, Linus Torvalds, linux-kernel On Mon, 9 Sep 2024 16:19:38 +0200 Michal Koutny <mkoutny@suse.com> > On Sat, Aug 17, 2024 at 09:33:34AM GMT, Chen Ridong <chenridong@huawei.com> wrote: > > The reason for this issue is cgroup_mutex and cpu_hotplug_lock are > > acquired in different tasks, which may lead to deadlock. > > It can lead to a deadlock through the following steps: > > 1. A large number of cpusets are deleted asynchronously, which puts a > > large number of cgroup_bpf_release works into system_wq. The max_active > > of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are > > cgroup_bpf_release works, and many cgroup_bpf_release works will be put > > into inactive queue. As illustrated in the diagram, there are 256 (in > > the acvtive queue) + n (in the inactive queue) works. > > 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put > > smp_call_on_cpu work into system_wq. However step 1 has already filled > > system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has > > to wait until the works that were put into the inacvtive queue earlier > > have executed (n cgroup_bpf_release), so it will be blocked for a while. > > 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2. > > 4. Cpusets that were deleted at step 1 put cgroup_release works into > > cgroup_destroy_wq. They are competing to get cgroup_mutex all the time. > > When cgroup_metux is acqured by work at css_killed_work_fn, it will > > call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read. > > However, cpuset_css_offline will be blocked for step 3. > > 5. At this moment, there are 256 works in active queue that are > > cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as > > a result, all of them are blocked. Consequently, sscs.work can not be > > executed. Ultimately, this situation leads to four processes being > > blocked, forming a deadlock. > > > > system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4) > > ... > > 2000+ cgroups deleted asyn > > 256 actives + n inactives > > __lockup_detector_reconfigure > > P(cpu_hotplug_lock.read) > > put sscs.work into system_wq > > 256 + n + 1(sscs.work) > > sscs.work wait to be executed > > warting sscs.work finish > > percpu_down_write > > P(cpu_hotplug_lock.write) > > ...blocking... > > css_killed_work_fn > > P(cgroup_mutex) > > cpuset_css_offline > > P(cpu_hotplug_lock.read) > > ...blocking... > > 256 cgroup_bpf_release > > mutex_lock(&cgroup_mutex); > > ..blocking... > > Thanks, Ridong, for laying this out. > Let me try to extract the core of the deps above. > > The correct lock ordering is: cgroup_mutex then cpu_hotplug_lock. > However, the smp_call_on_cpu() under cpus_read_lock may lead to > a deadlock (ABBA over those two locks). > > This is OK > thread T system_wq worker > > lock(cgroup_mutex) (II) > ... > unlock(cgroup_mutex) > down(cpu_hotplug_lock.read) > smp_call_on_cpu > queue_work_on(cpu, system_wq, scss) (I) > scss.func > wait_for_completion(scss) > up(cpu_hotplug_lock.read) > > However, there is no ordering between (I) and (II) so they can also happen > in opposite > > thread T system_wq worker > > down(cpu_hotplug_lock.read) > smp_call_on_cpu > queue_work_on(cpu, system_wq, scss) (I) > lock(cgroup_mutex) (II) > ... > unlock(cgroup_mutex) > scss.func > wait_for_completion(scss) > up(cpu_hotplug_lock.read) > > And here the thread T + system_wq worker effectively call > cpu_hotplug_lock and cgroup_mutex in the wrong order. (And since they're > two threads, it won't be caught by lockdep.) > Given no workqueue work executed without being dequeued, any queued work, regardless if they are more than 2048, that acquires cgroup_mutex could not prevent the work queued by thread-T from being executed, so thread-T can make safe forward progress, therefore with no chance left for the ABBA deadlock you spotted where lockdep fails to work. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock 2024-09-11 11:15 ` Hillf Danton @ 2024-09-26 12:53 ` Michal Koutný 2024-09-27 11:25 ` Hillf Danton 2024-09-28 8:11 ` Tetsuo Handa 1 sibling, 1 reply; 16+ messages in thread From: Michal Koutný @ 2024-09-26 12:53 UTC (permalink / raw) To: Hillf Danton Cc: Chen Ridong, tj, cgroups, Tetsuo Handa, Boqun Feng, Linus Torvalds, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1411 bytes --] Hello Hillf. (sorry for later reply) On Wed, Sep 11, 2024 at 07:15:42PM GMT, Hillf Danton <hdanton@sina.com> wrote: > > However, there is no ordering between (I) and (II) so they can also happen > > in opposite > > > > thread T system_wq worker > > > > down(cpu_hotplug_lock.read) > > smp_call_on_cpu > > queue_work_on(cpu, system_wq, scss) (I) > > lock(cgroup_mutex) (II) > > ... > > unlock(cgroup_mutex) > > scss.func > > wait_for_completion(scss) > > up(cpu_hotplug_lock.read) > > > > And here the thread T + system_wq worker effectively call > > cpu_hotplug_lock and cgroup_mutex in the wrong order. (And since they're > > two threads, it won't be caught by lockdep.) > > > Given no workqueue work executed without being dequeued, any queued work, > regardless if they are more than 2048, that acquires cgroup_mutex could not > prevent the work queued by thread-T from being executed, so thread-T can > make safe forward progress, therefore with no chance left for the ABBA > deadlock you spotted where lockdep fails to work. Is there a forgotten negation and did you intend to write: "any queued work ... that acquired cgroup_mutex could prevent"? Or if the negation is correct, why do you mean that processed work item is _not_ preventing thread T from running (in the case I left quoted above)? Thanks, Michal [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock 2024-09-26 12:53 ` Michal Koutný @ 2024-09-27 11:25 ` Hillf Danton 2024-09-27 14:03 ` Michal Koutný 0 siblings, 1 reply; 16+ messages in thread From: Hillf Danton @ 2024-09-27 11:25 UTC (permalink / raw) To: Michal Koutny Cc: Chen Ridong, tj, cgroups, Tetsuo Handa, Boqun Feng, Linus Torvalds, linux-kernel On Thu, 26 Sep 2024 14:53:46 +0200 Michal Koutny <mkoutny@suse.com> > On Wed, Sep 11, 2024 at 07:15:42PM GMT, Hillf Danton <hdanton@sina.com> wrote: > > > However, there is no ordering between (I) and (II) so they can also happen > > > in opposite > > > > > > thread T system_wq worker > > > > > > down(cpu_hotplug_lock.read) > > > smp_call_on_cpu > > > queue_work_on(cpu, system_wq, scss) (I) > > > lock(cgroup_mutex) (II) > > > ... > > > unlock(cgroup_mutex) > > > scss.func > > > wait_for_completion(scss) > > > up(cpu_hotplug_lock.read) > > > > > > And here the thread T + system_wq worker effectively call > > > cpu_hotplug_lock and cgroup_mutex in the wrong order. (And since they're > > > two threads, it won't be caught by lockdep.) > > > > > Given no workqueue work executed without being dequeued, any queued work, > > regardless if they are more than 2048, that acquires cgroup_mutex could not > > prevent the work queued by thread-T from being executed, so thread-T can > > make safe forward progress, therefore with no chance left for the ABBA > > deadlock you spotted where lockdep fails to work. > > Is there a forgotten negation and did you intend to write: "any queued > work ... that acquired cgroup_mutex could prevent"? > No I did not. > Or if the negation is correct, why do you mean that processed work item > is _not_ preventing thread T from running (in the case I left quoted > above)? > If N (N > 1) cgroup work items are queued before one cpu hotplug work, then 1) workqueue worker1 dequeues cgroup work1 and executes it, 2) worker1 goes off cpu and falls in nap because of failure of acquiring cgroup_mutex, 3) worker2 starts processing cgroup work2 and repeats 1) and 2), 4) after N sleepers, workerN+1 dequeus the hotplug work and executes it and completes finally. Clear lad? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock 2024-09-27 11:25 ` Hillf Danton @ 2024-09-27 14:03 ` Michal Koutný 0 siblings, 0 replies; 16+ messages in thread From: Michal Koutný @ 2024-09-27 14:03 UTC (permalink / raw) To: Hillf Danton Cc: Chen Ridong, tj, cgroups, Tetsuo Handa, Boqun Feng, Linus Torvalds, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1179 bytes --] On Fri, Sep 27, 2024 at 07:25:16PM GMT, Hillf Danton <hdanton@sina.com> wrote: > > Or if the negation is correct, why do you mean that processed work item > > is _not_ preventing thread T from running (in the case I left quoted > > above)? > > > If N (N > 1) cgroup work items are queued before one cpu hotplug work, then > 1) workqueue worker1 dequeues cgroup work1 and executes it, > 2) worker1 goes off cpu and falls in nap because of failure of acquiring > cgroup_mutex, > 3) worker2 starts processing cgroup work2 and repeats 1) and 2), > 4) after N sleepers, workerN+1 dequeus the hotplug work and executes it > and completes finally. My picture of putting everything under one system_wq worker was a bit clumsy. I see how other workers can help out with processing the queue, that's where then N >= WQ_DFL_ACTIVE comes into play, then this gets stuck(?). [1] IOW, if N < WQ_DFL_ACTIVE, the mutex waiters in the queue are harmless. > Clear lad? I hope, thanks! Michal [1] I don't see a trivial way how to modify lockdep to catch this (besides taking wq saturation into account it would also need to propagate some info across complete->wait_for_completion). [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock 2024-09-11 11:15 ` Hillf Danton 2024-09-26 12:53 ` Michal Koutný @ 2024-09-28 8:11 ` Tetsuo Handa 2024-09-29 1:30 ` Chen Ridong 1 sibling, 1 reply; 16+ messages in thread From: Tetsuo Handa @ 2024-09-28 8:11 UTC (permalink / raw) To: Hillf Danton, Michal Koutny Cc: Chen Ridong, tj, cgroups, Boqun Feng, Linus Torvalds, linux-kernel On 2024/09/11 20:15, Hillf Danton wrote: > On Mon, 9 Sep 2024 16:19:38 +0200 Michal Koutny <mkoutny@suse.com> >> On Sat, Aug 17, 2024 at 09:33:34AM GMT, Chen Ridong <chenridong@huawei.com> wrote: >>> The reason for this issue is cgroup_mutex and cpu_hotplug_lock are >>> acquired in different tasks, which may lead to deadlock. >>> It can lead to a deadlock through the following steps: >>> 1. A large number of cpusets are deleted asynchronously, which puts a >>> large number of cgroup_bpf_release works into system_wq. The max_active >>> of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are >>> cgroup_bpf_release works, and many cgroup_bpf_release works will be put >>> into inactive queue. As illustrated in the diagram, there are 256 (in >>> the acvtive queue) + n (in the inactive queue) works. > Given no workqueue work executed without being dequeued, any queued work, > regardless if they are more than 2048, that acquires cgroup_mutex could not > prevent the work queued by thread-T from being executed, so thread-T can > make safe forward progress, therefore with no chance left for the ABBA > deadlock you spotted where lockdep fails to work. I made a simple test which queues many work items into system_wq and measures time needed for flushing last work item. As number of work items increased, time needed also increased. Although nobody uses flush_workqueue() on system_wq, several users use flush_work() on work item in system_wq. Therefore, I think that queuing thousands of work items in system_wq should be avoided, regardless of whether there is possibility of deadlock. ---------------------------------------- #include <linux/module.h> #include <linux/workqueue.h> static void worker_func(struct work_struct *work) { schedule_timeout_uninterruptible(HZ); } #define MAX_WORKS 8192 static struct work_struct works[MAX_WORKS]; static int __init test_init(void) { int i; unsigned long start, end; for (i = 0; i < MAX_WORKS; i++) { INIT_WORK(&works[i], worker_func); schedule_work(&works[i]); } start = jiffies; flush_work(&works[MAX_WORKS - 1]); end = jiffies; printk("%u: Took %lu jiffies. (HZ=%u)\n", MAX_WORKS, end - start, HZ); for (i = 0; i < MAX_WORKS; i++) flush_work(&works[i]); return -EINVAL; } module_init(test_init); MODULE_LICENSE("GPL"); ---------------------------------------- 12 CPUs 256: Took 1025 jiffies. (HZ=1000) 512: Took 2091 jiffies. (HZ=1000) 1024: Took 4105 jiffies. (HZ=1000) 2048: Took 8321 jiffies. (HZ=1000) 4096: Took 16382 jiffies. (HZ=1000) 8192: Took 32770 jiffies. (HZ=1000) 1 CPU 256: Took 1133 jiffies. (HZ=1000) 512: Took 2047 jiffies. (HZ=1000) 1024: Took 4117 jiffies. (HZ=1000) 2048: Took 8210 jiffies. (HZ=1000) 4096: Took 16424 jiffies. (HZ=1000) 8192: Took 32774 jiffies. (HZ=1000) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/1] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock 2024-09-28 8:11 ` Tetsuo Handa @ 2024-09-29 1:30 ` Chen Ridong 0 siblings, 0 replies; 16+ messages in thread From: Chen Ridong @ 2024-09-29 1:30 UTC (permalink / raw) To: Tetsuo Handa, Hillf Danton, Michal Koutny Cc: tj, cgroups, Boqun Feng, Linus Torvalds, linux-kernel On 2024/9/28 16:11, Tetsuo Handa wrote: > On 2024/09/11 20:15, Hillf Danton wrote: >> On Mon, 9 Sep 2024 16:19:38 +0200 Michal Koutny <mkoutny@suse.com> >>> On Sat, Aug 17, 2024 at 09:33:34AM GMT, Chen Ridong <chenridong@huawei.com> wrote: >>>> The reason for this issue is cgroup_mutex and cpu_hotplug_lock are >>>> acquired in different tasks, which may lead to deadlock. >>>> It can lead to a deadlock through the following steps: >>>> 1. A large number of cpusets are deleted asynchronously, which puts a >>>> large number of cgroup_bpf_release works into system_wq. The max_active >>>> of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are >>>> cgroup_bpf_release works, and many cgroup_bpf_release works will be put >>>> into inactive queue. As illustrated in the diagram, there are 256 (in >>>> the acvtive queue) + n (in the inactive queue) works. >> Given no workqueue work executed without being dequeued, any queued work, >> regardless if they are more than 2048, that acquires cgroup_mutex could not >> prevent the work queued by thread-T from being executed, so thread-T can >> make safe forward progress, therefore with no chance left for the ABBA >> deadlock you spotted where lockdep fails to work. > > I made a simple test which queues many work items into system_wq and > measures time needed for flushing last work item. > > As number of work items increased, time needed also increased. > Although nobody uses flush_workqueue() on system_wq, several users > use flush_work() on work item in system_wq. Therefore, I think that > queuing thousands of work items in system_wq should be avoided, > regardless of whether there is possibility of deadlock. > I have sent a patch to document this. Link: https://lore.kernel.org/linux-kernel/20240923114352.4001560-3-chenridong@huaweicloud.com/ Michal and I are discussing how to make this constraint clear. If you can express this constraint more clearly, just reply. Best regards, Ridong ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2024-09-29 1:30 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-08-17 9:33 [PATCH v3 0/1] Fix deadlock caused by cgroup_mutex and cpu_hotplug_lock Chen Ridong 2024-08-17 9:33 ` [PATCH v3 1/1] cgroup: fix " Chen Ridong 2024-08-22 0:57 ` Chen Ridong 2024-08-30 2:08 ` Chen Ridong 2024-09-09 14:19 ` Michal Koutný 2024-09-10 1:31 ` Chen Ridong 2024-09-10 21:02 ` Roman Gushchin 2024-09-10 21:17 ` Tejun Heo 2024-09-12 1:33 ` Chen Ridong 2024-09-12 5:59 ` Tejun Heo 2024-09-11 11:15 ` Hillf Danton 2024-09-26 12:53 ` Michal Koutný 2024-09-27 11:25 ` Hillf Danton 2024-09-27 14:03 ` Michal Koutný 2024-09-28 8:11 ` Tetsuo Handa 2024-09-29 1:30 ` Chen Ridong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox