* [PATCH v6 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
2024-10-08 11:24 [PATCH v6 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE Chen Ridong
@ 2024-10-08 11:24 ` Chen Ridong
2024-10-08 18:45 ` Tejun Heo
2024-10-08 11:24 ` [PATCH v6 2/3] workqueue: doc: Add a note saturating the system_wq is not permitted Chen Ridong
2024-10-08 11:24 ` [PATCH v6 3/3] workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048 Chen Ridong
2 siblings, 1 reply; 6+ messages in thread
From: Chen Ridong @ 2024-10-08 11:24 UTC (permalink / raw)
To: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x,
hannes, roman.gushchin, mkoutny
Cc: bpf, cgroups, linux-kernel, chenridong
From: Chen Ridong <chenridong@huawei.com>
A hung_task problem shown below was found:
INFO: task kworker/0:0:8 blocked for more than 327 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Workqueue: events cgroup_bpf_release
Call Trace:
<TASK>
__schedule+0x5a2/0x2050
? find_held_lock+0x33/0x100
? wq_worker_sleeping+0x9e/0xe0
schedule+0x9f/0x180
schedule_preempt_disabled+0x25/0x50
__mutex_lock+0x512/0x740
? cgroup_bpf_release+0x1e/0x4d0
? cgroup_bpf_release+0xcf/0x4d0
? process_scheduled_works+0x161/0x8a0
? cgroup_bpf_release+0x1e/0x4d0
? mutex_lock_nested+0x2b/0x40
? __pfx_delay_tsc+0x10/0x10
mutex_lock_nested+0x2b/0x40
cgroup_bpf_release+0xcf/0x4d0
? process_scheduled_works+0x161/0x8a0
? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
? process_scheduled_works+0x161/0x8a0
process_scheduled_works+0x23a/0x8a0
worker_thread+0x231/0x5b0
? __pfx_worker_thread+0x10/0x10
kthread+0x14d/0x1c0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x59/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
This issue can be reproduced by the following pressuse test:
1. A large number of cpuset cgroups are deleted.
2. Set cpu on and off repeatly.
3. Set watchdog_thresh repeatly.
The scripts can be obtained at LINK mentioned above the signature.
The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
acquired in different tasks, which may lead to deadlock.
It can lead to a deadlock through the following steps:
1. A large number of cpusets are deleted asynchronously, which puts a
large number of cgroup_bpf_release works into system_wq. The max_active
of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
cgroup_bpf_release works, and many cgroup_bpf_release works will be put
into inactive queue. As illustrated in the diagram, there are 256 (in
the acvtive queue) + n (in the inactive queue) works.
2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
smp_call_on_cpu work into system_wq. However step 1 has already filled
system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
to wait until the works that were put into the inacvtive queue earlier
have executed (n cgroup_bpf_release), so it will be blocked for a while.
3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
4. Cpusets that were deleted at step 1 put cgroup_release works into
cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
When cgroup_metux is acqured by work at css_killed_work_fn, it will
call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
However, cpuset_css_offline will be blocked for step 3.
5. At this moment, there are 256 works in active queue that are
cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
a result, all of them are blocked. Consequently, sscs.work can not be
executed. Ultimately, this situation leads to four processes being
blocked, forming a deadlock.
system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4)
...
2000+ cgroups deleted asyn
256 actives + n inactives
__lockup_detector_reconfigure
P(cpu_hotplug_lock.read)
put sscs.work into system_wq
256 + n + 1(sscs.work)
sscs.work wait to be executed
warting sscs.work finish
percpu_down_write
P(cpu_hotplug_lock.write)
...blocking...
css_killed_work_fn
P(cgroup_mutex)
cpuset_css_offline
P(cpu_hotplug_lock.read)
...blocking...
256 cgroup_bpf_release
mutex_lock(&cgroup_mutex);
..blocking...
To fix the problem, place cgroup_bpf_release works on a dedicated
workqueue which can break the loop and solve the problem. System wqs are
for misc things which shouldn't create a large number of concurrent work
items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent
work items, it should use its own dedicated workqueue.
Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
Tested-by: Vishal Chourasia <vishalc@linux.ibm.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
kernel/bpf/cgroup.c | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index e7113d700b87..025d7e2214ae 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -24,6 +24,23 @@
DEFINE_STATIC_KEY_ARRAY_FALSE(cgroup_bpf_enabled_key, MAX_CGROUP_BPF_ATTACH_TYPE);
EXPORT_SYMBOL(cgroup_bpf_enabled_key);
+/*
+ * cgroup bpf destruction makes heavy use of work items and there can be a lot
+ * of concurrent destructions. Use a separate workqueue so that cgroup bpf
+ * destruction work items don't end up filling up max_active of system_wq
+ * which may lead to deadlock.
+ */
+static struct workqueue_struct *cgroup_bpf_destroy_wq;
+
+static int __init cgroup_bpf_wq_init(void)
+{
+ cgroup_bpf_destroy_wq = alloc_workqueue("cgroup_bpf_destroy", 0, 1);
+ if (!cgroup_bpf_destroy_wq)
+ panic("Failed to alloc workqueue for cgroup bpf destroy.\n");
+ return 0;
+}
+core_initcall(cgroup_bpf_wq_init);
+
/* __always_inline is necessary to prevent indirect call through run_prog
* function pointer.
*/
@@ -334,7 +351,7 @@ static void cgroup_bpf_release_fn(struct percpu_ref *ref)
struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);
INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
- queue_work(system_wq, &cgrp->bpf.release_work);
+ queue_work(cgroup_bpf_destroy_wq, &cgrp->bpf.release_work);
}
/* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's through
--
2.34.1
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [PATCH v6 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
2024-10-08 11:24 ` [PATCH v6 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction Chen Ridong
@ 2024-10-08 18:45 ` Tejun Heo
0 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2024-10-08 18:45 UTC (permalink / raw)
To: Chen Ridong
Cc: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, lizefan.x, hannes,
roman.gushchin, mkoutny, bpf, cgroups, linux-kernel, chenridong
On Tue, Oct 08, 2024 at 11:24:56AM +0000, Chen Ridong wrote:
...
> To fix the problem, place cgroup_bpf_release works on a dedicated
> workqueue which can break the loop and solve the problem. System wqs are
> for misc things which shouldn't create a large number of concurrent work
> items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent
> work items, it should use its own dedicated workqueue.
>
> Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
> Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
> Tested-by: Vishal Chourasia <vishalc@linux.ibm.com>
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
Applied to cgroup/for-6.12-fixes w/ stable cc'd.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v6 2/3] workqueue: doc: Add a note saturating the system_wq is not permitted
2024-10-08 11:24 [PATCH v6 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE Chen Ridong
2024-10-08 11:24 ` [PATCH v6 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction Chen Ridong
@ 2024-10-08 11:24 ` Chen Ridong
2024-10-08 11:24 ` [PATCH v6 3/3] workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048 Chen Ridong
2 siblings, 0 replies; 6+ messages in thread
From: Chen Ridong @ 2024-10-08 11:24 UTC (permalink / raw)
To: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x,
hannes, roman.gushchin, mkoutny
Cc: bpf, cgroups, linux-kernel, chenridong
From: Chen Ridong <chenridong@huawei.com>
If something is expected to generate large number of concurrent works,
it should utilize its own dedicated workqueue rather than system wq.
Because this may saturate system_wq and potentially block other's works.
eg, cgroup release work. Let's document this as a note.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
Documentation/core-api/workqueue.rst | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 16f861c9791e..2b813f80ce39 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -357,6 +357,11 @@ Guidelines
difference in execution characteristics between using a dedicated wq
and a system wq.
+ Note: If something may generate more than @max_active outstanding
+ work items (do stress test your producers), it may saturate a system
+ wq and potentially lead to deadlock. It should utilize its own
+ dedicated workqueue rather than the system wq.
+
* Unless work items are expected to consume a huge amount of CPU
cycles, using a bound wq is usually beneficial due to the increased
level of locality in wq operations and work item execution.
--
2.34.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v6 3/3] workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048
2024-10-08 11:24 [PATCH v6 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE Chen Ridong
2024-10-08 11:24 ` [PATCH v6 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction Chen Ridong
2024-10-08 11:24 ` [PATCH v6 2/3] workqueue: doc: Add a note saturating the system_wq is not permitted Chen Ridong
@ 2024-10-08 11:24 ` Chen Ridong
2024-10-08 18:48 ` Tejun Heo
2 siblings, 1 reply; 6+ messages in thread
From: Chen Ridong @ 2024-10-08 11:24 UTC (permalink / raw)
To: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x,
hannes, roman.gushchin, mkoutny
Cc: bpf, cgroups, linux-kernel, chenridong
From: Chen Ridong <chenridong@huawei.com>
WQ_MAX_ACTIVE is currently set to 512, which was established approximately
15 yeas ago. However, with the significant increase in machine sizes and
capabilities, the previous limit of 256 concurrent tasks is no longer
sufficient. Therefore, we propose to increase WQ_MAX_ACTIVE to 2048.
and WQ_DFL_ACTIVE is 1024 now.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
Documentation/core-api/workqueue.rst | 4 ++--
include/linux/workqueue.h | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 2b813f80ce39..e295835fc116 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -245,8 +245,8 @@ CPU which can be assigned to the work items of a wq. For example, with
at the same time per CPU. This is always a per-CPU attribute, even for
unbound workqueues.
-The maximum limit for ``@max_active`` is 512 and the default value used
-when 0 is specified is 256. These values are chosen sufficiently high
+The maximum limit for ``@max_active`` is 2048 and the default value used
+when 0 is specified is 1024. These values are chosen sufficiently high
such that they are not the limiting factor while providing protection in
runaway cases.
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 59c2695e12e7..b0dc957c3e56 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -412,7 +412,7 @@ enum wq_flags {
};
enum wq_consts {
- WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */
+ WQ_MAX_ACTIVE = 2048, /* I like 2048, better ideas? */
WQ_UNBOUND_MAX_ACTIVE = WQ_MAX_ACTIVE,
WQ_DFL_ACTIVE = WQ_MAX_ACTIVE / 2,
--
2.34.1
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [PATCH v6 3/3] workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048
2024-10-08 11:24 ` [PATCH v6 3/3] workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048 Chen Ridong
@ 2024-10-08 18:48 ` Tejun Heo
0 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2024-10-08 18:48 UTC (permalink / raw)
To: Chen Ridong
Cc: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, lizefan.x, hannes,
roman.gushchin, mkoutny, bpf, cgroups, linux-kernel, chenridong
On Tue, Oct 08, 2024 at 11:24:58AM +0000, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
>
> WQ_MAX_ACTIVE is currently set to 512, which was established approximately
> 15 yeas ago. However, with the significant increase in machine sizes and
> capabilities, the previous limit of 256 concurrent tasks is no longer
> sufficient. Therefore, we propose to increase WQ_MAX_ACTIVE to 2048.
> and WQ_DFL_ACTIVE is 1024 now.
>
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
Applied 2-3 to wq/for-6.13.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 6+ messages in thread