BPF List
 help / color / mirror / Atom feed
* [PATCH v6 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE
@ 2024-10-08 11:24 Chen Ridong
  2024-10-08 11:24 ` [PATCH v6 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction Chen Ridong
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Chen Ridong @ 2024-10-08 11:24 UTC (permalink / raw)
  To: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x,
	hannes, roman.gushchin, mkoutny
  Cc: bpf, cgroups, linux-kernel, chenridong

From: Chen Ridong <chenridong@huawei.com>

The patch series add a dedicated workqueue for cgroup bpf destruction,
add adjust WQ_MAX_ACTIVE from 512 to 2048.

v6:
- panic when the alloc_workqueue fails, suggested by Michal.
- update note in doc, suggested by Michal.

v5:
- use a dedicated workqueue for cgroup bpf destruction.
- update some messages for TJ's feedbacks.

v4:
- add a patch to document that saturating the system_wq is not permitted.
- add a patch to adjust WQ_MAX_ACTIVE from 512 to 2048.

v3:
- optimize commit msg.

Link v1: https://lore.kernel.org/cgroups/20240607110313.2230669-1-chenridong@huawei.com/
Link v2: https://lore.kernel.org/cgroups/20240719025232.2143638-1-chenridong@huawei.com/
Link v3: https://lore.kernel.org/cgroups/20240817093334.6062-1-chenridong@huawei.com/


Chen Ridong (3):
  cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
  workqueue: doc: Add a note saturating the system_wq is not permitted
  workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048

 Documentation/core-api/workqueue.rst |  9 +++++++--
 include/linux/workqueue.h            |  2 +-
 kernel/bpf/cgroup.c                  | 19 ++++++++++++++++++-
 3 files changed, 26 insertions(+), 4 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v6 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
  2024-10-08 11:24 [PATCH v6 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE Chen Ridong
@ 2024-10-08 11:24 ` Chen Ridong
  2024-10-08 18:45   ` Tejun Heo
  2024-10-08 11:24 ` [PATCH v6 2/3] workqueue: doc: Add a note saturating the system_wq is not permitted Chen Ridong
  2024-10-08 11:24 ` [PATCH v6 3/3] workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048 Chen Ridong
  2 siblings, 1 reply; 6+ messages in thread
From: Chen Ridong @ 2024-10-08 11:24 UTC (permalink / raw)
  To: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x,
	hannes, roman.gushchin, mkoutny
  Cc: bpf, cgroups, linux-kernel, chenridong

From: Chen Ridong <chenridong@huawei.com>

A hung_task problem shown below was found:

INFO: task kworker/0:0:8 blocked for more than 327 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Workqueue: events cgroup_bpf_release
Call Trace:
 <TASK>
 __schedule+0x5a2/0x2050
 ? find_held_lock+0x33/0x100
 ? wq_worker_sleeping+0x9e/0xe0
 schedule+0x9f/0x180
 schedule_preempt_disabled+0x25/0x50
 __mutex_lock+0x512/0x740
 ? cgroup_bpf_release+0x1e/0x4d0
 ? cgroup_bpf_release+0xcf/0x4d0
 ? process_scheduled_works+0x161/0x8a0
 ? cgroup_bpf_release+0x1e/0x4d0
 ? mutex_lock_nested+0x2b/0x40
 ? __pfx_delay_tsc+0x10/0x10
 mutex_lock_nested+0x2b/0x40
 cgroup_bpf_release+0xcf/0x4d0
 ? process_scheduled_works+0x161/0x8a0
 ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
 ? process_scheduled_works+0x161/0x8a0
 process_scheduled_works+0x23a/0x8a0
 worker_thread+0x231/0x5b0
 ? __pfx_worker_thread+0x10/0x10
 kthread+0x14d/0x1c0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x59/0x70
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>

This issue can be reproduced by the following pressuse test:
1. A large number of cpuset cgroups are deleted.
2. Set cpu on and off repeatly.
3. Set watchdog_thresh repeatly.
The scripts can be obtained at LINK mentioned above the signature.

The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
acquired in different tasks, which may lead to deadlock.
It can lead to a deadlock through the following steps:
1. A large number of cpusets are deleted asynchronously, which puts a
   large number of cgroup_bpf_release works into system_wq. The max_active
   of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
   cgroup_bpf_release works, and many cgroup_bpf_release works will be put
   into inactive queue. As illustrated in the diagram, there are 256 (in
   the acvtive queue) + n (in the inactive queue) works.
2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
   smp_call_on_cpu work into system_wq. However step 1 has already filled
   system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
   to wait until the works that were put into the inacvtive queue earlier
   have executed (n cgroup_bpf_release), so it will be blocked for a while.
3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
4. Cpusets that were deleted at step 1 put cgroup_release works into
   cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
   When cgroup_metux is acqured by work at css_killed_work_fn, it will
   call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
   However, cpuset_css_offline will be blocked for step 3.
5. At this moment, there are 256 works in active queue that are
   cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
   a result, all of them are blocked. Consequently, sscs.work can not be
   executed. Ultimately, this situation leads to four processes being
   blocked, forming a deadlock.

system_wq(step1)		WatchDog(step2)			cpu offline(step3)	cgroup_destroy_wq(step4)
...
2000+ cgroups deleted asyn
256 actives + n inactives
				__lockup_detector_reconfigure
				P(cpu_hotplug_lock.read)
				put sscs.work into system_wq
256 + n + 1(sscs.work)
sscs.work wait to be executed
				warting sscs.work finish
								percpu_down_write
								P(cpu_hotplug_lock.write)
								...blocking...
											css_killed_work_fn
											P(cgroup_mutex)
											cpuset_css_offline
											P(cpu_hotplug_lock.read)
											...blocking...
256 cgroup_bpf_release
mutex_lock(&cgroup_mutex);
..blocking...

To fix the problem, place cgroup_bpf_release works on a dedicated
workqueue which can break the loop and solve the problem. System wqs are
for misc things which shouldn't create a large number of concurrent work
items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent
work items, it should use its own dedicated workqueue.

Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
Tested-by: Vishal Chourasia <vishalc@linux.ibm.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 kernel/bpf/cgroup.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index e7113d700b87..025d7e2214ae 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -24,6 +24,23 @@
 DEFINE_STATIC_KEY_ARRAY_FALSE(cgroup_bpf_enabled_key, MAX_CGROUP_BPF_ATTACH_TYPE);
 EXPORT_SYMBOL(cgroup_bpf_enabled_key);
 
+/*
+ * cgroup bpf destruction makes heavy use of work items and there can be a lot
+ * of concurrent destructions.  Use a separate workqueue so that cgroup bpf
+ * destruction work items don't end up filling up max_active of system_wq
+ * which may lead to deadlock.
+ */
+static struct workqueue_struct *cgroup_bpf_destroy_wq;
+
+static int __init cgroup_bpf_wq_init(void)
+{
+	cgroup_bpf_destroy_wq = alloc_workqueue("cgroup_bpf_destroy", 0, 1);
+	if (!cgroup_bpf_destroy_wq)
+		panic("Failed to alloc workqueue for cgroup bpf destroy.\n");
+	return 0;
+}
+core_initcall(cgroup_bpf_wq_init);
+
 /* __always_inline is necessary to prevent indirect call through run_prog
  * function pointer.
  */
@@ -334,7 +351,7 @@ static void cgroup_bpf_release_fn(struct percpu_ref *ref)
 	struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);
 
 	INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
-	queue_work(system_wq, &cgrp->bpf.release_work);
+	queue_work(cgroup_bpf_destroy_wq, &cgrp->bpf.release_work);
 }
 
 /* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's through
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v6 2/3] workqueue: doc: Add a note saturating the system_wq is not permitted
  2024-10-08 11:24 [PATCH v6 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE Chen Ridong
  2024-10-08 11:24 ` [PATCH v6 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction Chen Ridong
@ 2024-10-08 11:24 ` Chen Ridong
  2024-10-08 11:24 ` [PATCH v6 3/3] workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048 Chen Ridong
  2 siblings, 0 replies; 6+ messages in thread
From: Chen Ridong @ 2024-10-08 11:24 UTC (permalink / raw)
  To: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x,
	hannes, roman.gushchin, mkoutny
  Cc: bpf, cgroups, linux-kernel, chenridong

From: Chen Ridong <chenridong@huawei.com>

If something is expected to generate large number of concurrent works,
it should utilize its own dedicated workqueue rather than system wq.
Because this may saturate system_wq and potentially block other's works.
eg, cgroup release work. Let's document this as a note.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 Documentation/core-api/workqueue.rst | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 16f861c9791e..2b813f80ce39 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -357,6 +357,11 @@ Guidelines
   difference in execution characteristics between using a dedicated wq
   and a system wq.
 
+  Note: If something may generate more than @max_active outstanding
+  work items (do stress test your producers), it may saturate a system
+  wq and potentially lead to deadlock. It should utilize its own
+  dedicated workqueue rather than the system wq.
+
 * Unless work items are expected to consume a huge amount of CPU
   cycles, using a bound wq is usually beneficial due to the increased
   level of locality in wq operations and work item execution.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v6 3/3] workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048
  2024-10-08 11:24 [PATCH v6 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE Chen Ridong
  2024-10-08 11:24 ` [PATCH v6 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction Chen Ridong
  2024-10-08 11:24 ` [PATCH v6 2/3] workqueue: doc: Add a note saturating the system_wq is not permitted Chen Ridong
@ 2024-10-08 11:24 ` Chen Ridong
  2024-10-08 18:48   ` Tejun Heo
  2 siblings, 1 reply; 6+ messages in thread
From: Chen Ridong @ 2024-10-08 11:24 UTC (permalink / raw)
  To: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, lizefan.x,
	hannes, roman.gushchin, mkoutny
  Cc: bpf, cgroups, linux-kernel, chenridong

From: Chen Ridong <chenridong@huawei.com>

WQ_MAX_ACTIVE is currently set to 512, which was established approximately
15 yeas ago. However, with the significant increase in machine sizes and
capabilities, the previous limit of 256 concurrent tasks is no longer
sufficient. Therefore, we propose to increase WQ_MAX_ACTIVE to 2048.
and WQ_DFL_ACTIVE is 1024 now.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 Documentation/core-api/workqueue.rst | 4 ++--
 include/linux/workqueue.h            | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 2b813f80ce39..e295835fc116 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -245,8 +245,8 @@ CPU which can be assigned to the work items of a wq. For example, with
 at the same time per CPU. This is always a per-CPU attribute, even for
 unbound workqueues.
 
-The maximum limit for ``@max_active`` is 512 and the default value used
-when 0 is specified is 256. These values are chosen sufficiently high
+The maximum limit for ``@max_active`` is 2048 and the default value used
+when 0 is specified is 1024. These values are chosen sufficiently high
 such that they are not the limiting factor while providing protection in
 runaway cases.
 
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 59c2695e12e7..b0dc957c3e56 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -412,7 +412,7 @@ enum wq_flags {
 };
 
 enum wq_consts {
-	WQ_MAX_ACTIVE		= 512,	  /* I like 512, better ideas? */
+	WQ_MAX_ACTIVE		= 2048,	  /* I like 2048, better ideas? */
 	WQ_UNBOUND_MAX_ACTIVE	= WQ_MAX_ACTIVE,
 	WQ_DFL_ACTIVE		= WQ_MAX_ACTIVE / 2,
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v6 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
  2024-10-08 11:24 ` [PATCH v6 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction Chen Ridong
@ 2024-10-08 18:45   ` Tejun Heo
  0 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2024-10-08 18:45 UTC (permalink / raw)
  To: Chen Ridong
  Cc: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, lizefan.x, hannes,
	roman.gushchin, mkoutny, bpf, cgroups, linux-kernel, chenridong

On Tue, Oct 08, 2024 at 11:24:56AM +0000, Chen Ridong wrote:
...
> To fix the problem, place cgroup_bpf_release works on a dedicated
> workqueue which can break the loop and solve the problem. System wqs are
> for misc things which shouldn't create a large number of concurrent work
> items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent
> work items, it should use its own dedicated workqueue.
> 
> Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
> Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
> Tested-by: Vishal Chourasia <vishalc@linux.ibm.com>
> Signed-off-by: Chen Ridong <chenridong@huawei.com>

Applied to cgroup/for-6.12-fixes w/ stable cc'd.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v6 3/3] workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048
  2024-10-08 11:24 ` [PATCH v6 3/3] workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048 Chen Ridong
@ 2024-10-08 18:48   ` Tejun Heo
  0 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2024-10-08 18:48 UTC (permalink / raw)
  To: Chen Ridong
  Cc: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, lizefan.x, hannes,
	roman.gushchin, mkoutny, bpf, cgroups, linux-kernel, chenridong

On Tue, Oct 08, 2024 at 11:24:58AM +0000, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
> 
> WQ_MAX_ACTIVE is currently set to 512, which was established approximately
> 15 yeas ago. However, with the significant increase in machine sizes and
> capabilities, the previous limit of 256 concurrent tasks is no longer
> sufficient. Therefore, we propose to increase WQ_MAX_ACTIVE to 2048.
> and WQ_DFL_ACTIVE is 1024 now.
> 
> Signed-off-by: Chen Ridong <chenridong@huawei.com>

Applied 2-3 to wq/for-6.13.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-10-08 18:48 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-08 11:24 [PATCH v6 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE Chen Ridong
2024-10-08 11:24 ` [PATCH v6 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction Chen Ridong
2024-10-08 18:45   ` Tejun Heo
2024-10-08 11:24 ` [PATCH v6 2/3] workqueue: doc: Add a note saturating the system_wq is not permitted Chen Ridong
2024-10-08 11:24 ` [PATCH v6 3/3] workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048 Chen Ridong
2024-10-08 18:48   ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox