[PATCH v5 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE
@ 2024-09-23 11:43 Chen Ridong
  2024-09-23 11:43 ` [PATCH v5 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction Chen Ridong
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Chen Ridong @ 2024-09-23 11:43 UTC (permalink / raw)
  To: tj, lizefan.x, hannes, longman, mkoutny, chenridong; +Cc: cgroups, linux-kernel

From: Chen Ridong <chenridong@huawei.com>

The patch series add a dedicated workqueue for cgroup bpf destruction,
add adjust WQ_MAX_ACTIVE from 512 to 2048.

v5:
- use a dedicated workqueue for cgroup bpf destruction.
- update some messages for TJ's feedbacks.

v4:
- add a patch to document that saturating the system_wq is not permitted.
- add a patch to adjust WQ_MAX_ACTIVE from 512 to 2048.

v3:
- optimize commit msg.

Link v1: https://lore.kernel.org/cgroups/20240607110313.2230669-1-chenridong@huawei.com/
Link v2: https://lore.kernel.org/cgroups/20240719025232.2143638-1-chenridong@huawei.com/
Link v3: https://lore.kernel.org/cgroups/20240817093334.6062-1-chenridong@huawei.com/



Chen Ridong (3):
  cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
  workqueue: doc: Add a note saturating the system_wq is not permitted
  workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048

 Documentation/core-api/workqueue.rst |  9 +++++++--
 include/linux/workqueue.h            |  2 +-
 kernel/bpf/cgroup.c                  | 18 +++++++++++++++++-
 3 files changed, 25 insertions(+), 4 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
  2024-09-23 11:43 [PATCH v5 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE Chen Ridong
@ 2024-09-23 11:43 ` Chen Ridong
  2024-09-26 12:49   ` Michal Koutný
       [not found]   ` <ZvYzIcYSJa3Loq4G@linux.ibm.com>
  2024-09-23 11:43 ` [PATCH v5 2/3] workqueue: doc: Add a note saturating the system_wq is not permitted Chen Ridong
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 13+ messages in thread
From: Chen Ridong @ 2024-09-23 11:43 UTC (permalink / raw)
  To: tj, lizefan.x, hannes, longman, mkoutny, chenridong; +Cc: cgroups, linux-kernel

From: Chen Ridong <chenridong@huawei.com>

I found a hung_task problem as shown below:

INFO: task kworker/0:0:8 blocked for more than 327 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Workqueue: events cgroup_bpf_release
Call Trace:
 <TASK>
 __schedule+0x5a2/0x2050
 ? find_held_lock+0x33/0x100
 ? wq_worker_sleeping+0x9e/0xe0
 schedule+0x9f/0x180
 schedule_preempt_disabled+0x25/0x50
 __mutex_lock+0x512/0x740
 ? cgroup_bpf_release+0x1e/0x4d0
 ? cgroup_bpf_release+0xcf/0x4d0
 ? process_scheduled_works+0x161/0x8a0
 ? cgroup_bpf_release+0x1e/0x4d0
 ? mutex_lock_nested+0x2b/0x40
 ? __pfx_delay_tsc+0x10/0x10
 mutex_lock_nested+0x2b/0x40
 cgroup_bpf_release+0xcf/0x4d0
 ? process_scheduled_works+0x161/0x8a0
 ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
 ? process_scheduled_works+0x161/0x8a0
 process_scheduled_works+0x23a/0x8a0
 worker_thread+0x231/0x5b0
 ? __pfx_worker_thread+0x10/0x10
 kthread+0x14d/0x1c0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x59/0x70
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>

This issue can be reproduced by the following pressuse test:
1. A large number of cpuset cgroups are deleted.
2. Set cpu on and off repeatly.
3. Set watchdog_thresh repeatly.
The scripts can be obtained at LINK mentioned above the signature.

The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
acquired in different tasks, which may lead to deadlock.
It can lead to a deadlock through the following steps:
1. A large number of cpusets are deleted asynchronously, which puts a
   large number of cgroup_bpf_release works into system_wq. The max_active
   of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
   cgroup_bpf_release works, and many cgroup_bpf_release works will be put
   into inactive queue. As illustrated in the diagram, there are 256 (in
   the acvtive queue) + n (in the inactive queue) works.
2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
   smp_call_on_cpu work into system_wq. However step 1 has already filled
   system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
   to wait until the works that were put into the inacvtive queue earlier
   have executed (n cgroup_bpf_release), so it will be blocked for a while.
3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
4. Cpusets that were deleted at step 1 put cgroup_release works into
   cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
   When cgroup_metux is acqured by work at css_killed_work_fn, it will
   call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
   However, cpuset_css_offline will be blocked for step 3.
5. At this moment, there are 256 works in active queue that are
   cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
   a result, all of them are blocked. Consequently, sscs.work can not be
   executed. Ultimately, this situation leads to four processes being
   blocked, forming a deadlock.

system_wq(step1)		WatchDog(step2)			cpu offline(step3)	cgroup_destroy_wq(step4)
...
2000+ cgroups deleted asyn
256 actives + n inactives
				__lockup_detector_reconfigure
				P(cpu_hotplug_lock.read)
				put sscs.work into system_wq
256 + n + 1(sscs.work)
sscs.work wait to be executed
				warting sscs.work finish
								percpu_down_write
								P(cpu_hotplug_lock.write)
								...blocking...
											css_killed_work_fn
											P(cgroup_mutex)
											cpuset_css_offline
											P(cpu_hotplug_lock.read)
											...blocking...
256 cgroup_bpf_release
mutex_lock(&cgroup_mutex);
..blocking...

To fix the problem, place cgroup_bpf_release works on a dedicated
workqueue which can break the loop and solve the problem. System wqs are
for misc things which shouldn't create a large number of concurrent work
items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent
work items, it should use its own dedicated workqueue.

Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 kernel/bpf/cgroup.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index e7113d700b87..1a7609f61d44 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -24,6 +24,22 @@
 DEFINE_STATIC_KEY_ARRAY_FALSE(cgroup_bpf_enabled_key, MAX_CGROUP_BPF_ATTACH_TYPE);
 EXPORT_SYMBOL(cgroup_bpf_enabled_key);

+/*
+ * cgroup bpf destruction makes heavy use of work items and there can be a lot
+ * of concurrent destructions.  Use a separate workqueue so that cgroup bpf
+ * destruction work items don't end up filling up max_active of system_wq
+ * which may lead to deadlock.
+ */
+static struct workqueue_struct *cgroup_bpf_destroy_wq;
+
+static int __init cgroup_bpf_wq_init(void)
+{
+	cgroup_bpf_destroy_wq = alloc_workqueue("cgroup_bpf_destroy", 0, 1);
+	WARN_ON_ONCE(!cgroup_bpf_destroy_wq);
+	return 0;
+}
+core_initcall(cgroup_bpf_wq_init);
+
 /* __always_inline is necessary to prevent indirect call through run_prog
  * function pointer.
  */
@@ -334,7 +350,7 @@ static void cgroup_bpf_release_fn(struct percpu_ref *ref)
 	struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);

 	INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
-	queue_work(system_wq, &cgrp->bpf.release_work);
+	queue_work(cgroup_bpf_destroy_wq, &cgrp->bpf.release_work);
 }

 /* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's through
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 2/3] workqueue: doc: Add a note saturating the system_wq is not permitted
  2024-09-23 11:43 [PATCH v5 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE Chen Ridong
  2024-09-23 11:43 ` [PATCH v5 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction Chen Ridong
@ 2024-09-23 11:43 ` Chen Ridong
  2024-09-26 12:49   ` Michal Koutný
  2024-09-23 11:43 ` [PATCH v5 3/3] workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048 Chen Ridong
  2024-09-27 19:56 ` [PATCH v5 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE Tejun Heo
  3 siblings, 1 reply; 13+ messages in thread
From: Chen Ridong @ 2024-09-23 11:43 UTC (permalink / raw)
  To: tj, lizefan.x, hannes, longman, mkoutny, chenridong; +Cc: cgroups, linux-kernel

From: Chen Ridong <chenridong@huawei.com>

If something is expected to generate large number of concurrent works,
it should utilize its own dedicated workqueue rather than system wq.
Because this may saturate system_wq and potentially block other's works.
eg, cgroup release work. Let's document this as a note.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 Documentation/core-api/workqueue.rst | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 16f861c9791e..9de622188f2f 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -357,6 +357,11 @@ Guidelines
   difference in execution characteristics between using a dedicated wq
   and a system wq.
 
+  Note: If something is expected to generate a large number of concurrent
+  works, it should utilize its own dedicated workqueue rather than
+  system wq. Because this may saturate system_wq and potentially lead
+  to deadlock.
+
 * Unless work items are expected to consume a huge amount of CPU
   cycles, using a bound wq is usually beneficial due to the increased
   level of locality in wq operations and work item execution.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 3/3] workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048
  2024-09-23 11:43 [PATCH v5 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE Chen Ridong
  2024-09-23 11:43 ` [PATCH v5 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction Chen Ridong
  2024-09-23 11:43 ` [PATCH v5 2/3] workqueue: doc: Add a note saturating the system_wq is not permitted Chen Ridong
@ 2024-09-23 11:43 ` Chen Ridong
  2024-09-27 19:56 ` [PATCH v5 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE Tejun Heo
  3 siblings, 0 replies; 13+ messages in thread
From: Chen Ridong @ 2024-09-23 11:43 UTC (permalink / raw)
  To: tj, lizefan.x, hannes, longman, mkoutny, chenridong; +Cc: cgroups, linux-kernel

From: Chen Ridong <chenridong@huawei.com>

WQ_MAX_ACTIVE is currently set to 512, which was established approximately
15 yeas ago. However, with the significant increase in machine sizes and
capabilities, the previous limit of 256 concurrent tasks is no longer
sufficient. Therefore, we propose to increase WQ_MAX_ACTIVE to 2048.
and WQ_DFL_ACTIVE is 1024 now.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 Documentation/core-api/workqueue.rst | 4 ++--
 include/linux/workqueue.h            | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 9de622188f2f..c6a8ac2530e9 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -245,8 +245,8 @@ CPU which can be assigned to the work items of a wq. For example, with
 at the same time per CPU. This is always a per-CPU attribute, even for
 unbound workqueues.
 
-The maximum limit for ``@max_active`` is 512 and the default value used
-when 0 is specified is 256. These values are chosen sufficiently high
+The maximum limit for ``@max_active`` is 2048 and the default value used
+when 0 is specified is 1024. These values are chosen sufficiently high
 such that they are not the limiting factor while providing protection in
 runaway cases.
 
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 59c2695e12e7..b0dc957c3e56 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -412,7 +412,7 @@ enum wq_flags {
 };
 
 enum wq_consts {
-	WQ_MAX_ACTIVE		= 512,	  /* I like 512, better ideas? */
+	WQ_MAX_ACTIVE		= 2048,	  /* I like 2048, better ideas? */
 	WQ_UNBOUND_MAX_ACTIVE	= WQ_MAX_ACTIVE,
 	WQ_DFL_ACTIVE		= WQ_MAX_ACTIVE / 2,
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
  2024-09-23 11:43 ` [PATCH v5 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction Chen Ridong
@ 2024-09-26 12:49   ` Michal Koutný
  2024-09-27  7:45     ` Chen Ridong
       [not found]   ` <ZvYzIcYSJa3Loq4G@linux.ibm.com>
  1 sibling, 1 reply; 13+ messages in thread
From: Michal Koutný @ 2024-09-26 12:49 UTC (permalink / raw)
  To: Chen Ridong
  Cc: tj, lizefan.x, hannes, longman, chenridong, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 769 bytes --]

Hello.

On Mon, Sep 23, 2024 at 11:43:50AM GMT, Chen Ridong <chenridong@huaweicloud.com> wrote:
> +static int __init cgroup_bpf_wq_init(void)
> +{
> +	cgroup_bpf_destroy_wq = alloc_workqueue("cgroup_bpf_destroy", 0, 1);
> +	WARN_ON_ONCE(!cgroup_bpf_destroy_wq);
> +	return 0;
> +}
> +core_initcall(cgroup_bpf_wq_init);

I think hard fail (panic() if you want to avoid BUG_ON) would be
warranted here and mere warning would leave system exposed to worse
errors later (and _ONCE in an initcall looks unnecessary).

Maybe look at other global wqs. I see that returning -ENOMEM might be an
option, however, I don't see that initcall's return value would be
processed anywhere currently :-/

Besides this allocation failpath this is a sensible change to me.

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 2/3] workqueue: doc: Add a note saturating the system_wq is not permitted
  2024-09-23 11:43 ` [PATCH v5 2/3] workqueue: doc: Add a note saturating the system_wq is not permitted Chen Ridong
@ 2024-09-26 12:49   ` Michal Koutný
  2024-09-27  8:08     ` Chen Ridong
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Koutný @ 2024-09-26 12:49 UTC (permalink / raw)
  To: Chen Ridong
  Cc: tj, lizefan.x, hannes, longman, chenridong, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 953 bytes --]

On Mon, Sep 23, 2024 at 11:43:51AM GMT, Chen Ridong <chenridong@huaweicloud.com> wrote:
> +  Note: If something is expected to generate a large number of concurrent
> +  works, it should utilize its own dedicated workqueue rather than
> +  system wq. Because this may saturate system_wq and potentially lead
> +  to deadlock.

How does "large number of concurrent" translate practically?

The example with released cgroup_bpf from
  cgroup_destroy_locked
    cgroup_bpf_offline
which is serialized under cgroup_mutex as argued previously. So this
generates a single entry at a time and it wouldn't hint towards the
creation of cgroup_bpf_destroy_wq.

I reckon the argument could be something like the processing rate vs
production rate of entry items should be such that number of active
items is bound. But I'm not sure it's practical since users may not know
the comparison result and they would end up always creating a dedicated
workqueue.


Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
  2024-09-26 12:49   ` Michal Koutný
@ 2024-09-27  7:45     ` Chen Ridong
  0 siblings, 0 replies; 13+ messages in thread
From: Chen Ridong @ 2024-09-27  7:45 UTC (permalink / raw)
  To: Michal Koutný
  Cc: tj, lizefan.x, hannes, longman, chenridong, cgroups, linux-kernel



On 2024/9/26 20:49, Michal Koutný wrote:
> Hello.
> 
> On Mon, Sep 23, 2024 at 11:43:50AM GMT, Chen Ridong <chenridong@huaweicloud.com> wrote:
>> +static int __init cgroup_bpf_wq_init(void)
>> +{
>> +	cgroup_bpf_destroy_wq = alloc_workqueue("cgroup_bpf_destroy", 0, 1);
>> +	WARN_ON_ONCE(!cgroup_bpf_destroy_wq);
>> +	return 0;
>> +}
>> +core_initcall(cgroup_bpf_wq_init);
> 
> I think hard fail (panic() if you want to avoid BUG_ON) would be
> warranted here and mere warning would leave system exposed to worse
> errors later (and _ONCE in an initcall looks unnecessary).
> 
Thank you. I think panic() is alright.

Best regards,
Ridong

> Maybe look at other global wqs. I see that returning -ENOMEM might be an
> option, however, I don't see that initcall's return value would be
> processed anywhere currently :-/
> 
> Besides this allocation failpath this is a sensible change to me.
> 
> Thanks,
> Michal


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 2/3] workqueue: doc: Add a note saturating the system_wq is not permitted
  2024-09-26 12:49   ` Michal Koutný
@ 2024-09-27  8:08     ` Chen Ridong
  2024-09-30 12:50       ` Michal Koutný
  0 siblings, 1 reply; 13+ messages in thread
From: Chen Ridong @ 2024-09-27  8:08 UTC (permalink / raw)
  To: Michal Koutný
  Cc: tj, lizefan.x, hannes, longman, chenridong, cgroups, linux-kernel



On 2024/9/26 20:49, Michal Koutný wrote:
> On Mon, Sep 23, 2024 at 11:43:51AM GMT, Chen Ridong <chenridong@huaweicloud.com> wrote:
>> +  Note: If something is expected to generate a large number of concurrent
>> +  works, it should utilize its own dedicated workqueue rather than
>> +  system wq. Because this may saturate system_wq and potentially lead
>> +  to deadlock.
> 
> How does "large number of concurrent" translate practically?
> 
> The example with released cgroup_bpf from
>    cgroup_destroy_locked
>      cgroup_bpf_offline
> which is serialized under cgroup_mutex as argued previously. So this
> generates a single entry at a time and it wouldn't hint towards the
> creation of cgroup_bpf_destroy_wq.
> 
> I reckon the argument could be something like the processing rate vs
> production rate of entry items should be such that number of active
> items is bound. But I'm not sure it's practical since users may not know
> the comparison result and they would end up always creating a dedicated
> workqueue.
> 
> 
> Michal

Thank you, Michal.
I think it's difficult to measure the comparison result. Actually, if 
something generates work at a high frequency, it would be better to use 
dedicated wq.

How about:
Note: If something may generate works frequently, it may saturate the 
system_wq and potentially lead to deadlock. It should utilize its own 
dedicated workqueue rather than system wq.

Best regards,
Ridong


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
       [not found]   ` <ZvYzIcYSJa3Loq4G@linux.ibm.com>
@ 2024-09-27  9:50     ` Chen Ridong
  0 siblings, 0 replies; 13+ messages in thread
From: Chen Ridong @ 2024-09-27  9:50 UTC (permalink / raw)
  To: Vishal Chourasia, tj@kernel.org >> Tejun Heo, Zefan Li,
	Waiman Long, Michal Koutný
  Cc: cgroups, LKML



On 2024/9/27 12:22, Vishal Chourasia wrote:
> On Mon, Sep 23, 2024 at 11:43:50AM +0000, Chen Ridong wrote:
>> I found a hung_task problem as shown below:
>>
>> INFO: task kworker/0:0:8 blocked for more than 327 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Workqueue: events cgroup_bpf_release
>> Call Trace:
>>   <TASK>
>>   __schedule+0x5a2/0x2050
>>   ? find_held_lock+0x33/0x100
>>   ? wq_worker_sleeping+0x9e/0xe0
>>   schedule+0x9f/0x180
>>   schedule_preempt_disabled+0x25/0x50
>>   __mutex_lock+0x512/0x740
>>   ? cgroup_bpf_release+0x1e/0x4d0
>>   ? cgroup_bpf_release+0xcf/0x4d0
>>   ? process_scheduled_works+0x161/0x8a0
>>   ? cgroup_bpf_release+0x1e/0x4d0
>>   ? mutex_lock_nested+0x2b/0x40
>>   ? __pfx_delay_tsc+0x10/0x10
>>   mutex_lock_nested+0x2b/0x40
>>   cgroup_bpf_release+0xcf/0x4d0
>>   ? process_scheduled_works+0x161/0x8a0
>>   ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
>>   ? process_scheduled_works+0x161/0x8a0
>>   process_scheduled_works+0x23a/0x8a0
>>   worker_thread+0x231/0x5b0
>>   ? __pfx_worker_thread+0x10/0x10
>>   kthread+0x14d/0x1c0
>>   ? __pfx_kthread+0x10/0x10
>>   ret_from_fork+0x59/0x70
>>   ? __pfx_kthread+0x10/0x10
>>   ret_from_fork_asm+0x1b/0x30
>>   </TASK>
>>
>> This issue can be reproduced by the following pressuse test:
>> 1. A large number of cpuset cgroups are deleted.
>> 2. Set cpu on and off repeatly.
>> 3. Set watchdog_thresh repeatly.
>> The scripts can be obtained at LINK mentioned above the signature.
>>
>> The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
>> acquired in different tasks, which may lead to deadlock.
>> It can lead to a deadlock through the following steps:
>> 1. A large number of cpusets are deleted asynchronously, which puts a
>>     large number of cgroup_bpf_release works into system_wq. The max_active
>>     of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
>>     cgroup_bpf_release works, and many cgroup_bpf_release works will be put
>>     into inactive queue. As illustrated in the diagram, there are 256 (in
>>     the acvtive queue) + n (in the inactive queue) works.
>> 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
>>     smp_call_on_cpu work into system_wq. However step 1 has already filled
>>     system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
>>     to wait until the works that were put into the inacvtive queue earlier
>>     have executed (n cgroup_bpf_release), so it will be blocked for a while.
>> 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
>> 4. Cpusets that were deleted at step 1 put cgroup_release works into
>>     cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
>>     When cgroup_metux is acqured by work at css_killed_work_fn, it will
>>     call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
>>     However, cpuset_css_offline will be blocked for step 3.
>> 5. At this moment, there are 256 works in active queue that are
>>     cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
>>     a result, all of them are blocked. Consequently, sscs.work can not be
>>     executed. Ultimately, this situation leads to four processes being
>>     blocked, forming a deadlock.
>>
>> system_wq(step1)		WatchDog(step2)			cpu offline(step3)	cgroup_destroy_wq(step4)
>> ...
>> 2000+ cgroups deleted asyn
>> 256 actives + n inactives
>> 				__lockup_detector_reconfigure
>> 				P(cpu_hotplug_lock.read)
>> 				put sscs.work into system_wq
>> 256 + n + 1(sscs.work)
>> sscs.work wait to be executed
>> 				warting sscs.work finish
>> 								percpu_down_write
>> 								P(cpu_hotplug_lock.write)
>> 								...blocking...
>> 											css_killed_work_fn
>> 											P(cgroup_mutex)
>> 											cpuset_css_offline
>> 											P(cpu_hotplug_lock.read)
>> 											...blocking...
>> 256 cgroup_bpf_release
>> mutex_lock(&cgroup_mutex);
>> ..blocking...
>>
>> To fix the problem, place cgroup_bpf_release works on a dedicated
>> workqueue which can break the loop and solve the problem. System wqs are
>> for misc things which shouldn't create a large number of concurrent work
>> items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent
>> work items, it should use its own dedicated workqueue.
>>
>> Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
>> Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> Tested-by: Vishal Chourasia <vishalc@linux.ibm.com>
> 
> Thank you Chen, for sharing the details on how to reproduce, and for the
> patchset.
> 
> Steps I followed to reproduce:
> 1) run cgroup-make.sh
> 2) run hotplug.sh
> 3) run watchdog.sh
> 
> # cat cgroup-make.sh
> #!/bin/bash
> 
> echo 30 > /proc/sys/kernel/hung_task_timeout_secs
> cat /proc/sys/kernel/hung_task_timeout_secs
> 
> echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
> mkdir /sys/fs/cgroup/memory
> # echo +memory > /sys/fs/cgroup/memory/cgroup.subtree_control
> mkdir /sys/fs/cgroup/cpuset
> echo +cpuset > /sys/fs/cgroup/cpuset/cgroup.subtree_control
> echo +cpu > /sys/fs/cgroup/cpuset/cgroup.subtree_control
> 
> 
> timestamp=$(date +%s)
> echo $timestamp
> while true; do
>          for i in {0..2000}; do
>                  mkdir /sys/fs/cgroup/cpuset/test${timestamp}_${i} &
>                  mkdir /sys/fs/cgroup/memory/test${timestamp}_${i} &
>          done
> 
>          for i in {0..2000}; do
>                  rmdir /sys/fs/cgroup/cpuset/test${timestamp}_${i} &
>                  rmdir /sys/fs/cgroup/memory/test${timestamp}_${i} &
>          done
> done
> 
> # cat hotplug.sh
> #!/bin/bash
> 
> while true
> do
> echo 1 > /sys/devices/system/cpu/cpu2/online
> echo 1 > /sys/devices/system/cpu/cpu3/online
> echo 0 > /sys/devices/system/cpu/cpu2/online
> echo 0 > /sys/devices/system/cpu/cpu3/online
> done
> 
> # cat watchdog.sh
> #!/bin/bash
> 
> while true
> do
> echo 12 > /proc/sys/kernel/watchdog_thresh
> echo 11 > /proc/sys/kernel/watchdog_thresh
> echo 10 > /proc/sys/kernel/watchdog_thresh
> done
> 
> With these steps I able to get the hung_task timeout log messages
> INFO: task kworker/7:1:84 blocked for more than 30 seconds.
>        Not tainted 6.11.0-chenridong_base-10547-g684a64bf32b6-dirty #59
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/7:1     state:D stack:0     pid:84    tgid:84    ppid:2      flags:0x00000000
> Workqueue: events cgroup_bpf_release
> Call Trace:
> [c00000000ee779a0] [c00000000ee779e0] 0xc00000000ee779e0 (unreliable)
> [c00000000ee77b50] [c00000000001f79c] __switch_to+0x14c/0x220
> [c00000000ee77bb0] [c0000000010e8cd0] __schedule+0x2c0/0x840
> [c00000000ee77c90] [c0000000010e9290] schedule+0x40/0x110
> [c00000000ee77d00] [c0000000010e95b0] schedule_preempt_disabled+0x20/0x30
> [c00000000ee77d20] [c0000000010ec408] __mutex_lock.constprop.0+0x5e8/0xbe0
> [c00000000ee77db0] [c000000000472f58] cgroup_bpf_release+0x98/0x3d0
> [c00000000ee77e40] [c0000000001886a8] process_one_work+0x1f8/0x520
> [c00000000ee77ef0] [c00000000018a01c] worker_thread+0x33c/0x4f0
> [c00000000ee77f90] [c0000000001970c8] kthread+0x138/0x140
> [c00000000ee77fe0] [c00000000000dd58] start_kernel_thread+0x14/0x18
> INFO: task kworker/4:1:98 blocked for more than 30 seconds.
>        Not tainted 6.11.0-chenridong_base-10547-g684a64bf32b6-dirty #59
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/4:1     state:D stack:0     pid:98    tgid:98    ppid:2      flags:0x00000000
> Workqueue: events cgroup_bpf_release
> Call Trace:
> [c00000000ee1f9a0] [c00000000ee1f9e0] 0xc00000000ee1f9e0 (unreliable)
> [c00000000ee1fb50] [c00000000001f79c] __switch_to+0x14c/0x220
> [c00000000ee1fbb0] [c0000000010e8cd0] __schedule+0x2c0/0x840
> [c00000000ee1fc90] [c0000000010e9290] schedule+0x40/0x110
> [c00000000ee1fd00] [c0000000010e95b0] schedule_preempt_disabled+0x20/0x30
> [c00000000ee1fd20] [c0000000010ec408] __mutex_lock.constprop.0+0x5e8/0xbe0
> [c00000000ee1fdb0] [c000000000472f58] cgroup_bpf_release+0x98/0x3d0
> [c00000000ee1fe40] [c0000000001886a8] process_one_work+0x1f8/0x520
> [c00000000ee1fef0] [c00000000018a01c] worker_thread+0x33c/0x4f0
> [c00000000ee1ff90] [c0000000001970c8] kthread+0x138/0x140
> [c00000000ee1ffe0] [c00000000000dd58] start_kernel_thread+0x14/0x18
> INFO: task kworker/5:1:110 blocked for more than 30 seconds.
>        Not tainted 6.11.0-chenridong_base-10547-g684a64bf32b6-dirty #59
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/5:1     state:D stack:0     pid:110   tgid:110   ppid:2      flags:0x00000000
> Workqueue: events cgroup_bpf_release
> Call Trace:
> [c0000000608bf9a0] [c0000000608bf9e0] 0xc0000000608bf9e0 (unreliable)
> [c0000000608bfb50] [c00000000001f79c] __switch_to+0x14c/0x220
> [c0000000608bfbb0] [c0000000010e8cd0] __schedule+0x2c0/0x840
> [c0000000608bfc90] [c0000000010e9290] schedule+0x40/0x110
> [c0000000608bfd00] [c0000000010e95b0] schedule_preempt_disabled+0x20/0x30
> [c0000000608bfd20] [c0000000010ec408] __mutex_lock.constprop.0+0x5e8/0xbe0
> [c0000000608bfdb0] [c000000000472f58] cgroup_bpf_release+0x98/0x3d0
> [c0000000608bfe40] [c0000000001886a8] process_one_work+0x1f8/0x520
> [c0000000608bfef0] [c00000000018a01c] worker_thread+0x33c/0x4f0
> [c0000000608bff90] [c0000000001970c8] kthread+0x138/0x140
> [c0000000608bffe0] [c00000000000dd58] start_kernel_thread+0x14/0x18
> Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings
> 
> After applying this patchset, I didn't see any log messages being
> printed in dmesg.
> 
> State of the git repo:
> $git log --oneline
> a40aebb33934 (HEAD -> patches/v5_20240923_chenridong_add_dedicated_wq_for_cgroup_bpf_and_adjust_wq_max_active) workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048
> 08a2979a9e59 workqueue: doc: Add a note saturating the system_wq is not permitted
> 0e6f5ea2955f cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
> 684a64bf32b6 Merge tag 'nfs-for-6.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
> 
> 
> 
>> ---
>>   kernel/bpf/cgroup.c | 18 +++++++++++++++++-
>>   1 file changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index e7113d700b87..1a7609f61d44 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -24,6 +24,22 @@
>>   DEFINE_STATIC_KEY_ARRAY_FALSE(cgroup_bpf_enabled_key, MAX_CGROUP_BPF_ATTACH_TYPE);
>>   EXPORT_SYMBOL(cgroup_bpf_enabled_key);
>>   
>> +/*
>> + * cgroup bpf destruction makes heavy use of work items and there can be a lot
>> + * of concurrent destructions.  Use a separate workqueue so that cgroup bpf
>> + * destruction work items don't end up filling up max_active of system_wq
>> + * which may lead to deadlock.
>> + */
>> +static struct workqueue_struct *cgroup_bpf_destroy_wq;
>> +
>> +static int __init cgroup_bpf_wq_init(void)
>> +{
>> +	cgroup_bpf_destroy_wq = alloc_workqueue("cgroup_bpf_destroy", 0, 1);
>> +	WARN_ON_ONCE(!cgroup_bpf_destroy_wq);
>> +	return 0;
>> +}
>> +core_initcall(cgroup_bpf_wq_init);
>> +
>>   /* __always_inline is necessary to prevent indirect call through run_prog
>>    * function pointer.
>>    */
>> @@ -334,7 +350,7 @@ static void cgroup_bpf_release_fn(struct percpu_ref *ref)
>>   	struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);
>>   
>>   	INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
>> -	queue_work(system_wq, &cgrp->bpf.release_work);
>> +	queue_work(cgroup_bpf_destroy_wq, &cgrp->bpf.release_work);
>>   }
>>   
>>   /* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's through
>> -- 
>> 2.34.1

Thank you for doing that.

Best regards,
Ridong


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE
  2024-09-23 11:43 [PATCH v5 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE Chen Ridong
                   ` (2 preceding siblings ...)
  2024-09-23 11:43 ` [PATCH v5 3/3] workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048 Chen Ridong
@ 2024-09-27 19:56 ` Tejun Heo
  2024-09-29  2:38   ` Chen Ridong
  3 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2024-09-27 19:56 UTC (permalink / raw)
  To: Chen Ridong
  Cc: lizefan.x, hannes, longman, mkoutny, chenridong, cgroups,
	linux-kernel

Hello,

On Mon, Sep 23, 2024 at 11:43:49AM +0000, Chen Ridong wrote:
> The patch series add a dedicated workqueue for cgroup bpf destruction,
> add adjust WQ_MAX_ACTIVE from 512 to 2048.

Patchset generally looks good to me. I'll wait for an updated version
addressing Michal's comments.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE
  2024-09-27 19:56 ` [PATCH v5 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE Tejun Heo
@ 2024-09-29  2:38   ` Chen Ridong
  0 siblings, 0 replies; 13+ messages in thread
From: Chen Ridong @ 2024-09-29  2:38 UTC (permalink / raw)
  To: Tejun Heo; +Cc: lizefan.x, hannes, longman, mkoutny, cgroups, linux-kernel



On 2024/9/28 3:56, Tejun Heo wrote:
> Hello,
> 
> On Mon, Sep 23, 2024 at 11:43:49AM +0000, Chen Ridong wrote:
>> The patch series add a dedicated workqueue for cgroup bpf destruction,
>> add adjust WQ_MAX_ACTIVE from 512 to 2048.
> 
> Patchset generally looks good to me. I'll wait for an updated version
> addressing Michal's comments.
> 
> Thanks.
> 

Thank you, TJ, I will update version if Michal thinks the "NOTE" is 
acceptable. I am waiting his reply.

Best regards,
Ridong


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 2/3] workqueue: doc: Add a note saturating the system_wq is not permitted
  2024-09-27  8:08     ` Chen Ridong
@ 2024-09-30 12:50       ` Michal Koutný
  2024-10-08  1:32         ` Chen Ridong
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Koutný @ 2024-09-30 12:50 UTC (permalink / raw)
  To: Chen Ridong
  Cc: tj, lizefan.x, hannes, longman, chenridong, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 898 bytes --]

Hi.

On Fri, Sep 27, 2024 at 04:08:26PM GMT, Chen Ridong <chenridong@huaweicloud.com> wrote:
> How about:
> Note: If something may generate works frequently, it may saturate the
> system_wq and potentially lead to deadlock. It should utilize its own
> dedicated workqueue rather than system wq.

It doesn't depend only on generating frequency (in Tetsuo's example with
slow works, the "high" would only be 256/s) and accurate information is
likely only empirical, thus I'd refine it further:

> Note: If something may generate more than @max_active outstanding
> work items (do stress test your producers), it may saturate a system
> wq and potentially lead to deadlock. It should utilize its own
> dedicated workqueue rather than the system wq.

(besides @max_active reference, I also changed generic system_wq to
system wq as the surrounding text seems to refer to any of the
system_*wq)

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 2/3] workqueue: doc: Add a note saturating the system_wq is not permitted
  2024-09-30 12:50       ` Michal Koutný
@ 2024-10-08  1:32         ` Chen Ridong
  0 siblings, 0 replies; 13+ messages in thread
From: Chen Ridong @ 2024-10-08  1:32 UTC (permalink / raw)
  To: Michal Koutný
  Cc: tj, lizefan.x, hannes, longman, chenridong, cgroups, linux-kernel



On 2024/9/30 20:50, Michal Koutný wrote:
> Hi.
> 
> On Fri, Sep 27, 2024 at 04:08:26PM GMT, Chen Ridong <chenridong@huaweicloud.com> wrote:
>> How about:
>> Note: If something may generate works frequently, it may saturate the
>> system_wq and potentially lead to deadlock. It should utilize its own
>> dedicated workqueue rather than system wq.
> 
> It doesn't depend only on generating frequency (in Tetsuo's example with
> slow works, the "high" would only be 256/s) and accurate information is
> likely only empirical, thus I'd refine it further:
> 
>> Note: If something may generate more than @max_active outstanding
>> work items (do stress test your producers), it may saturate a system
>> wq and potentially lead to deadlock. It should utilize its own
>> dedicated workqueue rather than the system wq.
> 
> (besides @max_active reference, I also changed generic system_wq to
> system wq as the surrounding text seems to refer to any of the
> system_*wq)
> 
> Michal

Thank you, Michal.
I took a week off.
I will update soon.

Best regards,
Ridong.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-10-08  1:32 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-23 11:43 [PATCH v5 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE Chen Ridong
2024-09-23 11:43 ` [PATCH v5 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction Chen Ridong
2024-09-26 12:49   ` Michal Koutný
2024-09-27  7:45     ` Chen Ridong
     [not found]   ` <ZvYzIcYSJa3Loq4G@linux.ibm.com>
2024-09-27  9:50     ` Chen Ridong
2024-09-23 11:43 ` [PATCH v5 2/3] workqueue: doc: Add a note saturating the system_wq is not permitted Chen Ridong
2024-09-26 12:49   ` Michal Koutný
2024-09-27  8:08     ` Chen Ridong
2024-09-30 12:50       ` Michal Koutný
2024-10-08  1:32         ` Chen Ridong
2024-09-23 11:43 ` [PATCH v5 3/3] workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048 Chen Ridong
2024-09-27 19:56 ` [PATCH v5 0/3] add dedicated wq for cgroup bpf and adjust WQ_MAX_ACTIVE Tejun Heo
2024-09-29  2:38   ` Chen Ridong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).