[PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix RCU and other stalls while iterating tasks during enable/disable

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix RCU and other stalls while iterating tasks during enable/disable
@ 2024-10-09 21:40 Tejun Heo
  2024-10-09 21:40 ` [PATCH 1/6] Revert "sched_ext: Use shorter slice while bypassing" Tejun Heo
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Tejun Heo @ 2024-10-09 21:40 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, sched-ext

The enable/disable paths walk all tasks a couple times in bypass mode. There
are a couple problems:

- Bypass mode incorrectly depends on ops.select_cpu() which must not be
  trusted in bypass mode.

- scx_tasks_lock is held while walking all tasks. This can lead to RCU and
  other stalls on a large heavily contended system with many tasks.

Fix the former by always using the default select_cpu() in bypass mode and
the latter by periodically dropping scx_tasks_lock while iterating tasks.

This patchset contains the following patches:

 0001-Revert-sched_ext-Use-shorter-slice-while-bypassing.patch
 0002-sched_ext-Start-schedulers-with-consistent-p-scx.sli.patch
 0003-sched_ext-Move-scx_buildin_idle_enabled-check-to-scx.patch
 0004-sched_ext-bypass-mode-shouldn-t-depend-on-ops.select.patch
 0005-sched_ext-Move-scx_tasks_lock-handling-into-scx_task.patch
 0006-sched_ext-Don-t-hold-scx_tasks_lock-for-too-long.patch

 0001 reverts an unnecessary earlier change.

 0002 makes sure that p->scx.slice is consistent on scheduler load.

 0003-0004 make bypass mode always use the default select_cpu().

 0005-0006 make task iteration drop scx_tasks_lock periodically.

and is also available in the following git branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-fix-task_iter-stalls

diffstat follows. Thanks.

 kernel/sched/ext.c |  183 +++++++++++++++++++++++++++++++++++++-------------------------------
 1 file changed, 100 insertions(+), 83 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/6] Revert "sched_ext: Use shorter slice while bypassing"
  2024-10-09 21:40 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix RCU and other stalls while iterating tasks during enable/disable Tejun Heo
@ 2024-10-09 21:40 ` Tejun Heo
  2024-10-10 17:59   ` David Vernet
  2024-10-09 21:40 ` [PATCH 2/6] sched_ext: Start schedulers with consistent p->scx.slice values Tejun Heo
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2024-10-09 21:40 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, sched-ext, Tejun Heo

This reverts commit 6f34d8d382d64e7d8e77f5a9ddfd06f4c04937b0.

Slice length is ignored while bypassing and tasks are switched on every tick
and thus the patch does not make any difference. The perceived difference
was from test noise.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 22e18aec4ee1..e08cfba33cb4 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -9,7 +9,6 @@
 #define SCX_OP_IDX(op)		(offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void)))
 
 enum scx_consts {
-	SCX_SLICE_BYPASS		= SCX_SLICE_DFL / 4,
 	SCX_DSP_DFL_MAX_BATCH		= 32,
 	SCX_DSP_MAX_LOOPS		= 32,
 	SCX_WATCHDOG_MAX_TIMEOUT	= 30 * HZ,
@@ -1949,7 +1948,6 @@ static bool scx_rq_online(struct rq *rq)
 static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 			    int sticky_cpu)
 {
-	bool bypassing = scx_rq_bypassing(rq);
 	struct task_struct **ddsp_taskp;
 	unsigned long qseq;
 
@@ -1967,7 +1965,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	if (!scx_rq_online(rq))
 		goto local;
 
-	if (bypassing)
+	if (scx_rq_bypassing(rq))
 		goto global;
 
 	if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
@@ -2022,7 +2020,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 
 global:
 	touch_core_sched(rq, p);	/* see the comment in local: */
-	p->scx.slice = bypassing ? SCX_SLICE_BYPASS : SCX_SLICE_DFL;
+	p->scx.slice = SCX_SLICE_DFL;
 	dispatch_enqueue(find_global_dsq(p), p, enq_flags);
 }
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/6] Revert "sched_ext: Use shorter slice while bypassing"
  2024-10-09 21:40 ` [PATCH 1/6] Revert "sched_ext: Use shorter slice while bypassing" Tejun Heo
@ 2024-10-10 17:59   ` David Vernet
  0 siblings, 0 replies; 17+ messages in thread
From: David Vernet @ 2024-10-10 17:59 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel, sched-ext

[-- Attachment #1: Type: text/plain, Size: 405 bytes --]

On Wed, Oct 09, 2024 at 11:40:57AM -1000, Tejun Heo wrote:
> This reverts commit 6f34d8d382d64e7d8e77f5a9ddfd06f4c04937b0.
> 
> Slice length is ignored while bypassing and tasks are switched on every tick
> and thus the patch does not make any difference. The perceived difference
> was from test noise.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: David Vernet <void@manifault.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/6] sched_ext: Start schedulers with consistent p->scx.slice values
  2024-10-09 21:40 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix RCU and other stalls while iterating tasks during enable/disable Tejun Heo
  2024-10-09 21:40 ` [PATCH 1/6] Revert "sched_ext: Use shorter slice while bypassing" Tejun Heo
@ 2024-10-09 21:40 ` Tejun Heo
  2024-10-10 18:00   ` David Vernet
  2024-10-09 21:40 ` [PATCH 3/6] sched_ext: Move scx_buildin_idle_enabled check to scx_bpf_select_cpu_dfl() Tejun Heo
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2024-10-09 21:40 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, sched-ext, Tejun Heo

The disable path caps p->scx.slice to SCX_SLICE_DFL. As the field is already
being ignored at this stage during disable, the only effect this has is that
when the next BPF scheduler is loaded, it won't see unreasonable left-over
slices. Ultimately, this shouldn't matter but it's better to start in a
known state. Drop p->scx.slice capping from the disable path and instead
reset it to SCX_SLICE_DFL in the enable path.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index e08cfba33cb4..7d2fa78ba0c9 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4473,7 +4473,6 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
 
 		sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx);
 
-		p->scx.slice = min_t(u64, p->scx.slice, SCX_SLICE_DFL);
 		__setscheduler_prio(p, p->prio);
 		check_class_changing(task_rq(p), p, old_class);
 
@@ -5190,6 +5189,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 
 		sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx);
 
+		p->scx.slice = SCX_SLICE_DFL;
 		__setscheduler_prio(p, p->prio);
 		check_class_changing(task_rq(p), p, old_class);
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] sched_ext: Start schedulers with consistent p->scx.slice values
  2024-10-09 21:40 ` [PATCH 2/6] sched_ext: Start schedulers with consistent p->scx.slice values Tejun Heo
@ 2024-10-10 18:00   ` David Vernet
  0 siblings, 0 replies; 17+ messages in thread
From: David Vernet @ 2024-10-10 18:00 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel, sched-ext

[-- Attachment #1: Type: text/plain, Size: 593 bytes --]

On Wed, Oct 09, 2024 at 11:40:58AM -1000, Tejun Heo wrote:
> The disable path caps p->scx.slice to SCX_SLICE_DFL. As the field is already
> being ignored at this stage during disable, the only effect this has is that
> when the next BPF scheduler is loaded, it won't see unreasonable left-over
> slices. Ultimately, this shouldn't matter but it's better to start in a
> known state. Drop p->scx.slice capping from the disable path and instead
> reset it to SCX_SLICE_DFL in the enable path.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: David Vernet <void@manifault.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3/6] sched_ext: Move scx_buildin_idle_enabled check to scx_bpf_select_cpu_dfl()
  2024-10-09 21:40 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix RCU and other stalls while iterating tasks during enable/disable Tejun Heo
  2024-10-09 21:40 ` [PATCH 1/6] Revert "sched_ext: Use shorter slice while bypassing" Tejun Heo
  2024-10-09 21:40 ` [PATCH 2/6] sched_ext: Start schedulers with consistent p->scx.slice values Tejun Heo
@ 2024-10-09 21:40 ` Tejun Heo
  2024-10-09 21:41 ` [PATCH 4/6] sched_ext: bypass mode shouldn't depend on ops.select_cpu() Tejun Heo
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2024-10-09 21:40 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, sched-ext, Tejun Heo

Move the sanity check from the inner function scx_select_cpu_dfl() to the
exported kfunc scx_bpf_select_cpu_dfl(). This doesn't cause behavior
differences and will allow using scx_select_cpu_dfl() in bypass mode
regardless of scx_builtin_idle_enabled.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 7d2fa78ba0c9..1cae18000de1 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3062,11 +3062,6 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 
 	*found = false;
 
-	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
-		scx_ops_error("built-in idle tracking is disabled");
-		return prev_cpu;
-	}
-
 	/*
 	 * If WAKE_SYNC, the waker's local DSQ is empty, and the system is
 	 * under utilized, wake up @p to the local DSQ of the waker. Checking
@@ -5870,16 +5865,21 @@ __bpf_kfunc_start_defs();
 __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 				       u64 wake_flags, bool *is_idle)
 {
-	if (!scx_kf_allowed(SCX_KF_SELECT_CPU)) {
-		*is_idle = false;
-		return prev_cpu;
+	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+		scx_ops_error("built-in idle tracking is disabled");
+		goto prev_cpu;
 	}
+
+	if (!scx_kf_allowed(SCX_KF_SELECT_CPU))
+		goto prev_cpu;
+
 #ifdef CONFIG_SMP
 	return scx_select_cpu_dfl(p, prev_cpu, wake_flags, is_idle);
-#else
+#endif
+
+prev_cpu:
 	*is_idle = false;
 	return prev_cpu;
-#endif
 }
 
 __bpf_kfunc_end_defs();
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/6] sched_ext: bypass mode shouldn't depend on ops.select_cpu()
  2024-10-09 21:40 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix RCU and other stalls while iterating tasks during enable/disable Tejun Heo
                   ` (2 preceding siblings ...)
  2024-10-09 21:40 ` [PATCH 3/6] sched_ext: Move scx_buildin_idle_enabled check to scx_bpf_select_cpu_dfl() Tejun Heo
@ 2024-10-09 21:41 ` Tejun Heo
  2024-10-10 18:15   ` David Vernet
  2024-10-09 21:41 ` [PATCH 5/6] sched_ext: Move scx_tasks_lock handling into scx_task_iter helpers Tejun Heo
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2024-10-09 21:41 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, sched-ext, Tejun Heo

Bypass mode was depending on ops.select_cpu() which can't be trusted as with
the rest of the BPF scheduler. Always enable and use scx_select_cpu_dfl() in
bypass mode.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 1cae18000de1..c5cda7368de5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3126,7 +3126,7 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
 	if (unlikely(wake_flags & WF_EXEC))
 		return prev_cpu;
 
-	if (SCX_HAS_OP(select_cpu)) {
+	if (SCX_HAS_OP(select_cpu) && !scx_rq_bypassing(task_rq(p))) {
 		s32 cpu;
 		struct task_struct **ddsp_taskp;
 
@@ -3191,7 +3191,7 @@ void __scx_update_idle(struct rq *rq, bool idle)
 {
 	int cpu = cpu_of(rq);
 
-	if (SCX_HAS_OP(update_idle)) {
+	if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
 		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
 		if (!static_branch_unlikely(&scx_builtin_idle_enabled))
 			return;
@@ -4254,21 +4254,23 @@ bool task_should_scx(struct task_struct *p)
  * the DISABLING state and then cycling the queued tasks through dequeue/enqueue
  * to force global FIFO scheduling.
  *
- * a. ops.enqueue() is ignored and tasks are queued in simple global FIFO order.
- *    %SCX_OPS_ENQ_LAST is also ignored.
+ * - ops.select_cpu() is ignored and the default select_cpu() is used.
  *
- * b. ops.dispatch() is ignored.
+ * - ops.enqueue() is ignored and tasks are queued in simple global FIFO order.
+ *   %SCX_OPS_ENQ_LAST is also ignored.
  *
- * c. balance_scx() does not set %SCX_RQ_BAL_KEEP on non-zero slice as slice
- *    can't be trusted. Whenever a tick triggers, the running task is rotated to
- *    the tail of the queue with core_sched_at touched.
+ * - ops.dispatch() is ignored.
  *
- * d. pick_next_task() suppresses zero slice warning.
+ * - balance_scx() does not set %SCX_RQ_BAL_KEEP on non-zero slice as slice
+ *   can't be trusted. Whenever a tick triggers, the running task is rotated to
+ *   the tail of the queue with core_sched_at touched.
  *
- * e. scx_bpf_kick_cpu() is disabled to avoid irq_work malfunction during PM
- *    operations.
+ * - pick_next_task() suppresses zero slice warning.
  *
- * f. scx_prio_less() reverts to the default core_sched_at order.
+ * - scx_bpf_kick_cpu() is disabled to avoid irq_work malfunction during PM
+ *   operations.
+ *
+ * - scx_prio_less() reverts to the default core_sched_at order.
  */
 static void scx_ops_bypass(bool bypass)
 {
@@ -4338,7 +4340,7 @@ static void scx_ops_bypass(bool bypass)
 
 		rq_unlock_irqrestore(rq, &rf);
 
-		/* kick to restore ticks */
+		/* resched to restore ticks and idle state */
 		resched_cpu(cpu);
 	}
 }
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/6] sched_ext: bypass mode shouldn't depend on ops.select_cpu()
  2024-10-09 21:41 ` [PATCH 4/6] sched_ext: bypass mode shouldn't depend on ops.select_cpu() Tejun Heo
@ 2024-10-10 18:15   ` David Vernet
  2024-10-10 18:26     ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: David Vernet @ 2024-10-10 18:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel, sched-ext

[-- Attachment #1: Type: text/plain, Size: 3608 bytes --]

On Wed, Oct 09, 2024 at 11:41:00AM -1000, Tejun Heo wrote:

Hi Tejun,

> Bypass mode was depending on ops.select_cpu() which can't be trusted as with
> the rest of the BPF scheduler. Always enable and use scx_select_cpu_dfl() in
> bypass mode.

Could you please clarify why we can't trust ops.select_cpu()? Even if it
returns a bogus, offline, etc, CPU, shouldn't core.c take care of
finding a valid CPU for us in select_fallback_rq()?

Assuming we really do require a valid CPU here in bypass mode, do we
need to reset the state of the idle masks for the case of
!scx_builtin_idle_enabled? The masks won't necessarily reflect the set
of online CPUs if we haven't been updating it, right?

Thanks,
David

> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext.c | 28 +++++++++++++++-------------
>  1 file changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 1cae18000de1..c5cda7368de5 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -3126,7 +3126,7 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
>  	if (unlikely(wake_flags & WF_EXEC))
>  		return prev_cpu;
>  
> -	if (SCX_HAS_OP(select_cpu)) {
> +	if (SCX_HAS_OP(select_cpu) && !scx_rq_bypassing(task_rq(p))) {
>  		s32 cpu;
>  		struct task_struct **ddsp_taskp;
>  
> @@ -3191,7 +3191,7 @@ void __scx_update_idle(struct rq *rq, bool idle)
>  {
>  	int cpu = cpu_of(rq);
>  
> -	if (SCX_HAS_OP(update_idle)) {
> +	if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
>  		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
>  		if (!static_branch_unlikely(&scx_builtin_idle_enabled))
>  			return;
> @@ -4254,21 +4254,23 @@ bool task_should_scx(struct task_struct *p)
>   * the DISABLING state and then cycling the queued tasks through dequeue/enqueue
>   * to force global FIFO scheduling.
>   *
> - * a. ops.enqueue() is ignored and tasks are queued in simple global FIFO order.
> - *    %SCX_OPS_ENQ_LAST is also ignored.
> + * - ops.select_cpu() is ignored and the default select_cpu() is used.
>   *
> - * b. ops.dispatch() is ignored.
> + * - ops.enqueue() is ignored and tasks are queued in simple global FIFO order.
> + *   %SCX_OPS_ENQ_LAST is also ignored.
>   *
> - * c. balance_scx() does not set %SCX_RQ_BAL_KEEP on non-zero slice as slice
> - *    can't be trusted. Whenever a tick triggers, the running task is rotated to
> - *    the tail of the queue with core_sched_at touched.
> + * - ops.dispatch() is ignored.
>   *
> - * d. pick_next_task() suppresses zero slice warning.
> + * - balance_scx() does not set %SCX_RQ_BAL_KEEP on non-zero slice as slice
> + *   can't be trusted. Whenever a tick triggers, the running task is rotated to
> + *   the tail of the queue with core_sched_at touched.
>   *
> - * e. scx_bpf_kick_cpu() is disabled to avoid irq_work malfunction during PM
> - *    operations.
> + * - pick_next_task() suppresses zero slice warning.
>   *
> - * f. scx_prio_less() reverts to the default core_sched_at order.
> + * - scx_bpf_kick_cpu() is disabled to avoid irq_work malfunction during PM
> + *   operations.
> + *
> + * - scx_prio_less() reverts to the default core_sched_at order.
>   */
>  static void scx_ops_bypass(bool bypass)
>  {
> @@ -4338,7 +4340,7 @@ static void scx_ops_bypass(bool bypass)
>  
>  		rq_unlock_irqrestore(rq, &rf);
>  
> -		/* kick to restore ticks */
> +		/* resched to restore ticks and idle state */
>  		resched_cpu(cpu);
>  	}
>  }
> -- 
> 2.46.2
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/6] sched_ext: bypass mode shouldn't depend on ops.select_cpu()
  2024-10-10 18:15   ` David Vernet
@ 2024-10-10 18:26     ` Tejun Heo
  2024-10-10 18:31       ` David Vernet
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2024-10-10 18:26 UTC (permalink / raw)
  To: David Vernet; +Cc: kernel-team, linux-kernel, sched-ext

Hello,

On Thu, Oct 10, 2024 at 01:15:17PM -0500, David Vernet wrote:
> On Wed, Oct 09, 2024 at 11:41:00AM -1000, Tejun Heo wrote:
> > Bypass mode was depending on ops.select_cpu() which can't be trusted as with
> > the rest of the BPF scheduler. Always enable and use scx_select_cpu_dfl() in
> > bypass mode.
> 
> Could you please clarify why we can't trust ops.select_cpu()? Even if it
> returns a bogus, offline, etc, CPU, shouldn't core.c take care of
> finding a valid CPU for us in select_fallback_rq()?

For example, if select_cpu() returns the same CPU for all threads on a
loaded system, that CPU can get very overloaded which can lead to RCU and
workqueue stalls which can then cascade to other failures.

> Assuming we really do require a valid CPU here in bypass mode, do we
> need to reset the state of the idle masks for the case of
> !scx_builtin_idle_enabled? The masks won't necessarily reflect the set
> of online CPUs if we haven't been updating it, right?

I think resched_cpu() after switching each CPU into bypass mode is enough.
That guarantees that the CPU leaves the idle state, clearing the idle state
if set, and if the CPU is idle, it goes back into idle, setting the bit, so
at the end, it ends up synchronized.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/6] sched_ext: bypass mode shouldn't depend on ops.select_cpu()
  2024-10-10 18:26     ` Tejun Heo
@ 2024-10-10 18:31       ` David Vernet
  0 siblings, 0 replies; 17+ messages in thread
From: David Vernet @ 2024-10-10 18:31 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel, sched-ext

[-- Attachment #1: Type: text/plain, Size: 1587 bytes --]

On Thu, Oct 10, 2024 at 08:26:27AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Thu, Oct 10, 2024 at 01:15:17PM -0500, David Vernet wrote:
> > On Wed, Oct 09, 2024 at 11:41:00AM -1000, Tejun Heo wrote:
> > > Bypass mode was depending on ops.select_cpu() which can't be trusted as with
> > > the rest of the BPF scheduler. Always enable and use scx_select_cpu_dfl() in
> > > bypass mode.
> > 
> > Could you please clarify why we can't trust ops.select_cpu()? Even if it
> > returns a bogus, offline, etc, CPU, shouldn't core.c take care of
> > finding a valid CPU for us in select_fallback_rq()?
> 
> For example, if select_cpu() returns the same CPU for all threads on a
> loaded system, that CPU can get very overloaded which can lead to RCU and
> workqueue stalls which can then cascade to other failures.

Ok I see, so it's that it could make such poor scheduling decisions that
the system can't recover. Got it.

> > Assuming we really do require a valid CPU here in bypass mode, do we
> > need to reset the state of the idle masks for the case of
> > !scx_builtin_idle_enabled? The masks won't necessarily reflect the set
> > of online CPUs if we haven't been updating it, right?
> 
> I think resched_cpu() after switching each CPU into bypass mode is enough.
> That guarantees that the CPU leaves the idle state, clearing the idle state
> if set, and if the CPU is idle, it goes back into idle, setting the bit, so
> at the end, it ends up synchronized.

Yeah that sounds reasonable. Thanks for explaining.

Acked-by: David Vernet <void@manifault.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 5/6] sched_ext: Move scx_tasks_lock handling into scx_task_iter helpers
  2024-10-09 21:40 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix RCU and other stalls while iterating tasks during enable/disable Tejun Heo
                   ` (3 preceding siblings ...)
  2024-10-09 21:41 ` [PATCH 4/6] sched_ext: bypass mode shouldn't depend on ops.select_cpu() Tejun Heo
@ 2024-10-09 21:41 ` Tejun Heo
  2024-10-10 18:36   ` David Vernet
  2024-10-09 21:41 ` [PATCH 6/6] sched_ext: Don't hold scx_tasks_lock for too long Tejun Heo
  2024-10-10 21:43 ` [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix RCU and other stalls while iterating tasks during enable/disable Tejun Heo
  6 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2024-10-09 21:41 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, sched-ext, Tejun Heo

Iterating with scx_task_iter involves scx_tasks_lock and optionally the rq
lock of the task being iterated. Both locks can be released during iteration
and the iteration can be continued after re-grabbing scx_tasks_lock.
Currently, all lock handling is pushed to the caller which is a bit
cumbersome and makes it difficult to add lock-aware behaviors. Make the
scx_task_iter helpers handle scx_tasks_lock.

- scx_task_iter_init/scx_taks_iter_exit() now grabs and releases
  scx_task_lock, respectively. Renamed to
  scx_task_iter_start/scx_task_iter_stop() to more clearly indicate that
  there are non-trivial side-effects.

- Add __ prefix to scx_task_iter_rq_unlock() to indicate that the function
  is internal.

- Add scx_task_iter_unlock/relock(). The former drops both rq lock (if held)
  and scx_tasks_lock and the latter re-locks only scx_tasks_lock.

This doesn't cause behavior changes and will be used to implement stall
avoidance.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 110 +++++++++++++++++++++++----------------------
 1 file changed, 56 insertions(+), 54 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index c5cda7368de5..d53c7a365fec 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1276,76 +1276,86 @@ struct scx_task_iter {
 };
 
 /**
- * scx_task_iter_init - Initialize a task iterator
+ * scx_task_iter_start - Lock scx_tasks_lock and start a task iteration
  * @iter: iterator to init
  *
- * Initialize @iter. Must be called with scx_tasks_lock held. Once initialized,
- * @iter must eventually be exited with scx_task_iter_exit().
+ * Initialize @iter and return with scx_tasks_lock held. Once initialized, @iter
+ * must eventually be stopped with scx_task_iter_stop().
  *
- * scx_tasks_lock may be released between this and the first next() call or
- * between any two next() calls. If scx_tasks_lock is released between two
- * next() calls, the caller is responsible for ensuring that the task being
- * iterated remains accessible either through RCU read lock or obtaining a
- * reference count.
+ * scx_tasks_lock and the rq lock may be released using scx_task_iter_unlock()
+ * between this and the first next() call or between any two next() calls. If
+ * the locks are released between two next() calls, the caller is responsible
+ * for ensuring that the task being iterated remains accessible either through
+ * RCU read lock or obtaining a reference count.
  *
  * All tasks which existed when the iteration started are guaranteed to be
  * visited as long as they still exist.
  */
-static void scx_task_iter_init(struct scx_task_iter *iter)
+static void scx_task_iter_start(struct scx_task_iter *iter)
 {
-	lockdep_assert_held(&scx_tasks_lock);
-
 	BUILD_BUG_ON(__SCX_DSQ_ITER_ALL_FLAGS &
 		     ((1U << __SCX_DSQ_LNODE_PRIV_SHIFT) - 1));
 
+	spin_lock_irq(&scx_tasks_lock);
+
 	iter->cursor = (struct sched_ext_entity){ .flags = SCX_TASK_CURSOR };
 	list_add(&iter->cursor.tasks_node, &scx_tasks);
 	iter->locked = NULL;
 }
 
+static void __scx_task_iter_rq_unlock(struct scx_task_iter *iter)
+{
+	if (iter->locked) {
+		task_rq_unlock(iter->rq, iter->locked, &iter->rf);
+		iter->locked = NULL;
+	}
+}
+
 /**
- * scx_task_iter_rq_unlock - Unlock rq locked by a task iterator
- * @iter: iterator to unlock rq for
+ * scx_task_iter_unlock - Unlock rq and scx_tasks_lock held by a task iterator
+ * @iter: iterator to unlock
  *
  * If @iter is in the middle of a locked iteration, it may be locking the rq of
- * the task currently being visited. Unlock the rq if so. This function can be
- * safely called anytime during an iteration.
+ * the task currently being visited in addition to scx_tasks_lock. Unlock both.
+ * This function can be safely called anytime during an iteration.
+ */
+static void scx_task_iter_unlock(struct scx_task_iter *iter)
+{
+	__scx_task_iter_rq_unlock(iter);
+	spin_unlock_irq(&scx_tasks_lock);
+}
+
+/**
+ * scx_task_iter_relock - Lock scx_tasks_lock released by scx_task_iter_unlock()
+ * @iter: iterator to re-lock
  *
- * Returns %true if the rq @iter was locking is unlocked. %false if @iter was
- * not locking an rq.
+ * Re-lock scx_tasks_lock unlocked by scx_task_iter_unlock(). Note that it
+ * doesn't re-lock the rq lock. Must be called before other iterator operations.
  */
-static bool scx_task_iter_rq_unlock(struct scx_task_iter *iter)
+static void scx_task_iter_relock(struct scx_task_iter *iter)
 {
-	if (iter->locked) {
-		task_rq_unlock(iter->rq, iter->locked, &iter->rf);
-		iter->locked = NULL;
-		return true;
-	} else {
-		return false;
-	}
+	spin_lock_irq(&scx_tasks_lock);
 }
 
 /**
- * scx_task_iter_exit - Exit a task iterator
+ * scx_task_iter_stop - Stop a task iteration and unlock scx_tasks_lock
  * @iter: iterator to exit
  *
- * Exit a previously initialized @iter. Must be called with scx_tasks_lock held.
- * If the iterator holds a task's rq lock, that rq lock is released. See
- * scx_task_iter_init() for details.
+ * Exit a previously initialized @iter. Must be called with scx_tasks_lock held
+ * which is released on return. If the iterator holds a task's rq lock, that rq
+ * lock is also released. See scx_task_iter_start() for details.
  */
-static void scx_task_iter_exit(struct scx_task_iter *iter)
+static void scx_task_iter_stop(struct scx_task_iter *iter)
 {
-	lockdep_assert_held(&scx_tasks_lock);
-
-	scx_task_iter_rq_unlock(iter);
 	list_del_init(&iter->cursor.tasks_node);
+	scx_task_iter_unlock(iter);
 }
 
 /**
  * scx_task_iter_next - Next task
  * @iter: iterator to walk
  *
- * Visit the next task. See scx_task_iter_init() for details.
+ * Visit the next task. See scx_task_iter_start() for details.
  */
 static struct task_struct *scx_task_iter_next(struct scx_task_iter *iter)
 {
@@ -1373,14 +1383,14 @@ static struct task_struct *scx_task_iter_next(struct scx_task_iter *iter)
  * @include_dead: Whether we should include dead tasks in the iteration
  *
  * Visit the non-idle task with its rq lock held. Allows callers to specify
- * whether they would like to filter out dead tasks. See scx_task_iter_init()
+ * whether they would like to filter out dead tasks. See scx_task_iter_start()
  * for details.
  */
 static struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter)
 {
 	struct task_struct *p;
 
-	scx_task_iter_rq_unlock(iter);
+	__scx_task_iter_rq_unlock(iter);
 
 	while ((p = scx_task_iter_next(iter))) {
 		/*
@@ -4462,8 +4472,7 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
 
 	scx_ops_init_task_enabled = false;
 
-	spin_lock_irq(&scx_tasks_lock);
-	scx_task_iter_init(&sti);
+	scx_task_iter_start(&sti);
 	while ((p = scx_task_iter_next_locked(&sti))) {
 		const struct sched_class *old_class = p->sched_class;
 		struct sched_enq_and_set_ctx ctx;
@@ -4478,8 +4487,7 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
 		check_class_changed(task_rq(p), p, old_class, p->prio);
 		scx_ops_exit_task(p);
 	}
-	scx_task_iter_exit(&sti);
-	spin_unlock_irq(&scx_tasks_lock);
+	scx_task_iter_stop(&sti);
 	percpu_up_write(&scx_fork_rwsem);
 
 	/* no task is on scx, turn off all the switches and flush in-progress calls */
@@ -5130,8 +5138,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 	if (ret)
 		goto err_disable_unlock_all;
 
-	spin_lock_irq(&scx_tasks_lock);
-	scx_task_iter_init(&sti);
+	scx_task_iter_start(&sti);
 	while ((p = scx_task_iter_next_locked(&sti))) {
 		/*
 		 * @p may already be dead, have lost all its usages counts and
@@ -5141,15 +5148,13 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 		if (!tryget_task_struct(p))
 			continue;
 
-		scx_task_iter_rq_unlock(&sti);
-		spin_unlock_irq(&scx_tasks_lock);
+		scx_task_iter_unlock(&sti);
 
 		ret = scx_ops_init_task(p, task_group(p), false);
 		if (ret) {
 			put_task_struct(p);
-			spin_lock_irq(&scx_tasks_lock);
-			scx_task_iter_exit(&sti);
-			spin_unlock_irq(&scx_tasks_lock);
+			scx_task_iter_relock(&sti);
+			scx_task_iter_stop(&sti);
 			scx_ops_error("ops.init_task() failed (%d) for %s[%d]",
 				      ret, p->comm, p->pid);
 			goto err_disable_unlock_all;
@@ -5158,10 +5163,9 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 		scx_set_task_state(p, SCX_TASK_READY);
 
 		put_task_struct(p);
-		spin_lock_irq(&scx_tasks_lock);
+		scx_task_iter_relock(&sti);
 	}
-	scx_task_iter_exit(&sti);
-	spin_unlock_irq(&scx_tasks_lock);
+	scx_task_iter_stop(&sti);
 	scx_cgroup_unlock();
 	percpu_up_write(&scx_fork_rwsem);
 
@@ -5178,8 +5182,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 	 * scx_tasks_lock.
 	 */
 	percpu_down_write(&scx_fork_rwsem);
-	spin_lock_irq(&scx_tasks_lock);
-	scx_task_iter_init(&sti);
+	scx_task_iter_start(&sti);
 	while ((p = scx_task_iter_next_locked(&sti))) {
 		const struct sched_class *old_class = p->sched_class;
 		struct sched_enq_and_set_ctx ctx;
@@ -5194,8 +5197,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 
 		check_class_changed(task_rq(p), p, old_class, p->prio);
 	}
-	scx_task_iter_exit(&sti);
-	spin_unlock_irq(&scx_tasks_lock);
+	scx_task_iter_stop(&sti);
 	percpu_up_write(&scx_fork_rwsem);
 
 	scx_ops_bypass(false);
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/6] sched_ext: Move scx_tasks_lock handling into scx_task_iter helpers
  2024-10-09 21:41 ` [PATCH 5/6] sched_ext: Move scx_tasks_lock handling into scx_task_iter helpers Tejun Heo
@ 2024-10-10 18:36   ` David Vernet
  0 siblings, 0 replies; 17+ messages in thread
From: David Vernet @ 2024-10-10 18:36 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel, sched-ext

[-- Attachment #1: Type: text/plain, Size: 1198 bytes --]

On Wed, Oct 09, 2024 at 11:41:01AM -1000, Tejun Heo wrote:
> Iterating with scx_task_iter involves scx_tasks_lock and optionally the rq
> lock of the task being iterated. Both locks can be released during iteration
> and the iteration can be continued after re-grabbing scx_tasks_lock.
> Currently, all lock handling is pushed to the caller which is a bit
> cumbersome and makes it difficult to add lock-aware behaviors. Make the
> scx_task_iter helpers handle scx_tasks_lock.
> 
> - scx_task_iter_init/scx_taks_iter_exit() now grabs and releases
>   scx_task_lock, respectively. Renamed to
>   scx_task_iter_start/scx_task_iter_stop() to more clearly indicate that
>   there are non-trivial side-effects.
> 
> - Add __ prefix to scx_task_iter_rq_unlock() to indicate that the function
>   is internal.
> 
> - Add scx_task_iter_unlock/relock(). The former drops both rq lock (if held)
>   and scx_tasks_lock and the latter re-locks only scx_tasks_lock.
> 
> This doesn't cause behavior changes and will be used to implement stall
> avoidance.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Thanks for the nice little cleanup.

Acked-by: David Vernet <void@manifault.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 6/6] sched_ext: Don't hold scx_tasks_lock for too long
  2024-10-09 21:40 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix RCU and other stalls while iterating tasks during enable/disable Tejun Heo
                   ` (4 preceding siblings ...)
  2024-10-09 21:41 ` [PATCH 5/6] sched_ext: Move scx_tasks_lock handling into scx_task_iter helpers Tejun Heo
@ 2024-10-09 21:41 ` Tejun Heo
  2024-10-10 19:12   ` David Vernet
  2024-10-10 21:43 ` [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix RCU and other stalls while iterating tasks during enable/disable Tejun Heo
  6 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2024-10-09 21:41 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, sched-ext, Tejun Heo

While enabling and disabling a BPF scheduler, every task is iterated a
couple times by walking scx_tasks. Except for one, all iterations keep
holding scx_tasks_lock. On multi-socket systems under heavy rq lock
contention and high number of threads, this can can lead to RCU and other
stalls.

The following is triggered on a 2 x AMD EPYC 7642 system (192 logical CPUs)
running `stress-ng --workload 150 --workload-threads 10` with >400k idle
threads and RCU stall period reduced to 5s:

  rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  rcu:     91-...!: (10 ticks this GP) idle=0754/1/0x4000000000000000 softirq=18204/18206 fqs=17
  rcu:     186-...!: (17 ticks this GP) idle=ec54/1/0x4000000000000000 softirq=25863/25866 fqs=17
  rcu:     (detected by 80, t=10042 jiffies, g=89305, q=33 ncpus=192)
  Sending NMI from CPU 80 to CPUs 91:
  NMI backtrace for cpu 91
  CPU: 91 UID: 0 PID: 284038 Comm: sched_ext_ops_h Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471
  Hardware name: Supermicro Super Server/H11DSi, BIOS 2.8 12/14/2023
  Sched_ext: simple (disabling+all)
  RIP: 0010:queued_spin_lock_slowpath+0x17b/0x2f0
  Code: 02 c0 10 03 00 83 79 08 00 75 08 f3 90 83 79 08 00 74 f8 48 8b 11 48 85 d2 74 09 0f 0d 0a eb 0a 31 d2 eb 06 31 d2 eb 02 f3 90 <8b> 07 66 85 c0 75 f7 39 d8 75 0d be 01 00 00 00 89 d8 f0 0f b1 37
  RSP: 0018:ffffc9000fadfcb8 EFLAGS: 00000002
  RAX: 0000000001700001 RBX: 0000000001700000 RCX: ffff88bfcaaf10c0
  RDX: 0000000000000000 RSI: 0000000000000101 RDI: ffff88bfca8f0080
  RBP: 0000000001700000 R08: 0000000000000090 R09: ffffffffffffffff
  R10: ffff88a74761b268 R11: 0000000000000000 R12: ffff88a6b6765460
  R13: ffffc9000fadfd60 R14: ffff88bfca8f0080 R15: ffff88bfcaac0000
  FS:  0000000000000000(0000) GS:ffff88bfcaac0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f5c55f526a0 CR3: 0000000afd474000 CR4: 0000000000350eb0
  Call Trace:
   <NMI>
   </NMI>
   <TASK>
   do_raw_spin_lock+0x9c/0xb0
   task_rq_lock+0x50/0x190
   scx_task_iter_next_locked+0x157/0x170
   scx_ops_disable_workfn+0x2c2/0xbf0
   kthread_worker_fn+0x108/0x2a0
   kthread+0xeb/0x110
   ret_from_fork+0x36/0x40
   ret_from_fork_asm+0x1a/0x30
   </TASK>
  Sending NMI from CPU 80 to CPUs 186:
  NMI backtrace for cpu 186
  CPU: 186 UID: 0 PID: 51248 Comm: fish Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471

scx_task_iter can safely drop locks while iterating. Make
scx_task_iter_next() drop scx_tasks_lock every 32 iterations to avoid
stalls.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index d53c7a365fec..b44946198ea5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -18,6 +18,12 @@ enum scx_consts {
 	SCX_EXIT_DUMP_DFL_LEN		= 32768,
 
 	SCX_CPUPERF_ONE			= SCHED_CAPACITY_SCALE,
+
+	/*
+	 * Iterating all tasks may take a while. Periodically drop
+	 * scx_tasks_lock to avoid causing e.g. CSD and RCU stalls.
+	 */
+	SCX_OPS_TASK_ITER_BATCH		= 32,
 };
 
 enum scx_exit_kind {
@@ -1273,6 +1279,7 @@ struct scx_task_iter {
 	struct task_struct		*locked;
 	struct rq			*rq;
 	struct rq_flags			rf;
+	u32				cnt;
 };
 
 /**
@@ -1301,6 +1308,7 @@ static void scx_task_iter_start(struct scx_task_iter *iter)
 	iter->cursor = (struct sched_ext_entity){ .flags = SCX_TASK_CURSOR };
 	list_add(&iter->cursor.tasks_node, &scx_tasks);
 	iter->locked = NULL;
+	iter->cnt = 0;
 }
 
 static void __scx_task_iter_rq_unlock(struct scx_task_iter *iter)
@@ -1355,14 +1363,21 @@ static void scx_task_iter_stop(struct scx_task_iter *iter)
  * scx_task_iter_next - Next task
  * @iter: iterator to walk
  *
- * Visit the next task. See scx_task_iter_start() for details.
+ * Visit the next task. See scx_task_iter_start() for details. Locks are dropped
+ * and re-acquired every %SCX_OPS_TASK_ITER_BATCH iterations to avoid causing
+ * stalls by holding scx_tasks_lock for too long.
  */
 static struct task_struct *scx_task_iter_next(struct scx_task_iter *iter)
 {
 	struct list_head *cursor = &iter->cursor.tasks_node;
 	struct sched_ext_entity *pos;
 
-	lockdep_assert_held(&scx_tasks_lock);
+	if (!(++iter->cnt % SCX_OPS_TASK_ITER_BATCH)) {
+		scx_task_iter_unlock(iter);
+		cpu_relax();
+		cond_resched();
+		scx_task_iter_relock(iter);
+	}
 
 	list_for_each_entry(pos, cursor, tasks_node) {
 		if (&pos->tasks_node == &scx_tasks)
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/6] sched_ext: Don't hold scx_tasks_lock for too long
  2024-10-09 21:41 ` [PATCH 6/6] sched_ext: Don't hold scx_tasks_lock for too long Tejun Heo
@ 2024-10-10 19:12   ` David Vernet
  2024-10-10 21:38     ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: David Vernet @ 2024-10-10 19:12 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel, sched-ext

[-- Attachment #1: Type: text/plain, Size: 5097 bytes --]

On Wed, Oct 09, 2024 at 11:41:02AM -1000, Tejun Heo wrote:
> While enabling and disabling a BPF scheduler, every task is iterated a
> couple times by walking scx_tasks. Except for one, all iterations keep
> holding scx_tasks_lock. On multi-socket systems under heavy rq lock
> contention and high number of threads, this can can lead to RCU and other
> stalls.
> 
> The following is triggered on a 2 x AMD EPYC 7642 system (192 logical CPUs)
> running `stress-ng --workload 150 --workload-threads 10` with >400k idle
> threads and RCU stall period reduced to 5s:
> 
>   rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>   rcu:     91-...!: (10 ticks this GP) idle=0754/1/0x4000000000000000 softirq=18204/18206 fqs=17
>   rcu:     186-...!: (17 ticks this GP) idle=ec54/1/0x4000000000000000 softirq=25863/25866 fqs=17
>   rcu:     (detected by 80, t=10042 jiffies, g=89305, q=33 ncpus=192)
>   Sending NMI from CPU 80 to CPUs 91:
>   NMI backtrace for cpu 91
>   CPU: 91 UID: 0 PID: 284038 Comm: sched_ext_ops_h Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471
>   Hardware name: Supermicro Super Server/H11DSi, BIOS 2.8 12/14/2023
>   Sched_ext: simple (disabling+all)
>   RIP: 0010:queued_spin_lock_slowpath+0x17b/0x2f0
>   Code: 02 c0 10 03 00 83 79 08 00 75 08 f3 90 83 79 08 00 74 f8 48 8b 11 48 85 d2 74 09 0f 0d 0a eb 0a 31 d2 eb 06 31 d2 eb 02 f3 90 <8b> 07 66 85 c0 75 f7 39 d8 75 0d be 01 00 00 00 89 d8 f0 0f b1 37
>   RSP: 0018:ffffc9000fadfcb8 EFLAGS: 00000002
>   RAX: 0000000001700001 RBX: 0000000001700000 RCX: ffff88bfcaaf10c0
>   RDX: 0000000000000000 RSI: 0000000000000101 RDI: ffff88bfca8f0080
>   RBP: 0000000001700000 R08: 0000000000000090 R09: ffffffffffffffff
>   R10: ffff88a74761b268 R11: 0000000000000000 R12: ffff88a6b6765460
>   R13: ffffc9000fadfd60 R14: ffff88bfca8f0080 R15: ffff88bfcaac0000
>   FS:  0000000000000000(0000) GS:ffff88bfcaac0000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 00007f5c55f526a0 CR3: 0000000afd474000 CR4: 0000000000350eb0
>   Call Trace:
>    <NMI>
>    </NMI>
>    <TASK>
>    do_raw_spin_lock+0x9c/0xb0
>    task_rq_lock+0x50/0x190
>    scx_task_iter_next_locked+0x157/0x170
>    scx_ops_disable_workfn+0x2c2/0xbf0
>    kthread_worker_fn+0x108/0x2a0
>    kthread+0xeb/0x110
>    ret_from_fork+0x36/0x40
>    ret_from_fork_asm+0x1a/0x30
>    </TASK>
>   Sending NMI from CPU 80 to CPUs 186:
>   NMI backtrace for cpu 186
>   CPU: 186 UID: 0 PID: 51248 Comm: fish Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471
> 
> scx_task_iter can safely drop locks while iterating. Make
> scx_task_iter_next() drop scx_tasks_lock every 32 iterations to avoid
> stalls.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

LG, just had one question below.

Acked-by: David Vernet <void@manifault.com>

> ---
>  kernel/sched/ext.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index d53c7a365fec..b44946198ea5 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -18,6 +18,12 @@ enum scx_consts {
>  	SCX_EXIT_DUMP_DFL_LEN		= 32768,
>  
>  	SCX_CPUPERF_ONE			= SCHED_CAPACITY_SCALE,
> +
> +	/*
> +	 * Iterating all tasks may take a while. Periodically drop
> +	 * scx_tasks_lock to avoid causing e.g. CSD and RCU stalls.
> +	 */
> +	SCX_OPS_TASK_ITER_BATCH		= 32,
>  };
>  
>  enum scx_exit_kind {
> @@ -1273,6 +1279,7 @@ struct scx_task_iter {
>  	struct task_struct		*locked;
>  	struct rq			*rq;
>  	struct rq_flags			rf;
> +	u32				cnt;
>  };
>  
>  /**
> @@ -1301,6 +1308,7 @@ static void scx_task_iter_start(struct scx_task_iter *iter)
>  	iter->cursor = (struct sched_ext_entity){ .flags = SCX_TASK_CURSOR };
>  	list_add(&iter->cursor.tasks_node, &scx_tasks);
>  	iter->locked = NULL;
> +	iter->cnt = 0;
>  }
>  
>  static void __scx_task_iter_rq_unlock(struct scx_task_iter *iter)
> @@ -1355,14 +1363,21 @@ static void scx_task_iter_stop(struct scx_task_iter *iter)
>   * scx_task_iter_next - Next task
>   * @iter: iterator to walk
>   *
> - * Visit the next task. See scx_task_iter_start() for details.
> + * Visit the next task. See scx_task_iter_start() for details. Locks are dropped
> + * and re-acquired every %SCX_OPS_TASK_ITER_BATCH iterations to avoid causing
> + * stalls by holding scx_tasks_lock for too long.
>   */
>  static struct task_struct *scx_task_iter_next(struct scx_task_iter *iter)
>  {
>  	struct list_head *cursor = &iter->cursor.tasks_node;
>  	struct sched_ext_entity *pos;
>  
> -	lockdep_assert_held(&scx_tasks_lock);
> +	if (!(++iter->cnt % SCX_OPS_TASK_ITER_BATCH)) {
> +		scx_task_iter_unlock(iter);
> +		cpu_relax();

Could you explain why we need this cpu_relax()? I thought it was only
necessary for busy-wait loops.

> +		cond_resched();
> +		scx_task_iter_relock(iter);
> +	}
>  
>  	list_for_each_entry(pos, cursor, tasks_node) {
>  		if (&pos->tasks_node == &scx_tasks)
> -- 
> 2.46.2
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/6] sched_ext: Don't hold scx_tasks_lock for too long
  2024-10-10 19:12   ` David Vernet
@ 2024-10-10 21:38     ` Tejun Heo
  2024-10-10 23:38       ` Waiman Long
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2024-10-10 21:38 UTC (permalink / raw)
  To: David Vernet; +Cc: kernel-team, linux-kernel, sched-ext, Waiman Long

Hello,

On Thu, Oct 10, 2024 at 02:12:37PM -0500, David Vernet wrote:
...
> >  static struct task_struct *scx_task_iter_next(struct scx_task_iter *iter)
> >  {
> >  	struct list_head *cursor = &iter->cursor.tasks_node;
> >  	struct sched_ext_entity *pos;
> >  
> > -	lockdep_assert_held(&scx_tasks_lock);
> > +	if (!(++iter->cnt % SCX_OPS_TASK_ITER_BATCH)) {
> > +		scx_task_iter_unlock(iter);
> > +		cpu_relax();
> 
> Could you explain why we need this cpu_relax()? I thought it was only
> necessary for busy-wait loops.

I don't think we need them with the current queued spinlock implementation
but back when the spinlocks were dumb, just unlocking and relocking could
still starve out others.

cc'ing Waiman who should know a lot better than me. Waiman, what's the
current state? When releasing a spinlock to give others a chance, can we
just do unlock/lock or is cpu_relax() still useful in some cases?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/6] sched_ext: Don't hold scx_tasks_lock for too long
  2024-10-10 21:38     ` Tejun Heo
@ 2024-10-10 23:38       ` Waiman Long
  0 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2024-10-10 23:38 UTC (permalink / raw)
  To: Tejun Heo, David Vernet; +Cc: kernel-team, linux-kernel, sched-ext

On 10/10/24 5:38 PM, Tejun Heo wrote:
> Hello,
>
> On Thu, Oct 10, 2024 at 02:12:37PM -0500, David Vernet wrote:
> ...
>>>   static struct task_struct *scx_task_iter_next(struct scx_task_iter *iter)
>>>   {
>>>   	struct list_head *cursor = &iter->cursor.tasks_node;
>>>   	struct sched_ext_entity *pos;
>>>   
>>> -	lockdep_assert_held(&scx_tasks_lock);
>>> +	if (!(++iter->cnt % SCX_OPS_TASK_ITER_BATCH)) {
>>> +		scx_task_iter_unlock(iter);
>>> +		cpu_relax();
>> Could you explain why we need this cpu_relax()? I thought it was only
>> necessary for busy-wait loops.
> I don't think we need them with the current queued spinlock implementation
> but back when the spinlocks were dumb, just unlocking and relocking could
> still starve out others.
>
> cc'ing Waiman who should know a lot better than me. Waiman, what's the
> current state? When releasing a spinlock to give others a chance, can we
> just do unlock/lock or is cpu_relax() still useful in some cases?

I agree with David that cpu_relax() isn't needed here. Queued spinlock 
is a fair lock. If there is a pending waiter, the current task will have 
to wait in the wait queue even if it attempts to acquire the lock again 
immediately  after an unlock.

A cpu_relax() is only useful in a spin loop in order to reduce the 
frequency of pounding the same cacheline again and again.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix RCU and other stalls while iterating tasks during enable/disable
  2024-10-09 21:40 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix RCU and other stalls while iterating tasks during enable/disable Tejun Heo
                   ` (5 preceding siblings ...)
  2024-10-09 21:41 ` [PATCH 6/6] sched_ext: Don't hold scx_tasks_lock for too long Tejun Heo
@ 2024-10-10 21:43 ` Tejun Heo
  6 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2024-10-10 21:43 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, sched-ext

On Wed, Oct 09, 2024 at 11:40:56AM -1000, Tejun Heo wrote:
> The enable/disable paths walk all tasks a couple times in bypass mode. There
> are a couple problems:
> 
> - Bypass mode incorrectly depends on ops.select_cpu() which must not be
>   trusted in bypass mode.
> 
> - scx_tasks_lock is held while walking all tasks. This can lead to RCU and
>   other stalls on a large heavily contended system with many tasks.
> 
> Fix the former by always using the default select_cpu() in bypass mode and
> the latter by periodically dropping scx_tasks_lock while iterating tasks.
> 
> This patchset contains the following patches:
> 
>  0001-Revert-sched_ext-Use-shorter-slice-while-bypassing.patch
>  0002-sched_ext-Start-schedulers-with-consistent-p-scx.sli.patch
>  0003-sched_ext-Move-scx_buildin_idle_enabled-check-to-scx.patch
>  0004-sched_ext-bypass-mode-shouldn-t-depend-on-ops.select.patch
>  0005-sched_ext-Move-scx_tasks_lock-handling-into-scx_task.patch
>  0006-sched_ext-Don-t-hold-scx_tasks_lock-for-too-long.patch

Applied to sched_ext/for-6.12-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-10-10 23:39 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-09 21:40 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix RCU and other stalls while iterating tasks during enable/disable Tejun Heo
2024-10-09 21:40 ` [PATCH 1/6] Revert "sched_ext: Use shorter slice while bypassing" Tejun Heo
2024-10-10 17:59   ` David Vernet
2024-10-09 21:40 ` [PATCH 2/6] sched_ext: Start schedulers with consistent p->scx.slice values Tejun Heo
2024-10-10 18:00   ` David Vernet
2024-10-09 21:40 ` [PATCH 3/6] sched_ext: Move scx_buildin_idle_enabled check to scx_bpf_select_cpu_dfl() Tejun Heo
2024-10-09 21:41 ` [PATCH 4/6] sched_ext: bypass mode shouldn't depend on ops.select_cpu() Tejun Heo
2024-10-10 18:15   ` David Vernet
2024-10-10 18:26     ` Tejun Heo
2024-10-10 18:31       ` David Vernet
2024-10-09 21:41 ` [PATCH 5/6] sched_ext: Move scx_tasks_lock handling into scx_task_iter helpers Tejun Heo
2024-10-10 18:36   ` David Vernet
2024-10-09 21:41 ` [PATCH 6/6] sched_ext: Don't hold scx_tasks_lock for too long Tejun Heo
2024-10-10 19:12   ` David Vernet
2024-10-10 21:38     ` Tejun Heo
2024-10-10 23:38       ` Waiman Long
2024-10-10 21:43 ` [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix RCU and other stalls while iterating tasks during enable/disable Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox