[PATCHSET sched_ext/for-6.11] sched_ext: Fix scx_bpf_reenqueue

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET sched_ext/for-6.11] sched_ext: Fix scx_bpf_reenqueue_local()
@ 2024-07-09 21:09 Tejun Heo
  2024-07-09 21:09 ` [PATCH 1/2] sched_ext: Reimplement scx_bpf_reenqueue_local() Tejun Heo
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Tejun Heo @ 2024-07-09 21:09 UTC (permalink / raw)
  To: void; +Cc: linux-kernel, kernel-team, schatzberg.dan

Fix scx_bpf_reenqueue_local()'s misbehaviors in two cases:

- when SCX_ENQ_HEAD is used to dispatch a re-enqueued task back to the same
  local DSQ.

- when the CPU is being preempted by stop_task for task migration.

This patchset contains the following two patches:

 0001-sched_ext-Reimplement-scx_bpf_reenqueue_local.patch
 0002-sched_ext-Make-scx_bpf_reenqueue_local-skip-tasks-th.patch

and is also available in the following git branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-fix-reenqueue_local

diffstat follows. Thanks.

 kernel/sched/ext.c |   40 +++++++++++++++++++++++++++-------------
 1 file changed, 27 insertions(+), 13 deletions(-)

--
tejun


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] sched_ext: Reimplement scx_bpf_reenqueue_local()
  2024-07-09 21:09 [PATCHSET sched_ext/for-6.11] sched_ext: Fix scx_bpf_reenqueue_local() Tejun Heo
@ 2024-07-09 21:09 ` Tejun Heo
  2024-07-09 21:18   ` David Vernet
  2024-07-09 21:09 ` [PATCH 2/2] sched_ext: Make scx_bpf_reenqueue_local() skip tasks that are being migrated Tejun Heo
  2024-07-09 22:31 ` [PATCHSET sched_ext/for-6.11] sched_ext: Fix scx_bpf_reenqueue_local() Tejun Heo
  2 siblings, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2024-07-09 21:09 UTC (permalink / raw)
  To: void; +Cc: linux-kernel, kernel-team, schatzberg.dan, Tejun Heo

scx_bpf_reenqueue_local() is used to re-enqueue tasks on the local DSQ from
ops.cpu_release(). Because the BPF scheduler may dispatch tasks to the same
local DSQ, to avoid processing the same tasks repeatedly, it first takes the
number of queued tasks and processes the task at the head of the queue that
number of times.

This is incorrect as a task can be dispatched to the same local DSQ with
SCX_ENQ_HEAD. Such a task will be processed repeatedly until the count is
exhausted and the succeeding tasks won't be processed at all.

Fix it by first moving all candidate tasks to a private list and then
processing that list. While at it, remove the WARNs. They're rather
superflous as later steps will check them anyway.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 245254f7081d ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()")
Cc: David Vernet <void@manifault.com>
---
 kernel/sched/ext.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f16d72d72635..0fabfe19327c 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5628,8 +5628,10 @@ __bpf_kfunc_start_defs();
  */
 __bpf_kfunc u32 scx_bpf_reenqueue_local(void)
 {
-	u32 nr_enqueued, i;
+	LIST_HEAD(tasks);
+	u32 nr_enqueued = 0;
 	struct rq *rq;
+	struct task_struct *p, *n;
 
 	if (!scx_kf_allowed(SCX_KF_CPU_RELEASE))
 		return 0;
@@ -5638,22 +5640,20 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(void)
 	lockdep_assert_rq_held(rq);
 
 	/*
-	 * Get the number of tasks on the local DSQ before iterating over it to
-	 * pull off tasks. The enqueue callback below can signal that it wants
-	 * the task to stay on the local DSQ, and we want to prevent the BPF
-	 * scheduler from causing us to loop indefinitely.
+	 * The BPF scheduler may choose to dispatch tasks back to
+	 * @rq->scx.local_dsq. Move all candidate tasks off to a private list
+	 * first to avoid processing the same tasks repeatedly.
 	 */
-	nr_enqueued = rq->scx.local_dsq.nr;
-	for (i = 0; i < nr_enqueued; i++) {
-		struct task_struct *p;
-
-		p = first_local_task(rq);
-		WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) !=
-			     SCX_OPSS_NONE);
-		WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED));
-		WARN_ON_ONCE(p->scx.holding_cpu != -1);
+	list_for_each_entry_safe(p, n, &rq->scx.local_dsq.list,
+				 scx.dsq_list.node) {
 		dispatch_dequeue(rq, p);
+		list_add_tail(&p->scx.dsq_list.node, &tasks);
+	}
+
+	list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) {
+		list_del_init(&p->scx.dsq_list.node);
 		do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
+		nr_enqueued++;
 	}
 
 	return nr_enqueued;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] sched_ext: Make scx_bpf_reenqueue_local() skip tasks that are being migrated
  2024-07-09 21:09 [PATCHSET sched_ext/for-6.11] sched_ext: Fix scx_bpf_reenqueue_local() Tejun Heo
  2024-07-09 21:09 ` [PATCH 1/2] sched_ext: Reimplement scx_bpf_reenqueue_local() Tejun Heo
@ 2024-07-09 21:09 ` Tejun Heo
  2024-07-09 22:16   ` David Vernet
  2024-07-09 22:31 ` [PATCHSET sched_ext/for-6.11] sched_ext: Fix scx_bpf_reenqueue_local() Tejun Heo
  2 siblings, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2024-07-09 21:09 UTC (permalink / raw)
  To: void; +Cc: linux-kernel, kernel-team, schatzberg.dan, Tejun Heo

When a running task is migrated to another CPU, the stop_task is used to
preempt the running task and migrate it. This, expectedly, invokes
ops.cpu_release(). If the BPF scheduler then calls
scx_bpf_reenqueue_local(), it re-enqueues all tasks on the local DSQ
including the task which is being migrated.

This creates an unnecessary re-enqueue of a task which is about to be
deactivated and re-activated for migration anyway. It can also cause
confusion for the BPF scheduler as scx_bpf_task_cpu() of the task and its
allowed CPUs may not agree while migration is pending.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 245254f7081d ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()")
Cc: David Vernet <void@manifault.com>
---
 kernel/sched/ext.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 0fabfe19327c..f4a7abcec793 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5646,6 +5646,20 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(void)
 	 */
 	list_for_each_entry_safe(p, n, &rq->scx.local_dsq.list,
 				 scx.dsq_list.node) {
+		/*
+		 * If @p is being migrated, @p's current CPU may not agree with
+		 * its allowed CPUs and the migration_cpu_stop is about to
+		 * deactivate and re-activate @p anyway. Skip re-enqueueing.
+		 *
+		 * While racing sched property changes may also dequeue and
+		 * re-enqueue a migrating task while its current CPU and allowed
+		 * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to
+		 * the current local DSQ for running tasks and thus are not
+		 * visible to the BPF scheduler.
+		 */
+		if (p->migration_pending)
+			continue;
+
 		dispatch_dequeue(rq, p);
 		list_add_tail(&p->scx.dsq_list.node, &tasks);
 	}
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] sched_ext: Reimplement scx_bpf_reenqueue_local()
  2024-07-09 21:09 ` [PATCH 1/2] sched_ext: Reimplement scx_bpf_reenqueue_local() Tejun Heo
@ 2024-07-09 21:18   ` David Vernet
  0 siblings, 0 replies; 6+ messages in thread
From: David Vernet @ 2024-07-09 21:18 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, kernel-team, schatzberg.dan

[-- Attachment #1: Type: text/plain, Size: 1040 bytes --]

On Tue, Jul 09, 2024 at 11:09:42AM -1000, Tejun Heo wrote:
> scx_bpf_reenqueue_local() is used to re-enqueue tasks on the local DSQ from
> ops.cpu_release(). Because the BPF scheduler may dispatch tasks to the same
> local DSQ, to avoid processing the same tasks repeatedly, it first takes the
> number of queued tasks and processes the task at the head of the queue that
> number of times.
> 
> This is incorrect as a task can be dispatched to the same local DSQ with
> SCX_ENQ_HEAD. Such a task will be processed repeatedly until the count is
> exhausted and the succeeding tasks won't be processed at all.
> 
> Fix it by first moving all candidate tasks to a private list and then
> processing that list. While at it, remove the WARNs. They're rather
> superflous as later steps will check them anyway.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Fixes: 245254f7081d ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()")
> Cc: David Vernet <void@manifault.com>

Acked-by: David Vernet <void@manifault.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] sched_ext: Make scx_bpf_reenqueue_local() skip tasks that are being migrated
  2024-07-09 21:09 ` [PATCH 2/2] sched_ext: Make scx_bpf_reenqueue_local() skip tasks that are being migrated Tejun Heo
@ 2024-07-09 22:16   ` David Vernet
  0 siblings, 0 replies; 6+ messages in thread
From: David Vernet @ 2024-07-09 22:16 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, kernel-team, schatzberg.dan

[-- Attachment #1: Type: text/plain, Size: 883 bytes --]

On Tue, Jul 09, 2024 at 11:09:43AM -1000, Tejun Heo wrote:
> When a running task is migrated to another CPU, the stop_task is used to
> preempt the running task and migrate it. This, expectedly, invokes
> ops.cpu_release(). If the BPF scheduler then calls
> scx_bpf_reenqueue_local(), it re-enqueues all tasks on the local DSQ
> including the task which is being migrated.
> 
> This creates an unnecessary re-enqueue of a task which is about to be
> deactivated and re-activated for migration anyway. It can also cause
> confusion for the BPF scheduler as scx_bpf_task_cpu() of the task and its
> allowed CPUs may not agree while migration is pending.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Fixes: 245254f7081d ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()")
> Cc: David Vernet <void@manifault.com>

Acked-by: David Vernet <void@manifault.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCHSET sched_ext/for-6.11] sched_ext: Fix scx_bpf_reenqueue_local()
  2024-07-09 21:09 [PATCHSET sched_ext/for-6.11] sched_ext: Fix scx_bpf_reenqueue_local() Tejun Heo
  2024-07-09 21:09 ` [PATCH 1/2] sched_ext: Reimplement scx_bpf_reenqueue_local() Tejun Heo
  2024-07-09 21:09 ` [PATCH 2/2] sched_ext: Make scx_bpf_reenqueue_local() skip tasks that are being migrated Tejun Heo
@ 2024-07-09 22:31 ` Tejun Heo
  2 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2024-07-09 22:31 UTC (permalink / raw)
  To: void; +Cc: linux-kernel, kernel-team, schatzberg.dan

On Tue, Jul 09, 2024 at 11:09:41AM -1000, Tejun Heo wrote:
> Fix scx_bpf_reenqueue_local()'s misbehaviors in two cases:
> 
> - when SCX_ENQ_HEAD is used to dispatch a re-enqueued task back to the same
>   local DSQ.
> 
> - when the CPU is being preempted by stop_task for task migration.
> 
> This patchset contains the following two patches:
> 
>  0001-sched_ext-Reimplement-scx_bpf_reenqueue_local.patch
>  0002-sched_ext-Make-scx_bpf_reenqueue_local-skip-tasks-th.patch

Applied to sched_ext/for-6.11.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-07-09 22:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-09 21:09 [PATCHSET sched_ext/for-6.11] sched_ext: Fix scx_bpf_reenqueue_local() Tejun Heo
2024-07-09 21:09 ` [PATCH 1/2] sched_ext: Reimplement scx_bpf_reenqueue_local() Tejun Heo
2024-07-09 21:18   ` David Vernet
2024-07-09 21:09 ` [PATCH 2/2] sched_ext: Make scx_bpf_reenqueue_local() skip tasks that are being migrated Tejun Heo
2024-07-09 22:16   ` David Vernet
2024-07-09 22:31 ` [PATCHSET sched_ext/for-6.11] sched_ext: Fix scx_bpf_reenqueue_local() Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox