[PATCHSET sched_ext/for-6.12-fixes] sched_ext: Split %SCX_DSQ

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Split %SCX_DSQ_GLOBAL per-node
@ 2024-09-25  0:06 Tejun Heo
  2024-09-25  0:06 ` [PATCH 1/5] scx_flatcg: Use a user DSQ for fallback instead of SCX_DSQ_GLOBAL Tejun Heo
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Tejun Heo @ 2024-09-25  0:06 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, sched-ext

In the bypass mode, the global DSQ is used to schedule all tasks in simple
FIFO order. All tasks are queued into the global DSQ and all CPUs try to
execute tasks from it. This creates a lot of cross-node cacheline accesses
and scheduling across the node boundaries, and can lead to live-lock
conditions where the system takes tens of minutes to disable the BPF
scheduler while executing in the bypass mode.

This patchset splits the global DSQ per NUMA node. Each node has its own
global DSQ. When a task is dispatched to SCX_DSQ_GLOBAL, it's put into the
global DSQ local to the task's CPU and all CPUs in a node only consume its
node-local global DSQ.

This resolves a livelock condition which could be reliably triggered on an
2x EPYC 7642 system by running `stress-ng --race-sched 1024` together with
`stress-ng --workload 80 --workload-threads 10` while repeatedly enabling
and disabling a SCX scheduler.

This patchset contains the following patches:

 0001-scx_flatcg-Use-a-user-DSQ-for-fallback-instead-of-SC.patch
 0002-sched_ext-Allow-only-user-DSQs-for-scx_bpf_consume-s.patch
 0003-sched_ext-Relocate-find_user_dsq.patch
 0004-sched_ext-Split-the-global-DSQ-per-NUMA-node.patch
 0005-sched_ext-Use-shorter-slice-while-bypassing.patch

 0001-0003 are preparations.

 0004 splits %SCX_DSQ_GLOBAL per-node.

 0005 reduces time slice used while bypassing. This can make e.g. unloading
 of the BPF scheduler complete faster under heavy contention.

This patchset can also be found in the following git branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-split-global

diffstat follows. Thanks.

 kernel/sched/ext.c               |  109 ++++++++++++++++++++++++++++++++++++++++++-------------------
 tools/sched_ext/scx_flatcg.bpf.c |   17 +++++++--
 2 files changed, 89 insertions(+), 37 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/5] scx_flatcg: Use a user DSQ for fallback instead of SCX_DSQ_GLOBAL
  2024-09-25  0:06 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Split %SCX_DSQ_GLOBAL per-node Tejun Heo
@ 2024-09-25  0:06 ` Tejun Heo
  2024-09-25 16:45   ` David Vernet
  2024-09-25  0:06 ` [PATCH 2/5] sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new() Tejun Heo
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2024-09-25  0:06 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, sched-ext, Tejun Heo

scx_flatcg was using SCX_DSQ_GLOBAL for fallback handling. However, it is
assuming that SCX_DSQ_GLOBAL isn't automatically consumed, which was true a
while ago but is no longer the case. Also, there are further changes planned
for SCX_DSQ_GLOBAL which will disallow explicit consumption from it. Switch
to a user DSQ for fallback.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 tools/sched_ext/scx_flatcg.bpf.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/tools/sched_ext/scx_flatcg.bpf.c b/tools/sched_ext/scx_flatcg.bpf.c
index 3ab2b60781a0..5027aafb2682 100644
--- a/tools/sched_ext/scx_flatcg.bpf.c
+++ b/tools/sched_ext/scx_flatcg.bpf.c
@@ -49,7 +49,10 @@
 /*
  * Maximum amount of retries to find a valid cgroup.
  */
-#define CGROUP_MAX_RETRIES 1024
+enum {
+	FALLBACK_DSQ		= 0,
+	CGROUP_MAX_RETRIES	= 1024,
+};
 
 char _license[] SEC("license") = "GPL";
 
@@ -378,7 +381,7 @@ void BPF_STRUCT_OPS(fcg_enqueue, struct task_struct *p, u64 enq_flags)
 			scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
 		} else {
 			stat_inc(FCG_STAT_GLOBAL);
-			scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+			scx_bpf_dispatch(p, FALLBACK_DSQ, SCX_SLICE_DFL, enq_flags);
 		}
 		return;
 	}
@@ -781,7 +784,7 @@ void BPF_STRUCT_OPS(fcg_dispatch, s32 cpu, struct task_struct *prev)
 pick_next_cgroup:
 	cpuc->cur_at = now;
 
-	if (scx_bpf_consume(SCX_DSQ_GLOBAL)) {
+	if (scx_bpf_consume(FALLBACK_DSQ)) {
 		cpuc->cur_cgid = 0;
 		return;
 	}
@@ -838,7 +841,7 @@ int BPF_STRUCT_OPS_SLEEPABLE(fcg_cgroup_init, struct cgroup *cgrp,
 	int ret;
 
 	/*
-	 * Technically incorrect as cgroup ID is full 64bit while dq ID is
+	 * Technically incorrect as cgroup ID is full 64bit while dsq ID is
 	 * 63bit. Should not be a problem in practice and easy to spot in the
 	 * unlikely case that it breaks.
 	 */
@@ -926,6 +929,11 @@ void BPF_STRUCT_OPS(fcg_cgroup_move, struct task_struct *p,
 	p->scx.dsq_vtime = to_cgc->tvtime_now + vtime_delta;
 }
 
+s32 BPF_STRUCT_OPS_SLEEPABLE(fcg_init)
+{
+	return scx_bpf_create_dsq(FALLBACK_DSQ, -1);
+}
+
 void BPF_STRUCT_OPS(fcg_exit, struct scx_exit_info *ei)
 {
 	UEI_RECORD(uei, ei);
@@ -944,6 +952,7 @@ SCX_OPS_DEFINE(flatcg_ops,
 	       .cgroup_init		= (void *)fcg_cgroup_init,
 	       .cgroup_exit		= (void *)fcg_cgroup_exit,
 	       .cgroup_move		= (void *)fcg_cgroup_move,
+	       .init			= (void *)fcg_init,
 	       .exit			= (void *)fcg_exit,
 	       .flags			= SCX_OPS_HAS_CGROUP_WEIGHT | SCX_OPS_ENQ_EXITING,
 	       .name			= "flatcg");
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/5] sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new()
  2024-09-25  0:06 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Split %SCX_DSQ_GLOBAL per-node Tejun Heo
  2024-09-25  0:06 ` [PATCH 1/5] scx_flatcg: Use a user DSQ for fallback instead of SCX_DSQ_GLOBAL Tejun Heo
@ 2024-09-25  0:06 ` Tejun Heo
  2024-09-25 17:09   ` David Vernet
  2024-09-25  0:06 ` [PATCH 3/5] sched_ext: Relocate find_user_dsq() Tejun Heo
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2024-09-25  0:06 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, sched-ext, Tejun Heo

SCX_DSQ_GLOBAL is special in that it can't be used as a priority queue and
is consumed implicitly, but all BPF DSQ related kfuncs could be used on it.
SCX_DSQ_GLOBAL will be split per-node for scalability and those operations
won't make sense anymore. Disallow SCX_DSQ_GLOBAL on scx_bpf_consume(),
scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new(). This means that
SCX_DSQ_GLOBAL can only be used as a dispatch target from BPF schedulers.

With scx_flatcg, which was using SCX_DSQ_GLOBAL as the fallback DSQ,
updated, this shouldn't affect any schedulers.

This leaves find_dsq_for_dispatch() the only user of find_non_local_dsq().
Open code and remove find_non_local_dsq().

Signed-off-by: tejun heo <tj@kernel.org>
---
 kernel/sched/ext.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index d74d1fe06999..ed2b914c42d1 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1808,16 +1808,6 @@ static struct scx_dispatch_q *find_user_dsq(u64 dsq_id)
 	return rhashtable_lookup_fast(&dsq_hash, &dsq_id, dsq_hash_params);
 }
 
-static struct scx_dispatch_q *find_non_local_dsq(u64 dsq_id)
-{
-	lockdep_assert(rcu_read_lock_any_held());
-
-	if (dsq_id == SCX_DSQ_GLOBAL)
-		return &scx_dsq_global;
-	else
-		return find_user_dsq(dsq_id);
-}
-
 static struct scx_dispatch_q *find_dsq_for_dispatch(struct rq *rq, u64 dsq_id,
 						    struct task_struct *p)
 {
@@ -1835,7 +1825,11 @@ static struct scx_dispatch_q *find_dsq_for_dispatch(struct rq *rq, u64 dsq_id,
 		return &cpu_rq(cpu)->scx.local_dsq;
 	}
 
-	dsq = find_non_local_dsq(dsq_id);
+	if (dsq_id == SCX_DSQ_GLOBAL)
+		dsq = &scx_dsq_global;
+	else
+		dsq = find_user_dsq(dsq_id);
+
 	if (unlikely(!dsq)) {
 		scx_ops_error("non-existent DSQ 0x%llx for %s[%d]",
 			      dsq_id, p->comm, p->pid);
@@ -6176,7 +6170,7 @@ __bpf_kfunc bool scx_bpf_consume(u64 dsq_id)
 
 	flush_dispatch_buf(dspc->rq);
 
-	dsq = find_non_local_dsq(dsq_id);
+	dsq = find_user_dsq(dsq_id);
 	if (unlikely(!dsq)) {
 		scx_ops_error("invalid DSQ ID 0x%016llx", dsq_id);
 		return false;
@@ -6497,7 +6491,7 @@ __bpf_kfunc s32 scx_bpf_dsq_nr_queued(u64 dsq_id)
 			goto out;
 		}
 	} else {
-		dsq = find_non_local_dsq(dsq_id);
+		dsq = find_user_dsq(dsq_id);
 		if (dsq) {
 			ret = READ_ONCE(dsq->nr);
 			goto out;
@@ -6546,7 +6540,7 @@ __bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id,
 	if (flags & ~__SCX_DSQ_ITER_USER_FLAGS)
 		return -EINVAL;
 
-	kit->dsq = find_non_local_dsq(dsq_id);
+	kit->dsq = find_user_dsq(dsq_id);
 	if (!kit->dsq)
 		return -ENOENT;
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/5] sched_ext: Relocate find_user_dsq()
  2024-09-25  0:06 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Split %SCX_DSQ_GLOBAL per-node Tejun Heo
  2024-09-25  0:06 ` [PATCH 1/5] scx_flatcg: Use a user DSQ for fallback instead of SCX_DSQ_GLOBAL Tejun Heo
  2024-09-25  0:06 ` [PATCH 2/5] sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new() Tejun Heo
@ 2024-09-25  0:06 ` Tejun Heo
  2024-09-26 21:46   ` David Vernet
  2024-09-25  0:06 ` [PATCH 4/5] sched_ext: Split the global DSQ per NUMA node Tejun Heo
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2024-09-25  0:06 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, sched-ext, Tejun Heo

To prepare for the addition of find_global_dsq(). No functional changes.

Signed-off-by: tejun heo <tj@kernel.org>
---
 kernel/sched/ext.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index ed2b914c42d1..acb4db7827a4 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1029,6 +1029,11 @@ static bool u32_before(u32 a, u32 b)
 	return (s32)(a - b) < 0;
 }
 
+static struct scx_dispatch_q *find_user_dsq(u64 dsq_id)
+{
+	return rhashtable_lookup_fast(&dsq_hash, &dsq_id, dsq_hash_params);
+}
+
 /*
  * scx_kf_mask enforcement. Some kfuncs can only be called from specific SCX
  * ops. When invoking SCX ops, SCX_CALL_OP[_RET]() should be used to indicate
@@ -1803,11 +1808,6 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
 		raw_spin_unlock(&dsq->lock);
 }
 
-static struct scx_dispatch_q *find_user_dsq(u64 dsq_id)
-{
-	return rhashtable_lookup_fast(&dsq_hash, &dsq_id, dsq_hash_params);
-}
-
 static struct scx_dispatch_q *find_dsq_for_dispatch(struct rq *rq, u64 dsq_id,
 						    struct task_struct *p)
 {
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/5] sched_ext: Split the global DSQ per NUMA node
  2024-09-25  0:06 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Split %SCX_DSQ_GLOBAL per-node Tejun Heo
                   ` (2 preceding siblings ...)
  2024-09-25  0:06 ` [PATCH 3/5] sched_ext: Relocate find_user_dsq() Tejun Heo
@ 2024-09-25  0:06 ` Tejun Heo
  2024-09-26 21:56   ` David Vernet
  2024-09-25  0:06 ` [PATCH 5/5] sched_ext: Use shorter slice while bypassing Tejun Heo
  2024-09-26 23:00 ` [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Split %SCX_DSQ_GLOBAL per-node Tejun Heo
  5 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2024-09-25  0:06 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, sched-ext, Tejun Heo

In the bypass mode, the global DSQ is used to schedule all tasks in simple
FIFO order. All tasks are queued into the global DSQ and all CPUs try to
execute tasks from it. This creates a lot of cross-node cacheline accesses
and scheduling across the node boundaries, and can lead to live-lock
conditions where the system takes tens of minutes to disable the BPF
scheduler while executing in the bypass mode.

Split the global DSQ per NUMA node. Each node has its own global DSQ. When a
task is dispatched to SCX_DSQ_GLOBAL, it's put into the global DSQ local to
the task's CPU and all CPUs in a node only consume its node-local global
DSQ.

This resolves a livelock condition which could be reliably triggered on an
2x EPYC 7642 system by running `stress-ng --race-sched 1024` together with
`stress-ng --workload 80 --workload-threads 10` while repeatedly enabling
and disabling a SCX scheduler.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 73 +++++++++++++++++++++++++++++++++++++---------
 1 file changed, 60 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index acb4db7827a4..949a3c43000a 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -925,8 +925,15 @@ static unsigned long __percpu *scx_kick_cpus_pnt_seqs;
  */
 static DEFINE_PER_CPU(struct task_struct *, direct_dispatch_task);
 
-/* dispatch queues */
-static struct scx_dispatch_q __cacheline_aligned_in_smp scx_dsq_global;
+/*
+ * Dispatch queues.
+ *
+ * The global DSQ (%SCX_DSQ_GLOBAL) is split per-node for scalability. This is
+ * to avoid live-locking in bypass mode where all tasks are dispatched to
+ * %SCX_DSQ_GLOBAL and all CPUs consume from it. If per-node split isn't
+ * sufficient, it can be further split.
+ */
+static struct scx_dispatch_q **global_dsqs;
 
 static const struct rhashtable_params dsq_hash_params = {
 	.key_len		= 8,
@@ -1029,6 +1036,11 @@ static bool u32_before(u32 a, u32 b)
 	return (s32)(a - b) < 0;
 }
 
+static struct scx_dispatch_q *find_global_dsq(struct task_struct *p)
+{
+	return global_dsqs[cpu_to_node(task_cpu(p))];
+}
+
 static struct scx_dispatch_q *find_user_dsq(u64 dsq_id)
 {
 	return rhashtable_lookup_fast(&dsq_hash, &dsq_id, dsq_hash_params);
@@ -1642,7 +1654,7 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
 			scx_ops_error("attempting to dispatch to a destroyed dsq");
 			/* fall back to the global dsq */
 			raw_spin_unlock(&dsq->lock);
-			dsq = &scx_dsq_global;
+			dsq = find_global_dsq(p);
 			raw_spin_lock(&dsq->lock);
 		}
 	}
@@ -1820,20 +1832,20 @@ static struct scx_dispatch_q *find_dsq_for_dispatch(struct rq *rq, u64 dsq_id,
 		s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
 
 		if (!ops_cpu_valid(cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict"))
-			return &scx_dsq_global;
+			return find_global_dsq(p);
 
 		return &cpu_rq(cpu)->scx.local_dsq;
 	}
 
 	if (dsq_id == SCX_DSQ_GLOBAL)
-		dsq = &scx_dsq_global;
+		dsq = find_global_dsq(p);
 	else
 		dsq = find_user_dsq(dsq_id);
 
 	if (unlikely(!dsq)) {
 		scx_ops_error("non-existent DSQ 0x%llx for %s[%d]",
 			      dsq_id, p->comm, p->pid);
-		return &scx_dsq_global;
+		return find_global_dsq(p);
 	}
 
 	return dsq;
@@ -2005,7 +2017,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 global:
 	touch_core_sched(rq, p);	/* see the comment in local: */
 	p->scx.slice = SCX_SLICE_DFL;
-	dispatch_enqueue(&scx_dsq_global, p, enq_flags);
+	dispatch_enqueue(find_global_dsq(p), p, enq_flags);
 }
 
 static bool task_runnable(const struct task_struct *p)
@@ -2391,6 +2403,13 @@ static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
 	return false;
 }
 
+static bool consume_global_dsq(struct rq *rq)
+{
+	int node = cpu_to_node(cpu_of(rq));
+
+	return consume_dispatch_q(rq, global_dsqs[node]);
+}
+
 /**
  * dispatch_to_local_dsq - Dispatch a task to a local dsq
  * @rq: current rq which is locked
@@ -2424,7 +2443,8 @@ static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
 
 #ifdef CONFIG_SMP
 	if (unlikely(!task_can_run_on_remote_rq(p, dst_rq, true))) {
-		dispatch_enqueue(&scx_dsq_global, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
+		dispatch_enqueue(find_global_dsq(p), p,
+				 enq_flags | SCX_ENQ_CLEAR_OPSS);
 		return;
 	}
 
@@ -2624,7 +2644,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
 	if (rq->scx.local_dsq.nr)
 		goto has_tasks;
 
-	if (consume_dispatch_q(rq, &scx_dsq_global))
+	if (consume_global_dsq(rq))
 		goto has_tasks;
 
 	if (!SCX_HAS_OP(dispatch) || scx_rq_bypassing(rq) || !scx_rq_online(rq))
@@ -2649,7 +2669,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
 
 		if (rq->scx.local_dsq.nr)
 			goto has_tasks;
-		if (consume_dispatch_q(rq, &scx_dsq_global))
+		if (consume_global_dsq(rq))
 			goto has_tasks;
 
 		/*
@@ -4924,7 +4944,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 	struct scx_task_iter sti;
 	struct task_struct *p;
 	unsigned long timeout;
-	int i, cpu, ret;
+	int i, cpu, node, ret;
 
 	if (!cpumask_equal(housekeeping_cpumask(HK_TYPE_DOMAIN),
 			   cpu_possible_mask)) {
@@ -4943,6 +4963,34 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 		}
 	}
 
+	if (!global_dsqs) {
+		struct scx_dispatch_q **dsqs;
+
+		dsqs = kcalloc(nr_node_ids, sizeof(dsqs[0]), GFP_KERNEL);
+		if (!dsqs) {
+			ret = -ENOMEM;
+			goto err_unlock;
+		}
+
+		for_each_node_state(node, N_POSSIBLE) {
+			struct scx_dispatch_q *dsq;
+
+			dsq = kzalloc_node(sizeof(*dsq), GFP_KERNEL, node);
+			if (!dsq) {
+				for_each_node_state(node, N_POSSIBLE)
+					kfree(dsqs[node]);
+				kfree(dsqs);
+				ret = -ENOMEM;
+				goto err_unlock;
+			}
+
+			init_dsq(dsq, SCX_DSQ_GLOBAL);
+			dsqs[node] = dsq;
+		}
+
+		global_dsqs = dsqs;
+	}
+
 	if (scx_ops_enable_state() != SCX_OPS_DISABLED) {
 		ret = -EBUSY;
 		goto err_unlock;
@@ -5777,7 +5825,6 @@ void __init init_sched_ext_class(void)
 		   SCX_TG_ONLINE);
 
 	BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params));
-	init_dsq(&scx_dsq_global, SCX_DSQ_GLOBAL);
 #ifdef CONFIG_SMP
 	BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
 	BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
@@ -6053,7 +6100,7 @@ static bool scx_dispatch_from_dsq(struct bpf_iter_scx_dsq_kern *kit,
 	if (dst_dsq->id == SCX_DSQ_LOCAL) {
 		dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
 		if (!task_can_run_on_remote_rq(p, dst_rq, true)) {
-			dst_dsq = &scx_dsq_global;
+			dst_dsq = find_global_dsq(p);
 			dst_rq = src_rq;
 		}
 	} else {
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/5] sched_ext: Use shorter slice while bypassing
  2024-09-25  0:06 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Split %SCX_DSQ_GLOBAL per-node Tejun Heo
                   ` (3 preceding siblings ...)
  2024-09-25  0:06 ` [PATCH 4/5] sched_ext: Split the global DSQ per NUMA node Tejun Heo
@ 2024-09-25  0:06 ` Tejun Heo
  2024-09-26 22:07   ` David Vernet
  2024-09-26 23:00 ` [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Split %SCX_DSQ_GLOBAL per-node Tejun Heo
  5 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2024-09-25  0:06 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, sched-ext, Tejun Heo

While bypassing, tasks are scheduled in FIFO order which favors tasks that
hog CPUs. This can slow down e.g. unloading of the BPF scheduler. While
bypassing, guaranteeing timely forward progress is the main goal. There's no
point in giving long slices. Shorten the time slice used while bypassing
from 20ms to 5ms.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 949a3c43000a..d6f6bf6caecc 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -9,6 +9,7 @@
 #define SCX_OP_IDX(op)		(offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void)))
 
 enum scx_consts {
+	SCX_SLICE_BYPASS		= SCX_SLICE_DFL / 4,
 	SCX_DSP_DFL_MAX_BATCH		= 32,
 	SCX_DSP_MAX_LOOPS		= 32,
 	SCX_WATCHDOG_MAX_TIMEOUT	= 30 * HZ,
@@ -1944,6 +1945,7 @@ static bool scx_rq_online(struct rq *rq)
 static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 			    int sticky_cpu)
 {
+	bool bypassing = scx_rq_bypassing(rq);
 	struct task_struct **ddsp_taskp;
 	unsigned long qseq;
 
@@ -1961,7 +1963,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	if (!scx_rq_online(rq))
 		goto local;
 
-	if (scx_rq_bypassing(rq))
+	if (bypassing)
 		goto global;
 
 	if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
@@ -2016,7 +2018,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 
 global:
 	touch_core_sched(rq, p);	/* see the comment in local: */
-	p->scx.slice = SCX_SLICE_DFL;
+	p->scx.slice = bypassing ? SCX_SLICE_BYPASS : SCX_SLICE_DFL;
 	dispatch_enqueue(find_global_dsq(p), p, enq_flags);
 }
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] scx_flatcg: Use a user DSQ for fallback instead of SCX_DSQ_GLOBAL
  2024-09-25  0:06 ` [PATCH 1/5] scx_flatcg: Use a user DSQ for fallback instead of SCX_DSQ_GLOBAL Tejun Heo
@ 2024-09-25 16:45   ` David Vernet
  0 siblings, 0 replies; 15+ messages in thread
From: David Vernet @ 2024-09-25 16:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel, sched-ext

[-- Attachment #1: Type: text/plain, Size: 501 bytes --]

On Tue, Sep 24, 2024 at 02:06:03PM -1000, Tejun Heo wrote:
> scx_flatcg was using SCX_DSQ_GLOBAL for fallback handling. However, it is
> assuming that SCX_DSQ_GLOBAL isn't automatically consumed, which was true a
> while ago but is no longer the case. Also, there are further changes planned
> for SCX_DSQ_GLOBAL which will disallow explicit consumption from it. Switch
> to a user DSQ for fallback.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: David Vernet <void@manifault.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new()
  2024-09-25  0:06 ` [PATCH 2/5] sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new() Tejun Heo
@ 2024-09-25 17:09   ` David Vernet
  2024-09-25 21:04     ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: David Vernet @ 2024-09-25 17:09 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel, sched-ext

[-- Attachment #1: Type: text/plain, Size: 1770 bytes --]

On Tue, Sep 24, 2024 at 02:06:04PM -1000, Tejun Heo wrote:

Hi Tejun,

> SCX_DSQ_GLOBAL is special in that it can't be used as a priority queue and
> is consumed implicitly, but all BPF DSQ related kfuncs could be used on it.
> SCX_DSQ_GLOBAL will be split per-node for scalability and those operations
> won't make sense anymore. Disallow SCX_DSQ_GLOBAL on scx_bpf_consume(),
> scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new(). This means that
> SCX_DSQ_GLOBAL can only be used as a dispatch target from BPF schedulers.

This API impedance where you can dispatch but not consume feels unnatural, and
a bit leaky. I understand why we don't want to allow BPF to consume it -- we're
already doing it for the user as part of (and before) the dispatch loop. That's
also one-off logic that's separate from the normal interface for DSQs though,
and because of that, SCX_DSQ_GLOBAL imposes a cognitive overhead that IMO
arguably outweighs the convenience it provides.

I'm still of the opinion that we should just hide SCX_DSQ_GLOBAL from the user
altogether. It makes sense why we'd need it as a backup DSQ for when we're e.g.
unloading the scheduler, but as a building block that's provided by the kernel
to the user, I'm not sure. In a realistic production scenario where you're
doing something like running a scheduler that's latency sensitive and cares
about deadlines, I'm not sure it would be viable or ever the optimal decision
to throw the task in a global DSQ and tolerate it being consumed before other
higher-priority tasks that were enqueued in normal DSQs. Or at the very least,
I could see users being surprised by the semantics, and having instead expected
it to function as just a built-in / pre-created DSQ that functions normally
otherwise.

Thanks,
David

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new()
  2024-09-25 17:09   ` David Vernet
@ 2024-09-25 21:04     ` Tejun Heo
  2024-09-26 21:36       ` David Vernet
  0 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2024-09-25 21:04 UTC (permalink / raw)
  To: David Vernet; +Cc: kernel-team, linux-kernel, sched-ext

Hello,

On Wed, Sep 25, 2024 at 12:09:56PM -0500, David Vernet wrote:
> On Tue, Sep 24, 2024 at 02:06:04PM -1000, Tejun Heo wrote:
...
> This API impedance where you can dispatch but not consume feels unnatural, and
> a bit leaky. I understand why we don't want to allow BPF to consume it -- we're
> already doing it for the user as part of (and before) the dispatch loop. That's
> also one-off logic that's separate from the normal interface for DSQs though,
> and because of that, SCX_DSQ_GLOBAL imposes a cognitive overhead that IMO
> arguably outweighs the convenience it provides.

I don't know. One can also argue that this makes all built-in DSQs behave
more consistently as the local DSQs can only be dispatched to too. Either
way, I don't think it makes meaningful differences.

> I'm still of the opinion that we should just hide SCX_DSQ_GLOBAL from the user
> altogether. It makes sense why we'd need it as a backup DSQ for when we're e.g.
> unloading the scheduler, but as a building block that's provided by the kernel
> to the user, I'm not sure. In a realistic production scenario where you're
> doing something like running a scheduler that's latency sensitive and cares
> about deadlines, I'm not sure it would be viable or ever the optimal decision
> to throw the task in a global DSQ and tolerate it being consumed before other
> higher-priority tasks that were enqueued in normal DSQs. Or at the very least,
> I could see users being surprised by the semantics, and having instead expected
> it to function as just a built-in / pre-created DSQ that functions normally
> otherwise.

Maybe we can further block it in the future but it doesn't seem material
either way and I tend to prefer not putting extra restrictions unless
necessary.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new()
  2024-09-25 21:04     ` Tejun Heo
@ 2024-09-26 21:36       ` David Vernet
  0 siblings, 0 replies; 15+ messages in thread
From: David Vernet @ 2024-09-26 21:36 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel, sched-ext

[-- Attachment #1: Type: text/plain, Size: 2263 bytes --]

On Wed, Sep 25, 2024 at 11:04:15AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 25, 2024 at 12:09:56PM -0500, David Vernet wrote:
> > On Tue, Sep 24, 2024 at 02:06:04PM -1000, Tejun Heo wrote:
> ...
> > This API impedance where you can dispatch but not consume feels unnatural, and
> > a bit leaky. I understand why we don't want to allow BPF to consume it -- we're
> > already doing it for the user as part of (and before) the dispatch loop. That's
> > also one-off logic that's separate from the normal interface for DSQs though,
> > and because of that, SCX_DSQ_GLOBAL imposes a cognitive overhead that IMO
> > arguably outweighs the convenience it provides.
> 
> I don't know. One can also argue that this makes all built-in DSQs behave
> more consistently as the local DSQs can only be dispatched to too. Either
> way, I don't think it makes meaningful differences.

That's a good point r.e. making it consistent with local. I don't think it
makes a big difference one way or the other, but this is something that folks
seem to get consistently confused about. To your point, maybe that won't be the
case now that you can't consume.

> > I'm still of the opinion that we should just hide SCX_DSQ_GLOBAL from the user
> > altogether. It makes sense why we'd need it as a backup DSQ for when we're e.g.
> > unloading the scheduler, but as a building block that's provided by the kernel
> > to the user, I'm not sure. In a realistic production scenario where you're
> > doing something like running a scheduler that's latency sensitive and cares
> > about deadlines, I'm not sure it would be viable or ever the optimal decision
> > to throw the task in a global DSQ and tolerate it being consumed before other
> > higher-priority tasks that were enqueued in normal DSQs. Or at the very least,
> > I could see users being surprised by the semantics, and having instead expected
> > it to function as just a built-in / pre-created DSQ that functions normally
> > otherwise.
> 
> Maybe we can further block it in the future but it doesn't seem material
> either way and I tend to prefer not putting extra restrictions unless
> necessary.

Fine with me, patch LG otherwise:

Acked-by: David Vernet <void@manifault.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] sched_ext: Relocate find_user_dsq()
  2024-09-25  0:06 ` [PATCH 3/5] sched_ext: Relocate find_user_dsq() Tejun Heo
@ 2024-09-26 21:46   ` David Vernet
  0 siblings, 0 replies; 15+ messages in thread
From: David Vernet @ 2024-09-26 21:46 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel, sched-ext

[-- Attachment #1: Type: text/plain, Size: 1350 bytes --]

On Tue, Sep 24, 2024 at 02:06:05PM -1000, Tejun Heo wrote:
> To prepare for the addition of find_global_dsq(). No functional changes.
> 
> Signed-off-by: tejun heo <tj@kernel.org>

Acked-by: David Vernet <void@manifault.com>

> ---
>  kernel/sched/ext.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index ed2b914c42d1..acb4db7827a4 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1029,6 +1029,11 @@ static bool u32_before(u32 a, u32 b)
>  	return (s32)(a - b) < 0;
>  }
>  
> +static struct scx_dispatch_q *find_user_dsq(u64 dsq_id)
> +{
> +	return rhashtable_lookup_fast(&dsq_hash, &dsq_id, dsq_hash_params);
> +}
> +
>  /*
>   * scx_kf_mask enforcement. Some kfuncs can only be called from specific SCX
>   * ops. When invoking SCX ops, SCX_CALL_OP[_RET]() should be used to indicate
> @@ -1803,11 +1808,6 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
>  		raw_spin_unlock(&dsq->lock);
>  }
>  
> -static struct scx_dispatch_q *find_user_dsq(u64 dsq_id)
> -{
> -	return rhashtable_lookup_fast(&dsq_hash, &dsq_id, dsq_hash_params);
> -}
> -
>  static struct scx_dispatch_q *find_dsq_for_dispatch(struct rq *rq, u64 dsq_id,
>  						    struct task_struct *p)
>  {
> -- 
> 2.46.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] sched_ext: Split the global DSQ per NUMA node
  2024-09-25  0:06 ` [PATCH 4/5] sched_ext: Split the global DSQ per NUMA node Tejun Heo
@ 2024-09-26 21:56   ` David Vernet
  0 siblings, 0 replies; 15+ messages in thread
From: David Vernet @ 2024-09-26 21:56 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel, sched-ext

[-- Attachment #1: Type: text/plain, Size: 1098 bytes --]

On Tue, Sep 24, 2024 at 02:06:06PM -1000, Tejun Heo wrote:
> In the bypass mode, the global DSQ is used to schedule all tasks in simple
> FIFO order. All tasks are queued into the global DSQ and all CPUs try to
> execute tasks from it. This creates a lot of cross-node cacheline accesses
> and scheduling across the node boundaries, and can lead to live-lock
> conditions where the system takes tens of minutes to disable the BPF
> scheduler while executing in the bypass mode.
> 
> Split the global DSQ per NUMA node. Each node has its own global DSQ. When a
> task is dispatched to SCX_DSQ_GLOBAL, it's put into the global DSQ local to
> the task's CPU and all CPUs in a node only consume its node-local global
> DSQ.
> 
> This resolves a livelock condition which could be reliably triggered on an
> 2x EPYC 7642 system by running `stress-ng --race-sched 1024` together with
> `stress-ng --workload 80 --workload-threads 10` while repeatedly enabling
> and disabling a SCX scheduler.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: David Vernet <void@manifault.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/5] sched_ext: Use shorter slice while bypassing
  2024-09-25  0:06 ` [PATCH 5/5] sched_ext: Use shorter slice while bypassing Tejun Heo
@ 2024-09-26 22:07   ` David Vernet
  2024-09-26 22:55     ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: David Vernet @ 2024-09-26 22:07 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel, sched-ext

[-- Attachment #1: Type: text/plain, Size: 642 bytes --]

On Tue, Sep 24, 2024 at 02:06:07PM -1000, Tejun Heo wrote:

Hi Tejun,

> While bypassing, tasks are scheduled in FIFO order which favors tasks that
> hog CPUs. This can slow down e.g. unloading of the BPF scheduler. While
> bypassing, guaranteeing timely forward progress is the main goal. There's no
> point in giving long slices. Shorten the time slice used while bypassing
> from 20ms to 5ms.

My vote would be to just update SCX_SLICE_DFL to be 5ms, but this is fine as
well if you'd prefer to leave the default slice as something that observably
works well for throughput in fair schedulers.

Acked-by: David Vernet <void@manifault.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/5] sched_ext: Use shorter slice while bypassing
  2024-09-26 22:07   ` David Vernet
@ 2024-09-26 22:55     ` Tejun Heo
  0 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2024-09-26 22:55 UTC (permalink / raw)
  To: David Vernet; +Cc: kernel-team, linux-kernel, sched-ext

Hello,

On Thu, Sep 26, 2024 at 05:07:21PM -0500, David Vernet wrote:
...
> My vote would be to just update SCX_SLICE_DFL to be 5ms, but this is fine as
> well if you'd prefer to leave the default slice as something that observably
> works well for throughput in fair schedulers.

20ms originally came from experimenting with simple FIFO scheduling on web
server workloads. Even with a scheduling policy as simple as FIFO, if CPUs
are not left idle and the scheduler doesn't induce unnecessary context
switches, many workloads do reasonably well. So, 20ms is the long-ish slice
which doesn't trigger involuntary context switches too much. We can shorten
it but it will have downstream effects as multiple scheudlers still make
some use of the default slice duration. Unless there are strong reasons to
change, it'd probably be better to leave it as-is.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Split %SCX_DSQ_GLOBAL per-node
  2024-09-25  0:06 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Split %SCX_DSQ_GLOBAL per-node Tejun Heo
                   ` (4 preceding siblings ...)
  2024-09-25  0:06 ` [PATCH 5/5] sched_ext: Use shorter slice while bypassing Tejun Heo
@ 2024-09-26 23:00 ` Tejun Heo
  5 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2024-09-26 23:00 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, sched-ext

On Tue, Sep 24, 2024 at 02:06:02PM -1000, Tejun Heo wrote:
> In the bypass mode, the global DSQ is used to schedule all tasks in simple
> FIFO order. All tasks are queued into the global DSQ and all CPUs try to
> execute tasks from it. This creates a lot of cross-node cacheline accesses
> and scheduling across the node boundaries, and can lead to live-lock
> conditions where the system takes tens of minutes to disable the BPF
> scheduler while executing in the bypass mode.
> 
> This patchset splits the global DSQ per NUMA node. Each node has its own
> global DSQ. When a task is dispatched to SCX_DSQ_GLOBAL, it's put into the
> global DSQ local to the task's CPU and all CPUs in a node only consume its
> node-local global DSQ.
> 
> This resolves a livelock condition which could be reliably triggered on an
> 2x EPYC 7642 system by running `stress-ng --race-sched 1024` together with
> `stress-ng --workload 80 --workload-threads 10` while repeatedly enabling
> and disabling a SCX scheduler.
> 
> This patchset contains the following patches:
> 
>  0001-scx_flatcg-Use-a-user-DSQ-for-fallback-instead-of-SC.patch
>  0002-sched_ext-Allow-only-user-DSQs-for-scx_bpf_consume-s.patch
>  0003-sched_ext-Relocate-find_user_dsq.patch
>  0004-sched_ext-Split-the-global-DSQ-per-NUMA-node.patch
>  0005-sched_ext-Use-shorter-slice-while-bypassing.patch

Applied to sched_ext/for-6.12-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-09-26 23:00 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-25  0:06 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Split %SCX_DSQ_GLOBAL per-node Tejun Heo
2024-09-25  0:06 ` [PATCH 1/5] scx_flatcg: Use a user DSQ for fallback instead of SCX_DSQ_GLOBAL Tejun Heo
2024-09-25 16:45   ` David Vernet
2024-09-25  0:06 ` [PATCH 2/5] sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new() Tejun Heo
2024-09-25 17:09   ` David Vernet
2024-09-25 21:04     ` Tejun Heo
2024-09-26 21:36       ` David Vernet
2024-09-25  0:06 ` [PATCH 3/5] sched_ext: Relocate find_user_dsq() Tejun Heo
2024-09-26 21:46   ` David Vernet
2024-09-25  0:06 ` [PATCH 4/5] sched_ext: Split the global DSQ per NUMA node Tejun Heo
2024-09-26 21:56   ` David Vernet
2024-09-25  0:06 ` [PATCH 5/5] sched_ext: Use shorter slice while bypassing Tejun Heo
2024-09-26 22:07   ` David Vernet
2024-09-26 22:55     ` Tejun Heo
2024-09-26 23:00 ` [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Split %SCX_DSQ_GLOBAL per-node Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox