[PATCH v3 0/2] sched_ext: lockless peek operation for DSQs

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/2] sched_ext: lockless peek operation for DSQs
@ 2025-10-06 17:04 Ryan Newton
  2025-10-06 17:04 ` [PATCH v3 1/2] sched_ext: Add " Ryan Newton
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Ryan Newton @ 2025-10-06 17:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: sched-ext, tj, arighi, rrnewton, newton

This allows sched_ext schedulers an inexpensive operation to peek
at the first element in a queue (DSQ), without creating an iterator 
and acquiring the lock on that queue.

Note that manual testing has thus far included a modified version of the
example qmap scheduler that exercises peek, as well as a modified
modified LAVD (from the SCX repo) that exercises peek. The attached test
passes >1000 stress tests when run in concurrent VMs, and when run
sequentially on the host kernel. Presently, tested on the below
workstation and server processors.
- AMD Ryzen Threadripper PRO 7975WX 32-Cores
- AMD EPYC 9D64 88-Core Processor

Initial experiments indicate a substantial speedup (on schbench) when
running an SCX scheduler with per-cpu DSQs and peeking each queue to
retrieve the task with the minimum vruntime across all the CPUs.

---
Changes in v3:
 - inline helpers and simplify
 - coding style tweaks

Changes in v2:
 - make peek() only work for user DSQs and error otherwise
 - added a stress test component to the selftest that performs many peeks
 - responded to review comments from tj@kernel.org and arighi@nvidia.com 
 - link: https://lore.kernel.org/lkml/20251003195408.675527-1-rrnewton@gmail.com/
 
v1 link: https://lore.kernel.org/lkml/20251002025722.3420916-1-rrnewton@gmail.com/

Ryan Newton (2):
  sched_ext: Add lockless peek operation for DSQs
  sched_ext: Add a selftest for scx_bpf_dsq_peek

 include/linux/sched/ext.h                     |   1 +
 kernel/sched/ext.c                            |  56 +++-
 tools/sched_ext/include/scx/common.bpf.h      |   1 +
 tools/sched_ext/include/scx/compat.bpf.h      |  19 ++
 tools/testing/selftests/sched_ext/Makefile    |   1 +
 .../selftests/sched_ext/peek_dsq.bpf.c        | 265 ++++++++++++++++++
 tools/testing/selftests/sched_ext/peek_dsq.c  | 230 +++++++++++++++
 7 files changed, 571 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/peek_dsq.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/peek_dsq.c

-- 
2.51.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 1/2] sched_ext: Add lockless peek operation for DSQs
  2025-10-06 17:04 [PATCH v3 0/2] sched_ext: lockless peek operation for DSQs Ryan Newton
@ 2025-10-06 17:04 ` Ryan Newton
  2025-10-06 17:26   ` Andrea Righi
  2025-10-14 22:27   ` Jake Hillion
  2025-10-06 17:04 ` [PATCH v3 2/2] sched_ext: Add a selftest for scx_bpf_dsq_peek Ryan Newton
  2025-10-06 17:20 ` [PATCH v3 0/2] sched_ext: lockless peek operation for DSQs Christian Loehle
  2 siblings, 2 replies; 10+ messages in thread
From: Ryan Newton @ 2025-10-06 17:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: sched-ext, tj, arighi, rrnewton, newton

From: Ryan Newton <newton@meta.com>

The builtin DSQ queue data structures are meant to be used by a wide
range of different sched_ext schedulers with different demands on these
data structures. They might be per-cpu with low-contention, or
high-contention shared queues. Unfortunately, DSQs have a coarse-grained
lock around the whole data structure. Without going all the way to a
lock-free, more scalable implementation, a small step we can take to
reduce lock contention is to allow a lockless, small-fixed-cost peek at
the head of the queue.

This change allows certain custom SCX schedulers to cheaply peek at
queues, e.g. during load balancing, before locking them. But it
represents a few extra memory operations to update the pointer each
time the DSQ is modified, including a memory barrier on ARM so the write
appears correctly ordered.

This commit adds a first_task pointer field which is updated
atomically when the DSQ is modified, and allows any thread to peek at
the head of the queue without holding the lock.

Signed-off-by: Ryan Newton <newton@meta.com>
---
 include/linux/sched/ext.h                |  1 +
 kernel/sched/ext.c                       | 54 +++++++++++++++++++++++-
 tools/sched_ext/include/scx/common.bpf.h |  1 +
 tools/sched_ext/include/scx/compat.bpf.h | 19 +++++++++
 4 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index d82b7a9b0658..81478d4ae782 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -58,6 +58,7 @@ enum scx_dsq_id_flags {
  */
 struct scx_dispatch_q {
 	raw_spinlock_t		lock;
+	struct task_struct __rcu *first_task; /* lockless peek at head */
 	struct list_head	list;	/* tasks in dispatch order */
 	struct rb_root		priq;	/* used to order by p->scx.dsq_vtime */
 	u32			nr;
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 2b0e88206d07..6d3537e65001 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -944,8 +944,11 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
 				container_of(rbp, struct task_struct,
 					     scx.dsq_priq);
 			list_add(&p->scx.dsq_list.node, &prev->scx.dsq_list.node);
+			/* first task unchanged - no update needed */
 		} else {
 			list_add(&p->scx.dsq_list.node, &dsq->list);
+			/* not builtin and new task is at head - use fastpath */
+			rcu_assign_pointer(dsq->first_task, p);
 		}
 	} else {
 		/* a FIFO DSQ shouldn't be using PRIQ enqueuing */
@@ -953,10 +956,19 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
 			scx_error(sch, "DSQ ID 0x%016llx already had PRIQ-enqueued tasks",
 				  dsq->id);
 
-		if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
+		if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) {
 			list_add(&p->scx.dsq_list.node, &dsq->list);
-		else
+			/* new task inserted at head - use fastpath */
+			if (!(dsq->id & SCX_DSQ_FLAG_BUILTIN))
+				rcu_assign_pointer(dsq->first_task, p);
+		} else {
+			bool was_empty;
+
+			was_empty = list_empty(&dsq->list);
 			list_add_tail(&p->scx.dsq_list.node, &dsq->list);
+			if (was_empty && !(dsq->id & SCX_DSQ_FLAG_BUILTIN))
+				rcu_assign_pointer(dsq->first_task, p);
+		}
 	}
 
 	/* seq records the order tasks are queued, used by BPF DSQ iterator */
@@ -1011,6 +1023,13 @@ static void task_unlink_from_dsq(struct task_struct *p,
 		p->scx.dsq_flags &= ~SCX_TASK_DSQ_ON_PRIQ;
 	}
 
+	if (!(dsq->id & SCX_DSQ_FLAG_BUILTIN) && dsq->first_task == p) {
+		struct task_struct *first_task;
+
+		first_task = nldsq_next_task(dsq, NULL, false);
+		rcu_assign_pointer(dsq->first_task, first_task);
+	}
+
 	list_del_init(&p->scx.dsq_list.node);
 	dsq_mod_nr(dsq, -1);
 }
@@ -6084,6 +6103,36 @@ __bpf_kfunc void bpf_iter_scx_dsq_destroy(struct bpf_iter_scx_dsq *it)
 	kit->dsq = NULL;
 }
 
+/**
+ * scx_bpf_dsq_peek - Lockless peek at the first element.
+ * @dsq_id: DSQ to examine.
+ *
+ * Read the first element in the DSQ. This is semantically equivalent to using
+ * the DSQ iterator, but is lockfree.
+ *
+ * Returns the pointer, or NULL indicates an empty queue OR internal error.
+ */
+__bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id)
+{
+	struct scx_sched *sch;
+	struct scx_dispatch_q *dsq;
+
+	sch = rcu_dereference(scx_root);
+	if (unlikely(!sch))
+		return NULL;
+	if (unlikely(dsq_id & SCX_DSQ_FLAG_BUILTIN)) {
+		scx_error(sch, "peek disallowed on builtin DSQ 0x%llx", dsq_id);
+		return NULL;
+	}
+
+	dsq = find_user_dsq(sch, dsq_id);
+	if (unlikely(!dsq)) {
+		scx_error(sch, "peek on non-existent DSQ 0x%llx", dsq_id);
+		return NULL;
+	}
+	return rcu_dereference(dsq->first_task);
+}
+
 __bpf_kfunc_end_defs();
 
 static s32 __bstr_format(struct scx_sched *sch, u64 *data_buf, char *line_buf,
@@ -6641,6 +6690,7 @@ BTF_KFUNCS_START(scx_kfunc_ids_any)
 BTF_ID_FLAGS(func, scx_bpf_kick_cpu)
 BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued)
 BTF_ID_FLAGS(func, scx_bpf_destroy_dsq)
+BTF_ID_FLAGS(func, scx_bpf_dsq_peek, KF_RCU_PROTECTED | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_iter_scx_dsq_new, KF_ITER_NEW | KF_RCU_PROTECTED)
 BTF_ID_FLAGS(func, bpf_iter_scx_dsq_next, KF_ITER_NEXT | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_iter_scx_dsq_destroy, KF_ITER_DESTROY)
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 06e2551033cb..fbf3e7f9526c 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -75,6 +75,7 @@ u32 scx_bpf_reenqueue_local(void) __ksym;
 void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym;
 s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
 void scx_bpf_destroy_dsq(u64 dsq_id) __ksym;
+struct task_struct *scx_bpf_dsq_peek(u64 dsq_id) __ksym __weak;
 int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, u64 flags) __ksym __weak;
 struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *it) __ksym __weak;
 void bpf_iter_scx_dsq_destroy(struct bpf_iter_scx_dsq *it) __ksym __weak;
diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
index dd9144624dc9..97b10c184b2c 100644
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -130,6 +130,25 @@ int bpf_cpumask_populate(struct cpumask *dst, void *src, size_t src__sz) __ksym
 	false;									\
 })
 
+
+/*
+ * v6.19: Introduce lockless peek API for user DSQs.
+ *
+ * Preserve the following macro until v6.21.
+ */
+static inline struct task_struct *__COMPAT_scx_bpf_dsq_peek(u64 dsq_id)
+{
+	struct task_struct *p = NULL;
+	struct bpf_iter_scx_dsq it;
+
+	if (bpf_ksym_exists(scx_bpf_dsq_peek))
+		return scx_bpf_dsq_peek(dsq_id);
+	if (!bpf_iter_scx_dsq_new(&it, dsq_id, 0))
+		p = bpf_iter_scx_dsq_next(&it);
+	bpf_iter_scx_dsq_destroy(&it);
+	return p;
+}
+
 /**
  * __COMPAT_is_enq_cpu_selected - Test if SCX_ENQ_CPU_SELECTED is on
  * in a compatible way. We will preserve this __COMPAT helper until v6.16.
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/2] sched_ext: Add lockless peek operation for DSQs
  2025-10-06 17:04 ` [PATCH v3 1/2] sched_ext: Add " Ryan Newton
@ 2025-10-06 17:26   ` Andrea Righi
  2025-10-14 22:27   ` Jake Hillion
  1 sibling, 0 replies; 10+ messages in thread
From: Andrea Righi @ 2025-10-06 17:26 UTC (permalink / raw)
  To: Ryan Newton; +Cc: linux-kernel, sched-ext, tj, newton

Hi Ryan,

On Mon, Oct 06, 2025 at 01:04:02PM -0400, Ryan Newton wrote:
> From: Ryan Newton <newton@meta.com>
> 
> The builtin DSQ queue data structures are meant to be used by a wide
> range of different sched_ext schedulers with different demands on these
> data structures. They might be per-cpu with low-contention, or
> high-contention shared queues. Unfortunately, DSQs have a coarse-grained
> lock around the whole data structure. Without going all the way to a
> lock-free, more scalable implementation, a small step we can take to
> reduce lock contention is to allow a lockless, small-fixed-cost peek at
> the head of the queue.
> 
> This change allows certain custom SCX schedulers to cheaply peek at
> queues, e.g. during load balancing, before locking them. But it
> represents a few extra memory operations to update the pointer each
> time the DSQ is modified, including a memory barrier on ARM so the write
> appears correctly ordered.
> 
> This commit adds a first_task pointer field which is updated
> atomically when the DSQ is modified, and allows any thread to peek at
> the head of the queue without holding the lock.
> 
> Signed-off-by: Ryan Newton <newton@meta.com>

With the minor nit from PATCH 2/2 this looks good to me.

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks!
-Andrea

> ---
>  include/linux/sched/ext.h                |  1 +
>  kernel/sched/ext.c                       | 54 +++++++++++++++++++++++-
>  tools/sched_ext/include/scx/common.bpf.h |  1 +
>  tools/sched_ext/include/scx/compat.bpf.h | 19 +++++++++
>  4 files changed, 73 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index d82b7a9b0658..81478d4ae782 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -58,6 +58,7 @@ enum scx_dsq_id_flags {
>   */
>  struct scx_dispatch_q {
>  	raw_spinlock_t		lock;
> +	struct task_struct __rcu *first_task; /* lockless peek at head */
>  	struct list_head	list;	/* tasks in dispatch order */
>  	struct rb_root		priq;	/* used to order by p->scx.dsq_vtime */
>  	u32			nr;
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 2b0e88206d07..6d3537e65001 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -944,8 +944,11 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
>  				container_of(rbp, struct task_struct,
>  					     scx.dsq_priq);
>  			list_add(&p->scx.dsq_list.node, &prev->scx.dsq_list.node);
> +			/* first task unchanged - no update needed */
>  		} else {
>  			list_add(&p->scx.dsq_list.node, &dsq->list);
> +			/* not builtin and new task is at head - use fastpath */
> +			rcu_assign_pointer(dsq->first_task, p);
>  		}
>  	} else {
>  		/* a FIFO DSQ shouldn't be using PRIQ enqueuing */
> @@ -953,10 +956,19 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
>  			scx_error(sch, "DSQ ID 0x%016llx already had PRIQ-enqueued tasks",
>  				  dsq->id);
>  
> -		if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
> +		if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) {
>  			list_add(&p->scx.dsq_list.node, &dsq->list);
> -		else
> +			/* new task inserted at head - use fastpath */
> +			if (!(dsq->id & SCX_DSQ_FLAG_BUILTIN))
> +				rcu_assign_pointer(dsq->first_task, p);
> +		} else {
> +			bool was_empty;
> +
> +			was_empty = list_empty(&dsq->list);
>  			list_add_tail(&p->scx.dsq_list.node, &dsq->list);
> +			if (was_empty && !(dsq->id & SCX_DSQ_FLAG_BUILTIN))
> +				rcu_assign_pointer(dsq->first_task, p);
> +		}
>  	}
>  
>  	/* seq records the order tasks are queued, used by BPF DSQ iterator */
> @@ -1011,6 +1023,13 @@ static void task_unlink_from_dsq(struct task_struct *p,
>  		p->scx.dsq_flags &= ~SCX_TASK_DSQ_ON_PRIQ;
>  	}
>  
> +	if (!(dsq->id & SCX_DSQ_FLAG_BUILTIN) && dsq->first_task == p) {
> +		struct task_struct *first_task;
> +
> +		first_task = nldsq_next_task(dsq, NULL, false);
> +		rcu_assign_pointer(dsq->first_task, first_task);
> +	}
> +
>  	list_del_init(&p->scx.dsq_list.node);
>  	dsq_mod_nr(dsq, -1);
>  }
> @@ -6084,6 +6103,36 @@ __bpf_kfunc void bpf_iter_scx_dsq_destroy(struct bpf_iter_scx_dsq *it)
>  	kit->dsq = NULL;
>  }
>  
> +/**
> + * scx_bpf_dsq_peek - Lockless peek at the first element.
> + * @dsq_id: DSQ to examine.
> + *
> + * Read the first element in the DSQ. This is semantically equivalent to using
> + * the DSQ iterator, but is lockfree.
> + *
> + * Returns the pointer, or NULL indicates an empty queue OR internal error.
> + */
> +__bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id)
> +{
> +	struct scx_sched *sch;
> +	struct scx_dispatch_q *dsq;
> +
> +	sch = rcu_dereference(scx_root);
> +	if (unlikely(!sch))
> +		return NULL;
> +	if (unlikely(dsq_id & SCX_DSQ_FLAG_BUILTIN)) {
> +		scx_error(sch, "peek disallowed on builtin DSQ 0x%llx", dsq_id);
> +		return NULL;
> +	}
> +
> +	dsq = find_user_dsq(sch, dsq_id);
> +	if (unlikely(!dsq)) {
> +		scx_error(sch, "peek on non-existent DSQ 0x%llx", dsq_id);
> +		return NULL;
> +	}
> +	return rcu_dereference(dsq->first_task);
> +}
> +
>  __bpf_kfunc_end_defs();
>  
>  static s32 __bstr_format(struct scx_sched *sch, u64 *data_buf, char *line_buf,
> @@ -6641,6 +6690,7 @@ BTF_KFUNCS_START(scx_kfunc_ids_any)
>  BTF_ID_FLAGS(func, scx_bpf_kick_cpu)
>  BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued)
>  BTF_ID_FLAGS(func, scx_bpf_destroy_dsq)
> +BTF_ID_FLAGS(func, scx_bpf_dsq_peek, KF_RCU_PROTECTED | KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_iter_scx_dsq_new, KF_ITER_NEW | KF_RCU_PROTECTED)
>  BTF_ID_FLAGS(func, bpf_iter_scx_dsq_next, KF_ITER_NEXT | KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_iter_scx_dsq_destroy, KF_ITER_DESTROY)
> diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
> index 06e2551033cb..fbf3e7f9526c 100644
> --- a/tools/sched_ext/include/scx/common.bpf.h
> +++ b/tools/sched_ext/include/scx/common.bpf.h
> @@ -75,6 +75,7 @@ u32 scx_bpf_reenqueue_local(void) __ksym;
>  void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym;
>  s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
>  void scx_bpf_destroy_dsq(u64 dsq_id) __ksym;
> +struct task_struct *scx_bpf_dsq_peek(u64 dsq_id) __ksym __weak;
>  int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, u64 flags) __ksym __weak;
>  struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *it) __ksym __weak;
>  void bpf_iter_scx_dsq_destroy(struct bpf_iter_scx_dsq *it) __ksym __weak;
> diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
> index dd9144624dc9..97b10c184b2c 100644
> --- a/tools/sched_ext/include/scx/compat.bpf.h
> +++ b/tools/sched_ext/include/scx/compat.bpf.h
> @@ -130,6 +130,25 @@ int bpf_cpumask_populate(struct cpumask *dst, void *src, size_t src__sz) __ksym
>  	false;									\
>  })
>  
> +
> +/*
> + * v6.19: Introduce lockless peek API for user DSQs.
> + *
> + * Preserve the following macro until v6.21.
> + */
> +static inline struct task_struct *__COMPAT_scx_bpf_dsq_peek(u64 dsq_id)
> +{
> +	struct task_struct *p = NULL;
> +	struct bpf_iter_scx_dsq it;
> +
> +	if (bpf_ksym_exists(scx_bpf_dsq_peek))
> +		return scx_bpf_dsq_peek(dsq_id);
> +	if (!bpf_iter_scx_dsq_new(&it, dsq_id, 0))
> +		p = bpf_iter_scx_dsq_next(&it);
> +	bpf_iter_scx_dsq_destroy(&it);
> +	return p;
> +}
> +
>  /**
>   * __COMPAT_is_enq_cpu_selected - Test if SCX_ENQ_CPU_SELECTED is on
>   * in a compatible way. We will preserve this __COMPAT helper until v6.16.
> -- 
> 2.51.0
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/2] sched_ext: Add lockless peek operation for DSQs
  2025-10-06 17:04 ` [PATCH v3 1/2] sched_ext: Add " Ryan Newton
  2025-10-06 17:26   ` Andrea Righi
@ 2025-10-14 22:27   ` Jake Hillion
  1 sibling, 0 replies; 10+ messages in thread
From: Jake Hillion @ 2025-10-14 22:27 UTC (permalink / raw)
  To: Ryan Newton; +Cc: linux-kernel, sched-ext, tj, arighi, newton

On Mon, Oct 06, 2025 at 01:04:02PM -0400, Ryan Newton wrote:
> From: Ryan Newton <newton@meta.com>

Hi Ryan,

> The builtin DSQ queue data structures are meant to be used by a wide
> range of different sched_ext schedulers with different demands on these
> data structures. They might be per-cpu with low-contention, or
> high-contention shared queues. Unfortunately, DSQs have a coarse-grained
> lock around the whole data structure. Without going all the way to a
> lock-free, more scalable implementation, a small step we can take to
> reduce lock contention is to allow a lockless, small-fixed-cost peek at
> the head of the queue.
> 
> This change allows certain custom SCX schedulers to cheaply peek at
> queues, e.g. during load balancing, before locking them. But it
> represents a few extra memory operations to update the pointer each
> time the DSQ is modified, including a memory barrier on ARM so the write
> appears correctly ordered.
> 
> This commit adds a first_task pointer field which is updated
> atomically when the DSQ is modified, and allows any thread to peek at
> the head of the queue without holding the lock.
> 
> Signed-off-by: Ryan Newton <newton@meta.com>

Applied to my local kernel and modified my scheduler to use this
feature. dmesg is clean and it works great.

Tested-by: Jake Hillion <jake@hillion.co.uk>

Thanks!
Jake.

> ---
>  include/linux/sched/ext.h                |  1 +
>  kernel/sched/ext.c                       | 54 +++++++++++++++++++++++-
>  tools/sched_ext/include/scx/common.bpf.h |  1 +
>  tools/sched_ext/include/scx/compat.bpf.h | 19 +++++++++
>  4 files changed, 73 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index d82b7a9b0658..81478d4ae782 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -58,6 +58,7 @@ enum scx_dsq_id_flags {
>   */
>  struct scx_dispatch_q {
>  	raw_spinlock_t		lock;
> +	struct task_struct __rcu *first_task; /* lockless peek at head */
>  	struct list_head	list;	/* tasks in dispatch order */
>  	struct rb_root		priq;	/* used to order by p->scx.dsq_vtime */
>  	u32			nr;
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 2b0e88206d07..6d3537e65001 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -944,8 +944,11 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
>  				container_of(rbp, struct task_struct,
>  					     scx.dsq_priq);
>  			list_add(&p->scx.dsq_list.node, &prev->scx.dsq_list.node);
> +			/* first task unchanged - no update needed */
>  		} else {
>  			list_add(&p->scx.dsq_list.node, &dsq->list);
> +			/* not builtin and new task is at head - use fastpath */
> +			rcu_assign_pointer(dsq->first_task, p);
>  		}
>  	} else {
>  		/* a FIFO DSQ shouldn't be using PRIQ enqueuing */
> @@ -953,10 +956,19 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
>  			scx_error(sch, "DSQ ID 0x%016llx already had PRIQ-enqueued tasks",
>  				  dsq->id);
> 
> -		if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
> +		if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) {
>  			list_add(&p->scx.dsq_list.node, &dsq->list);
> -		else
> +			/* new task inserted at head - use fastpath */
> +			if (!(dsq->id & SCX_DSQ_FLAG_BUILTIN))
> +				rcu_assign_pointer(dsq->first_task, p);
> +		} else {
> +			bool was_empty;
> +
> +			was_empty = list_empty(&dsq->list);
>  			list_add_tail(&p->scx.dsq_list.node, &dsq->list);
> +			if (was_empty && !(dsq->id & SCX_DSQ_FLAG_BUILTIN))
> +				rcu_assign_pointer(dsq->first_task, p);
> +		}
>  	}
> 
>  	/* seq records the order tasks are queued, used by BPF DSQ iterator */
> @@ -1011,6 +1023,13 @@ static void task_unlink_from_dsq(struct task_struct *p,
>  		p->scx.dsq_flags &= ~SCX_TASK_DSQ_ON_PRIQ;
>  	}
> 
> +	if (!(dsq->id & SCX_DSQ_FLAG_BUILTIN) && dsq->first_task == p) {
> +		struct task_struct *first_task;
> +
> +		first_task = nldsq_next_task(dsq, NULL, false);
> +		rcu_assign_pointer(dsq->first_task, first_task);
> +	}
> +
>  	list_del_init(&p->scx.dsq_list.node);
>  	dsq_mod_nr(dsq, -1);
>  }
> @@ -6084,6 +6103,36 @@ __bpf_kfunc void bpf_iter_scx_dsq_destroy(struct bpf_iter_scx_dsq *it)
>  	kit->dsq = NULL;
>  }
> 
> +/**
> + * scx_bpf_dsq_peek - Lockless peek at the first element.
> + * @dsq_id: DSQ to examine.
> + *
> + * Read the first element in the DSQ. This is semantically equivalent to using
> + * the DSQ iterator, but is lockfree.
> + *
> + * Returns the pointer, or NULL indicates an empty queue OR internal error.
> + */
> +__bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id)
> +{
> +	struct scx_sched *sch;
> +	struct scx_dispatch_q *dsq;
> +
> +	sch = rcu_dereference(scx_root);
> +	if (unlikely(!sch))
> +		return NULL;
> +	if (unlikely(dsq_id & SCX_DSQ_FLAG_BUILTIN)) {
> +		scx_error(sch, "peek disallowed on builtin DSQ 0x%llx", dsq_id);
> +		return NULL;
> +	}
> +
> +	dsq = find_user_dsq(sch, dsq_id);
> +	if (unlikely(!dsq)) {
> +		scx_error(sch, "peek on non-existent DSQ 0x%llx", dsq_id);
> +		return NULL;
> +	}
> +	return rcu_dereference(dsq->first_task);
> +}
> +
>  __bpf_kfunc_end_defs();
> 
>  static s32 __bstr_format(struct scx_sched *sch, u64 *data_buf, char *line_buf,
> @@ -6641,6 +6690,7 @@ BTF_KFUNCS_START(scx_kfunc_ids_any)
>  BTF_ID_FLAGS(func, scx_bpf_kick_cpu)
>  BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued)
>  BTF_ID_FLAGS(func, scx_bpf_destroy_dsq)
> +BTF_ID_FLAGS(func, scx_bpf_dsq_peek, KF_RCU_PROTECTED | KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_iter_scx_dsq_new, KF_ITER_NEW | KF_RCU_PROTECTED)
>  BTF_ID_FLAGS(func, bpf_iter_scx_dsq_next, KF_ITER_NEXT | KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_iter_scx_dsq_destroy, KF_ITER_DESTROY)
> diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
> index 06e2551033cb..fbf3e7f9526c 100644
> --- a/tools/sched_ext/include/scx/common.bpf.h
> +++ b/tools/sched_ext/include/scx/common.bpf.h
> @@ -75,6 +75,7 @@ u32 scx_bpf_reenqueue_local(void) __ksym;
>  void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym;
>  s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
>  void scx_bpf_destroy_dsq(u64 dsq_id) __ksym;
> +struct task_struct *scx_bpf_dsq_peek(u64 dsq_id) __ksym __weak;
>  int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, u64 flags) __ksym __weak;
>  struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *it) __ksym __weak;
>  void bpf_iter_scx_dsq_destroy(struct bpf_iter_scx_dsq *it) __ksym __weak;
> diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
> index dd9144624dc9..97b10c184b2c 100644
> --- a/tools/sched_ext/include/scx/compat.bpf.h
> +++ b/tools/sched_ext/include/scx/compat.bpf.h
> @@ -130,6 +130,25 @@ int bpf_cpumask_populate(struct cpumask *dst, void *src, size_t src__sz) __ksym
>  	false;									\
>  })
> 
> +
> +/*
> + * v6.19: Introduce lockless peek API for user DSQs.
> + *
> + * Preserve the following macro until v6.21.
> + */
> +static inline struct task_struct *__COMPAT_scx_bpf_dsq_peek(u64 dsq_id)
> +{
> +	struct task_struct *p = NULL;
> +	struct bpf_iter_scx_dsq it;
> +
> +	if (bpf_ksym_exists(scx_bpf_dsq_peek))
> +		return scx_bpf_dsq_peek(dsq_id);
> +	if (!bpf_iter_scx_dsq_new(&it, dsq_id, 0))
> +		p = bpf_iter_scx_dsq_next(&it);
> +	bpf_iter_scx_dsq_destroy(&it);
> +	return p;
> +}
> +
>  /**
>   * __COMPAT_is_enq_cpu_selected - Test if SCX_ENQ_CPU_SELECTED is on
>   * in a compatible way. We will preserve this __COMPAT helper until v6.16.
> --
> 2.51.0
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 2/2] sched_ext: Add a selftest for scx_bpf_dsq_peek
  2025-10-06 17:04 [PATCH v3 0/2] sched_ext: lockless peek operation for DSQs Ryan Newton
  2025-10-06 17:04 ` [PATCH v3 1/2] sched_ext: Add " Ryan Newton
@ 2025-10-06 17:04 ` Ryan Newton
  2025-10-06 17:17   ` Christian Loehle
  2025-10-06 19:36   ` Andrea Righi
  2025-10-06 17:20 ` [PATCH v3 0/2] sched_ext: lockless peek operation for DSQs Christian Loehle
  2 siblings, 2 replies; 10+ messages in thread
From: Ryan Newton @ 2025-10-06 17:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: sched-ext, tj, arighi, rrnewton, newton

From: Ryan Newton <newton@meta.com>

Perform the most basic unit test: make sure an empty queue peeks as
empty, and when we put one element in the queue, make sure peek returns
that element.

However, even this simple test is a little complicated by the different
behavior of scx_bpf_dsq_insert in different calling contexts:
 - insert is for direct dispatch in enqueue
 - insert is delayed when called from select_cpu

In this case we split the insert and the peek that verifies the
result between enqueue/dispatch. As a second phase, we stress test by
performing many peeks on an array of user DSQs.

Note: An alternative would be to call `scx_bpf_dsq_move_to_local` on an
empty queue, which in turn calls `flush_dispatch_buf`, in order to flush
the buffered insert. Unfortunately, this is not viable within the
enqueue path, as it attempts a voluntary context switch within an RCU
read-side critical section.

Signed-off-by: Ryan Newton <newton@meta.com>
---
 kernel/sched/ext.c                            |   2 +
 tools/testing/selftests/sched_ext/Makefile    |   1 +
 .../selftests/sched_ext/peek_dsq.bpf.c        | 265 ++++++++++++++++++
 tools/testing/selftests/sched_ext/peek_dsq.c  | 230 +++++++++++++++
 4 files changed, 498 insertions(+)
 create mode 100644 tools/testing/selftests/sched_ext/peek_dsq.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/peek_dsq.c

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 6d3537e65001..ec7e791cd4c8 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6120,6 +6120,7 @@ __bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id)
 	sch = rcu_dereference(scx_root);
 	if (unlikely(!sch))
 		return NULL;
+
 	if (unlikely(dsq_id & SCX_DSQ_FLAG_BUILTIN)) {
 		scx_error(sch, "peek disallowed on builtin DSQ 0x%llx", dsq_id);
 		return NULL;
@@ -6130,6 +6131,7 @@ __bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id)
 		scx_error(sch, "peek on non-existent DSQ 0x%llx", dsq_id);
 		return NULL;
 	}
+
 	return rcu_dereference(dsq->first_task);
 }
 
diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
index 9d9d6b4c38b0..5fe45f9c5f8f 100644
--- a/tools/testing/selftests/sched_ext/Makefile
+++ b/tools/testing/selftests/sched_ext/Makefile
@@ -174,6 +174,7 @@ auto-test-targets :=			\
 	minimal				\
 	numa				\
 	allowed_cpus			\
+	peek_dsq			\
 	prog_run			\
 	reload_loop			\
 	select_cpu_dfl			\
diff --git a/tools/testing/selftests/sched_ext/peek_dsq.bpf.c b/tools/testing/selftests/sched_ext/peek_dsq.bpf.c
new file mode 100644
index 000000000000..8d179d4c7efb
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/peek_dsq.bpf.c
@@ -0,0 +1,265 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A BPF program for testing DSQ operations including create, destroy,
+ * and peek operations. Uses a hybrid approach:
+ * - Syscall program for DSQ lifecycle (create/destroy)
+ * - Struct ops scheduler for task insertion/dequeue testing
+ *
+ * Copyright (c) 2025 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2025 Ryan Newton <ryan.newton@alum.mit.edu>
+ */
+
+#include <scx/common.bpf.h>
+#include <scx/compat.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+#define MAX_SAMPLES 100
+#define MAX_CPUS 512
+#define DSQ_POOL_SIZE 8
+int max_samples = MAX_SAMPLES;
+int max_cpus = MAX_CPUS;
+int dsq_pool_size = DSQ_POOL_SIZE;
+
+/* Global variables to store test results */
+int dsq_create_result = -1;
+int dsq_destroy_result = -1;
+int dsq_peek_result1 = -1;
+long dsq_inserted_pid = -1;
+int insert_test_cpu = -1; /* Set to the cpu that performs the test */
+long dsq_peek_result2 = -1;
+long dsq_peek_result2_pid = -1;
+long dsq_peek_result2_expected = -1;
+int test_dsq_id = 1234; /* Use a simple ID like create_dsq example */
+int real_dsq_id = 1235; /* DSQ for normal operation */
+int enqueue_count = -1;
+int dispatch_count = -1;
+int debug_ksym_exists = -1;
+
+/* DSQ pool for stress testing */
+int dsq_pool_base_id = 2000;
+int phase1_complete = -1;
+int total_peek_attempts = -1;
+int successful_peeks = -1;
+
+/* BPF map for sharing peek results with userspace */
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, MAX_SAMPLES);
+	__type(key, u32);
+	__type(value, long);
+} peek_results SEC(".maps");
+
+/* Test if we're actually using the native or compat version */
+int check_dsq_insert_ksym(void)
+{
+	return bpf_ksym_exists(scx_bpf_dsq_insert) ? 1 : 0;
+}
+
+int check_dsq_peek_ksym(void)
+{
+	return bpf_ksym_exists(scx_bpf_dsq_peek) ? 1 : 0;
+}
+
+static inline int get_random_dsq_id(void)
+{
+	u64 time = bpf_ktime_get_ns();
+
+	return dsq_pool_base_id + (time % DSQ_POOL_SIZE);
+}
+
+static inline void record_peek_result(long pid)
+{
+	u32 slot_key;
+	long *slot_pid_ptr;
+	int ix;
+
+	if (pid <= 0)
+		return;
+
+	/* Find an empty slot or one with the same PID */
+	bpf_for(ix, 0, 10) {
+		slot_key = (pid + ix) % MAX_SAMPLES;
+		slot_pid_ptr = bpf_map_lookup_elem(&peek_results, &slot_key);
+		if (!slot_pid_ptr)
+			continue;
+
+		if (*slot_pid_ptr == -1 || *slot_pid_ptr == pid) {
+			*slot_pid_ptr = pid;
+			break;
+		}
+	}
+}
+
+/* Scan all DSQs in the pool and try to move a task to local */
+static inline int scan_dsq_pool(void)
+{
+	struct task_struct *task;
+	int moved = 0;
+	int i;
+
+	bpf_for(i, 0, DSQ_POOL_SIZE) {
+		int dsq_id = dsq_pool_base_id + i;
+
+		total_peek_attempts++;
+
+		task = __COMPAT_scx_bpf_dsq_peek(dsq_id);
+		if (task) {
+			successful_peeks++;
+			record_peek_result(task->pid);
+
+			/* Try to move this task to local */
+			if (!moved && scx_bpf_dsq_move_to_local(dsq_id) == 0) {
+				moved = 1;
+				break;
+			}
+		}
+	}
+	return moved;
+}
+
+/* Struct_ops scheduler for testing DSQ peek operations */
+void BPF_STRUCT_OPS(peek_dsq_enqueue, struct task_struct *p, u64 enq_flags)
+{
+	struct task_struct *peek_result;
+	int last_insert_test_cpu, cpu;
+
+	enqueue_count++;
+	cpu = bpf_get_smp_processor_id();
+	last_insert_test_cpu = __sync_val_compare_and_swap(
+		&insert_test_cpu, -1, cpu);
+
+	/* Phase 1: Simple insert-then-peek test (only on first task) */
+	if (last_insert_test_cpu == -1) {
+		bpf_printk("peek_dsq_enqueue beginning phase 1 peek test on cpu %d\n", cpu);
+
+		/* Test 1: Peek empty DSQ - should return NULL */
+		peek_result = __COMPAT_scx_bpf_dsq_peek(test_dsq_id);
+		dsq_peek_result1 = (long)peek_result; /* Should be 0 (NULL) */
+
+		/* Test 2: Insert task into test DSQ for testing in dispatch callback */
+		dsq_inserted_pid = p->pid;
+		scx_bpf_dsq_insert(p, test_dsq_id, 0, enq_flags);
+		dsq_peek_result2_expected = (long)p; /* Expected the task we just inserted */
+	} else if (!phase1_complete) {
+		/* Still in phase 1, use real DSQ */
+		scx_bpf_dsq_insert(p, real_dsq_id, 0, enq_flags);
+	} else {
+		/* Phase 2: Random DSQ insertion for stress testing */
+		int random_dsq_id = get_random_dsq_id();
+
+		scx_bpf_dsq_insert(p, random_dsq_id, 0, enq_flags);
+	}
+}
+
+void BPF_STRUCT_OPS(peek_dsq_dispatch, s32 cpu, struct task_struct *prev)
+{
+	dispatch_count++;
+
+	/* Phase 1: Complete the simple peek test if we inserted a task but
+	 * haven't tested peek yet
+	 */
+	if (insert_test_cpu == cpu && dsq_peek_result2 == -1) {
+		struct task_struct *peek_result;
+
+		bpf_printk("peek_dsq_dispatch completing phase 1 peek test on cpu %d\n", cpu);
+
+		/* Test 3: Peek DSQ after insert - should return the task we inserted */
+		peek_result = __COMPAT_scx_bpf_dsq_peek(test_dsq_id);
+		/* Store the PID of the peeked task for comparison */
+		dsq_peek_result2 = (long)peek_result;
+		dsq_peek_result2_pid = peek_result ? peek_result->pid : -1;
+
+		/* Now consume the task since we've peeked at it */
+		scx_bpf_dsq_move_to_local(test_dsq_id);
+
+		/* Mark phase 1 as complete */
+		phase1_complete = 1;
+		bpf_printk("Phase 1 complete, starting phase 2 stress testing\n");
+	} else if (!phase1_complete) {
+		/* Still in phase 1, use real DSQ */
+		scx_bpf_dsq_move_to_local(real_dsq_id);
+	} else {
+		/* Phase 2: Scan all DSQs in the pool and try to move a task */
+		if (!scan_dsq_pool()) {
+			/* No tasks found in DSQ pool, fall back to real DSQ */
+			scx_bpf_dsq_move_to_local(real_dsq_id);
+		}
+	}
+}
+
+s32 BPF_STRUCT_OPS_SLEEPABLE(peek_dsq_init)
+{
+	s32 err;
+	int i;
+
+	/* Always set debug values so we can see which version we're using */
+	debug_ksym_exists = check_dsq_peek_ksym();
+
+	/* Initialize state first */
+	insert_test_cpu = -1;
+	enqueue_count = 0;
+	dispatch_count = 0;
+	phase1_complete = 0;
+	total_peek_attempts = 0;
+	successful_peeks = 0;
+	dsq_create_result = 0; /* Reset to 0 before attempting */
+
+	/* Create the test and real DSQs */
+	err = scx_bpf_create_dsq(test_dsq_id, -1);
+	if (!err)
+		err = scx_bpf_create_dsq(real_dsq_id, -1);
+	if (err) {
+		dsq_create_result = err;
+		scx_bpf_error("Failed to create primary DSQ %d: %d", test_dsq_id, err);
+		return err;
+	}
+
+	/* Create the DSQ pool for stress testing */
+	bpf_for(i, 0, DSQ_POOL_SIZE) {
+		int dsq_id = dsq_pool_base_id + i;
+
+		err = scx_bpf_create_dsq(dsq_id, -1);
+		if (err) {
+			dsq_create_result = err;
+			scx_bpf_error("Failed to create DSQ pool entry %d: %d", dsq_id, err);
+			return err;
+		}
+	}
+
+	dsq_create_result = 1; /* Success */
+
+	/* Initialize the peek results map */
+	bpf_for(i, 0, MAX_SAMPLES) {
+		u32 key = i;
+		long pid = -1;
+
+		bpf_map_update_elem(&peek_results, &key, &pid, BPF_ANY);
+	}
+
+	return 0;
+}
+
+void BPF_STRUCT_OPS(peek_dsq_exit, struct scx_exit_info *ei)
+{
+	int i;
+
+	scx_bpf_destroy_dsq(test_dsq_id);
+	scx_bpf_destroy_dsq(real_dsq_id);
+	bpf_for(i, 0, DSQ_POOL_SIZE) {
+		int dsq_id = dsq_pool_base_id + i;
+
+		scx_bpf_destroy_dsq(dsq_id);
+	}
+
+	dsq_destroy_result = 1;
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops peek_dsq_ops = {
+	.enqueue = (void *)peek_dsq_enqueue,
+	.dispatch = (void *)peek_dsq_dispatch,
+	.init = (void *)peek_dsq_init,
+	.exit = (void *)peek_dsq_exit,
+	.name = "peek_dsq",
+};
diff --git a/tools/testing/selftests/sched_ext/peek_dsq.c b/tools/testing/selftests/sched_ext/peek_dsq.c
new file mode 100644
index 000000000000..182dbdce2400
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/peek_dsq.c
@@ -0,0 +1,230 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test for DSQ operations including create, destroy, and peek operations.
+ *
+ * Copyright (c) 2025 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2025 Ryan Newton <ryan.newton@alum.mit.edu>
+ */
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <string.h>
+#include <sched.h>
+#include "peek_dsq.bpf.skel.h"
+#include "scx_test.h"
+
+#define NUM_WORKERS 100
+
+static bool workload_running = true;
+static pthread_t workload_threads[NUM_WORKERS];
+
+/**
+ * Background workload thread that sleeps and wakes rapidly to exercise
+ * the scheduler's enqueue operations and ensure DSQ operations get tested.
+ */
+static void *workload_thread_fn(void *arg)
+{
+	while (workload_running) {
+		/* Sleep for a very short time to trigger scheduler activity */
+		usleep(1000); /* 1ms sleep */
+		sched_yield();
+	}
+	return NULL;
+}
+
+static enum scx_test_status setup(void **ctx)
+{
+	struct peek_dsq *skel;
+
+	skel = peek_dsq__open();
+	SCX_FAIL_IF(!skel, "Failed to open");
+	SCX_ENUM_INIT(skel);
+	SCX_FAIL_IF(peek_dsq__load(skel), "Failed to load skel");
+
+	*ctx = skel;
+
+	return SCX_TEST_PASS;
+}
+
+static int print_observed_pids(struct bpf_map *map, int max_samples, const char *dsq_name)
+{
+	long count = 0;
+
+	printf("Observed %s DSQ peek pids:\n", dsq_name);
+	for (int i = 0; i < max_samples; i++) {
+		long pid;
+		int err;
+
+		err = bpf_map_lookup_elem(bpf_map__fd(map), &i, &pid);
+		if (err == 0) {
+			if (pid == 0) {
+				printf("  Sample %d: NULL peek\n", i);
+			} else if (pid > 0) {
+				printf("  Sample %d: pid %ld\n", i, pid);
+				count++;
+			}
+		} else {
+			printf("  Sample %d: error reading pid (err=%d)\n", i, err);
+		}
+	}
+	printf("Observed ~%ld pids in the %s DSQ(s)\n", count, dsq_name);
+	return count;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+	struct peek_dsq *skel = ctx;
+	bool failed = false;
+	int seconds = 3;
+	int err;
+
+	printf("Enabling scheduler to test DSQ insert operations...\n");
+
+	struct bpf_link *link =
+		bpf_map__attach_struct_ops(skel->maps.peek_dsq_ops);
+
+	if (!link) {
+		SCX_ERR("Failed to attach struct_ops");
+		return SCX_TEST_FAIL;
+	}
+
+	printf("Starting %d background workload threads...\n", NUM_WORKERS);
+	workload_running = true;
+	for (int i = 0; i < NUM_WORKERS; i++) {
+		err = pthread_create(&workload_threads[i], NULL, workload_thread_fn, NULL);
+		if (err) {
+			SCX_ERR("Failed to create workload thread %d: %s", i, strerror(err));
+			/* Stop already-created threads */
+			workload_running = false;
+			for (int j = 0; j < i; j++)
+				pthread_join(workload_threads[j], NULL);
+			bpf_link__destroy(link);
+			return SCX_TEST_FAIL;
+		}
+	}
+
+	printf("Waiting for enqueue events.\n");
+	sleep(seconds);
+	while (skel->data->enqueue_count <= 0) {
+		printf(".");
+		fflush(stdout);
+		sleep(1);
+		seconds++;
+		if (seconds >= 30) {
+			printf("\n\u2717 Timeout waiting for enqueue events\n");
+			/* Stop workload threads and cleanup */
+			workload_running = false;
+			for (int i = 0; i < NUM_WORKERS; i++)
+				pthread_join(workload_threads[i], NULL);
+			bpf_link__destroy(link);
+			return SCX_TEST_FAIL;
+		}
+	}
+
+	workload_running = false;
+	for (int i = 0; i < NUM_WORKERS; i++) {
+		err = pthread_join(workload_threads[i], NULL);
+		if (err) {
+			SCX_ERR("Failed to join workload thread %d: %s", i, strerror(err));
+			bpf_link__destroy(link);
+			return SCX_TEST_FAIL;
+		}
+	}
+	printf("Background workload threads stopped.\n");
+
+	/* Detach the scheduler */
+	bpf_link__destroy(link);
+
+	if (skel->data->dsq_create_result != 1) {
+		printf("\u2717 DSQ create failed: got %d, expected 1\n",
+		       skel->data->dsq_create_result);
+		failed = true;
+	} else {
+		printf("\u2713 DSQ create succeeded\n");
+	}
+
+	printf("Enqueue/dispatch count over %d seconds: %d / %d\n", seconds,
+		skel->data->enqueue_count, skel->data->dispatch_count);
+	printf("Debug: ksym_exists=%d\n",
+	       skel->data->debug_ksym_exists);
+
+	printf("DSQ insert test done on cpu: %d\n", skel->data->insert_test_cpu);
+	if (skel->data->insert_test_cpu != -1)
+		printf("\u2713 DSQ insert succeeded !\n");
+	else {
+		printf("\u2717 DSQ insert failed or not attempted\n");
+		failed = true;
+	}
+
+	printf("  DSQ peek result 1 (before insert): %d\n",
+	       skel->data->dsq_peek_result1);
+	if (skel->data->dsq_peek_result1 == 0)
+		printf("\u2713 DSQ peek verification success: peek returned NULL!\n");
+	else {
+		printf("\u2717 DSQ peek verification failed\n");
+		failed = true;
+	}
+
+	printf("  DSQ peek result 2 (after insert): %ld\n",
+	       skel->data->dsq_peek_result2);
+	printf("  DSQ peek result 2, expected: %ld\n",
+	       skel->data->dsq_peek_result2_expected);
+	if (skel->data->dsq_peek_result2 ==
+	    skel->data->dsq_peek_result2_expected)
+		printf("\u2713 DSQ peek verification success: peek returned the inserted task!\n");
+	else {
+		printf("\u2717 DSQ peek verification failed\n");
+		failed = true;
+	}
+
+	printf("  Inserted test task -> pid: %ld\n", skel->data->dsq_inserted_pid);
+	printf("  DSQ peek result 2 -> pid: %ld\n", skel->data->dsq_peek_result2_pid);
+
+	if (skel->data->dsq_destroy_result != 1) {
+		printf("\u2717 DSQ destroy failed: got %d, expected 1\n",
+		       skel->data->dsq_destroy_result);
+		failed = true;
+	}
+
+	int pid_count;
+
+	pid_count = print_observed_pids(skel->maps.peek_results,
+					skel->data->max_samples, "DSQ pool");
+
+	if (skel->data->debug_ksym_exists && pid_count == 0) {
+		printf("\u2717 DSQ pool test failed: no successful peeks in native mode\n");
+		failed = true;
+	}
+	if (skel->data->debug_ksym_exists && pid_count > 0)
+		printf("\u2713 DSQ pool test success: observed successful peeks in native mode\n");
+
+	if (failed)
+		return SCX_TEST_FAIL;
+	else
+		return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+	struct peek_dsq *skel = ctx;
+
+	if (workload_running) {
+		workload_running = false;
+		for (int i = 0; i < NUM_WORKERS; i++)
+			pthread_join(workload_threads[i], NULL);
+	}
+
+	peek_dsq__destroy(skel);
+}
+
+struct scx_test peek_dsq = {
+	.name = "peek_dsq",
+	.description =
+		"Test DSQ create/destroy operations and future peek functionality",
+	.setup = setup,
+	.run = run,
+	.cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&peek_dsq)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/2] sched_ext: Add a selftest for scx_bpf_dsq_peek
  2025-10-06 17:04 ` [PATCH v3 2/2] sched_ext: Add a selftest for scx_bpf_dsq_peek Ryan Newton
@ 2025-10-06 17:17   ` Christian Loehle
  2025-10-06 17:22     ` Andrea Righi
  2025-10-06 19:36   ` Andrea Righi
  1 sibling, 1 reply; 10+ messages in thread
From: Christian Loehle @ 2025-10-06 17:17 UTC (permalink / raw)
  To: Ryan Newton, linux-kernel; +Cc: sched-ext, tj, arighi, newton

On 10/6/25 18:04, Ryan Newton wrote:
> From: Ryan Newton <newton@meta.com>
> 
> Perform the most basic unit test: make sure an empty queue peeks as
> empty, and when we put one element in the queue, make sure peek returns
> that element.
> 
> However, even this simple test is a little complicated by the different
> behavior of scx_bpf_dsq_insert in different calling contexts:
>  - insert is for direct dispatch in enqueue
>  - insert is delayed when called from select_cpu
> 
> In this case we split the insert and the peek that verifies the
> result between enqueue/dispatch. As a second phase, we stress test by
> performing many peeks on an array of user DSQs.
> 
> Note: An alternative would be to call `scx_bpf_dsq_move_to_local` on an
> empty queue, which in turn calls `flush_dispatch_buf`, in order to flush
> the buffered insert. Unfortunately, this is not viable within the
> enqueue path, as it attempts a voluntary context switch within an RCU
> read-side critical section.
> 
> Signed-off-by: Ryan Newton <newton@meta.com>
> ---
>  kernel/sched/ext.c                            |   2 +
>  tools/testing/selftests/sched_ext/Makefile    |   1 +
>  .../selftests/sched_ext/peek_dsq.bpf.c        | 265 ++++++++++++++++++
>  tools/testing/selftests/sched_ext/peek_dsq.c  | 230 +++++++++++++++
>  4 files changed, 498 insertions(+)
>  create mode 100644 tools/testing/selftests/sched_ext/peek_dsq.bpf.c
>  create mode 100644 tools/testing/selftests/sched_ext/peek_dsq.c
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 6d3537e65001..ec7e791cd4c8 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -6120,6 +6120,7 @@ __bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id)
>  	sch = rcu_dereference(scx_root);
>  	if (unlikely(!sch))
>  		return NULL;
> +

Accidental whitespace

>  	if (unlikely(dsq_id & SCX_DSQ_FLAG_BUILTIN)) {
>  		scx_error(sch, "peek disallowed on builtin DSQ 0x%llx", dsq_id);
>  		return NULL;
> @@ -6130,6 +6131,7 @@ __bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id)
>  		scx_error(sch, "peek on non-existent DSQ 0x%llx", dsq_id);
>  		return NULL;
>  	}
> +

Accidental whitespace

>  	return rcu_dereference(dsq->first_task);
>  }
>  [snip]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/2] sched_ext: Add a selftest for scx_bpf_dsq_peek
  2025-10-06 17:17   ` Christian Loehle
@ 2025-10-06 17:22     ` Andrea Righi
  0 siblings, 0 replies; 10+ messages in thread
From: Andrea Righi @ 2025-10-06 17:22 UTC (permalink / raw)
  To: Christian Loehle; +Cc: Ryan Newton, linux-kernel, sched-ext, tj, newton

On Mon, Oct 06, 2025 at 06:17:50PM +0100, Christian Loehle wrote:
> On 10/6/25 18:04, Ryan Newton wrote:
> > From: Ryan Newton <newton@meta.com>
> > 
> > Perform the most basic unit test: make sure an empty queue peeks as
> > empty, and when we put one element in the queue, make sure peek returns
> > that element.
> > 
> > However, even this simple test is a little complicated by the different
> > behavior of scx_bpf_dsq_insert in different calling contexts:
> >  - insert is for direct dispatch in enqueue
> >  - insert is delayed when called from select_cpu
> > 
> > In this case we split the insert and the peek that verifies the
> > result between enqueue/dispatch. As a second phase, we stress test by
> > performing many peeks on an array of user DSQs.
> > 
> > Note: An alternative would be to call `scx_bpf_dsq_move_to_local` on an
> > empty queue, which in turn calls `flush_dispatch_buf`, in order to flush
> > the buffered insert. Unfortunately, this is not viable within the
> > enqueue path, as it attempts a voluntary context switch within an RCU
> > read-side critical section.
> > 
> > Signed-off-by: Ryan Newton <newton@meta.com>
> > ---
> >  kernel/sched/ext.c                            |   2 +
> >  tools/testing/selftests/sched_ext/Makefile    |   1 +
> >  .../selftests/sched_ext/peek_dsq.bpf.c        | 265 ++++++++++++++++++
> >  tools/testing/selftests/sched_ext/peek_dsq.c  | 230 +++++++++++++++
> >  4 files changed, 498 insertions(+)
> >  create mode 100644 tools/testing/selftests/sched_ext/peek_dsq.bpf.c
> >  create mode 100644 tools/testing/selftests/sched_ext/peek_dsq.c
> > 
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 6d3537e65001..ec7e791cd4c8 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -6120,6 +6120,7 @@ __bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id)
> >  	sch = rcu_dereference(scx_root);
> >  	if (unlikely(!sch))
> >  		return NULL;
> > +
> 
> Accidental whitespace

Yeah, this should go in the previous patch.

> 
> >  	if (unlikely(dsq_id & SCX_DSQ_FLAG_BUILTIN)) {
> >  		scx_error(sch, "peek disallowed on builtin DSQ 0x%llx", dsq_id);
> >  		return NULL;
> > @@ -6130,6 +6131,7 @@ __bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id)
> >  		scx_error(sch, "peek on non-existent DSQ 0x%llx", dsq_id);
> >  		return NULL;
> >  	}
> > +
> 
> Accidental whitespace

Ditto.

> 
> >  	return rcu_dereference(dsq->first_task);
> >  }
> >  [snip]

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/2] sched_ext: Add a selftest for scx_bpf_dsq_peek
  2025-10-06 17:04 ` [PATCH v3 2/2] sched_ext: Add a selftest for scx_bpf_dsq_peek Ryan Newton
  2025-10-06 17:17   ` Christian Loehle
@ 2025-10-06 19:36   ` Andrea Righi
  1 sibling, 0 replies; 10+ messages in thread
From: Andrea Righi @ 2025-10-06 19:36 UTC (permalink / raw)
  To: Ryan Newton; +Cc: linux-kernel, sched-ext, tj, newton

Hi Ryan,

On Mon, Oct 06, 2025 at 01:04:03PM -0400, Ryan Newton wrote:
> From: Ryan Newton <newton@meta.com>
> 
> Perform the most basic unit test: make sure an empty queue peeks as
> empty, and when we put one element in the queue, make sure peek returns
> that element.
> 
> However, even this simple test is a little complicated by the different
> behavior of scx_bpf_dsq_insert in different calling contexts:
>  - insert is for direct dispatch in enqueue
>  - insert is delayed when called from select_cpu
> 
> In this case we split the insert and the peek that verifies the
> result between enqueue/dispatch. As a second phase, we stress test by
> performing many peeks on an array of user DSQs.
> 
> Note: An alternative would be to call `scx_bpf_dsq_move_to_local` on an
> empty queue, which in turn calls `flush_dispatch_buf`, in order to flush
> the buffered insert. Unfortunately, this is not viable within the
> enqueue path, as it attempts a voluntary context switch within an RCU
> read-side critical section.

This test is a bit difficult to follow and it's completely hammering the VM
where I'm running it. I think we should avoid adding selftests that are too
heavy, as they might trigger false positives and lead distro maintainers to
ignore or disable them in their CI/CD.

Maybe we should go with something simpler to test the basic functionality
of this new API and validate the expected beavior.

For instance, most BPF schedulers using this API will likely implement
something like the following in their ops.dispatch():

	u64 min_vtime = ULLONG_MAX;
	u64 dsq_id, target_dsq;

        bpf_for(dsq_id, 0, MAX_DSQ) {
                struct task_struct *p = __COMPAT_scx_bpf_dsq_peek(dsq_id);

                if (p && bpf_cpumask_test_cpu(from_cpu, p->cpus_ptr) &&
                    p->scx.dsq_vtime < min_vtime) {
                        min_vtime = p->scx.dsq_vtime;
                        target_dsq = dsq_id;
                }
        }

	if (min_vtime < ULLONG_MAX)
		scx_bpf_dsq_move_to_local(min_cpu)

I think this is the specific use case we want to make sure doesn't break in
the future.

A way to validate this could be to use a global counter (vtime_enq), every
time a task is queued, increment vtime_enq and use the value with
scx_dsq_insert_vtime() and insert the task to a DSQ from a pool of DSQs (up
to MAX_DSQ).

Then in ops.dispatch() we can always consume the task with the minimum
vtime, using the logic above and verify that min_vtime is always
incremented by one, or something similar.

That said, we can go with your approach for now and improve the selftest
later in the future, you don't have to completely rewrite the test. But I
think we should make it a bit more lightweight at least, maybe reduce the
workers or similar.

Also, few minor comments below.

> 
> Signed-off-by: Ryan Newton <newton@meta.com>
...
> diff --git a/tools/testing/selftests/sched_ext/peek_dsq.bpf.c b/tools/testing/selftests/sched_ext/peek_dsq.bpf.c
> new file mode 100644
> index 000000000000..8d179d4c7efb
> --- /dev/null
> +++ b/tools/testing/selftests/sched_ext/peek_dsq.bpf.c
> @@ -0,0 +1,265 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * A BPF program for testing DSQ operations including create, destroy,
> + * and peek operations. Uses a hybrid approach:
> + * - Syscall program for DSQ lifecycle (create/destroy)
> + * - Struct ops scheduler for task insertion/dequeue testing
> + *
> + * Copyright (c) 2025 Meta Platforms, Inc. and affiliates.
> + * Copyright (c) 2025 Ryan Newton <ryan.newton@alum.mit.edu>
> + */
> +
> +#include <scx/common.bpf.h>
> +#include <scx/compat.bpf.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +#define MAX_SAMPLES 100
> +#define MAX_CPUS 512
> +#define DSQ_POOL_SIZE 8
> +int max_samples = MAX_SAMPLES;
> +int max_cpus = MAX_CPUS;
> +int dsq_pool_size = DSQ_POOL_SIZE;
> +
> +/* Global variables to store test results */
> +int dsq_create_result = -1;
> +int dsq_destroy_result = -1;
> +int dsq_peek_result1 = -1;
> +long dsq_inserted_pid = -1;
> +int insert_test_cpu = -1; /* Set to the cpu that performs the test */
> +long dsq_peek_result2 = -1;
> +long dsq_peek_result2_pid = -1;
> +long dsq_peek_result2_expected = -1;
> +int test_dsq_id = 1234; /* Use a simple ID like create_dsq example */
> +int real_dsq_id = 1235; /* DSQ for normal operation */
> +int enqueue_count = -1;
> +int dispatch_count = -1;
> +int debug_ksym_exists = -1;

Maybe use a bool here.

> +
> +/* DSQ pool for stress testing */
> +int dsq_pool_base_id = 2000;
> +int phase1_complete = -1;
> +int total_peek_attempts = -1;
> +int successful_peeks = -1;
> +
> +/* BPF map for sharing peek results with userspace */
> +struct {
> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> +	__uint(max_entries, MAX_SAMPLES);
> +	__type(key, u32);
> +	__type(value, long);
> +} peek_results SEC(".maps");
> +
> +/* Test if we're actually using the native or compat version */
> +int check_dsq_insert_ksym(void)
> +{
> +	return bpf_ksym_exists(scx_bpf_dsq_insert) ? 1 : 0;
> +}

This helper is unused, we can drop it.

> +
> +int check_dsq_peek_ksym(void)
> +{
> +	return bpf_ksym_exists(scx_bpf_dsq_peek) ? 1 : 0;
> +}

Do we need this helper? I think we can just do:

  debug_ksym_exists = bpf_ksym_exists(scx_bpf_dsq_peek);

And have debug_ksym_exists as bool.

> +
> +static inline int get_random_dsq_id(void)
> +{
> +	u64 time = bpf_ktime_get_ns();
> +
> +	return dsq_pool_base_id + (time % DSQ_POOL_SIZE);
> +}
> +
> +static inline void record_peek_result(long pid)

Does it really need to be inline? It's quite of a big function.

> +{
> +	u32 slot_key;
> +	long *slot_pid_ptr;
> +	int ix;
> +
> +	if (pid <= 0)
> +		return;
> +
> +	/* Find an empty slot or one with the same PID */
> +	bpf_for(ix, 0, 10) {
> +		slot_key = (pid + ix) % MAX_SAMPLES;
> +		slot_pid_ptr = bpf_map_lookup_elem(&peek_results, &slot_key);
> +		if (!slot_pid_ptr)
> +			continue;
> +
> +		if (*slot_pid_ptr == -1 || *slot_pid_ptr == pid) {
> +			*slot_pid_ptr = pid;
> +			break;
> +		}
> +	}
> +}
> +
> +/* Scan all DSQs in the pool and try to move a task to local */
> +static inline int scan_dsq_pool(void)

This is also quite big, we can drop inline and let the compiler decide.

> +{
> +	struct task_struct *task;
> +	int moved = 0;
> +	int i;
> +
> +	bpf_for(i, 0, DSQ_POOL_SIZE) {
> +		int dsq_id = dsq_pool_base_id + i;
> +
> +		total_peek_attempts++;
> +
> +		task = __COMPAT_scx_bpf_dsq_peek(dsq_id);
> +		if (task) {
> +			successful_peeks++;
> +			record_peek_result(task->pid);
> +
> +			/* Try to move this task to local */
> +			if (!moved && scx_bpf_dsq_move_to_local(dsq_id) == 0) {
> +				moved = 1;
> +				break;
> +			}
> +		}
> +	}
> +	return moved;
> +}
> +
> +/* Struct_ops scheduler for testing DSQ peek operations */
> +void BPF_STRUCT_OPS(peek_dsq_enqueue, struct task_struct *p, u64 enq_flags)
> +{
> +	struct task_struct *peek_result;
> +	int last_insert_test_cpu, cpu;
> +
> +	enqueue_count++;
> +	cpu = bpf_get_smp_processor_id();
> +	last_insert_test_cpu = __sync_val_compare_and_swap(
> +		&insert_test_cpu, -1, cpu);

This can be in a single line.

> +
> +	/* Phase 1: Simple insert-then-peek test (only on first task) */
> +	if (last_insert_test_cpu == -1) {
> +		bpf_printk("peek_dsq_enqueue beginning phase 1 peek test on cpu %d\n", cpu);

No need the "\n" when you use bpf_printk().

> +
> +		/* Test 1: Peek empty DSQ - should return NULL */
> +		peek_result = __COMPAT_scx_bpf_dsq_peek(test_dsq_id);
> +		dsq_peek_result1 = (long)peek_result; /* Should be 0 (NULL) */
> +
> +		/* Test 2: Insert task into test DSQ for testing in dispatch callback */
> +		dsq_inserted_pid = p->pid;
> +		scx_bpf_dsq_insert(p, test_dsq_id, 0, enq_flags);
> +		dsq_peek_result2_expected = (long)p; /* Expected the task we just inserted */
> +	} else if (!phase1_complete) {
> +		/* Still in phase 1, use real DSQ */
> +		scx_bpf_dsq_insert(p, real_dsq_id, 0, enq_flags);
> +	} else {
> +		/* Phase 2: Random DSQ insertion for stress testing */
> +		int random_dsq_id = get_random_dsq_id();
> +
> +		scx_bpf_dsq_insert(p, random_dsq_id, 0, enq_flags);
> +	}
> +}
> +
> +void BPF_STRUCT_OPS(peek_dsq_dispatch, s32 cpu, struct task_struct *prev)
> +{
> +	dispatch_count++;
> +
> +	/* Phase 1: Complete the simple peek test if we inserted a task but
> +	 * haven't tested peek yet
> +	 */
> +	if (insert_test_cpu == cpu && dsq_peek_result2 == -1) {
> +		struct task_struct *peek_result;
> +
> +		bpf_printk("peek_dsq_dispatch completing phase 1 peek test on cpu %d\n", cpu);

Ditto about "\n".

> +
> +		/* Test 3: Peek DSQ after insert - should return the task we inserted */
> +		peek_result = __COMPAT_scx_bpf_dsq_peek(test_dsq_id);
> +		/* Store the PID of the peeked task for comparison */
> +		dsq_peek_result2 = (long)peek_result;
> +		dsq_peek_result2_pid = peek_result ? peek_result->pid : -1;
> +
> +		/* Now consume the task since we've peeked at it */
> +		scx_bpf_dsq_move_to_local(test_dsq_id);
> +
> +		/* Mark phase 1 as complete */
> +		phase1_complete = 1;
> +		bpf_printk("Phase 1 complete, starting phase 2 stress testing\n");

Ditto.

> +	} else if (!phase1_complete) {
> +		/* Still in phase 1, use real DSQ */
> +		scx_bpf_dsq_move_to_local(real_dsq_id);
> +	} else {
> +		/* Phase 2: Scan all DSQs in the pool and try to move a task */
> +		if (!scan_dsq_pool()) {
> +			/* No tasks found in DSQ pool, fall back to real DSQ */
> +			scx_bpf_dsq_move_to_local(real_dsq_id);
> +		}
> +	}
> +}
> +
> +s32 BPF_STRUCT_OPS_SLEEPABLE(peek_dsq_init)
> +{
> +	s32 err;
> +	int i;
> +
> +	/* Always set debug values so we can see which version we're using */
> +	debug_ksym_exists = check_dsq_peek_ksym();

As mentioned above this one can be:

	debug_ksym_exists = bpf_ksym_exists(scx_bpf_dsq_peek);

> +
> +	/* Initialize state first */
> +	insert_test_cpu = -1;
> +	enqueue_count = 0;
> +	dispatch_count = 0;
> +	phase1_complete = 0;
> +	total_peek_attempts = 0;
> +	successful_peeks = 0;
> +	dsq_create_result = 0; /* Reset to 0 before attempting */
> +
> +	/* Create the test and real DSQs */
> +	err = scx_bpf_create_dsq(test_dsq_id, -1);
> +	if (!err)
> +		err = scx_bpf_create_dsq(real_dsq_id, -1);
> +	if (err) {
> +		dsq_create_result = err;
> +		scx_bpf_error("Failed to create primary DSQ %d: %d", test_dsq_id, err);
> +		return err;
> +	}

How about:

	err = scx_bpf_create_dsq(test_dsq_id, -1);
	if (err) {
		dsq_create_result = err;
		scx_bpf_error("Failed to create DSQ %d: %d", test_dsq_id, err);
		return err;
	}
	err = scx_bpf_create_dsq(real_dsq_id, -1);
	if (err) {
		dsq_create_result = err;
		scx_bpf_error("Failed to create DSQ %d: %d", real_dsq_id, err);
		return err;
	}

Also do we need to save the error in dsq_create_result (similar with the
other error variables)?

We could just use scx_bpf_erroro(), have UEI_RECORD(uei, ei) in ops.exit()
and use SCX_EQ() in peek_dsq.c to catch error conditions and trigger a
failure, see for example allowed_cpus[.bpf].c.

> +
> +	/* Create the DSQ pool for stress testing */
> +	bpf_for(i, 0, DSQ_POOL_SIZE) {
> +		int dsq_id = dsq_pool_base_id + i;
> +
> +		err = scx_bpf_create_dsq(dsq_id, -1);
> +		if (err) {
> +			dsq_create_result = err;
> +			scx_bpf_error("Failed to create DSQ pool entry %d: %d", dsq_id, err);
> +			return err;
> +		}
> +	}
> +
> +	dsq_create_result = 1; /* Success */
> +
> +	/* Initialize the peek results map */
> +	bpf_for(i, 0, MAX_SAMPLES) {
> +		u32 key = i;
> +		long pid = -1;
> +
> +		bpf_map_update_elem(&peek_results, &key, &pid, BPF_ANY);
> +	}
> +
> +	return 0;
> +}
> +
> +void BPF_STRUCT_OPS(peek_dsq_exit, struct scx_exit_info *ei)
> +{
> +	int i;
> +
> +	scx_bpf_destroy_dsq(test_dsq_id);
> +	scx_bpf_destroy_dsq(real_dsq_id);
> +	bpf_for(i, 0, DSQ_POOL_SIZE) {
> +		int dsq_id = dsq_pool_base_id + i;
> +
> +		scx_bpf_destroy_dsq(dsq_id);
> +	}
> +
> +	dsq_destroy_result = 1;
> +}
> +
> +SEC(".struct_ops.link")
> +struct sched_ext_ops peek_dsq_ops = {
> +	.enqueue = (void *)peek_dsq_enqueue,
> +	.dispatch = (void *)peek_dsq_dispatch,
> +	.init = (void *)peek_dsq_init,
> +	.exit = (void *)peek_dsq_exit,
> +	.name = "peek_dsq",
> +};

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 0/2] sched_ext: lockless peek operation for DSQs
  2025-10-06 17:04 [PATCH v3 0/2] sched_ext: lockless peek operation for DSQs Ryan Newton
  2025-10-06 17:04 ` [PATCH v3 1/2] sched_ext: Add " Ryan Newton
  2025-10-06 17:04 ` [PATCH v3 2/2] sched_ext: Add a selftest for scx_bpf_dsq_peek Ryan Newton
@ 2025-10-06 17:20 ` Christian Loehle
  2025-10-06 17:34   ` Ryan Newton
  2 siblings, 1 reply; 10+ messages in thread
From: Christian Loehle @ 2025-10-06 17:20 UTC (permalink / raw)
  To: Ryan Newton, linux-kernel; +Cc: sched-ext, tj, arighi, newton

On 10/6/25 18:04, Ryan Newton wrote:
> This allows sched_ext schedulers an inexpensive operation to peek
> at the first element in a queue (DSQ), without creating an iterator 
> and acquiring the lock on that queue.
> 
> Note that manual testing has thus far included a modified version of the
> example qmap scheduler that exercises peek, as well as a modified
> modified LAVD (from the SCX repo) that exercises peek. The attached test
> passes >1000 stress tests when run in concurrent VMs, and when run
> sequentially on the host kernel. Presently, tested on the below
> workstation and server processors.
> - AMD Ryzen Threadripper PRO 7975WX 32-Cores
> - AMD EPYC 9D64 88-Core Processor

Is the adapted qmap and lavd available somewhere?

> 
> Initial experiments indicate a substantial speedup (on schbench) when
> running an SCX scheduler with per-cpu DSQs and peeking each queue to
> retrieve the task with the minimum vruntime across all the CPUs.
> 
> ---
> Changes in v3:
>  - inline helpers and simplify
>  - coding style tweaks
> 
> Changes in v2:
>  - make peek() only work for user DSQs and error otherwise
>  - added a stress test component to the selftest that performs many peeks
>  - responded to review comments from tj@kernel.org and arighi@nvidia.com 
>  - link: https://lore.kernel.org/lkml/20251003195408.675527-1-rrnewton@gmail.com/
>  
> v1 link: https://lore.kernel.org/lkml/20251002025722.3420916-1-rrnewton@gmail.com/
> 
> Ryan Newton (2):
>   sched_ext: Add lockless peek operation for DSQs
>   sched_ext: Add a selftest for scx_bpf_dsq_peek
> 
>  include/linux/sched/ext.h                     |   1 +
>  kernel/sched/ext.c                            |  56 +++-
>  tools/sched_ext/include/scx/common.bpf.h      |   1 +
>  tools/sched_ext/include/scx/compat.bpf.h      |  19 ++
>  tools/testing/selftests/sched_ext/Makefile    |   1 +
>  .../selftests/sched_ext/peek_dsq.bpf.c        | 265 ++++++++++++++++++
>  tools/testing/selftests/sched_ext/peek_dsq.c  | 230 +++++++++++++++
>  7 files changed, 571 insertions(+), 2 deletions(-)
>  create mode 100644 tools/testing/selftests/sched_ext/peek_dsq.bpf.c
>  create mode 100644 tools/testing/selftests/sched_ext/peek_dsq.c
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 0/2] sched_ext: lockless peek operation for DSQs
  2025-10-06 17:20 ` [PATCH v3 0/2] sched_ext: lockless peek operation for DSQs Christian Loehle
@ 2025-10-06 17:34   ` Ryan Newton
  0 siblings, 0 replies; 10+ messages in thread
From: Ryan Newton @ 2025-10-06 17:34 UTC (permalink / raw)
  To: Christian Loehle; +Cc: linux-kernel, sched-ext, tj, arighi, newton

Hello Christian,

Sure, the lavd with peek is here:
https://github.com/sched-ext/scx/pull/2675

Beerland with peek is here:
https://github.com/sched-ext/scx/commit/c2a0f185051c06cc1ebae1dc40e5fe2bd3022c1e

The qmap one was not a meaningful change to the scheduler but just
extra peeks thrown in with debug prints.

Cheers,
  -Ryan

P.S. Good catch on the two newlines being split into the wrong diff.
Argh, I was playing with Sapling's `sl absorb` and I swear it said it
amended the earlier commit. Apologies for not catching it. FWIW here's
a tip with that correction:
  https://github.com/rrnewton/linux/commit/db426f852813e2b6deeae0869d20df1bea647a07




On Mon, Oct 6, 2025 at 1:20 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 10/6/25 18:04, Ryan Newton wrote:
> > This allows sched_ext schedulers an inexpensive operation to peek
> > at the first element in a queue (DSQ), without creating an iterator
> > and acquiring the lock on that queue.
> >
> > Note that manual testing has thus far included a modified version of the
> > example qmap scheduler that exercises peek, as well as a modified
> > modified LAVD (from the SCX repo) that exercises peek. The attached test
> > passes >1000 stress tests when run in concurrent VMs, and when run
> > sequentially on the host kernel. Presently, tested on the below
> > workstation and server processors.
> > - AMD Ryzen Threadripper PRO 7975WX 32-Cores
> > - AMD EPYC 9D64 88-Core Processor
>
> Is the adapted qmap and lavd available somewhere?
>
> >
> > Initial experiments indicate a substantial speedup (on schbench) when
> > running an SCX scheduler with per-cpu DSQs and peeking each queue to
> > retrieve the task with the minimum vruntime across all the CPUs.
> >
> > ---
> > Changes in v3:
> >  - inline helpers and simplify
> >  - coding style tweaks
> >
> > Changes in v2:
> >  - make peek() only work for user DSQs and error otherwise
> >  - added a stress test component to the selftest that performs many peeks
> >  - responded to review comments from tj@kernel.org and arighi@nvidia.com
> >  - link: https://lore.kernel.org/lkml/20251003195408.675527-1-rrnewton@gmail.com/
> >
> > v1 link: https://lore.kernel.org/lkml/20251002025722.3420916-1-rrnewton@gmail.com/
> >
> > Ryan Newton (2):
> >   sched_ext: Add lockless peek operation for DSQs
> >   sched_ext: Add a selftest for scx_bpf_dsq_peek
> >
> >  include/linux/sched/ext.h                     |   1 +
> >  kernel/sched/ext.c                            |  56 +++-
> >  tools/sched_ext/include/scx/common.bpf.h      |   1 +
> >  tools/sched_ext/include/scx/compat.bpf.h      |  19 ++
> >  tools/testing/selftests/sched_ext/Makefile    |   1 +
> >  .../selftests/sched_ext/peek_dsq.bpf.c        | 265 ++++++++++++++++++
> >  tools/testing/selftests/sched_ext/peek_dsq.c  | 230 +++++++++++++++
> >  7 files changed, 571 insertions(+), 2 deletions(-)
> >  create mode 100644 tools/testing/selftests/sched_ext/peek_dsq.bpf.c
> >  create mode 100644 tools/testing/selftests/sched_ext/peek_dsq.c
> >
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-10-14 23:14 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-06 17:04 [PATCH v3 0/2] sched_ext: lockless peek operation for DSQs Ryan Newton
2025-10-06 17:04 ` [PATCH v3 1/2] sched_ext: Add " Ryan Newton
2025-10-06 17:26   ` Andrea Righi
2025-10-14 22:27   ` Jake Hillion
2025-10-06 17:04 ` [PATCH v3 2/2] sched_ext: Add a selftest for scx_bpf_dsq_peek Ryan Newton
2025-10-06 17:17   ` Christian Loehle
2025-10-06 17:22     ` Andrea Righi
2025-10-06 19:36   ` Andrea Righi
2025-10-06 17:20 ` [PATCH v3 0/2] sched_ext: lockless peek operation for DSQs Christian Loehle
2025-10-06 17:34   ` Ryan Newton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox