[PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()
@ 2025-09-22  1:32 Tejun Heo
  2025-09-22  1:32 ` [PATCH 2/7] sched_ext: Improve SCX_KF_DISPATCH comment Tejun Heo
                   ` (10 more replies)
  0 siblings, 11 replies; 26+ messages in thread
From: Tejun Heo @ 2025-09-22  1:32 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: linux-kernel, sched-ext, Tejun Heo

The find_user_dsq() function is called from contexts that are already
under RCU read lock protection. Switch from rhashtable_lookup_fast() to
rhashtable_lookup() to avoid redundant RCU locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f5873f8ed669..df433f6fab4b 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -207,7 +207,7 @@ static struct scx_dispatch_q *find_global_dsq(struct task_struct *p)
 
 static struct scx_dispatch_q *find_user_dsq(struct scx_sched *sch, u64 dsq_id)
 {
-	return rhashtable_lookup_fast(&sch->dsq_hash, &dsq_id, dsq_hash_params);
+	return rhashtable_lookup(&sch->dsq_hash, &dsq_id, dsq_hash_params);
 }
 
 /*
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/7] sched_ext: Improve SCX_KF_DISPATCH comment
  2025-09-22  1:32 [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() Tejun Heo
@ 2025-09-22  1:32 ` Tejun Heo
  2025-09-23  7:20   ` Andrea Righi
  2025-09-22  1:32 ` [PATCH 3/7] sched_ext: Fix stray scx_root usage in task_can_run_on_remote_rq() Tejun Heo
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2025-09-22  1:32 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: linux-kernel, sched-ext, Tejun Heo

The comment for SCX_KF_DISPATCH was incomplete and didn't explain that
ops.dispatch() may temporarily release the rq lock, allowing ENQUEUE and
SELECT_CPU operations to be nested inside DISPATCH contexts.

Update the comment to clarify this nesting behavior and provide better
context for when these operations can occur within dispatch.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/ext.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 7047101dbf58..d82b7a9b0658 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -108,7 +108,11 @@ enum scx_kf_mask {
 	SCX_KF_UNLOCKED		= 0,	  /* sleepable and not rq locked */
 	/* ENQUEUE and DISPATCH may be nested inside CPU_RELEASE */
 	SCX_KF_CPU_RELEASE	= 1 << 0, /* ops.cpu_release() */
-	/* ops.dequeue (in REST) may be nested inside DISPATCH */
+	/*
+	 * ops.dispatch() may release rq lock temporarily and thus ENQUEUE and
+	 * SELECT_CPU may be nested inside. ops.dequeue (in REST) may also be
+	 * nested inside DISPATCH.
+	 */
 	SCX_KF_DISPATCH		= 1 << 1, /* ops.dispatch() */
 	SCX_KF_ENQUEUE		= 1 << 2, /* ops.enqueue() and ops.select_cpu() */
 	SCX_KF_SELECT_CPU	= 1 << 3, /* ops.select_cpu() */
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/7] sched_ext: Fix stray scx_root usage in task_can_run_on_remote_rq()
  2025-09-22  1:32 [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() Tejun Heo
  2025-09-22  1:32 ` [PATCH 2/7] sched_ext: Improve SCX_KF_DISPATCH comment Tejun Heo
@ 2025-09-22  1:32 ` Tejun Heo
  2025-09-23  7:22   ` Andrea Righi
  2025-09-22  1:32 ` [PATCH 4/7] sched_ext: Use bitfields for boolean warning flags Tejun Heo
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2025-09-22  1:32 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: linux-kernel, sched-ext, Tejun Heo

task_can_run_on_remote_rq() takes @sch but it is using scx_root when
incrementing SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, which is inconsistent and
gets in the way of implementing multiple scheduler support. Use @sch
instead. As currently scx_root is the only possible scheduler instance, this
doesn't cause any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index df433f6fab4b..5801ac676d59 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1622,8 +1622,7 @@ static bool task_can_run_on_remote_rq(struct scx_sched *sch,
 
 	if (!scx_rq_online(rq)) {
 		if (enforce)
-			__scx_add_event(scx_root,
-					SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, 1);
+			__scx_add_event(sch, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, 1);
 		return false;
 	}
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/7] sched_ext: Use bitfields for boolean warning flags
  2025-09-22  1:32 [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() Tejun Heo
  2025-09-22  1:32 ` [PATCH 2/7] sched_ext: Improve SCX_KF_DISPATCH comment Tejun Heo
  2025-09-22  1:32 ` [PATCH 3/7] sched_ext: Fix stray scx_root usage in task_can_run_on_remote_rq() Tejun Heo
@ 2025-09-22  1:32 ` Tejun Heo
  2025-09-23  7:45   ` Andrea Righi
  2025-09-23 16:58   ` [PATCH v2 " Tejun Heo
  2025-09-22  1:32 ` [PATCH 5/7] sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init() Tejun Heo
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 26+ messages in thread
From: Tejun Heo @ 2025-09-22  1:32 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: linux-kernel, sched-ext, Tejun Heo

Convert warned_zero_slice and warned_deprecated_rq in scx_sched struct to
single-bit bitfields to reduce struct size.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext_internal.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 2e289931e567..1a80d01b1f0c 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -871,8 +871,8 @@ struct scx_sched {
 	struct scx_dispatch_q	**global_dsqs;
 	struct scx_sched_pcpu __percpu *pcpu;
 
-	bool			warned_zero_slice;
-	bool			warned_deprecated_rq;
+	bool			warned_zero_slice:1;
+	bool			warned_deprecated_rq:1;
 
 	atomic_t		exit_kind;
 	struct scx_exit_info	*exit_info;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/7] sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init()
  2025-09-22  1:32 [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() Tejun Heo
                   ` (2 preceding siblings ...)
  2025-09-22  1:32 ` [PATCH 4/7] sched_ext: Use bitfields for boolean warning flags Tejun Heo
@ 2025-09-22  1:32 ` Tejun Heo
  2025-09-23  8:08   ` Andrea Righi
  2025-09-22  1:32 ` [PATCH 6/7] sched_ext: Make qmap dump operation non-destructive Tejun Heo
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2025-09-22  1:32 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: linux-kernel, sched-ext, Tejun Heo

ops.exit() may be called even if the loading failed before ops.init()
finishes successfully. This is because ops.exit() allows rich exit info
communication. Add SCX_EFLAG_INITIALIZED flag to scx_exit_info.flags to
indicate whether ops.init() finished successfully.

This enables BPF schedulers to distinguish between exit scenarios and
handle cleanup appropriately based on initialization state.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c          |  1 +
 kernel/sched/ext_internal.h | 13 +++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 5801ac676d59..d131e98156ac 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4554,6 +4554,7 @@ static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 			scx_error(sch, "ops.init() failed (%d)", ret);
 			goto err_disable;
 		}
+		sch->exit_info->flags |= SCX_EFLAG_INITIALIZED;
 	}
 
 	for (i = SCX_OPI_CPU_HOTPLUG_BEGIN; i < SCX_OPI_CPU_HOTPLUG_END; i++)
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 1a80d01b1f0c..b3617abed510 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -62,6 +62,16 @@ enum scx_exit_code {
 	SCX_ECODE_ACT_RESTART	= 1LLU << 48,
 };
 
+enum scx_exit_flags {
+	/*
+	 * ops.exit() may be called even if the loading failed before ops.init()
+	 * finishes successfully. This is because ops.exit() allows rich exit
+	 * info communication. The following flag indicates whether ops.init()
+	 * finished successfully.
+	 */
+	SCX_EFLAG_INITIALIZED,
+};
+
 /*
  * scx_exit_info is passed to ops.exit() to describe why the BPF scheduler is
  * being disabled.
@@ -73,6 +83,9 @@ struct scx_exit_info {
 	/* exit code if gracefully exiting */
 	s64			exit_code;
 
+	/* %SCX_EFLAG_* */
+	u64			flags;
+
 	/* textual representation of the above */
 	const char		*reason;
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 6/7] sched_ext: Make qmap dump operation non-destructive
  2025-09-22  1:32 [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() Tejun Heo
                   ` (3 preceding siblings ...)
  2025-09-22  1:32 ` [PATCH 5/7] sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init() Tejun Heo
@ 2025-09-22  1:32 ` Tejun Heo
  2025-09-23  8:09   ` Andrea Righi
  2025-09-22  1:32 ` [PATCH 7/7] tools/sched_ext: scx_qmap: Make debug output quieter by default Tejun Heo
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2025-09-22  1:32 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: linux-kernel, sched-ext, Tejun Heo

The qmap dump operation was destructively consuming queue entries while
displaying them. As dump can be triggered anytime, this can easily lead to
stalls. Add a temporary dump_store queue and modify the dump logic to pop
entries, display them, and then restore them back to the original queue.
This allows dump operations to be performed without affecting the
scheduler's queue state.

Note that if racing against new enqueues during dump, ordering can get
mixed up, but this is acceptable for debugging purposes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 tools/sched_ext/scx_qmap.bpf.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index 69d877501cb7..cd50a94326e3 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -56,7 +56,8 @@ struct qmap {
   queue1 SEC(".maps"),
   queue2 SEC(".maps"),
   queue3 SEC(".maps"),
-  queue4 SEC(".maps");
+  queue4 SEC(".maps"),
+  dump_store SEC(".maps");
 
 struct {
 	__uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
@@ -578,11 +579,26 @@ void BPF_STRUCT_OPS(qmap_dump, struct scx_dump_ctx *dctx)
 			return;
 
 		scx_bpf_dump("QMAP FIFO[%d]:", i);
+
+		/*
+		 * Dump can be invoked anytime and there is no way to iterate in
+		 * a non-destructive way. Pop and store in dump_store and then
+		 * restore afterwards. If racing against new enqueues, ordering
+		 * can get mixed up.
+		 */
 		bpf_repeat(4096) {
 			if (bpf_map_pop_elem(fifo, &pid))
 				break;
+			bpf_map_push_elem(&dump_store, &pid, 0);
 			scx_bpf_dump(" %d", pid);
 		}
+
+		bpf_repeat(4096) {
+			if (bpf_map_pop_elem(&dump_store, &pid))
+				break;
+			bpf_map_push_elem(fifo, &pid, 0);
+		}
+
 		scx_bpf_dump("\n");
 	}
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 7/7] tools/sched_ext: scx_qmap: Make debug output quieter by default
  2025-09-22  1:32 [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() Tejun Heo
                   ` (4 preceding siblings ...)
  2025-09-22  1:32 ` [PATCH 6/7] sched_ext: Make qmap dump operation non-destructive Tejun Heo
@ 2025-09-22  1:32 ` Tejun Heo
  2025-09-23  8:11   ` Andrea Righi
  2025-09-23  7:18 ` [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() Andrea Righi
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2025-09-22  1:32 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: linux-kernel, sched-ext, Tejun Heo

scx_qmap currently outputs verbose debug messages including cgroup operations
and CPU online/offline events by default, which can be noisy during normal
operation. While the existing -P option controls DSQ dumps and event
statistics, there's no way to suppress the other debug messages.

Split the debug output controls to make scx_qmap quieter by default. The -P
option continues to control DSQ dumps and event statistics
(print_dsqs_and_events), while a new -M option controls debug messages like
cgroup operations and CPU events (print_msgs). This allows users to run
scx_qmap with minimal output and selectively enable debug information as
needed.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 tools/sched_ext/scx_qmap.bpf.c | 80 +++++++++++++++++++---------------
 tools/sched_ext/scx_qmap.c     | 12 +++--
 2 files changed, 53 insertions(+), 39 deletions(-)

diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index cd50a94326e3..3072b593f898 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -39,7 +39,8 @@ const volatile u32 stall_kernel_nth;
 const volatile u32 dsp_inf_loop_after;
 const volatile u32 dsp_batch;
 const volatile bool highpri_boosting;
-const volatile bool print_shared_dsq;
+const volatile bool print_dsqs_and_events;
+const volatile bool print_msgs;
 const volatile s32 disallow_tgid;
 const volatile bool suppress_dump;
 
@@ -633,22 +634,25 @@ void BPF_STRUCT_OPS(qmap_dump_task, struct scx_dump_ctx *dctx, struct task_struc
 
 s32 BPF_STRUCT_OPS(qmap_cgroup_init, struct cgroup *cgrp, struct scx_cgroup_init_args *args)
 {
-	bpf_printk("CGRP INIT %llu weight=%u period=%lu quota=%ld burst=%lu",
-		   cgrp->kn->id, args->weight, args->bw_period_us,
-		   args->bw_quota_us, args->bw_burst_us);
+	if (print_msgs)
+		bpf_printk("CGRP INIT %llu weight=%u period=%lu quota=%ld burst=%lu",
+			   cgrp->kn->id, args->weight, args->bw_period_us,
+			   args->bw_quota_us, args->bw_burst_us);
 	return 0;
 }
 
 void BPF_STRUCT_OPS(qmap_cgroup_set_weight, struct cgroup *cgrp, u32 weight)
 {
-	bpf_printk("CGRP SET %llu weight=%u", cgrp->kn->id, weight);
+	if (print_msgs)
+		bpf_printk("CGRP SET %llu weight=%u", cgrp->kn->id, weight);
 }
 
 void BPF_STRUCT_OPS(qmap_cgroup_set_bandwidth, struct cgroup *cgrp,
 		    u64 period_us, u64 quota_us, u64 burst_us)
 {
-	bpf_printk("CGRP SET %llu period=%lu quota=%ld burst=%lu", cgrp->kn->id,
-		   period_us, quota_us, burst_us);
+	if (print_msgs)
+		bpf_printk("CGRP SET %llu period=%lu quota=%ld burst=%lu",
+			   cgrp->kn->id, period_us, quota_us, burst_us);
 }
 
 /*
@@ -692,16 +696,20 @@ static void print_cpus(void)
 
 void BPF_STRUCT_OPS(qmap_cpu_online, s32 cpu)
 {
-	bpf_printk("CPU %d coming online", cpu);
-	/* @cpu is already online at this point */
-	print_cpus();
+	if (print_msgs) {
+		bpf_printk("CPU %d coming online", cpu);
+		/* @cpu is already online at this point */
+		print_cpus();
+	}
 }
 
 void BPF_STRUCT_OPS(qmap_cpu_offline, s32 cpu)
 {
-	bpf_printk("CPU %d going offline", cpu);
-	/* @cpu is still online at this point */
-	print_cpus();
+	if (print_msgs) {
+		bpf_printk("CPU %d going offline", cpu);
+		/* @cpu is still online at this point */
+		print_cpus();
+	}
 }
 
 struct monitor_timer {
@@ -799,35 +807,36 @@ static void dump_shared_dsq(void)
 
 static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer)
 {
-	struct scx_event_stats events;
-
 	bpf_rcu_read_lock();
 	dispatch_highpri(true);
 	bpf_rcu_read_unlock();
 
 	monitor_cpuperf();
 
-	if (print_shared_dsq)
+	if (print_dsqs_and_events) {
+		struct scx_event_stats events;
+
 		dump_shared_dsq();
 
-	__COMPAT_scx_bpf_events(&events, sizeof(events));
-
-	bpf_printk("%35s: %lld", "SCX_EV_SELECT_CPU_FALLBACK",
-		   scx_read_event(&events, SCX_EV_SELECT_CPU_FALLBACK));
-	bpf_printk("%35s: %lld", "SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE",
-		   scx_read_event(&events, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE));
-	bpf_printk("%35s: %lld", "SCX_EV_DISPATCH_KEEP_LAST",
-		   scx_read_event(&events, SCX_EV_DISPATCH_KEEP_LAST));
-	bpf_printk("%35s: %lld", "SCX_EV_ENQ_SKIP_EXITING",
-		   scx_read_event(&events, SCX_EV_ENQ_SKIP_EXITING));
-	bpf_printk("%35s: %lld", "SCX_EV_REFILL_SLICE_DFL",
-		   scx_read_event(&events, SCX_EV_REFILL_SLICE_DFL));
-	bpf_printk("%35s: %lld", "SCX_EV_BYPASS_DURATION",
-		   scx_read_event(&events, SCX_EV_BYPASS_DURATION));
-	bpf_printk("%35s: %lld", "SCX_EV_BYPASS_DISPATCH",
-		   scx_read_event(&events, SCX_EV_BYPASS_DISPATCH));
-	bpf_printk("%35s: %lld", "SCX_EV_BYPASS_ACTIVATE",
-		   scx_read_event(&events, SCX_EV_BYPASS_ACTIVATE));
+		__COMPAT_scx_bpf_events(&events, sizeof(events));
+
+		bpf_printk("%35s: %lld", "SCX_EV_SELECT_CPU_FALLBACK",
+			   scx_read_event(&events, SCX_EV_SELECT_CPU_FALLBACK));
+		bpf_printk("%35s: %lld", "SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE",
+			   scx_read_event(&events, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE));
+		bpf_printk("%35s: %lld", "SCX_EV_DISPATCH_KEEP_LAST",
+			   scx_read_event(&events, SCX_EV_DISPATCH_KEEP_LAST));
+		bpf_printk("%35s: %lld", "SCX_EV_ENQ_SKIP_EXITING",
+			   scx_read_event(&events, SCX_EV_ENQ_SKIP_EXITING));
+		bpf_printk("%35s: %lld", "SCX_EV_REFILL_SLICE_DFL",
+			   scx_read_event(&events, SCX_EV_REFILL_SLICE_DFL));
+		bpf_printk("%35s: %lld", "SCX_EV_BYPASS_DURATION",
+			   scx_read_event(&events, SCX_EV_BYPASS_DURATION));
+		bpf_printk("%35s: %lld", "SCX_EV_BYPASS_DISPATCH",
+			   scx_read_event(&events, SCX_EV_BYPASS_DISPATCH));
+		bpf_printk("%35s: %lld", "SCX_EV_BYPASS_ACTIVATE",
+			   scx_read_event(&events, SCX_EV_BYPASS_ACTIVATE));
+	}
 
 	bpf_timer_start(timer, ONE_SEC_IN_NS, 0);
 	return 0;
@@ -839,7 +848,8 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
 	struct bpf_timer *timer;
 	s32 ret;
 
-	print_cpus();
+	if (print_msgs)
+		print_cpus();
 
 	ret = scx_bpf_create_dsq(SHARED_DSQ, -1);
 	if (ret)
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index c4912ab2e76f..ef701d45ba43 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -20,7 +20,7 @@ const char help_fmt[] =
 "See the top-level comment in .bpf.c for more details.\n"
 "\n"
 "Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-l COUNT] [-b COUNT]\n"
-"       [-P] [-d PID] [-D LEN] [-p] [-v]\n"
+"       [-P] [-M] [-d PID] [-D LEN] [-p] [-v]\n"
 "\n"
 "  -s SLICE_US   Override slice duration\n"
 "  -e COUNT      Trigger scx_bpf_error() after COUNT enqueues\n"
@@ -28,7 +28,8 @@ const char help_fmt[] =
 "  -T COUNT      Stall every COUNT'th kernel thread\n"
 "  -l COUNT      Trigger dispatch infinite looping after COUNT dispatches\n"
 "  -b COUNT      Dispatch upto COUNT tasks together\n"
-"  -P            Print out DSQ content to trace_pipe every second, use with -b\n"
+"  -P            Print out DSQ content and event counters to trace_pipe every second\n"
+"  -M            Print out debug messages to trace_pipe\n"
 "  -H            Boost nice -20 tasks in SHARED_DSQ, use with -b\n"
 "  -d PID        Disallow a process from switching into SCHED_EXT (-1 for self)\n"
 "  -D LEN        Set scx_exit_info.dump buffer length\n"
@@ -66,7 +67,7 @@ int main(int argc, char **argv)
 
 	skel->rodata->slice_ns = __COMPAT_ENUM_OR_ZERO("scx_public_consts", "SCX_SLICE_DFL");
 
-	while ((opt = getopt(argc, argv, "s:e:t:T:l:b:PHd:D:Spvh")) != -1) {
+	while ((opt = getopt(argc, argv, "s:e:t:T:l:b:PMHd:D:Spvh")) != -1) {
 		switch (opt) {
 		case 's':
 			skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
@@ -87,7 +88,10 @@ int main(int argc, char **argv)
 			skel->rodata->dsp_batch = strtoul(optarg, NULL, 0);
 			break;
 		case 'P':
-			skel->rodata->print_shared_dsq = true;
+			skel->rodata->print_dsqs_and_events = true;
+			break;
+		case 'M':
+			skel->rodata->print_msgs = true;
 			break;
 		case 'H':
 			skel->rodata->highpri_boosting = true;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()
  2025-09-22  1:32 [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() Tejun Heo
                   ` (5 preceding siblings ...)
  2025-09-22  1:32 ` [PATCH 7/7] tools/sched_ext: scx_qmap: Make debug output quieter by default Tejun Heo
@ 2025-09-23  7:18 ` Andrea Righi
  2025-09-23 15:59   ` Tejun Heo
  2025-09-23 19:13 ` Tejun Heo
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Andrea Righi @ 2025-09-23  7:18 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Vernet, Changwoo Min, linux-kernel, sched-ext

Hi Tejun,

On Sun, Sep 21, 2025 at 03:32:40PM -1000, Tejun Heo wrote:
> The find_user_dsq() function is called from contexts that are already
> under RCU read lock protection. Switch from rhashtable_lookup_fast() to
> rhashtable_lookup() to avoid redundant RCU locking.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index f5873f8ed669..df433f6fab4b 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -207,7 +207,7 @@ static struct scx_dispatch_q *find_global_dsq(struct task_struct *p)
>  
>  static struct scx_dispatch_q *find_user_dsq(struct scx_sched *sch, u64 dsq_id)
>  {

Maybe we should add a WARN_ON_ONCE(!rcu_read_lock_held()) to make sure the
assumption is correct and catch potential non RCU read locked usage in the
future?

> -	return rhashtable_lookup_fast(&sch->dsq_hash, &dsq_id, dsq_hash_params);
> +	return rhashtable_lookup(&sch->dsq_hash, &dsq_id, dsq_hash_params);
>  }
>  
>  /*

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/7] sched_ext: Improve SCX_KF_DISPATCH comment
  2025-09-22  1:32 ` [PATCH 2/7] sched_ext: Improve SCX_KF_DISPATCH comment Tejun Heo
@ 2025-09-23  7:20   ` Andrea Righi
  0 siblings, 0 replies; 26+ messages in thread
From: Andrea Righi @ 2025-09-23  7:20 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Vernet, Changwoo Min, linux-kernel, sched-ext

On Sun, Sep 21, 2025 at 03:32:41PM -1000, Tejun Heo wrote:
> The comment for SCX_KF_DISPATCH was incomplete and didn't explain that
> ops.dispatch() may temporarily release the rq lock, allowing ENQUEUE and
> SELECT_CPU operations to be nested inside DISPATCH contexts.
> 
> Update the comment to clarify this nesting behavior and provide better
> context for when these operations can occur within dispatch.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Looks good.

Acked-by: Andrea Righi <arighi@nvidia.com>

> ---
>  include/linux/sched/ext.h | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index 7047101dbf58..d82b7a9b0658 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -108,7 +108,11 @@ enum scx_kf_mask {
>  	SCX_KF_UNLOCKED		= 0,	  /* sleepable and not rq locked */
>  	/* ENQUEUE and DISPATCH may be nested inside CPU_RELEASE */
>  	SCX_KF_CPU_RELEASE	= 1 << 0, /* ops.cpu_release() */
> -	/* ops.dequeue (in REST) may be nested inside DISPATCH */
> +	/*
> +	 * ops.dispatch() may release rq lock temporarily and thus ENQUEUE and
> +	 * SELECT_CPU may be nested inside. ops.dequeue (in REST) may also be
> +	 * nested inside DISPATCH.
> +	 */
>  	SCX_KF_DISPATCH		= 1 << 1, /* ops.dispatch() */
>  	SCX_KF_ENQUEUE		= 1 << 2, /* ops.enqueue() and ops.select_cpu() */
>  	SCX_KF_SELECT_CPU	= 1 << 3, /* ops.select_cpu() */
> -- 
> 2.51.0
> 

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/7] sched_ext: Fix stray scx_root usage in task_can_run_on_remote_rq()
  2025-09-22  1:32 ` [PATCH 3/7] sched_ext: Fix stray scx_root usage in task_can_run_on_remote_rq() Tejun Heo
@ 2025-09-23  7:22   ` Andrea Righi
  0 siblings, 0 replies; 26+ messages in thread
From: Andrea Righi @ 2025-09-23  7:22 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Vernet, Changwoo Min, linux-kernel, sched-ext

On Sun, Sep 21, 2025 at 03:32:42PM -1000, Tejun Heo wrote:
> task_can_run_on_remote_rq() takes @sch but it is using scx_root when
> incrementing SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, which is inconsistent and
> gets in the way of implementing multiple scheduler support. Use @sch
> instead. As currently scx_root is the only possible scheduler instance, this
> doesn't cause any behavior changes.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Good catch.

Acked-by: Andrea Righi <arighi@nvidia.com>

> ---
>  kernel/sched/ext.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index df433f6fab4b..5801ac676d59 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1622,8 +1622,7 @@ static bool task_can_run_on_remote_rq(struct scx_sched *sch,
>  
>  	if (!scx_rq_online(rq)) {
>  		if (enforce)
> -			__scx_add_event(scx_root,
> -					SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, 1);
> +			__scx_add_event(sch, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, 1);
>  		return false;
>  	}
>  
> -- 
> 2.51.0
> 

-Andrea

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] sched_ext: Use bitfields for boolean warning flags
  2025-09-22  1:32 ` [PATCH 4/7] sched_ext: Use bitfields for boolean warning flags Tejun Heo
@ 2025-09-23  7:45   ` Andrea Righi
  2025-09-23 16:00     ` Tejun Heo
  2025-09-23 16:58   ` [PATCH v2 " Tejun Heo
  1 sibling, 1 reply; 26+ messages in thread
From: Andrea Righi @ 2025-09-23  7:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Vernet, Changwoo Min, linux-kernel, sched-ext

On Sun, Sep 21, 2025 at 03:32:43PM -1000, Tejun Heo wrote:
> Convert warned_zero_slice and warned_deprecated_rq in scx_sched struct to
> single-bit bitfields to reduce struct size.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Hm... I think the struct size remains the same, at least in my build:
 - before: /* size: 1072, cachelines: 17, members: 14 */
 - after:  /* size: 1072, cachelines: 17, members: 14 */

Maybe if we add more attributes in the future?

Thanks,
-Andrea

> ---
>  kernel/sched/ext_internal.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
> index 2e289931e567..1a80d01b1f0c 100644
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -871,8 +871,8 @@ struct scx_sched {
>  	struct scx_dispatch_q	**global_dsqs;
>  	struct scx_sched_pcpu __percpu *pcpu;
>  
> -	bool			warned_zero_slice;
> -	bool			warned_deprecated_rq;
> +	bool			warned_zero_slice:1;
> +	bool			warned_deprecated_rq:1;
>  
>  	atomic_t		exit_kind;
>  	struct scx_exit_info	*exit_info;
> -- 
> 2.51.0
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/7] sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init()
  2025-09-22  1:32 ` [PATCH 5/7] sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init() Tejun Heo
@ 2025-09-23  8:08   ` Andrea Righi
  0 siblings, 0 replies; 26+ messages in thread
From: Andrea Righi @ 2025-09-23  8:08 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Vernet, Changwoo Min, linux-kernel, sched-ext

On Sun, Sep 21, 2025 at 03:32:44PM -1000, Tejun Heo wrote:
> ops.exit() may be called even if the loading failed before ops.init()
> finishes successfully. This is because ops.exit() allows rich exit info
> communication. Add SCX_EFLAG_INITIALIZED flag to scx_exit_info.flags to
> indicate whether ops.init() finished successfully.
> 
> This enables BPF schedulers to distinguish between exit scenarios and
> handle cleanup appropriately based on initialization state.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

This can be useful, we could update UEI_REPORT() to show the flag, but we
can also do it later in a separate patch.

Acked-by: Andrea Righi <arighi@nvidia.com>

> ---
>  kernel/sched/ext.c          |  1 +
>  kernel/sched/ext_internal.h | 13 +++++++++++++
>  2 files changed, 14 insertions(+)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 5801ac676d59..d131e98156ac 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -4554,6 +4554,7 @@ static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link)
>  			scx_error(sch, "ops.init() failed (%d)", ret);
>  			goto err_disable;
>  		}
> +		sch->exit_info->flags |= SCX_EFLAG_INITIALIZED;
>  	}
>  
>  	for (i = SCX_OPI_CPU_HOTPLUG_BEGIN; i < SCX_OPI_CPU_HOTPLUG_END; i++)
> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
> index 1a80d01b1f0c..b3617abed510 100644
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -62,6 +62,16 @@ enum scx_exit_code {
>  	SCX_ECODE_ACT_RESTART	= 1LLU << 48,
>  };
>  
> +enum scx_exit_flags {
> +	/*
> +	 * ops.exit() may be called even if the loading failed before ops.init()
> +	 * finishes successfully. This is because ops.exit() allows rich exit
> +	 * info communication. The following flag indicates whether ops.init()
> +	 * finished successfully.
> +	 */
> +	SCX_EFLAG_INITIALIZED,
> +};
> +
>  /*
>   * scx_exit_info is passed to ops.exit() to describe why the BPF scheduler is
>   * being disabled.
> @@ -73,6 +83,9 @@ struct scx_exit_info {
>  	/* exit code if gracefully exiting */
>  	s64			exit_code;
>  
> +	/* %SCX_EFLAG_* */
> +	u64			flags;
> +
>  	/* textual representation of the above */
>  	const char		*reason;
>  
> -- 
> 2.51.0
> 

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/7] sched_ext: Make qmap dump operation non-destructive
  2025-09-22  1:32 ` [PATCH 6/7] sched_ext: Make qmap dump operation non-destructive Tejun Heo
@ 2025-09-23  8:09   ` Andrea Righi
  0 siblings, 0 replies; 26+ messages in thread
From: Andrea Righi @ 2025-09-23  8:09 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Vernet, Changwoo Min, linux-kernel, sched-ext

On Sun, Sep 21, 2025 at 03:32:45PM -1000, Tejun Heo wrote:
> The qmap dump operation was destructively consuming queue entries while
> displaying them. As dump can be triggered anytime, this can easily lead to
> stalls. Add a temporary dump_store queue and modify the dump logic to pop
> entries, display them, and then restore them back to the original queue.
> This allows dump operations to be performed without affecting the
> scheduler's queue state.
> 
> Note that if racing against new enqueues during dump, ordering can get
> mixed up, but this is acceptable for debugging purposes.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Makes sense.

Acked-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> ---
>  tools/sched_ext/scx_qmap.bpf.c | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
> index 69d877501cb7..cd50a94326e3 100644
> --- a/tools/sched_ext/scx_qmap.bpf.c
> +++ b/tools/sched_ext/scx_qmap.bpf.c
> @@ -56,7 +56,8 @@ struct qmap {
>    queue1 SEC(".maps"),
>    queue2 SEC(".maps"),
>    queue3 SEC(".maps"),
> -  queue4 SEC(".maps");
> +  queue4 SEC(".maps"),
> +  dump_store SEC(".maps");
>  
>  struct {
>  	__uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
> @@ -578,11 +579,26 @@ void BPF_STRUCT_OPS(qmap_dump, struct scx_dump_ctx *dctx)
>  			return;
>  
>  		scx_bpf_dump("QMAP FIFO[%d]:", i);
> +
> +		/*
> +		 * Dump can be invoked anytime and there is no way to iterate in
> +		 * a non-destructive way. Pop and store in dump_store and then
> +		 * restore afterwards. If racing against new enqueues, ordering
> +		 * can get mixed up.
> +		 */
>  		bpf_repeat(4096) {
>  			if (bpf_map_pop_elem(fifo, &pid))
>  				break;
> +			bpf_map_push_elem(&dump_store, &pid, 0);
>  			scx_bpf_dump(" %d", pid);
>  		}
> +
> +		bpf_repeat(4096) {
> +			if (bpf_map_pop_elem(&dump_store, &pid))
> +				break;
> +			bpf_map_push_elem(fifo, &pid, 0);
> +		}
> +
>  		scx_bpf_dump("\n");
>  	}
>  }
> -- 
> 2.51.0
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] tools/sched_ext: scx_qmap: Make debug output quieter by default
  2025-09-22  1:32 ` [PATCH 7/7] tools/sched_ext: scx_qmap: Make debug output quieter by default Tejun Heo
@ 2025-09-23  8:11   ` Andrea Righi
  0 siblings, 0 replies; 26+ messages in thread
From: Andrea Righi @ 2025-09-23  8:11 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Vernet, Changwoo Min, linux-kernel, sched-ext

On Sun, Sep 21, 2025 at 03:32:46PM -1000, Tejun Heo wrote:
> scx_qmap currently outputs verbose debug messages including cgroup operations
> and CPU online/offline events by default, which can be noisy during normal
> operation. While the existing -P option controls DSQ dumps and event
> statistics, there's no way to suppress the other debug messages.
> 
> Split the debug output controls to make scx_qmap quieter by default. The -P
> option continues to control DSQ dumps and event statistics
> (print_dsqs_and_events), while a new -M option controls debug messages like
> cgroup operations and CPU events (print_msgs). This allows users to run
> scx_qmap with minimal output and selectively enable debug information as
> needed.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> ---
>  tools/sched_ext/scx_qmap.bpf.c | 80 +++++++++++++++++++---------------
>  tools/sched_ext/scx_qmap.c     | 12 +++--
>  2 files changed, 53 insertions(+), 39 deletions(-)
> 
> diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
> index cd50a94326e3..3072b593f898 100644
> --- a/tools/sched_ext/scx_qmap.bpf.c
> +++ b/tools/sched_ext/scx_qmap.bpf.c
> @@ -39,7 +39,8 @@ const volatile u32 stall_kernel_nth;
>  const volatile u32 dsp_inf_loop_after;
>  const volatile u32 dsp_batch;
>  const volatile bool highpri_boosting;
> -const volatile bool print_shared_dsq;
> +const volatile bool print_dsqs_and_events;
> +const volatile bool print_msgs;
>  const volatile s32 disallow_tgid;
>  const volatile bool suppress_dump;
>  
> @@ -633,22 +634,25 @@ void BPF_STRUCT_OPS(qmap_dump_task, struct scx_dump_ctx *dctx, struct task_struc
>  
>  s32 BPF_STRUCT_OPS(qmap_cgroup_init, struct cgroup *cgrp, struct scx_cgroup_init_args *args)
>  {
> -	bpf_printk("CGRP INIT %llu weight=%u period=%lu quota=%ld burst=%lu",
> -		   cgrp->kn->id, args->weight, args->bw_period_us,
> -		   args->bw_quota_us, args->bw_burst_us);
> +	if (print_msgs)
> +		bpf_printk("CGRP INIT %llu weight=%u period=%lu quota=%ld burst=%lu",
> +			   cgrp->kn->id, args->weight, args->bw_period_us,
> +			   args->bw_quota_us, args->bw_burst_us);
>  	return 0;
>  }
>  
>  void BPF_STRUCT_OPS(qmap_cgroup_set_weight, struct cgroup *cgrp, u32 weight)
>  {
> -	bpf_printk("CGRP SET %llu weight=%u", cgrp->kn->id, weight);
> +	if (print_msgs)
> +		bpf_printk("CGRP SET %llu weight=%u", cgrp->kn->id, weight);
>  }
>  
>  void BPF_STRUCT_OPS(qmap_cgroup_set_bandwidth, struct cgroup *cgrp,
>  		    u64 period_us, u64 quota_us, u64 burst_us)
>  {
> -	bpf_printk("CGRP SET %llu period=%lu quota=%ld burst=%lu", cgrp->kn->id,
> -		   period_us, quota_us, burst_us);
> +	if (print_msgs)
> +		bpf_printk("CGRP SET %llu period=%lu quota=%ld burst=%lu",
> +			   cgrp->kn->id, period_us, quota_us, burst_us);
>  }
>  
>  /*
> @@ -692,16 +696,20 @@ static void print_cpus(void)
>  
>  void BPF_STRUCT_OPS(qmap_cpu_online, s32 cpu)
>  {
> -	bpf_printk("CPU %d coming online", cpu);
> -	/* @cpu is already online at this point */
> -	print_cpus();
> +	if (print_msgs) {
> +		bpf_printk("CPU %d coming online", cpu);
> +		/* @cpu is already online at this point */
> +		print_cpus();
> +	}
>  }
>  
>  void BPF_STRUCT_OPS(qmap_cpu_offline, s32 cpu)
>  {
> -	bpf_printk("CPU %d going offline", cpu);
> -	/* @cpu is still online at this point */
> -	print_cpus();
> +	if (print_msgs) {
> +		bpf_printk("CPU %d going offline", cpu);
> +		/* @cpu is still online at this point */
> +		print_cpus();
> +	}
>  }
>  
>  struct monitor_timer {
> @@ -799,35 +807,36 @@ static void dump_shared_dsq(void)
>  
>  static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer)
>  {
> -	struct scx_event_stats events;
> -
>  	bpf_rcu_read_lock();
>  	dispatch_highpri(true);
>  	bpf_rcu_read_unlock();
>  
>  	monitor_cpuperf();
>  
> -	if (print_shared_dsq)
> +	if (print_dsqs_and_events) {
> +		struct scx_event_stats events;
> +
>  		dump_shared_dsq();
>  
> -	__COMPAT_scx_bpf_events(&events, sizeof(events));
> -
> -	bpf_printk("%35s: %lld", "SCX_EV_SELECT_CPU_FALLBACK",
> -		   scx_read_event(&events, SCX_EV_SELECT_CPU_FALLBACK));
> -	bpf_printk("%35s: %lld", "SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE",
> -		   scx_read_event(&events, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE));
> -	bpf_printk("%35s: %lld", "SCX_EV_DISPATCH_KEEP_LAST",
> -		   scx_read_event(&events, SCX_EV_DISPATCH_KEEP_LAST));
> -	bpf_printk("%35s: %lld", "SCX_EV_ENQ_SKIP_EXITING",
> -		   scx_read_event(&events, SCX_EV_ENQ_SKIP_EXITING));
> -	bpf_printk("%35s: %lld", "SCX_EV_REFILL_SLICE_DFL",
> -		   scx_read_event(&events, SCX_EV_REFILL_SLICE_DFL));
> -	bpf_printk("%35s: %lld", "SCX_EV_BYPASS_DURATION",
> -		   scx_read_event(&events, SCX_EV_BYPASS_DURATION));
> -	bpf_printk("%35s: %lld", "SCX_EV_BYPASS_DISPATCH",
> -		   scx_read_event(&events, SCX_EV_BYPASS_DISPATCH));
> -	bpf_printk("%35s: %lld", "SCX_EV_BYPASS_ACTIVATE",
> -		   scx_read_event(&events, SCX_EV_BYPASS_ACTIVATE));
> +		__COMPAT_scx_bpf_events(&events, sizeof(events));
> +
> +		bpf_printk("%35s: %lld", "SCX_EV_SELECT_CPU_FALLBACK",
> +			   scx_read_event(&events, SCX_EV_SELECT_CPU_FALLBACK));
> +		bpf_printk("%35s: %lld", "SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE",
> +			   scx_read_event(&events, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE));
> +		bpf_printk("%35s: %lld", "SCX_EV_DISPATCH_KEEP_LAST",
> +			   scx_read_event(&events, SCX_EV_DISPATCH_KEEP_LAST));
> +		bpf_printk("%35s: %lld", "SCX_EV_ENQ_SKIP_EXITING",
> +			   scx_read_event(&events, SCX_EV_ENQ_SKIP_EXITING));
> +		bpf_printk("%35s: %lld", "SCX_EV_REFILL_SLICE_DFL",
> +			   scx_read_event(&events, SCX_EV_REFILL_SLICE_DFL));
> +		bpf_printk("%35s: %lld", "SCX_EV_BYPASS_DURATION",
> +			   scx_read_event(&events, SCX_EV_BYPASS_DURATION));
> +		bpf_printk("%35s: %lld", "SCX_EV_BYPASS_DISPATCH",
> +			   scx_read_event(&events, SCX_EV_BYPASS_DISPATCH));
> +		bpf_printk("%35s: %lld", "SCX_EV_BYPASS_ACTIVATE",
> +			   scx_read_event(&events, SCX_EV_BYPASS_ACTIVATE));
> +	}
>  
>  	bpf_timer_start(timer, ONE_SEC_IN_NS, 0);
>  	return 0;
> @@ -839,7 +848,8 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
>  	struct bpf_timer *timer;
>  	s32 ret;
>  
> -	print_cpus();
> +	if (print_msgs)
> +		print_cpus();
>  
>  	ret = scx_bpf_create_dsq(SHARED_DSQ, -1);
>  	if (ret)
> diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
> index c4912ab2e76f..ef701d45ba43 100644
> --- a/tools/sched_ext/scx_qmap.c
> +++ b/tools/sched_ext/scx_qmap.c
> @@ -20,7 +20,7 @@ const char help_fmt[] =
>  "See the top-level comment in .bpf.c for more details.\n"
>  "\n"
>  "Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-l COUNT] [-b COUNT]\n"
> -"       [-P] [-d PID] [-D LEN] [-p] [-v]\n"
> +"       [-P] [-M] [-d PID] [-D LEN] [-p] [-v]\n"
>  "\n"
>  "  -s SLICE_US   Override slice duration\n"
>  "  -e COUNT      Trigger scx_bpf_error() after COUNT enqueues\n"
> @@ -28,7 +28,8 @@ const char help_fmt[] =
>  "  -T COUNT      Stall every COUNT'th kernel thread\n"
>  "  -l COUNT      Trigger dispatch infinite looping after COUNT dispatches\n"
>  "  -b COUNT      Dispatch upto COUNT tasks together\n"
> -"  -P            Print out DSQ content to trace_pipe every second, use with -b\n"
> +"  -P            Print out DSQ content and event counters to trace_pipe every second\n"
> +"  -M            Print out debug messages to trace_pipe\n"
>  "  -H            Boost nice -20 tasks in SHARED_DSQ, use with -b\n"
>  "  -d PID        Disallow a process from switching into SCHED_EXT (-1 for self)\n"
>  "  -D LEN        Set scx_exit_info.dump buffer length\n"
> @@ -66,7 +67,7 @@ int main(int argc, char **argv)
>  
>  	skel->rodata->slice_ns = __COMPAT_ENUM_OR_ZERO("scx_public_consts", "SCX_SLICE_DFL");
>  
> -	while ((opt = getopt(argc, argv, "s:e:t:T:l:b:PHd:D:Spvh")) != -1) {
> +	while ((opt = getopt(argc, argv, "s:e:t:T:l:b:PMHd:D:Spvh")) != -1) {
>  		switch (opt) {
>  		case 's':
>  			skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
> @@ -87,7 +88,10 @@ int main(int argc, char **argv)
>  			skel->rodata->dsp_batch = strtoul(optarg, NULL, 0);
>  			break;
>  		case 'P':
> -			skel->rodata->print_shared_dsq = true;
> +			skel->rodata->print_dsqs_and_events = true;
> +			break;
> +		case 'M':
> +			skel->rodata->print_msgs = true;
>  			break;
>  		case 'H':
>  			skel->rodata->highpri_boosting = true;
> -- 
> 2.51.0
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()
  2025-09-23  7:18 ` [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() Andrea Righi
@ 2025-09-23 15:59   ` Tejun Heo
  2025-09-23 18:47     ` Andrea Righi
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2025-09-23 15:59 UTC (permalink / raw)
  To: Andrea Righi; +Cc: David Vernet, Changwoo Min, linux-kernel, sched-ext

On Tue, Sep 23, 2025 at 09:18:33AM +0200, Andrea Righi wrote:
> Hi Tejun,
> 
> On Sun, Sep 21, 2025 at 03:32:40PM -1000, Tejun Heo wrote:
> > The find_user_dsq() function is called from contexts that are already
> > under RCU read lock protection. Switch from rhashtable_lookup_fast() to
> > rhashtable_lookup() to avoid redundant RCU locking.
> > 
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > ---
> >  kernel/sched/ext.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index f5873f8ed669..df433f6fab4b 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -207,7 +207,7 @@ static struct scx_dispatch_q *find_global_dsq(struct task_struct *p)
> >  
> >  static struct scx_dispatch_q *find_user_dsq(struct scx_sched *sch, u64 dsq_id)
> >  {
> 
> Maybe we should add a WARN_ON_ONCE(!rcu_read_lock_held()) to make sure the
> assumption is correct and catch potential non RCU read locked usage in the
> future?

rhashtable_lookup() already has that as it uses rcu_dereference_check()
internally.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] sched_ext: Use bitfields for boolean warning flags
  2025-09-23  7:45   ` Andrea Righi
@ 2025-09-23 16:00     ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2025-09-23 16:00 UTC (permalink / raw)
  To: Andrea Righi; +Cc: David Vernet, Changwoo Min, linux-kernel, sched-ext

On Tue, Sep 23, 2025 at 09:45:26AM +0200, Andrea Righi wrote:
> On Sun, Sep 21, 2025 at 03:32:43PM -1000, Tejun Heo wrote:
> > Convert warned_zero_slice and warned_deprecated_rq in scx_sched struct to
> > single-bit bitfields to reduce struct size.
> > 
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> 
> Hm... I think the struct size remains the same, at least in my build:
>  - before: /* size: 1072, cachelines: 17, members: 14 */
>  - after:  /* size: 1072, cachelines: 17, members: 14 */
> 
> Maybe if we add more attributes in the future?

Yeah, just so that we don't keep expanding the struct as we add flags. I'll
update the description.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 4/7] sched_ext: Use bitfields for boolean warning flags
  2025-09-22  1:32 ` [PATCH 4/7] sched_ext: Use bitfields for boolean warning flags Tejun Heo
  2025-09-23  7:45   ` Andrea Righi
@ 2025-09-23 16:58   ` Tejun Heo
  2025-09-23 18:45     ` Andrea Righi
  1 sibling, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2025-09-23 16:58 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min; +Cc: linux-kernel, sched-ext

Convert warned_zero_slice and warned_deprecated_rq in scx_sched struct to
single-bit bitfields. While this doesn't reduce struct size immediately,
it prepares for future bitfield additions.

v2: Update patch description.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext_internal.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -871,8 +871,8 @@ struct scx_sched {
 	struct scx_dispatch_q	**global_dsqs;
 	struct scx_sched_pcpu __percpu *pcpu;
 
-	bool			warned_zero_slice;
-	bool			warned_deprecated_rq;
+	bool			warned_zero_slice:1;
+	bool			warned_deprecated_rq:1;
 
 	atomic_t		exit_kind;
 	struct scx_exit_info	*exit_info;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 4/7] sched_ext: Use bitfields for boolean warning flags
  2025-09-23 16:58   ` [PATCH v2 " Tejun Heo
@ 2025-09-23 18:45     ` Andrea Righi
  0 siblings, 0 replies; 26+ messages in thread
From: Andrea Righi @ 2025-09-23 18:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Vernet, Changwoo Min, linux-kernel, sched-ext

On Tue, Sep 23, 2025 at 06:58:06AM -1000, Tejun Heo wrote:
> Convert warned_zero_slice and warned_deprecated_rq in scx_sched struct to
> single-bit bitfields. While this doesn't reduce struct size immediately,
> it prepares for future bitfield additions.
> 
> v2: Update patch description.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> ---
>  kernel/sched/ext_internal.h |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -871,8 +871,8 @@ struct scx_sched {
>  	struct scx_dispatch_q	**global_dsqs;
>  	struct scx_sched_pcpu __percpu *pcpu;
>  
> -	bool			warned_zero_slice;
> -	bool			warned_deprecated_rq;
> +	bool			warned_zero_slice:1;
> +	bool			warned_deprecated_rq:1;
>  
>  	atomic_t		exit_kind;
>  	struct scx_exit_info	*exit_info;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()
  2025-09-23 15:59   ` Tejun Heo
@ 2025-09-23 18:47     ` Andrea Righi
  0 siblings, 0 replies; 26+ messages in thread
From: Andrea Righi @ 2025-09-23 18:47 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Vernet, Changwoo Min, linux-kernel, sched-ext

On Tue, Sep 23, 2025 at 05:59:47AM -1000, Tejun Heo wrote:
> On Tue, Sep 23, 2025 at 09:18:33AM +0200, Andrea Righi wrote:
> > Hi Tejun,
> > 
> > On Sun, Sep 21, 2025 at 03:32:40PM -1000, Tejun Heo wrote:
> > > The find_user_dsq() function is called from contexts that are already
> > > under RCU read lock protection. Switch from rhashtable_lookup_fast() to
> > > rhashtable_lookup() to avoid redundant RCU locking.
> > > 
> > > Signed-off-by: Tejun Heo <tj@kernel.org>
> > > ---
> > >  kernel/sched/ext.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > > index f5873f8ed669..df433f6fab4b 100644
> > > --- a/kernel/sched/ext.c
> > > +++ b/kernel/sched/ext.c
> > > @@ -207,7 +207,7 @@ static struct scx_dispatch_q *find_global_dsq(struct task_struct *p)
> > >  
> > >  static struct scx_dispatch_q *find_user_dsq(struct scx_sched *sch, u64 dsq_id)
> > >  {
> > 
> > Maybe we should add a WARN_ON_ONCE(!rcu_read_lock_held()) to make sure the
> > assumption is correct and catch potential non RCU read locked usage in the
> > future?
> 
> rhashtable_lookup() already has that as it uses rcu_dereference_check()
> internally.

Oh that's right, I see it now. Then looks good.

Acked-by: Andrea Righi <arighi@nvidia.com>

-Andrea

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()
  2025-09-22  1:32 [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() Tejun Heo
                   ` (6 preceding siblings ...)
  2025-09-23  7:18 ` [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() Andrea Righi
@ 2025-09-23 19:13 ` Tejun Heo
  2025-09-24  6:14 ` Andrea Righi
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2025-09-23 19:13 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min; +Cc: linux-kernel, sched-ext

Applied 1-7 to sched_ext/for-6.18.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()
  2025-09-22  1:32 [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() Tejun Heo
                   ` (7 preceding siblings ...)
  2025-09-23 19:13 ` Tejun Heo
@ 2025-09-24  6:14 ` Andrea Righi
  2025-09-24  6:38   ` Tejun Heo
  2025-10-22 21:37 ` Andrea Righi
  2025-10-22 21:56 ` Tejun Heo
  10 siblings, 1 reply; 26+ messages in thread
From: Andrea Righi @ 2025-09-24  6:14 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Vernet, Changwoo Min, linux-kernel, sched-ext

Hi Tejun

On Sun, Sep 21, 2025 at 03:32:40PM -1000, Tejun Heo wrote:
> The find_user_dsq() function is called from contexts that are already
> under RCU read lock protection. Switch from rhashtable_lookup_fast() to
> rhashtable_lookup() to avoid redundant RCU locking.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

It looks like the ttwu_queue() path isn't RCU read lock protected.
With this applied:

[    6.647598] =============================
[    6.647603] WARNING: suspicious RCU usage
[    6.647605] 6.17.0-rc7-virtme #1 Not tainted
[    6.647608] -----------------------------
[    6.647608] ./include/linux/rhashtable.h:602 suspicious rcu_dereference_check() usage!
[    6.647610]
[    6.647610] other info that might help us debug this:
[    6.647610]
[    6.647612]
[    6.647612] rcu_scheduler_active = 2, debug_locks = 1
[    6.647613] 1 lock held by swapper/10/0:
[    6.647614]  #0: ffff8b14bbb3cc98 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0x90
[    6.647630]
[    6.647630] stack backtrace:
[    6.647633] CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.17.0-rc7-virtme #1 PREEMPT(full)
[    6.647643] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[    6.647646] Sched_ext: beerland_1.0.2_g27d63fc3_x86_64_unknown_linux_gnu (enabled+all)
[    6.647648] Call Trace:
[    6.647652]  <IRQ>
[    6.647655]  dump_stack_lvl+0x78/0xe0
[    6.647665]  lockdep_rcu_suspicious+0x14a/0x1b0
[    6.647672]  __rhashtable_lookup.constprop.0+0x1d5/0x250
[    6.647680]  find_dsq_for_dispatch+0xbc/0x190
[    6.647684]  do_enqueue_task+0x25b/0x550
[    6.647689]  enqueue_task_scx+0x21d/0x360
[    6.647692]  ? trace_lock_acquire+0x22/0xb0
[    6.647695]  enqueue_task+0x2e/0xd0
[    6.647698]  ttwu_do_activate+0xa2/0x290
[    6.647703]  sched_ttwu_pending+0xfd/0x250
[    6.647706]  __flush_smp_call_function_queue+0x1cd/0x610
[    6.647714]  __sysvec_call_function_single+0x34/0x150
[    6.647720]  sysvec_call_function_single+0x6e/0x80
[    6.647726]  </IRQ>
[    6.647726]  <TASK>
[    6.647727]  asm_sysvec_call_function_single+0x1a/0x20

Should we revert this?

Thanks,
-Andrea

> ---
>  kernel/sched/ext.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index f5873f8ed669..df433f6fab4b 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -207,7 +207,7 @@ static struct scx_dispatch_q *find_global_dsq(struct task_struct *p)
>  
>  static struct scx_dispatch_q *find_user_dsq(struct scx_sched *sch, u64 dsq_id)
>  {
> -	return rhashtable_lookup_fast(&sch->dsq_hash, &dsq_id, dsq_hash_params);
> +	return rhashtable_lookup(&sch->dsq_hash, &dsq_id, dsq_hash_params);
>  }
>  
>  /*
> -- 
> 2.51.0
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()
  2025-09-24  6:14 ` Andrea Righi
@ 2025-09-24  6:38   ` Tejun Heo
  2025-09-24  8:26     ` Andrea Righi
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2025-09-24  6:38 UTC (permalink / raw)
  To: Andrea Righi; +Cc: David Vernet, Changwoo Min, linux-kernel, sched-ext

Hello,

On Wed, Sep 24, 2025 at 08:14:03AM +0200, Andrea Righi wrote:
> Hi Tejun
> 
> On Sun, Sep 21, 2025 at 03:32:40PM -1000, Tejun Heo wrote:
> > The find_user_dsq() function is called from contexts that are already
> > under RCU read lock protection. Switch from rhashtable_lookup_fast() to
> > rhashtable_lookup() to avoid redundant RCU locking.
> > 
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> 
> It looks like the ttwu_queue() path isn't RCU read lock protected.
> With this applied:
> 
> [    6.647598] =============================
> [    6.647603] WARNING: suspicious RCU usage
> [    6.647605] 6.17.0-rc7-virtme #1 Not tainted
> [    6.647608] -----------------------------
> [    6.647608] ./include/linux/rhashtable.h:602 suspicious rcu_dereference_check() usage!
> [    6.647610]
> [    6.647610] other info that might help us debug this:
> [    6.647610]
> [    6.647612]
> [    6.647612] rcu_scheduler_active = 2, debug_locks = 1
> [    6.647613] 1 lock held by swapper/10/0:
> [    6.647614]  #0: ffff8b14bbb3cc98 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0x90
> [    6.647630]
> [    6.647630] stack backtrace:
> [    6.647633] CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.17.0-rc7-virtme #1 PREEMPT(full)
> [    6.647643] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [    6.647646] Sched_ext: beerland_1.0.2_g27d63fc3_x86_64_unknown_linux_gnu (enabled+all)
> [    6.647648] Call Trace:
> [    6.647652]  <IRQ>
> [    6.647655]  dump_stack_lvl+0x78/0xe0
> [    6.647665]  lockdep_rcu_suspicious+0x14a/0x1b0
> [    6.647672]  __rhashtable_lookup.constprop.0+0x1d5/0x250
> [    6.647680]  find_dsq_for_dispatch+0xbc/0x190
> [    6.647684]  do_enqueue_task+0x25b/0x550
> [    6.647689]  enqueue_task_scx+0x21d/0x360
> [    6.647692]  ? trace_lock_acquire+0x22/0xb0
> [    6.647695]  enqueue_task+0x2e/0xd0
> [    6.647698]  ttwu_do_activate+0xa2/0x290
> [    6.647703]  sched_ttwu_pending+0xfd/0x250
> [    6.647706]  __flush_smp_call_function_queue+0x1cd/0x610
> [    6.647714]  __sysvec_call_function_single+0x34/0x150
> [    6.647720]  sysvec_call_function_single+0x6e/0x80
> [    6.647726]  </IRQ>
> [    6.647726]  <TASK>
> [    6.647727]  asm_sysvec_call_function_single+0x1a/0x20
> 
> Should we revert this?

IRQ is disabled, so it is in RCU protected region but the lockdep annotation
isn't happy with it. :-( I'll revert the patch for now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()
  2025-09-24  6:38   ` Tejun Heo
@ 2025-09-24  8:26     ` Andrea Righi
  2025-09-24 18:00       ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Andrea Righi @ 2025-09-24  8:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Vernet, Changwoo Min, linux-kernel, sched-ext

On Tue, Sep 23, 2025 at 08:38:05PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 24, 2025 at 08:14:03AM +0200, Andrea Righi wrote:
> > Hi Tejun
> > 
> > On Sun, Sep 21, 2025 at 03:32:40PM -1000, Tejun Heo wrote:
> > > The find_user_dsq() function is called from contexts that are already
> > > under RCU read lock protection. Switch from rhashtable_lookup_fast() to
> > > rhashtable_lookup() to avoid redundant RCU locking.
> > > 
> > > Signed-off-by: Tejun Heo <tj@kernel.org>
> > 
> > It looks like the ttwu_queue() path isn't RCU read lock protected.
> > With this applied:
> > 
> > [    6.647598] =============================
> > [    6.647603] WARNING: suspicious RCU usage
> > [    6.647605] 6.17.0-rc7-virtme #1 Not tainted
> > [    6.647608] -----------------------------
> > [    6.647608] ./include/linux/rhashtable.h:602 suspicious rcu_dereference_check() usage!
> > [    6.647610]
> > [    6.647610] other info that might help us debug this:
> > [    6.647610]
> > [    6.647612]
> > [    6.647612] rcu_scheduler_active = 2, debug_locks = 1
> > [    6.647613] 1 lock held by swapper/10/0:
> > [    6.647614]  #0: ffff8b14bbb3cc98 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0x90
> > [    6.647630]
> > [    6.647630] stack backtrace:
> > [    6.647633] CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.17.0-rc7-virtme #1 PREEMPT(full)
> > [    6.647643] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > [    6.647646] Sched_ext: beerland_1.0.2_g27d63fc3_x86_64_unknown_linux_gnu (enabled+all)
> > [    6.647648] Call Trace:
> > [    6.647652]  <IRQ>
> > [    6.647655]  dump_stack_lvl+0x78/0xe0
> > [    6.647665]  lockdep_rcu_suspicious+0x14a/0x1b0
> > [    6.647672]  __rhashtable_lookup.constprop.0+0x1d5/0x250
> > [    6.647680]  find_dsq_for_dispatch+0xbc/0x190
> > [    6.647684]  do_enqueue_task+0x25b/0x550
> > [    6.647689]  enqueue_task_scx+0x21d/0x360
> > [    6.647692]  ? trace_lock_acquire+0x22/0xb0
> > [    6.647695]  enqueue_task+0x2e/0xd0
> > [    6.647698]  ttwu_do_activate+0xa2/0x290
> > [    6.647703]  sched_ttwu_pending+0xfd/0x250
> > [    6.647706]  __flush_smp_call_function_queue+0x1cd/0x610
> > [    6.647714]  __sysvec_call_function_single+0x34/0x150
> > [    6.647720]  sysvec_call_function_single+0x6e/0x80
> > [    6.647726]  </IRQ>
> > [    6.647726]  <TASK>
> > [    6.647727]  asm_sysvec_call_function_single+0x1a/0x20
> > 
> > Should we revert this?
> 
> IRQ is disabled, so it is in RCU protected region but the lockdep annotation
> isn't happy with it. :-( I'll revert the patch for now.

We just need this: https://lore.kernel.org/all/aL_4gCJibYMW0J98@gondor.apana.org.au/

With this one lockdep is happy.

-Andrea

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()
  2025-09-24  8:26     ` Andrea Righi
@ 2025-09-24 18:00       ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2025-09-24 18:00 UTC (permalink / raw)
  To: Andrea Righi; +Cc: David Vernet, Changwoo Min, linux-kernel, sched-ext

On Wed, Sep 24, 2025 at 10:26:24AM +0200, Andrea Righi wrote:
> > IRQ is disabled, so it is in RCU protected region but the lockdep annotation
> > isn't happy with it. :-( I'll revert the patch for now.
> 
> We just need this: https://lore.kernel.org/all/aL_4gCJibYMW0J98@gondor.apana.org.au/
> 
> With this one lockdep is happy.

Oh yeah, we can try switching again once that patch lands.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()
  2025-09-22  1:32 [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() Tejun Heo
                   ` (8 preceding siblings ...)
  2025-09-24  6:14 ` Andrea Righi
@ 2025-10-22 21:37 ` Andrea Righi
  2025-10-22 21:56 ` Tejun Heo
  10 siblings, 0 replies; 26+ messages in thread
From: Andrea Righi @ 2025-10-22 21:37 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Vernet, Changwoo Min, linux-kernel, sched-ext

Hi Tejun,

On Sun, Sep 21, 2025 at 03:32:40PM -1000, Tejun Heo wrote:
> The find_user_dsq() function is called from contexts that are already
> under RCU read lock protection. Switch from rhashtable_lookup_fast() to
> rhashtable_lookup() to avoid redundant RCU locking.

Now that bee8a520eb849 ("rhashtable: Use rcu_dereference_all and
rcu_dereference_all_check") landed upstream, can we re-apply this?
Maybe add:

Requires: bee8a520eb849 ("rhashtable: Use rcu_dereference_all and rcu_dereference_all_check")

Thanks,
-Andrea

> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index f5873f8ed669..df433f6fab4b 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -207,7 +207,7 @@ static struct scx_dispatch_q *find_global_dsq(struct task_struct *p)
>  
>  static struct scx_dispatch_q *find_user_dsq(struct scx_sched *sch, u64 dsq_id)
>  {
> -	return rhashtable_lookup_fast(&sch->dsq_hash, &dsq_id, dsq_hash_params);
> +	return rhashtable_lookup(&sch->dsq_hash, &dsq_id, dsq_hash_params);
>  }
>  
>  /*
> -- 
> 2.51.0
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()
  2025-09-22  1:32 [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() Tejun Heo
                   ` (9 preceding siblings ...)
  2025-10-22 21:37 ` Andrea Righi
@ 2025-10-22 21:56 ` Tejun Heo
  10 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2025-10-22 21:56 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: linux-kernel, sched-ext

> Tejun Heo (1):
>   sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()

Applied to sched_ext/for-6.19 with the suggested Requires tag.

Thanks.
--
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-10-22 21:56 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-22  1:32 [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() Tejun Heo
2025-09-22  1:32 ` [PATCH 2/7] sched_ext: Improve SCX_KF_DISPATCH comment Tejun Heo
2025-09-23  7:20   ` Andrea Righi
2025-09-22  1:32 ` [PATCH 3/7] sched_ext: Fix stray scx_root usage in task_can_run_on_remote_rq() Tejun Heo
2025-09-23  7:22   ` Andrea Righi
2025-09-22  1:32 ` [PATCH 4/7] sched_ext: Use bitfields for boolean warning flags Tejun Heo
2025-09-23  7:45   ` Andrea Righi
2025-09-23 16:00     ` Tejun Heo
2025-09-23 16:58   ` [PATCH v2 " Tejun Heo
2025-09-23 18:45     ` Andrea Righi
2025-09-22  1:32 ` [PATCH 5/7] sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init() Tejun Heo
2025-09-23  8:08   ` Andrea Righi
2025-09-22  1:32 ` [PATCH 6/7] sched_ext: Make qmap dump operation non-destructive Tejun Heo
2025-09-23  8:09   ` Andrea Righi
2025-09-22  1:32 ` [PATCH 7/7] tools/sched_ext: scx_qmap: Make debug output quieter by default Tejun Heo
2025-09-23  8:11   ` Andrea Righi
2025-09-23  7:18 ` [PATCH 1/7] sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() Andrea Righi
2025-09-23 15:59   ` Tejun Heo
2025-09-23 18:47     ` Andrea Righi
2025-09-23 19:13 ` Tejun Heo
2025-09-24  6:14 ` Andrea Righi
2025-09-24  6:38   ` Tejun Heo
2025-09-24  8:26     ` Andrea Righi
2025-09-24 18:00       ` Tejun Heo
2025-10-22 21:37 ` Andrea Righi
2025-10-22 21:56 ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox