public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] sched_ext: Implement core event counters
@ 2025-01-16 15:15 Changwoo Min
  2025-01-16 15:15 ` [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event Changwoo Min
                   ` (6 more replies)
  0 siblings, 7 replies; 28+ messages in thread
From: Changwoo Min @ 2025-01-16 15:15 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

The sched_ext core often has to override the BPF scheduler decisions,
and some events could be interesting but not easily visible.

This patchset aims to address such a problem in the following manner:
- Introduce an infrastructure to collect such events in a scalable and
  extensible way and to expose the collected events to the BPF scheduler
  in a compatible way.
- Define five events to be collected.
- Modify an scx scheduler to demonstrate the usage of the new BPF APIs.

Changwoo Min (7):
  sched_ext: Implement event counter infrastructure and add an event
  sched_ext: Add an event, SCX_EVENT_OFFLINE_LOCAL_DSQ
  sched_ext: Add an event, SCX_EVENT_CNTD_RUN_WO_ENQ
  sched_ext: Add an event, SCX_EVENT_ENQ_LOCAL_EXITING
  sched_ext: Add an event, SCX_EVENT_RQ_BYPASSING_OPS
  sched_ext: Add scx_bpf_event_stat() and scx_read_event() for BPF
    schedulers
  sched_ext: Print core event count in scx_central scheduler

 kernel/sched/ext.c                       | 173 ++++++++++++++++++++++-
 tools/sched_ext/include/scx/common.bpf.h |   4 +
 tools/sched_ext/scx_central.bpf.c        |  18 +++
 3 files changed, 190 insertions(+), 5 deletions(-)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event
  2025-01-16 15:15 [PATCH 0/7] sched_ext: Implement core event counters Changwoo Min
@ 2025-01-16 15:15 ` Changwoo Min
  2025-01-17  1:33   ` Tejun Heo
                     ` (2 more replies)
  2025-01-16 15:15 ` [PATCH 2/7] sched_ext: Add an event, SCX_EVENT_OFFLINE_LOCAL_DSQ Changwoo Min
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 28+ messages in thread
From: Changwoo Min @ 2025-01-16 15:15 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

Collect the statistics of specific types of behavior in the sched_ext core,
which are not easily visible but still interesting to an scx scheduler.

Also, add a core event, SCX_EVENT_INVAL_SELECT_CPU, which represents how
many times ops.select_cpu() returns a CPU that the task can't use.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/ext.c | 120 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 118 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 0bcdd1a31676..7e12d5b8322e 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1458,6 +1458,66 @@ static struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter)
 	return p;
 }
 
+/*
+ * Collection of event counters.
+ */
+struct scx_event_stat {
+	/*
+	 * If ops.select_cpu() returns a CPU which can't be used by the task,
+	 * the core scheduler code silently picks a fallback CPU.
+	 */
+	u64		INVAL_SELECT_CPU;
+};
+
+#define SCX_EVENT_IDX(e)	(offsetof(struct scx_event_stat, e)/sizeof(u64))
+#define SCX_EVENT_END_IDX()	(sizeof(struct scx_event_stat)/sizeof(u64))
+#define SCX_EVENT_DEFINE(e)	SCX_EVENT_##e = SCX_EVENT_IDX(e)
+
+/*
+ * Types of event counters.
+ */
+enum scx_event_kind {
+	SCX_EVENT_BEGIN = 0,
+	SCX_EVENT_DEFINE(INVAL_SELECT_CPU),
+	SCX_EVENT_END = SCX_EVENT_END_IDX(),
+};
+
+static const char *scx_event_stat_str[] = {
+	[SCX_EVENT_INVAL_SELECT_CPU]	= "invalid_select_cpu",
+};
+
+/*
+ * The event counter is organized by a per-CPU variable to minimize the
+ * accounting overhead without synchronization. A system-wide view on the
+ * event counter is constructed when requested by scx_bpf_get_event_stat().
+ */
+static DEFINE_PER_CPU(struct scx_event_stat, event_stats);
+
+/**
+ * scx_add_event - Increase an event counter for 'name' by 'cnt'
+ * @name: an event name defined in struct scx_event_stat
+ * @cnt: the number of the event occured
+ */
+#define scx_add_event(name, cnt) ({						\
+	struct scx_event_stat *__e;						\
+	__e = get_cpu_ptr(&event_stats);					\
+	WRITE_ONCE(__e->name, __e->name+ (cnt));				\
+	put_cpu_ptr(&event_stats);						\
+})
+
+
+/**
+ * scx_read_event_kind - Read an event from 'e' with 'kind'
+ * @e: a pointer to an event collected by scx_bpf_event_stat()
+ * @kine: an event type defined in scx_event_kind
+ */
+#define scx_read_event_kind(e, kind) ({						\
+	u64 *__e64 = (u64 *)(e);						\
+	__e64[kind];								\
+})
+
+static void scx_bpf_event_stat(struct scx_event_stat *event, size_t event__sz);
+
 static enum scx_ops_enable_state scx_ops_enable_state(void)
 {
 	return atomic_read(&scx_ops_enable_state_var);
@@ -3607,8 +3667,10 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
 		*ddsp_taskp = NULL;
 		if (ops_cpu_valid(cpu, "from ops.select_cpu()"))
 			return cpu;
-		else
+		else {
+			scx_add_event(INVAL_SELECT_CPU, 1);
 			return prev_cpu;
+		}
 	} else {
 		bool found;
 		s32 cpu;
@@ -5053,6 +5115,15 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
 		scx_rq_clock_invalidate(rq);
 	}
 
+	/*
+	 * Clear event counters so the next scx scheduler always gets
+	 * fresh event counter values.
+	 */
+	for_each_possible_cpu(cpu) {
+		struct scx_event_stat *e = per_cpu_ptr(&event_stats, cpu);
+		memset(e, 0, sizeof(*e));
+	}
+
 	/* no task is on scx, turn off all the switches and flush in-progress calls */
 	static_branch_disable(&__scx_ops_enabled);
 	for (i = SCX_OPI_BEGIN; i < SCX_OPI_END; i++)
@@ -5309,9 +5380,10 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
 		.at_jiffies = jiffies,
 	};
 	struct seq_buf s;
+	struct scx_event_stat event;
 	unsigned long flags;
 	char *buf;
-	int cpu;
+	int cpu, kind;
 
 	spin_lock_irqsave(&dump_lock, flags);
 
@@ -5417,6 +5489,16 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
 		rq_unlock(rq, &rf);
 	}
 
+	dump_newline(&s);
+	dump_line(&s, "Event counters");
+	dump_line(&s, "--------------");
+
+	scx_bpf_event_stat(&event, sizeof(event));
+	for (kind = SCX_EVENT_BEGIN; kind < SCX_EVENT_END; kind++) {
+		dump_line(&s, "%25s : %llu", scx_event_stat_str[kind],
+			  scx_read_event_kind(&event, kind));
+	}
+
 	if (seq_buf_has_overflowed(&s) && dump_len >= sizeof(trunc_marker))
 		memcpy(ei->dump + dump_len - sizeof(trunc_marker),
 		       trunc_marker, sizeof(trunc_marker));
@@ -7720,6 +7802,39 @@ __bpf_kfunc u64 scx_bpf_now(void)
 	return clock;
 }
 
+/*
+ * scx_bpf_event_stat - Get a system-wide event counter to
+ * @event: output buffer from a BPF program
+ * @event__sz: @event len, must end in '__sz'' for the verifier
+ */
+__bpf_kfunc void scx_bpf_event_stat(struct scx_event_stat *event,
+				    size_t event__sz)
+{
+	struct scx_event_stat *e;
+	u64 *event64, *e64;
+	int cpu, kind, event_end;
+
+	/*
+	 * We cannot entirely trust a BPF-provided size since a BPF program
+	 * might be compiled against a different vmlinux.h, of which
+	 * scx_event_stat would be larger (a newer vmlinux.h) or smaller
+	 * (an older vmlinux.h). Hence, we use the smaller size to avoid
+	 * memory corruption.
+	 */
+	event__sz = min(event__sz, sizeof(*event));
+	event_end = event__sz / sizeof(u64);
+
+	event64 = (u64 *)event;
+	memset(event, 0, event__sz);
+	for_each_possible_cpu(cpu) {
+		e = per_cpu_ptr(&event_stats, cpu);
+		e64 = (u64 *)e;
+		for (kind = 0; kind < event_end; kind++) {
+			event64[kind] += READ_ONCE(e64[kind]);
+		}
+	}
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(scx_kfunc_ids_any)
@@ -7752,6 +7867,7 @@ BTF_ID_FLAGS(func, scx_bpf_cpu_rq)
 BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_RCU | KF_ACQUIRE)
 #endif
 BTF_ID_FLAGS(func, scx_bpf_now)
+BTF_ID_FLAGS(func, scx_bpf_event_stat, KF_TRUSTED_ARGS)
 BTF_KFUNCS_END(scx_kfunc_ids_any)
 
 static const struct btf_kfunc_id_set scx_kfunc_set_any = {
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/7] sched_ext: Add an event, SCX_EVENT_OFFLINE_LOCAL_DSQ
  2025-01-16 15:15 [PATCH 0/7] sched_ext: Implement core event counters Changwoo Min
  2025-01-16 15:15 ` [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event Changwoo Min
@ 2025-01-16 15:15 ` Changwoo Min
  2025-01-17  1:37   ` Tejun Heo
  2025-01-16 15:15 ` [PATCH 3/7] sched_ext: Add an event, SCX_EVENT_CNTD_RUN_WO_ENQ Changwoo Min
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Changwoo Min @ 2025-01-16 15:15 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

Add a core event, SCX_EVENT_OFFLINE_LOCAL_DSQ, which represents how many
times a BPF scheduler tries to dispatch to an offlined local DSQ.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/ext.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 7e12d5b8322e..8054e4e5ed0c 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1467,6 +1467,12 @@ struct scx_event_stat {
 	 * the core scheduler code silently picks a fallback CPU.
 	 */
 	u64		INVAL_SELECT_CPU;
+
+	/*
+	 * When dispatching to a local DSQ, the CPU may have gone offline in
+	 * the meantime. In this case, the task is bounced to the global DSQ.
+	 */
+	u64		OFFLINE_LOCAL_DSQ;
 };
 
 #define SCX_EVENT_IDX(e)	(offsetof(struct scx_event_stat, e)/sizeof(u64))
@@ -1479,11 +1485,13 @@ struct scx_event_stat {
 enum scx_event_kind {
 	SCX_EVENT_BEGIN = 0,
 	SCX_EVENT_DEFINE(INVAL_SELECT_CPU),
+	SCX_EVENT_DEFINE(OFFLINE_LOCAL_DSQ),
 	SCX_EVENT_END = SCX_EVENT_END_IDX(),
 };
 
 static const char *scx_event_stat_str[] = {
 	[SCX_EVENT_INVAL_SELECT_CPU]	= "invalid_select_cpu",
+	[SCX_EVENT_OFFLINE_LOCAL_DSQ]	= "offline_local_dsq",
 };
 
 /*
@@ -2651,6 +2659,7 @@ static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
 	if (unlikely(!task_can_run_on_remote_rq(p, dst_rq, true))) {
 		dispatch_enqueue(find_global_dsq(p), p,
 				 enq_flags | SCX_ENQ_CLEAR_OPSS);
+		scx_add_event(OFFLINE_LOCAL_DSQ, 1);
 		return;
 	}
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/7] sched_ext: Add an event, SCX_EVENT_CNTD_RUN_WO_ENQ
  2025-01-16 15:15 [PATCH 0/7] sched_ext: Implement core event counters Changwoo Min
  2025-01-16 15:15 ` [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event Changwoo Min
  2025-01-16 15:15 ` [PATCH 2/7] sched_ext: Add an event, SCX_EVENT_OFFLINE_LOCAL_DSQ Changwoo Min
@ 2025-01-16 15:15 ` Changwoo Min
  2025-01-17  1:39   ` Tejun Heo
  2025-01-16 15:15 ` [PATCH 4/7] sched_ext: Add an event, SCX_EVENT_ENQ_LOCAL_EXITING Changwoo Min
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Changwoo Min @ 2025-01-16 15:15 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

Add a core event, SCX_EVENT_CNTD_RUN_WO_ENQ, which represents how many
times a task is continued to run without ops.enqueue() when SCX_OPS_ENQ_LAST
is not set.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/ext.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 8054e4e5ed0c..909f12a41934 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1473,6 +1473,12 @@ struct scx_event_stat {
 	 * the meantime. In this case, the task is bounced to the global DSQ.
 	 */
 	u64		OFFLINE_LOCAL_DSQ;
+
+	/*
+	 * If SCX_OPS_ENQ_LAST is not set, the number of times that a task
+	 * continued to run because there were no other tasks on the CPU.
+	 */
+	u64		CNTD_RUN_WO_ENQ;
 };
 
 #define SCX_EVENT_IDX(e)	(offsetof(struct scx_event_stat, e)/sizeof(u64))
@@ -1486,12 +1492,14 @@ enum scx_event_kind {
 	SCX_EVENT_BEGIN = 0,
 	SCX_EVENT_DEFINE(INVAL_SELECT_CPU),
 	SCX_EVENT_DEFINE(OFFLINE_LOCAL_DSQ),
+	SCX_EVENT_DEFINE(CNTD_RUN_WO_ENQ),
 	SCX_EVENT_END = SCX_EVENT_END_IDX(),
 };
 
 static const char *scx_event_stat_str[] = {
 	[SCX_EVENT_INVAL_SELECT_CPU]	= "invalid_select_cpu",
 	[SCX_EVENT_OFFLINE_LOCAL_DSQ]	= "offline_local_dsq",
+	[SCX_EVENT_CNTD_RUN_WO_ENQ]	= "cntd_run_wo_enq",
 };
 
 /*
@@ -2914,6 +2922,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
 	if (prev_on_rq && (!static_branch_unlikely(&scx_ops_enq_last) ||
 	     scx_rq_bypassing(rq))) {
 		rq->scx.flags |= SCX_RQ_BAL_KEEP;
+		scx_add_event(CNTD_RUN_WO_ENQ, 1);
 		goto has_tasks;
 	}
 	rq->scx.flags &= ~SCX_RQ_IN_BALANCE;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/7] sched_ext: Add an event, SCX_EVENT_ENQ_LOCAL_EXITING
  2025-01-16 15:15 [PATCH 0/7] sched_ext: Implement core event counters Changwoo Min
                   ` (2 preceding siblings ...)
  2025-01-16 15:15 ` [PATCH 3/7] sched_ext: Add an event, SCX_EVENT_CNTD_RUN_WO_ENQ Changwoo Min
@ 2025-01-16 15:15 ` Changwoo Min
  2025-01-17  1:40   ` Tejun Heo
  2025-01-16 15:15 ` [PATCH 5/7] sched_ext: Add an event, SCX_EVENT_RQ_BYPASSING_OPS Changwoo Min
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Changwoo Min @ 2025-01-16 15:15 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

Add a core event, SCX_EVENT_ENQ_LOCAL_EXITING, which represents how many
times a task is enqueued to a local DSQ when exiting if SCX_OPS_ENQ_EXITING
is not set.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/ext.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 909f12a41934..094e19f5fb78 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1479,6 +1479,12 @@ struct scx_event_stat {
 	 * continued to run because there were no other tasks on the CPU.
 	 */
 	u64		CNTD_RUN_WO_ENQ;
+
+	/*
+	 * If SCX_OPS_ENQ_EXITING is not set, the number of times that a task
+	 * is dispatched to a local DSQ when exiting.
+	 */
+	u64		ENQ_LOCAL_EXITING;
 };
 
 #define SCX_EVENT_IDX(e)	(offsetof(struct scx_event_stat, e)/sizeof(u64))
@@ -1493,6 +1499,7 @@ enum scx_event_kind {
 	SCX_EVENT_DEFINE(INVAL_SELECT_CPU),
 	SCX_EVENT_DEFINE(OFFLINE_LOCAL_DSQ),
 	SCX_EVENT_DEFINE(CNTD_RUN_WO_ENQ),
+	SCX_EVENT_DEFINE(ENQ_LOCAL_EXITING),
 	SCX_EVENT_END = SCX_EVENT_END_IDX(),
 };
 
@@ -1500,6 +1507,7 @@ static const char *scx_event_stat_str[] = {
 	[SCX_EVENT_INVAL_SELECT_CPU]	= "invalid_select_cpu",
 	[SCX_EVENT_OFFLINE_LOCAL_DSQ]	= "offline_local_dsq",
 	[SCX_EVENT_CNTD_RUN_WO_ENQ]	= "cntd_run_wo_enq",
+	[SCX_EVENT_ENQ_LOCAL_EXITING]	= "enq_local_exiting",
 };
 
 /*
@@ -2087,8 +2095,10 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 
 	/* see %SCX_OPS_ENQ_EXITING */
 	if (!static_branch_unlikely(&scx_ops_enq_exiting) &&
-	    unlikely(p->flags & PF_EXITING))
+	    unlikely(p->flags & PF_EXITING)) {
+		scx_add_event(ENQ_LOCAL_EXITING, 1);
 		goto local;
+	}
 
 	if (!SCX_HAS_OP(enqueue))
 		goto global;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/7] sched_ext: Add an event, SCX_EVENT_RQ_BYPASSING_OPS
  2025-01-16 15:15 [PATCH 0/7] sched_ext: Implement core event counters Changwoo Min
                   ` (3 preceding siblings ...)
  2025-01-16 15:15 ` [PATCH 4/7] sched_ext: Add an event, SCX_EVENT_ENQ_LOCAL_EXITING Changwoo Min
@ 2025-01-16 15:15 ` Changwoo Min
  2025-01-17  1:41   ` Tejun Heo
  2025-01-16 15:15 ` [PATCH 6/7] sched_ext: Add scx_bpf_event_stat() and scx_read_event() for BPF schedulers Changwoo Min
  2025-01-16 15:15 ` [PATCH 7/7] sched_ext: Print core event count in scx_central scheduler Changwoo Min
  6 siblings, 1 reply; 28+ messages in thread
From: Changwoo Min @ 2025-01-16 15:15 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

Add a core event, SCX_EVENT_RQ_BYPASSING_OPS, which represents how many
operations are bypassed once the bypass mode is set.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/ext.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 094e19f5fb78..44b44d963a0c 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1485,6 +1485,11 @@ struct scx_event_stat {
 	 * is dispatched to a local DSQ when exiting.
 	 */
 	u64		ENQ_LOCAL_EXITING;
+
+	/*
+	 * When the bypassing mode is set, the number of bypassed operations.
+	 */
+	u64		RQ_BYPASSING_OPS;
 };
 
 #define SCX_EVENT_IDX(e)	(offsetof(struct scx_event_stat, e)/sizeof(u64))
@@ -1500,6 +1505,7 @@ enum scx_event_kind {
 	SCX_EVENT_DEFINE(OFFLINE_LOCAL_DSQ),
 	SCX_EVENT_DEFINE(CNTD_RUN_WO_ENQ),
 	SCX_EVENT_DEFINE(ENQ_LOCAL_EXITING),
+	SCX_EVENT_DEFINE(RQ_BYPASSING_OPS),
 	SCX_EVENT_END = SCX_EVENT_END_IDX(),
 };
 
@@ -1508,6 +1514,7 @@ static const char *scx_event_stat_str[] = {
 	[SCX_EVENT_OFFLINE_LOCAL_DSQ]	= "offline_local_dsq",
 	[SCX_EVENT_CNTD_RUN_WO_ENQ]	= "cntd_run_wo_enq",
 	[SCX_EVENT_ENQ_LOCAL_EXITING]	= "enq_local_exiting",
+	[SCX_EVENT_RQ_BYPASSING_OPS]	= "rq_bypassing_ops",
 };
 
 /*
@@ -2087,8 +2094,10 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	if (!scx_rq_online(rq))
 		goto local;
 
-	if (scx_rq_bypassing(rq))
+	if (scx_rq_bypassing(rq)) {
+		scx_add_event(RQ_BYPASSING_OPS, 1);
 		goto global;
+	}
 
 	if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
 		goto direct;
@@ -2933,6 +2942,8 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
 	     scx_rq_bypassing(rq))) {
 		rq->scx.flags |= SCX_RQ_BAL_KEEP;
 		scx_add_event(CNTD_RUN_WO_ENQ, 1);
+		if (scx_rq_bypassing(rq))
+			scx_add_event(RQ_BYPASSING_OPS, 1);
 		goto has_tasks;
 	}
 	rq->scx.flags &= ~SCX_RQ_IN_BALANCE;
@@ -3708,6 +3719,9 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
 			p->scx.slice = SCX_SLICE_DFL;
 			p->scx.ddsp_dsq_id = SCX_DSQ_LOCAL;
 		}
+
+		if (scx_rq_bypassing(task_rq(p)))
+			scx_add_event(RQ_BYPASSING_OPS, 1);
 		return cpu;
 	}
 }
@@ -3799,6 +3813,8 @@ void __scx_update_idle(struct rq *rq, bool idle, bool do_notify)
 	 */
 	if (SCX_HAS_OP(update_idle) && do_notify && !scx_rq_bypassing(rq))
 		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
+	else if (scx_rq_bypassing(rq))
+		scx_add_event(RQ_BYPASSING_OPS, 1);
 
 	/*
 	 * Update the idle masks:
@@ -3940,6 +3956,7 @@ static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
 	if (scx_rq_bypassing(rq)) {
 		curr->scx.slice = 0;
 		touch_core_sched(rq, curr);
+		scx_add_event(RQ_BYPASSING_OPS, 1);
 	} else if (SCX_HAS_OP(tick)) {
 		SCX_CALL_OP(SCX_KF_REST, tick, curr);
 	}
@@ -7131,8 +7148,10 @@ __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags)
 	 * lead to irq_work_queue() malfunction such as infinite busy wait for
 	 * IRQ status update. Suppress kicking.
 	 */
-	if (scx_rq_bypassing(this_rq))
+	if (scx_rq_bypassing(this_rq)) {
+		scx_add_event(RQ_BYPASSING_OPS, 1);
 		goto out;
+	}
 
 	/*
 	 * Actual kicking is bounced to kick_cpus_irq_workfn() to avoid nesting
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 6/7] sched_ext: Add scx_bpf_event_stat() and scx_read_event() for BPF schedulers
  2025-01-16 15:15 [PATCH 0/7] sched_ext: Implement core event counters Changwoo Min
                   ` (4 preceding siblings ...)
  2025-01-16 15:15 ` [PATCH 5/7] sched_ext: Add an event, SCX_EVENT_RQ_BYPASSING_OPS Changwoo Min
@ 2025-01-16 15:15 ` Changwoo Min
  2025-01-16 15:15 ` [PATCH 7/7] sched_ext: Print core event count in scx_central scheduler Changwoo Min
  6 siblings, 0 replies; 28+ messages in thread
From: Changwoo Min @ 2025-01-16 15:15 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

scx_bpf_event_stat() is added to the header files so the BPF scheduler
can use it. Also, scx_read_event() is added to read an event type in a
compatible way.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 tools/sched_ext/include/scx/common.bpf.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index f3e15e9efa76..c526aa8a949a 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -78,6 +78,10 @@ struct rq *scx_bpf_cpu_rq(s32 cpu) __ksym;
 struct cgroup *scx_bpf_task_cgroup(struct task_struct *p) __ksym __weak;
 u64 scx_bpf_now(void) __ksym __weak;
 
+void scx_bpf_event_stat(struct scx_event_stat *event, size_t event__sz) __ksym __weak;
+#define scx_read_event(e, name)							\
+	(bpf_core_field_exists((e)->name) ? (e)->name : 0)
+
 /*
  * Use the following as @it__iter when calling scx_bpf_dsq_move[_vtime]() from
  * within bpf_for_each() loops.
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 7/7] sched_ext: Print core event count in scx_central scheduler
  2025-01-16 15:15 [PATCH 0/7] sched_ext: Implement core event counters Changwoo Min
                   ` (5 preceding siblings ...)
  2025-01-16 15:15 ` [PATCH 6/7] sched_ext: Add scx_bpf_event_stat() and scx_read_event() for BPF schedulers Changwoo Min
@ 2025-01-16 15:15 ` Changwoo Min
  6 siblings, 0 replies; 28+ messages in thread
From: Changwoo Min @ 2025-01-16 15:15 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

Modify the scx_central scheduler to print the core event counter
every second.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 tools/sched_ext/scx_central.bpf.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/tools/sched_ext/scx_central.bpf.c b/tools/sched_ext/scx_central.bpf.c
index 50bc1737c167..1f0dc9861418 100644
--- a/tools/sched_ext/scx_central.bpf.c
+++ b/tools/sched_ext/scx_central.bpf.c
@@ -256,6 +256,8 @@ static int central_timerfn(void *map, int *key, struct bpf_timer *timer)
 	u64 now = scx_bpf_now();
 	u64 nr_to_kick = nr_queued;
 	s32 i, curr_cpu;
+	struct scx_event_stat event;
+	int kind;
 
 	curr_cpu = bpf_get_smp_processor_id();
 	if (timer_pinned && (curr_cpu != central_cpu)) {
@@ -291,6 +293,22 @@ static int central_timerfn(void *map, int *key, struct bpf_timer *timer)
 
 	bpf_timer_start(timer, TIMER_INTERVAL_NS, BPF_F_TIMER_CPU_PIN);
 	__sync_fetch_and_add(&nr_timers, 1);
+
+	/* print event counters every second */
+	if (nr_timers % 1000 == 0) {
+		scx_bpf_event_stat(&event, sizeof(event));
+
+		bpf_printk("%20s: %llu\n", "invalid_select_cpu",
+			   scx_read_event(&event, INVAL_SELECT_CPU));
+		bpf_printk("%20s: %llu\n", "offline_local_dsq",
+			   scx_read_event(&event, OFFLINE_LOCAL_DSQ));
+		bpf_printk("%20s: %llu\n", "cntd_run_wo_enq",
+			   scx_read_event(&event, CNTD_RUN_WO_ENQ));
+		bpf_printk("%20s: %llu\n", "enq_local_exiting",
+			   scx_read_event(&event, ENQ_LOCAL_EXITING));
+		bpf_printk("%20s: %llu\n", "rq_bypassing_ops",
+			   scx_read_event(&event, RQ_BYPASSING_OPS));
+	}
 	return 0;
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event
  2025-01-16 15:15 ` [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event Changwoo Min
@ 2025-01-17  1:33   ` Tejun Heo
  2025-01-17  7:08     ` Changwoo Min
  2025-01-17  9:49   ` Andrea Righi
  2025-01-21 15:00   ` Leonardo Temperanza
  2 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2025-01-17  1:33 UTC (permalink / raw)
  To: Changwoo Min; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

On Fri, Jan 17, 2025 at 12:15:37AM +0900, Changwoo Min wrote:
> +/*
> + * Collection of event counters.
> + */
> +struct scx_event_stat {

If we keep this struct, maybe name it scx_event_stats?

> +	/*
> +	 * If ops.select_cpu() returns a CPU which can't be used by the task,
> +	 * the core scheduler code silently picks a fallback CPU.
> +	 */
> +	u64		INVAL_SELECT_CPU;

Let's do $COMPONENT_$EVENT, so SELECT_CPU_INVALID (or maybe something more
specific than invalid, like disallowed or fallback?).

> +#define SCX_EVENT_IDX(e)	(offsetof(struct scx_event_stat, e)/sizeof(u64))
> +#define SCX_EVENT_END_IDX()	(sizeof(struct scx_event_stat)/sizeof(u64))
> +#define SCX_EVENT_DEFINE(e)	SCX_EVENT_##e = SCX_EVENT_IDX(e)
> +
> +/*
> + * Types of event counters.
> + */
> +enum scx_event_kind {
> +	SCX_EVENT_BEGIN = 0,

I think 0 BEGIN value can be implicit.

> +	SCX_EVENT_DEFINE(INVAL_SELECT_CPU),
> +	SCX_EVENT_END = SCX_EVENT_END_IDX(),

SCX_NR_EVENTS? Also, we don't really need to macro to count the enums.

> +};

This feels a bit odd to me. Why does it need both the struct fields and
indices? Can we do either one of those?

> +static const char *scx_event_stat_str[] = {
> +	[SCX_EVENT_INVAL_SELECT_CPU]	= "invalid_select_cpu",
> +};

and hopefully derive this through # macro arg expansion?

> +/*
> + * The event counter is organized by a per-CPU variable to minimize the
> + * accounting overhead without synchronization. A system-wide view on the
> + * event counter is constructed when requested by scx_bpf_get_event_stat().
> + */
> +static DEFINE_PER_CPU(struct scx_event_stat, event_stats);

May be better to include "cpu" in the variable name.

> +/**
> + * scx_add_event - Increase an event counter for 'name' by 'cnt'
> + * @name: an event name defined in struct scx_event_stat
> + * @cnt: the number of the event occured
> + */
> +#define scx_add_event(name, cnt) ({						\
> +	struct scx_event_stat *__e;						\
> +	__e = get_cpu_ptr(&event_stats);					\
> +	WRITE_ONCE(__e->name, __e->name+ (cnt));				\
> +	put_cpu_ptr(&event_stats);						\

this_cpu_add()?

> @@ -3607,8 +3667,10 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
>  		*ddsp_taskp = NULL;
>  		if (ops_cpu_valid(cpu, "from ops.select_cpu()"))
>  			return cpu;
> -		else
> +		else {
> +			scx_add_event(INVAL_SELECT_CPU, 1);
>  			return prev_cpu;
> +		}

formatting:

        if () {
        } else {
        }

Also, I'm not sure this is a useful event to count. False ops_cpu_valid()
indicates that the returned CPU is not even possible and the scheduler is
ejected right away. What's more interesting is
kernel/sched/core.c::select_task_rq() tripping on !is_cpu_allowed() and
falling back using select_fallback_rq().

We can either hook into core.c or just compare the ops.select_cpu() picked
CPU against the CPU the task ends up on in enqueue_task_scx().

> @@ -5053,6 +5115,15 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
>  		scx_rq_clock_invalidate(rq);
>  	}
>  
> +	/*
> +	 * Clear event counters so the next scx scheduler always gets
> +	 * fresh event counter values.
> +	 */
> +	for_each_possible_cpu(cpu) {
> +		struct scx_event_stat *e = per_cpu_ptr(&event_stats, cpu);
> +		memset(e, 0, sizeof(*e));
> +	}
> +

Wouldn't we want to keep these counters intact on ejection so that the
scheduler's ejection path can capture the counters if necessary? Resetting
on load probably is better.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/7] sched_ext: Add an event, SCX_EVENT_OFFLINE_LOCAL_DSQ
  2025-01-16 15:15 ` [PATCH 2/7] sched_ext: Add an event, SCX_EVENT_OFFLINE_LOCAL_DSQ Changwoo Min
@ 2025-01-17  1:37   ` Tejun Heo
  2025-01-17  7:11     ` Changwoo Min
  0 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2025-01-17  1:37 UTC (permalink / raw)
  To: Changwoo Min; +Cc: void, arighi, kernel-dev, linux-kernel

On Fri, Jan 17, 2025 at 12:15:38AM +0900, Changwoo Min wrote:
...
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 7e12d5b8322e..8054e4e5ed0c 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1467,6 +1467,12 @@ struct scx_event_stat {
>  	 * the core scheduler code silently picks a fallback CPU.
>  	 */
>  	u64		INVAL_SELECT_CPU;
> +
> +	/*
> +	 * When dispatching to a local DSQ, the CPU may have gone offline in
> +	 * the meantime. In this case, the task is bounced to the global DSQ.
> +	 */
> +	u64		OFFLINE_LOCAL_DSQ;

Let's do components in descending order + event:

- SELECT_CPU_INVAL (or SELECT_CPU_FALLBACK)
- DISPATCH_LOCAL_DSQ_OFFLINE

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/7] sched_ext: Add an event, SCX_EVENT_CNTD_RUN_WO_ENQ
  2025-01-16 15:15 ` [PATCH 3/7] sched_ext: Add an event, SCX_EVENT_CNTD_RUN_WO_ENQ Changwoo Min
@ 2025-01-17  1:39   ` Tejun Heo
  2025-01-17  7:12     ` Changwoo Min
  0 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2025-01-17  1:39 UTC (permalink / raw)
  To: Changwoo Min; +Cc: void, arighi, kernel-dev, linux-kernel

On Fri, Jan 17, 2025 at 12:15:39AM +0900, Changwoo Min wrote:
> Add a core event, SCX_EVENT_CNTD_RUN_WO_ENQ, which represents how many
> times a task is continued to run without ops.enqueue() when SCX_OPS_ENQ_LAST
> is not set.
> 
> Signed-off-by: Changwoo Min <changwoo@igalia.com>
> ---
>  kernel/sched/ext.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 8054e4e5ed0c..909f12a41934 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1473,6 +1473,12 @@ struct scx_event_stat {
>  	 * the meantime. In this case, the task is bounced to the global DSQ.
>  	 */
>  	u64		OFFLINE_LOCAL_DSQ;
> +
> +	/*
> +	 * If SCX_OPS_ENQ_LAST is not set, the number of times that a task
> +	 * continued to run because there were no other tasks on the CPU.
> +	 */
> +	u64		CNTD_RUN_WO_ENQ;

DISPATCH_KEEP_LAST?

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/7] sched_ext: Add an event, SCX_EVENT_ENQ_LOCAL_EXITING
  2025-01-16 15:15 ` [PATCH 4/7] sched_ext: Add an event, SCX_EVENT_ENQ_LOCAL_EXITING Changwoo Min
@ 2025-01-17  1:40   ` Tejun Heo
  2025-01-17  7:12     ` Changwoo Min
  0 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2025-01-17  1:40 UTC (permalink / raw)
  To: Changwoo Min; +Cc: void, arighi, kernel-dev, linux-kernel

On Fri, Jan 17, 2025 at 12:15:40AM +0900, Changwoo Min wrote:
> Add a core event, SCX_EVENT_ENQ_LOCAL_EXITING, which represents how many
> times a task is enqueued to a local DSQ when exiting if SCX_OPS_ENQ_EXITING
> is not set.
> 
> Signed-off-by: Changwoo Min <changwoo@igalia.com>
> ---
>  kernel/sched/ext.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 909f12a41934..094e19f5fb78 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1479,6 +1479,12 @@ struct scx_event_stat {
>  	 * continued to run because there were no other tasks on the CPU.
>  	 */
>  	u64		CNTD_RUN_WO_ENQ;
> +
> +	/*
> +	 * If SCX_OPS_ENQ_EXITING is not set, the number of times that a task
> +	 * is dispatched to a local DSQ when exiting.
> +	 */
> +	u64		ENQ_LOCAL_EXITING;

ENQ_SKIP_EXITING?

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/7] sched_ext: Add an event, SCX_EVENT_RQ_BYPASSING_OPS
  2025-01-16 15:15 ` [PATCH 5/7] sched_ext: Add an event, SCX_EVENT_RQ_BYPASSING_OPS Changwoo Min
@ 2025-01-17  1:41   ` Tejun Heo
  2025-01-17  7:31     ` Changwoo Min
  0 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2025-01-17  1:41 UTC (permalink / raw)
  To: Changwoo Min; +Cc: void, arighi, kernel-dev, linux-kernel

On Fri, Jan 17, 2025 at 12:15:41AM +0900, Changwoo Min wrote:
> Add a core event, SCX_EVENT_RQ_BYPASSING_OPS, which represents how many
> operations are bypassed once the bypass mode is set.
> 
> Signed-off-by: Changwoo Min <changwoo@igalia.com>
> ---
>  kernel/sched/ext.c | 23 +++++++++++++++++++++--
>  1 file changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 094e19f5fb78..44b44d963a0c 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1485,6 +1485,11 @@ struct scx_event_stat {
>  	 * is dispatched to a local DSQ when exiting.
>  	 */
>  	u64		ENQ_LOCAL_EXITING;
> +
> +	/*
> +	 * When the bypassing mode is set, the number of bypassed operations.
> +	 */
> +	u64		RQ_BYPASSING_OPS;
>  };

I'm not sure this is a particularly good way to account for bypass mode.
Maybe account the total duration of bypass modes, the number of times it was
activated and the number of times tasks were dispatched in bypass mode?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event
  2025-01-17  1:33   ` Tejun Heo
@ 2025-01-17  7:08     ` Changwoo Min
  2025-01-17 16:24       ` Tejun Heo
  0 siblings, 1 reply; 28+ messages in thread
From: Changwoo Min @ 2025-01-17  7:08 UTC (permalink / raw)
  To: Tejun Heo; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

Thank you for the review!

On 25. 1. 17. 10:33, Tejun Heo wrote:
> Hello,
> 
> On Fri, Jan 17, 2025 at 12:15:37AM +0900, Changwoo Min wrote:
>> +/*
>> + * Collection of event counters.
>> + */
>> +struct scx_event_stat {
> 
> If we keep this struct, maybe name it scx_event_stats?

Sure, I will change it as suggested. Also, I will change
scx_bpf_event_stat() to scx_bpf_event_stats() for consistency.

> 
>> +	/*
>> +	 * If ops.select_cpu() returns a CPU which can't be used by the task,
>> +	 * the core scheduler code silently picks a fallback CPU.
>> +	 */
>> +	u64		INVAL_SELECT_CPU;
> 
> Let's do $COMPONENT_$EVENT, so SELECT_CPU_INVALID (or maybe something more
> specific than invalid, like disallowed or fallback?).

$COMPONENT_$EVENT sounds more systematic. Thanks for the
suggestion. What about SELECT_CPU_FALLBACK?


>> +#define SCX_EVENT_IDX(e)	(offsetof(struct scx_event_stat, e)/sizeof(u64))
>> +#define SCX_EVENT_END_IDX()	(sizeof(struct scx_event_stat)/sizeof(u64))
>> +#define SCX_EVENT_DEFINE(e)	SCX_EVENT_##e = SCX_EVENT_IDX(e)
>> +
>> +/*
>> + * Types of event counters.
>> + */
>> +enum scx_event_kind {
>> +	SCX_EVENT_BEGIN = 0,
> 
> I think 0 BEGIN value can be implicit.

Yup, I will remove it in the next version.

>> +	SCX_EVENT_DEFINE(INVAL_SELECT_CPU),
>> +	SCX_EVENT_END = SCX_EVENT_END_IDX(),
> 
> SCX_NR_EVENTS? Also, we don't really need to macro to count the enums.

For SCX_EVENT_BEGIN/END, I followed the style of
SCX_OPI_BEGIN/END. You are right. I will remove
SCX_EVENT_END_IDX() in the next version.

> This feels a bit odd to me. Why does it need both the struct fields and
> indices? Can we do either one of those?

Before submitting the patch set, I tested one approach using
purely enums storing the actual counters in an array and another
(current) approach using struct fields. I rejected the first
approach because the BPF verifier does not allow changing the
array size, causing the BPF binary compatibility problem. The
second approach is satisfactory regarding the BPF binary
compatibility by CO-RE. However, it is a little bit cumbersome to
iterate all the events, so I added the enums. I know the enums
are only usable in the kernel code due to the binary
compatibility, but I thought adding the enums rather than
bloating the scx_dump_state() code at the end would be better.
Does it make sense to you?


> 
>> +static const char *scx_event_stat_str[] = {
>> +	[SCX_EVENT_INVAL_SELECT_CPU]	= "invalid_select_cpu",
>> +};
> 
> and hopefully derive this through # macro arg expansion?

That's a good idea. I will change it.

> 
>> +/*
>> + * The event counter is organized by a per-CPU variable to minimize the
>> + * accounting overhead without synchronization. A system-wide view on the
>> + * event counter is constructed when requested by scx_bpf_get_event_stat().
>> + */
>> +static DEFINE_PER_CPU(struct scx_event_stat, event_stats);
> 
> May be better to include "cpu" in the variable name.

I will add a "_cpu" suffix, renaming it to "event_stats_cpu".

>> +/**
>> + * scx_add_event - Increase an event counter for 'name' by 'cnt'
>> + * @name: an event name defined in struct scx_event_stat
>> + * @cnt: the number of the event occured
>> + */
>> +#define scx_add_event(name, cnt) ({						\
>> +	struct scx_event_stat *__e;						\
>> +	__e = get_cpu_ptr(&event_stats);					\
>> +	WRITE_ONCE(__e->name, __e->name+ (cnt));				\
>> +	put_cpu_ptr(&event_stats);						\
> 
> this_cpu_add()?

That's handier. I will change it.

> 
>> @@ -3607,8 +3667,10 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
>>   		*ddsp_taskp = NULL;
>>   		if (ops_cpu_valid(cpu, "from ops.select_cpu()"))
>>   			return cpu;
>> -		else
>> +		else {
>> +			scx_add_event(INVAL_SELECT_CPU, 1);
>>   			return prev_cpu;
>> +		}
> 
> formatting:
> 
>          if () {
>          } else {
>          }
My bad. Will fix it.

> Also, I'm not sure this is a useful event to count. False ops_cpu_valid()
> indicates that the returned CPU is not even possible and the scheduler is
> ejected right away. What's more interesting is
> kernel/sched/core.c::select_task_rq() tripping on !is_cpu_allowed() and
> falling back using select_fallback_rq().
> 
> We can either hook into core.c or just compare the ops.select_cpu() picked
> CPU against the CPU the task ends up on in enqueue_task_scx().

Modifying core.c will be more direct and straightforward. Also,
I think it would be better to separate this commit into two: one
for the infra-structure and another for the SELECT_CPU_FALLBACK
event, which touches core.c. I will move the necessary code in
the infrastructure into kernel/sched/sched.h, so we can use
scx_add_event() in core.c.

> 
>> @@ -5053,6 +5115,15 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
>>   		scx_rq_clock_invalidate(rq);
>>   	}
>>   
>> +	/*
>> +	 * Clear event counters so the next scx scheduler always gets
>> +	 * fresh event counter values.
>> +	 */
>> +	for_each_possible_cpu(cpu) {
>> +		struct scx_event_stat *e = per_cpu_ptr(&event_stats, cpu);
>> +		memset(e, 0, sizeof(*e));
>> +	}
>> +
> 
> Wouldn't we want to keep these counters intact on ejection so that the
> scheduler's ejection path can capture the counters if necessary? Resetting
> on load probably is better.

That makes sense. I will move the resetting code on load.

Regards,
Changwoo Min

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/7] sched_ext: Add an event, SCX_EVENT_OFFLINE_LOCAL_DSQ
  2025-01-17  1:37   ` Tejun Heo
@ 2025-01-17  7:11     ` Changwoo Min
  0 siblings, 0 replies; 28+ messages in thread
From: Changwoo Min @ 2025-01-17  7:11 UTC (permalink / raw)
  To: Tejun Heo; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

On 25. 1. 17. 10:37, Tejun Heo wrote:
> On Fri, Jan 17, 2025 at 12:15:38AM +0900, Changwoo Min wrote:
> ...
>> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
>> index 7e12d5b8322e..8054e4e5ed0c 100644
>> --- a/kernel/sched/ext.c
>> +++ b/kernel/sched/ext.c
>> @@ -1467,6 +1467,12 @@ struct scx_event_stat {
>>   	 * the core scheduler code silently picks a fallback CPU.
>>   	 */
>>   	u64		INVAL_SELECT_CPU;
>> +
>> +	/*
>> +	 * When dispatching to a local DSQ, the CPU may have gone offline in
>> +	 * the meantime. In this case, the task is bounced to the global DSQ.
>> +	 */
>> +	u64		OFFLINE_LOCAL_DSQ;
> 
> Let's do components in descending order + event:
> 
> - SELECT_CPU_INVAL (or SELECT_CPU_FALLBACK)
> - DISPATCH_LOCAL_DSQ_OFFLINE

Sure. Will change it in the next version.

Regards,
Changwoo Min

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/7] sched_ext: Add an event, SCX_EVENT_CNTD_RUN_WO_ENQ
  2025-01-17  1:39   ` Tejun Heo
@ 2025-01-17  7:12     ` Changwoo Min
  0 siblings, 0 replies; 28+ messages in thread
From: Changwoo Min @ 2025-01-17  7:12 UTC (permalink / raw)
  To: Tejun Heo; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

On 25. 1. 17. 10:39, Tejun Heo wrote:
> On Fri, Jan 17, 2025 at 12:15:39AM +0900, Changwoo Min wrote:
>> Add a core event, SCX_EVENT_CNTD_RUN_WO_ENQ, which represents how many
>> times a task is continued to run without ops.enqueue() when SCX_OPS_ENQ_LAST
>> is not set.
>>
>> Signed-off-by: Changwoo Min <changwoo@igalia.com>
>> ---
>>   kernel/sched/ext.c | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
>> index 8054e4e5ed0c..909f12a41934 100644
>> --- a/kernel/sched/ext.c
>> +++ b/kernel/sched/ext.c
>> @@ -1473,6 +1473,12 @@ struct scx_event_stat {
>>   	 * the meantime. In this case, the task is bounced to the global DSQ.
>>   	 */
>>   	u64		OFFLINE_LOCAL_DSQ;
>> +
>> +	/*
>> +	 * If SCX_OPS_ENQ_LAST is not set, the number of times that a task
>> +	 * continued to run because there were no other tasks on the CPU.
>> +	 */
>> +	u64		CNTD_RUN_WO_ENQ;
> 
> DISPATCH_KEEP_LAST?

That's good.

Regards,
Changwoo Min

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/7] sched_ext: Add an event, SCX_EVENT_ENQ_LOCAL_EXITING
  2025-01-17  1:40   ` Tejun Heo
@ 2025-01-17  7:12     ` Changwoo Min
  0 siblings, 0 replies; 28+ messages in thread
From: Changwoo Min @ 2025-01-17  7:12 UTC (permalink / raw)
  To: Tejun Heo; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

On 25. 1. 17. 10:40, Tejun Heo wrote:
> On Fri, Jan 17, 2025 at 12:15:40AM +0900, Changwoo Min wrote:
>> Add a core event, SCX_EVENT_ENQ_LOCAL_EXITING, which represents how many
>> times a task is enqueued to a local DSQ when exiting if SCX_OPS_ENQ_EXITING
>> is not set.
>>
>> Signed-off-by: Changwoo Min <changwoo@igalia.com>
>> ---
>>   kernel/sched/ext.c | 12 +++++++++++-
>>   1 file changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
>> index 909f12a41934..094e19f5fb78 100644
>> --- a/kernel/sched/ext.c
>> +++ b/kernel/sched/ext.c
>> @@ -1479,6 +1479,12 @@ struct scx_event_stat {
>>   	 * continued to run because there were no other tasks on the CPU.
>>   	 */
>>   	u64		CNTD_RUN_WO_ENQ;
>> +
>> +	/*
>> +	 * If SCX_OPS_ENQ_EXITING is not set, the number of times that a task
>> +	 * is dispatched to a local DSQ when exiting.
>> +	 */
>> +	u64		ENQ_LOCAL_EXITING;
> 
> ENQ_SKIP_EXITING?

Will change it as suggested.

Regards,
Changwoo Min

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/7] sched_ext: Add an event, SCX_EVENT_RQ_BYPASSING_OPS
  2025-01-17  1:41   ` Tejun Heo
@ 2025-01-17  7:31     ` Changwoo Min
  2025-01-17 16:14       ` Tejun Heo
  0 siblings, 1 reply; 28+ messages in thread
From: Changwoo Min @ 2025-01-17  7:31 UTC (permalink / raw)
  To: Tejun Heo; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

On 25. 1. 17. 10:41, Tejun Heo wrote:
>> +
>> +	/*
>> +	 * When the bypassing mode is set, the number of bypassed operations.
>> +	 */
>> +	u64		RQ_BYPASSING_OPS;
>>   };
> 
> I'm not sure this is a particularly good way to account for bypass mode.
> Maybe account the total duration of bypass modes,the number of times it was
> activated and the number of times tasks were dispatched in bypass mode?
I think it is a good idea to further specialize RQ_BYPASSING events.

For the number of times the bypassing mode activated, what about
BYPASS_NR_ACTIVATED?

For the number of task dispatched,what about
BYPASS_NR_TASK_DISPATCHED?

I think BYPASS_NR_ACTIVATED and BYPASS_NR_TASK_DISPATCHED will be
a good proxy for the total duration, so we can skip it until we
have a clear user case. If we need the total duration now (maybe
BYPASS_DURATION?), we can directly measure it in the
scx_ops_bypass() directly. What do you think?

Regards,
Changwoo Min

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event
  2025-01-16 15:15 ` [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event Changwoo Min
  2025-01-17  1:33   ` Tejun Heo
@ 2025-01-17  9:49   ` Andrea Righi
  2025-01-17 16:26     ` Tejun Heo
  2025-01-18  0:00     ` Changwoo Min
  2025-01-21 15:00   ` Leonardo Temperanza
  2 siblings, 2 replies; 28+ messages in thread
From: Andrea Righi @ 2025-01-17  9:49 UTC (permalink / raw)
  To: Changwoo Min; +Cc: tj, void, kernel-dev, linux-kernel

On Fri, Jan 17, 2025 at 12:15:37AM +0900, Changwoo Min wrote:
...
> +/*
> + * The event counter is organized by a per-CPU variable to minimize the
> + * accounting overhead without synchronization. A system-wide view on the
> + * event counter is constructed when requested by scx_bpf_get_event_stat().
> + */
> +static DEFINE_PER_CPU(struct scx_event_stat, event_stats);

Should we consider tracking these statistics per-scheduler rather than
globally (like adding scx_event_stat to sched_ext_ops)?

It's not particularly important for now, but in the future, if we allow
multiple scx schedulers to be loaded at the same time, tracking separate
stats per-scheduler would be preferable.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/7] sched_ext: Add an event, SCX_EVENT_RQ_BYPASSING_OPS
  2025-01-17  7:31     ` Changwoo Min
@ 2025-01-17 16:14       ` Tejun Heo
  0 siblings, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2025-01-17 16:14 UTC (permalink / raw)
  To: Changwoo Min; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

On Fri, Jan 17, 2025 at 04:31:55PM +0900, Changwoo Min wrote:
...
> For the number of times the bypassing mode activated, what about
> BYPASS_NR_ACTIVATED?
> 
> For the number of task dispatched,what about
> BYPASS_NR_TASK_DISPATCHED?

Those names are fine but we used simple imperatives for other names, so the
followings might be more consistent:

- BYPASS_ACTIVATE
- BYPASS_DISPATCH

> I think BYPASS_NR_ACTIVATED and BYPASS_NR_TASK_DISPATCHED will be
> a good proxy for the total duration, so we can skip it until we
> have a clear user case. If we need the total duration now (maybe
> BYPASS_DURATION?), we can directly measure it in the
> scx_ops_bypass() directly. What do you think?

I think it'd be a useful counter to have and measuring from scx_ops_bypass()
makes sense to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event
  2025-01-17  7:08     ` Changwoo Min
@ 2025-01-17 16:24       ` Tejun Heo
  2025-01-18  0:27         ` Changwoo Min
  0 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2025-01-17 16:24 UTC (permalink / raw)
  To: Changwoo Min; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

On Fri, Jan 17, 2025 at 04:08:45PM +0900, Changwoo Min wrote:
> $COMPONENT_$EVENT sounds more systematic. Thanks for the
> suggestion. What about SELECT_CPU_FALLBACK?

Sounds good.

> > This feels a bit odd to me. Why does it need both the struct fields and
> > indices? Can we do either one of those?
> 
> Before submitting the patch set, I tested one approach using
> purely enums storing the actual counters in an array and another
> (current) approach using struct fields. I rejected the first
> approach because the BPF verifier does not allow changing the
> array size, causing the BPF binary compatibility problem. The

That makes sense. It'd be useful to note that on top of the struct
definition.

> second approach is satisfactory regarding the BPF binary
> compatibility by CO-RE. However, it is a little bit cumbersome to
> iterate all the events, so I added the enums. I know the enums
> are only usable in the kernel code due to the binary
> compatibility, but I thought adding the enums rather than
> bloating the scx_dump_state() code at the end would be better.
> Does it make sense to you?

Let's just use the structs and open code dumping. We can pack the dump
output better that way too and I don't think BPF scheds would have much use
for enum iterations.

...
> > > +/**
> > > + * scx_add_event - Increase an event counter for 'name' by 'cnt'
> > > + * @name: an event name defined in struct scx_event_stat
> > > + * @cnt: the number of the event occured
> > > + */
> > > +#define scx_add_event(name, cnt) ({						\
> > > +	struct scx_event_stat *__e;						\
> > > +	__e = get_cpu_ptr(&event_stats);					\
> > > +	WRITE_ONCE(__e->name, __e->name+ (cnt));				\
> > > +	put_cpu_ptr(&event_stats);						\
> > 
> > this_cpu_add()?
> 
> That's handier. I will change it.

Note that there's __this_cpu_add() too which can be used from sections of
code that already disables preemption. On x86, both variants cost the same
but on arm __this_cpu_add() is cheaper as it can skip preemption off/on.
Given that most of these counters are modified while holding rq lock, it may
make sense to provide __ version too.

...
> > Also, I'm not sure this is a useful event to count. False ops_cpu_valid()
> > indicates that the returned CPU is not even possible and the scheduler is
> > ejected right away. What's more interesting is
> > kernel/sched/core.c::select_task_rq() tripping on !is_cpu_allowed() and
> > falling back using select_fallback_rq().
> > 
> > We can either hook into core.c or just compare the ops.select_cpu() picked
> > CPU against the CPU the task ends up on in enqueue_task_scx().
> 
> Modifying core.c will be more direct and straightforward. Also,
> I think it would be better to separate this commit into two: one
> for the infra-structure and another for the SELECT_CPU_FALLBACK
> event, which touches core.c. I will move the necessary code in
> the infrastructure into kernel/sched/sched.h, so we can use
> scx_add_event() in core.c.

Hmm.... yeah, both approaches have pros and cons but I kinda like the idea
of restricting the events within ext.c and here detecting the fallback
condition is pretty trivial. I don't know.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event
  2025-01-17  9:49   ` Andrea Righi
@ 2025-01-17 16:26     ` Tejun Heo
  2025-01-18  0:00     ` Changwoo Min
  1 sibling, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2025-01-17 16:26 UTC (permalink / raw)
  To: Andrea Righi; +Cc: Changwoo Min, void, kernel-dev, linux-kernel

Hello,

On Fri, Jan 17, 2025 at 10:49:55AM +0100, Andrea Righi wrote:
> On Fri, Jan 17, 2025 at 12:15:37AM +0900, Changwoo Min wrote:
> ...
> > +/*
> > + * The event counter is organized by a per-CPU variable to minimize the
> > + * accounting overhead without synchronization. A system-wide view on the
> > + * event counter is constructed when requested by scx_bpf_get_event_stat().
> > + */
> > +static DEFINE_PER_CPU(struct scx_event_stat, event_stats);
> 
> Should we consider tracking these statistics per-scheduler rather than
> globally (like adding scx_event_stat to sched_ext_ops)?
> 
> It's not particularly important for now, but in the future, if we allow
> multiple scx schedulers to be loaded at the same time, tracking separate
> stats per-scheduler would be preferable.

Yeah, eventually, but I don't think it affects anything user visible right
now. Later, we may add an interface to query the stats for a specific
scheduler provided the scheduler is a child of the current one and so on but
we'd still need an interface to query "my stats" which the current interface
can become.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event
  2025-01-17  9:49   ` Andrea Righi
  2025-01-17 16:26     ` Tejun Heo
@ 2025-01-18  0:00     ` Changwoo Min
  1 sibling, 0 replies; 28+ messages in thread
From: Changwoo Min @ 2025-01-18  0:00 UTC (permalink / raw)
  To: Andrea Righi; +Cc: tj, void, kernel-dev, linux-kernel

Hello,

Thank you for the suggestion, Andrea!

On 25. 1. 17. 18:49, Andrea Righi wrote:
> On Fri, Jan 17, 2025 at 12:15:37AM +0900, Changwoo Min wrote:
> ...
>> +/*
>> + * The event counter is organized by a per-CPU variable to minimize the
>> + * accounting overhead without synchronization. A system-wide view on the
>> + * event counter is constructed when requested by scx_bpf_get_event_stat().
>> + */
>> +static DEFINE_PER_CPU(struct scx_event_stat, event_stats);
> 
> Should we consider tracking these statistics per-scheduler rather than
> globally (like adding scx_event_stat to sched_ext_ops)?
> 
> It's not particularly important for now, but in the future, if we allow
> multiple scx schedulers to be loaded at the same time, tracking separate
> stats per-scheduler would be preferable.

It will be useful to get per-scheduler information. Since how the
scheduler composability will be embodied is wide open at this
point, I will revisit the interface design as the composability
design gets more concrete.

Regards,
Changwoo Min



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event
  2025-01-17 16:24       ` Tejun Heo
@ 2025-01-18  0:27         ` Changwoo Min
  2025-01-22  0:42           ` Tejun Heo
  0 siblings, 1 reply; 28+ messages in thread
From: Changwoo Min @ 2025-01-18  0:27 UTC (permalink / raw)
  To: Tejun Heo; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

On 25. 1. 18. 01:24, Tejun Heo wrote:
>> second approach is satisfactory regarding the BPF binary
>> compatibility by CO-RE. However, it is a little bit cumbersome to
>> iterate all the events, so I added the enums. I know the enums
>> are only usable in the kernel code due to the binary
>> compatibility, but I thought adding the enums rather than
>> bloating the scx_dump_state() code at the end would be better.
>> Does it make sense to you?
> 
> Let's just use the structs and open code dumping. We can pack the dump
> output better that way too and I don't think BPF scheds would have much use
> for enum iterations.

Currently, enums are not exposed to BPF schedulers. Also, BPF
schedulers should not rely on the enums because the enums are
coupled with a specific layout of the struct, so that would cause
a BPF binary compatibility problem.

Could you explain more on the code dumping? I want at least print
the event string for easier interpretation.

> ...
>>>> +/**
>>>> + * scx_add_event - Increase an event counter for 'name' by 'cnt'
>>>> + * @name: an event name defined in struct scx_event_stat
>>>> + * @cnt: the number of the event occured
>>>> + */
>>>> +#define scx_add_event(name, cnt) ({						\
>>>> +	struct scx_event_stat *__e;						\
>>>> +	__e = get_cpu_ptr(&event_stats);					\
>>>> +	WRITE_ONCE(__e->name, __e->name+ (cnt));				\
>>>> +	put_cpu_ptr(&event_stats);						\
>>>
>>> this_cpu_add()?
>>
>> That's handier. I will change it.
> 
> Note that there's __this_cpu_add() too which can be used from sections of
> code that already disables preemption. On x86, both variants cost the same
> but on arm __this_cpu_add() is cheaper as it can skip preemption off/on.
> Given that most of these counters are modified while holding rq lock, it may
> make sense to provide __ version too.

Sounds good. I will add __scx_add_event() following the
convention of this_cpu_add() and __this_cpu_add().

> 
> ...
>>> Also, I'm not sure this is a useful event to count. False ops_cpu_valid()
>>> indicates that the returned CPU is not even possible and the scheduler is
>>> ejected right away. What's more interesting is
>>> kernel/sched/core.c::select_task_rq() tripping on !is_cpu_allowed() and
>>> falling back using select_fallback_rq().
>>>
>>> We can either hook into core.c or just compare the ops.select_cpu() picked
>>> CPU against the CPU the task ends up on in enqueue_task_scx().
>>
>> Modifying core.c will be more direct and straightforward. Also,
>> I think it would be better to separate this commit into two: one
>> for the infra-structure and another for the SELECT_CPU_FALLBACK
>> event, which touches core.c. I will move the necessary code in
>> the infrastructure into kernel/sched/sched.h, so we can use
>> scx_add_event() in core.c.
> 
> Hmm.... yeah, both approaches have pros and cons but I kinda like the idea
> of restricting the events within ext.c and here detecting the fallback
> condition is pretty trivial. I don't know.

Right, it will be possible track the same thing without touch
core.c (that's too much) by replicating the conditions in core.c
to ext.c. I will dig the code more to figure out what the best
approach is.

Regards,
Changwoo Min

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event
  2025-01-16 15:15 ` [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event Changwoo Min
  2025-01-17  1:33   ` Tejun Heo
  2025-01-17  9:49   ` Andrea Righi
@ 2025-01-21 15:00   ` Leonardo Temperanza
  2025-01-22  1:45     ` Changwoo Min
  2 siblings, 1 reply; 28+ messages in thread
From: Leonardo Temperanza @ 2025-01-21 15:00 UTC (permalink / raw)
  To: Changwoo Min, tj, void, arighi; +Cc: kernel-dev, linux-kernel

Hello,

On 1/16/2025 4:15 PM, Changwoo Min wrote:
> Collect the statistics of specific types of behavior in the sched_ext core,
> which are not easily visible but still interesting to an scx scheduler.
>
> Also, add a core event, SCX_EVENT_INVAL_SELECT_CPU, which represents how
> many times ops.select_cpu() returns a CPU that the task can't use.
>
> Signed-off-by: Changwoo Min <changwoo@igalia.com>
> ---
>   kernel/sched/ext.c | 120 ++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 118 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 0bcdd1a31676..7e12d5b8322e 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1458,6 +1458,66 @@ static struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter)
>   	return p;
>   }
>   
> +/*
> + * Collection of event counters.
> + */
> +struct scx_event_stat {
> +	/*
> +	 * If ops.select_cpu() returns a CPU which can't be used by the task,
> +	 * the core scheduler code silently picks a fallback CPU.
> +	 */
> +	u64		INVAL_SELECT_CPU;
> +};
> +
> +#define SCX_EVENT_IDX(e)	(offsetof(struct scx_event_stat, e)/sizeof(u64))
> +#define SCX_EVENT_END_IDX()	(sizeof(struct scx_event_stat)/sizeof(u64))
> +#define SCX_EVENT_DEFINE(e)	SCX_EVENT_##e = SCX_EVENT_IDX(e)
> +


scx_event_stat fields are required to be u64, otherwise the macros below 
don't work correctly. Perhaps a comment could highlight this "hidden" 
constraint? As well as a BUILD_BUG_ON to verify that this constraint is 
verified at build time.


> +/*
> + * Types of event counters.
> + */
> +enum scx_event_kind {
> +	SCX_EVENT_BEGIN = 0,
> +	SCX_EVENT_DEFINE(INVAL_SELECT_CPU),
> +	SCX_EVENT_END = SCX_EVENT_END_IDX(),
> +};
> +
> +static const char *scx_event_stat_str[] = {
> +	[SCX_EVENT_INVAL_SELECT_CPU]	= "invalid_select_cpu",
> +};
> +


If an event is added to scx_event_kind but not scx_event_stat_str, the 
array can possibly be accessed out of bounds (or into a NULL string 
depending on the missing index). The GNU C extension could be used to 
initialize all elements to the empty string "", like this:

static const char *scx_event_stat_str[SCX_EVENT_END] = {
     [0 ... SCX_EVENT_END - 1] = "",
     [SCX_EVENT_INVAL_SELECT_CPU] = "invalid_select_cpu",
};

Alternatively a helper could be used:

static const char *scx_event_name(enum scx_event_kind kind)
{
     return scx_event_stat_str[kind] ? : "";
}


> +/*
> + * The event counter is organized by a per-CPU variable to minimize the
> + * accounting overhead without synchronization. A system-wide view on the
> + * event counter is constructed when requested by scx_bpf_get_event_stat().
> + */
> +static DEFINE_PER_CPU(struct scx_event_stat, event_stats);
> +
> +/**
> + * scx_add_event - Increase an event counter for 'name' by 'cnt'
> + * @name: an event name defined in struct scx_event_stat
> + * @cnt: the number of the event occured
> + */
> +#define scx_add_event(name, cnt) ({						\
> +	struct scx_event_stat *__e;						\
> +	__e = get_cpu_ptr(&event_stats);					\
> +	WRITE_ONCE(__e->name, __e->name+ (cnt));				\
> +	put_cpu_ptr(&event_stats);						\
> +})
> +
> +
> +/**
> + * scx_read_event_kind - Read an event from 'e' with 'kind'
> + * @e: a pointer to an event collected by scx_bpf_event_stat()
> + * @kine: an event type defined in scx_event_kind
> + */
> +#define scx_read_event_kind(e, kind) ({						\
> +	u64 *__e64 = (u64 *)(e);						\
> +	__e64[kind];								\
> +})
> +


nit: typo, "@kine" instead of "@kind"


> +static void scx_bpf_event_stat(struct scx_event_stat *event, size_t event__sz);
> +
>   static enum scx_ops_enable_state scx_ops_enable_state(void)
>   {
>   	return atomic_read(&scx_ops_enable_state_var);
> @@ -3607,8 +3667,10 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
>   		*ddsp_taskp = NULL;
>   		if (ops_cpu_valid(cpu, "from ops.select_cpu()"))
>   			return cpu;
> -		else
> +		else {
> +			scx_add_event(INVAL_SELECT_CPU, 1);
>   			return prev_cpu;
> +		}
>   	} else {
>   		bool found;
>   		s32 cpu;
> @@ -5053,6 +5115,15 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
>   		scx_rq_clock_invalidate(rq);
>   	}
>   
> +	/*
> +	 * Clear event counters so the next scx scheduler always gets
> +	 * fresh event counter values.
> +	 */
> +	for_each_possible_cpu(cpu) {
> +		struct scx_event_stat *e = per_cpu_ptr(&event_stats, cpu);
> +		memset(e, 0, sizeof(*e));
> +	}
> +
>   	/* no task is on scx, turn off all the switches and flush in-progress calls */
>   	static_branch_disable(&__scx_ops_enabled);
>   	for (i = SCX_OPI_BEGIN; i < SCX_OPI_END; i++)
> @@ -5309,9 +5380,10 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
>   		.at_jiffies = jiffies,
>   	};
>   	struct seq_buf s;
> +	struct scx_event_stat event;
>   	unsigned long flags;
>   	char *buf;
> -	int cpu;
> +	int cpu, kind;
>   
>   	spin_lock_irqsave(&dump_lock, flags);
>   
> @@ -5417,6 +5489,16 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
>   		rq_unlock(rq, &rf);
>   	}
>   
> +	dump_newline(&s);
> +	dump_line(&s, "Event counters");
> +	dump_line(&s, "--------------");
> +
> +	scx_bpf_event_stat(&event, sizeof(event));
> +	for (kind = SCX_EVENT_BEGIN; kind < SCX_EVENT_END; kind++) {
> +		dump_line(&s, "%25s : %llu", scx_event_stat_str[kind],
> +			  scx_read_event_kind(&event, kind));
> +	}
> +
>   	if (seq_buf_has_overflowed(&s) && dump_len >= sizeof(trunc_marker))
>   		memcpy(ei->dump + dump_len - sizeof(trunc_marker),
>   		       trunc_marker, sizeof(trunc_marker));
> @@ -7720,6 +7802,39 @@ __bpf_kfunc u64 scx_bpf_now(void)
>   	return clock;
>   }
>   
> +/*
> + * scx_bpf_event_stat - Get a system-wide event counter to
> + * @event: output buffer from a BPF program
> + * @event__sz: @event len, must end in '__sz'' for the verifier
> + */
> +__bpf_kfunc void scx_bpf_event_stat(struct scx_event_stat *event,
> +				    size_t event__sz)
> +{
> +	struct scx_event_stat *e;
> +	u64 *event64, *e64;
> +	int cpu, kind, event_end;
> +
> +	/*
> +	 * We cannot entirely trust a BPF-provided size since a BPF program
> +	 * might be compiled against a different vmlinux.h, of which
> +	 * scx_event_stat would be larger (a newer vmlinux.h) or smaller
> +	 * (an older vmlinux.h). Hence, we use the smaller size to avoid
> +	 * memory corruption.
> +	 */
> +	event__sz = min(event__sz, sizeof(*event));
> +	event_end = event__sz / sizeof(u64);
> +
> +	event64 = (u64 *)event;
> +	memset(event, 0, event__sz);
> +	for_each_possible_cpu(cpu) {
> +		e = per_cpu_ptr(&event_stats, cpu);
> +		e64 = (u64 *)e;
> +		for (kind = 0; kind < event_end; kind++) {
> +			event64[kind] += READ_ONCE(e64[kind]);
> +		}
> +	}
> +}
> +
>   __bpf_kfunc_end_defs();
>   
>   BTF_KFUNCS_START(scx_kfunc_ids_any)
> @@ -7752,6 +7867,7 @@ BTF_ID_FLAGS(func, scx_bpf_cpu_rq)
>   BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_RCU | KF_ACQUIRE)
>   #endif
>   BTF_ID_FLAGS(func, scx_bpf_now)
> +BTF_ID_FLAGS(func, scx_bpf_event_stat, KF_TRUSTED_ARGS)
>   BTF_KFUNCS_END(scx_kfunc_ids_any)
>   
>   static const struct btf_kfunc_id_set scx_kfunc_set_any = {

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event
  2025-01-18  0:27         ` Changwoo Min
@ 2025-01-22  0:42           ` Tejun Heo
  2025-01-22  1:37             ` Changwoo Min
  0 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2025-01-22  0:42 UTC (permalink / raw)
  To: Changwoo Min; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

On Sat, Jan 18, 2025 at 09:27:18AM +0900, Changwoo Min wrote:
...
> Could you explain more on the code dumping? I want at least print
> the event string for easier interpretation.

I meant that we don't need to the enums to print out the counters. e.g. we
can just do:

        dump_line(&s, "selcpu_inval: %16llu    dsp_keep_last: %16llu",
                  events.select_cpu_inval, events.dispatch_keep_last);

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event
  2025-01-22  0:42           ` Tejun Heo
@ 2025-01-22  1:37             ` Changwoo Min
  0 siblings, 0 replies; 28+ messages in thread
From: Changwoo Min @ 2025-01-22  1:37 UTC (permalink / raw)
  To: Tejun Heo; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

On 25. 1. 22. 09:42, Tejun Heo wrote:
> I meant that we don't need to the enums to print out the counters. e.g. we
> can just do:
> 
>          dump_line(&s, "selcpu_inval: %16llu    dsp_keep_last: %16llu",
>                    events.select_cpu_inval, events.dispatch_keep_last);

Haha. Thanks for the clarification. I got it.

Regards,
Changwoo Min

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event
  2025-01-21 15:00   ` Leonardo Temperanza
@ 2025-01-22  1:45     ` Changwoo Min
  0 siblings, 0 replies; 28+ messages in thread
From: Changwoo Min @ 2025-01-22  1:45 UTC (permalink / raw)
  To: Leonardo Temperanza, tj, void, arighi; +Cc: kernel-dev, linux-kernel

Hello Leonardo,

Thanks for the review.

On 25. 1. 22. 00:00, Leonardo Temperanza wrote:
> scx_event_stat fields are required to be u64, otherwise the macros below 
> don't work correctly. Perhaps a comment could highlight this "hidden" 
> constraint? As well as a BUILD_BUG_ON to verify that this constraint is 
> verified at build time.

That's a good suggestion. I will add any hidden constratins in the comments.

> If an event is added to scx_event_kind but not scx_event_stat_str, the 
> array can possibly be accessed out of bounds (or into a NULL string 
> depending on the missing index). The GNU C extension could be used to 
> initialize all elements to the empty string "", like this:
> 
> static const char *scx_event_stat_str[SCX_EVENT_END] = {
>      [0 ... SCX_EVENT_END - 1] = "",
>      [SCX_EVENT_INVAL_SELECT_CPU] = "invalid_select_cpu",
> };
> 
> Alternatively a helper could be used:
> 
> static const char *scx_event_name(enum scx_event_kind kind)
> {
>      return scx_event_stat_str[kind] ? : "";
> }

Thanks for the suggestion. I am considering dropping event and event_str 
in the next version. That is because the only purpose of the event 
things is to print dump messages. At this point, the simpler the better, 
I think.


>> +/**
>> + * scx_read_event_kind - Read an event from 'e' with 'kind'
>> + * @e: a pointer to an event collected by scx_bpf_event_stat()
>> + * @kine: an event type defined in scx_event_kind
>> + */
>> +#define scx_read_event_kind(e, kind) ({                        \
>> +    u64 *__e64 = (u64 *)(e);                        \
>> +    __e64[kind];                                \
>> +})
>> +
> 
> 
> nit: typo, "@kine" instead of "@kind"

Good catch. Will fix it.

Regards,
Changwoo Min

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-01-22  1:45 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-16 15:15 [PATCH 0/7] sched_ext: Implement core event counters Changwoo Min
2025-01-16 15:15 ` [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event Changwoo Min
2025-01-17  1:33   ` Tejun Heo
2025-01-17  7:08     ` Changwoo Min
2025-01-17 16:24       ` Tejun Heo
2025-01-18  0:27         ` Changwoo Min
2025-01-22  0:42           ` Tejun Heo
2025-01-22  1:37             ` Changwoo Min
2025-01-17  9:49   ` Andrea Righi
2025-01-17 16:26     ` Tejun Heo
2025-01-18  0:00     ` Changwoo Min
2025-01-21 15:00   ` Leonardo Temperanza
2025-01-22  1:45     ` Changwoo Min
2025-01-16 15:15 ` [PATCH 2/7] sched_ext: Add an event, SCX_EVENT_OFFLINE_LOCAL_DSQ Changwoo Min
2025-01-17  1:37   ` Tejun Heo
2025-01-17  7:11     ` Changwoo Min
2025-01-16 15:15 ` [PATCH 3/7] sched_ext: Add an event, SCX_EVENT_CNTD_RUN_WO_ENQ Changwoo Min
2025-01-17  1:39   ` Tejun Heo
2025-01-17  7:12     ` Changwoo Min
2025-01-16 15:15 ` [PATCH 4/7] sched_ext: Add an event, SCX_EVENT_ENQ_LOCAL_EXITING Changwoo Min
2025-01-17  1:40   ` Tejun Heo
2025-01-17  7:12     ` Changwoo Min
2025-01-16 15:15 ` [PATCH 5/7] sched_ext: Add an event, SCX_EVENT_RQ_BYPASSING_OPS Changwoo Min
2025-01-17  1:41   ` Tejun Heo
2025-01-17  7:31     ` Changwoo Min
2025-01-17 16:14       ` Tejun Heo
2025-01-16 15:15 ` [PATCH 6/7] sched_ext: Add scx_bpf_event_stat() and scx_read_event() for BPF schedulers Changwoo Min
2025-01-16 15:15 ` [PATCH 7/7] sched_ext: Print core event count in scx_central scheduler Changwoo Min

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox